[Python系列] Beginning Python Visualization

Shai Vaingast Beginning Python Visualization Crafting Visual Transformation Scripts Beginning Python Visualization: Crafting Visual Transformation Scripts Copyright © 2009 by Shai Vaingast All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1843-2 ISBN-13 (electronic): 978-1-4302-1844-9 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Lead Editors: Frank Pohlmann, Michelle Lowman Technical Reviewer: C. Titus Brown Editorial Board: Clay Andres, Steve Anglin, Mark Beckner, Ewan Buckingham, Tony Campbell, Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Jeffrey Pepper, Frank Pohlmann, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh Project Manager: Kylie Johnston Copy Editor: Ami Knox Associate Production Director: Kari Brooks-Copony Production Editor: Kelly Winquist Compositor: Dina Quan Proofreader: Liz Welch Indexer: Julie Grady Artist: April Milne Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax 201-348-4505, e-mail kn`ano)juCLOna_aeran(_kjja_pa`pk]Hajkrkh]lpklP2,* @]p]s]oc]pdana`re]pdaoane]hlknpopkna`pk_ha]npatpbehao$?OR%* Ia]oqnaiajposanap]gajpkaopei]paolaa`]j`peiaolajpejpn]bbe_* C]pdana`^uOd]eR]ejc]op* @]pa6pdnkqcdkqp.,,4(oaabehapeiaop]ilo* CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION8 Data Analysis Once data is organized and accessible in files, the next step is to extract information. Informa- tion can be a value, a graph, or a report pertaining to the problem at hand. The idea is to use Python’s scripting abilities and the wide range of readily available pack- ages to write a fully automated application to process, analyze, and visualize data. Scripts are small pieces of code that are written relatively quickly in a high-level programming language. The key word here is productivity, the ability to change and test algorithms and extract results fast. Scripts might not be highly efficient in terms of processing speed, but written properly, THEYSHOULDNOTSLOWDOWNRUNNINGTIMES&OREXAMPLE ASCRIPTMIGHTGENERATEGRAPHSOR search the hard drive for data files, analyze log files, and extract the maximum and minimum temperatures, or in our case, analyze GPS data. Back to our GPS case study. The following is the algorithm we’ll follow: 1. Compile a list of all the data files. 2. &OREACHFILE a. Read the data. b. Process the data. c. Plot the data. Walking Directories To compile a list of all the files having GPS data, we’ll use the function ko*s]hg$% provided with the module os, which is part of the Python Standard Library. To use os, we issue eilknpko. :::eilknpko :::bknnkkp(`eno(behaoejko*s]hg$#**+`]p]#%6 ***lnejpnkkp(`eno(behao *** **+`]p]WYW#CLO).,,4),1)/,),5),,)1,*_or#(#CLO).,,4),1)/,),5)-,)1.*_or#( #Na]`ia*ptp#Y NNote To be able to change directories within the Python interpreter, first issue eilknpko. Then, to change to a directory, issue ko*_d`en$`ena_pknu[l]pd%. To list directory contents, you can use ko*heop`en$`ena_pknu[l]pd%. Some interpreters like IPython let you use, among other enhancements, shell-like commands such as _` and ho, which add considerably to usability. The function ko*s]hg$% iterates through the directory `]p] and its subdirectories recur- sively, looking for files and folders, storing the results in variables nkkp, `eno, and behao. The second line prints out the root directory for our search, in our case **+`]p] (notice the rela- tive path), then the subdirectories, and lastly the files themselves, in a list. I’ve only recorded two data files, but as time progresses, more data is added to this folder, and the number can CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 9 increase substantially. Since we have no subdirectories in folder `]p], the output correspond- ing to `eno should be an empty list, which is denoted by WY. ko*s]hg$% is a bit of an overkill here. In our case, directory `]p] doesn’t have any sub- directories, and we could have just as easily listed the contents of the directory using the ko*heop`en$% function call, as follows: :::ko*heop`en$#**+`]p]#% W#CLO).,,4)/,),1),5),,)1,*_or#(#CLO).,,4)/,),1),5)-,)1.*_or#(#Na]`ia*ptp#Y However, ko*s]hg$% is very useful. It’s not uncommon to have files grouped together in DIRECTORIESANDWITHINTHOSEDIRECTORIESSUBDIRECTORIESHOLDINGMOREFILES&OREXAMPLE YOU might want to group files in accordance with the GPS that recorded the data. Or if another driver is recording GPS data, you might want to put that data in a separate subdirectory within your `]p] directory. In those cases, ko*s]hg$% is exactly what’s needed. Now that we have a list of all the files in directory `]p], we turn to process only those with the *_or extension. This is done using the aj`osepd$% function, which checks whether a STRINGENDSWITHhCSVv&ILESTHATDONOTENDWITHhCSVvARESKIPPEDUSINGTHE_kjpejqa state- ment: _kjpejqa instructs the bkn loop to skip current execution and proceed to the next ELEMENT&ILESTHATDOENDWITHhCSVvAREREADANDPROCESSED7EALSOINTRODUCEAFUNCTIONTO create a full file name path from the directory and the file name, ko*l]pd*fkej$%, as shown in Listing 1-3. Listing 1-3. Processing Only CSV Files bknbehaj]iaejbehao6 _na]pabqhhbehaj]iaej_hq`ejcl]pd _qn[beha9ko*l]pd*fkej$nkkp(behaj]ia% ebbehaj]ia*aj`osepd$#_or#%6 u9na]`[_or[beha$_qn[beha% ahoa6 _kjpejqa kjhubehaosepdpda*_oratpajoekjbnkidanakj Reading CSV Files Our next step is to read the files. Again, we turn to Python’s built-in modules, this time the csv module. Although the CSV file format is quite popular, there’s no clear definition, and each spreadsheet and database employs its own “dialect.” The files we’ll be processing adhere to the most basic CSV file dialect, so we’ll use the default behavior of Python’s csv module. Since we’ll be reading several CSV files, it stands to reason to define a function to perform this task. Listing 1-4 shows this function. Listing 1-4. A Function to Read CSV Files `abna]`[_or[beha$behaj]ia%6 Na]`o]?ORbeha]j`napqnjoep]o]heopkbnkso* CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION10 `]p]9WY bknnksej_or*na]`an$klaj$behaj]ia%%6 `]p]*]llaj`$nks% napqnj`]p] The first line defines a function named na]`[_or[beha$%. CSV file support is introduced with the csv module, so we have to eilknp_or before calling the function. The function takes one variable, behaj]ia, and returns an array of rows holding data in the file. What I mean by this is that every line read is processed and becomes a list, with every comma-separated value ASONEELEMENTINTHATLIST4HEFUNCTIONRETURNSANARRAYOFSUCHLISTS&OREXAMPLE :::eilknp_or :::t9na]`[_or[beha$#**+`]p]+CLO).,,4),2),0),5),/)01*_or#% :::haj$t% /24/ :::tW-,Y W# CLCOR#(#/#(#/#(#-.#(#.5#(#-,#(#,0,#(##(#-2#(#,-#(#/,.#(##(#.2#(#,-#( #,/3#(##(#,,#(#,,#(#,,,#(#&3.#Y :::tW-232Y W# CLCOR#(#/#(#-#(#-.#(#.-#(#42#(#.14#(#0/#(#-4#(#22#(#.42#(#.,#(#-1#( #1,#(#,15#(#01#(#.0#(#00#(#-.2#(#0/&3.#Y haj$t% lets us know the size of the array of lists. It’s also a crude way for us to ensure that data was actually read into the array. The second line in the function is called a docstring, and it is characterized by three quotes () surrounding the text in the following manner: `ko_pnejc. In this case, a docstring is used to document the function, that is, what it does. Issuing the command dahl$bqj_j]ia% yields its docstring: :::dahl$na]`[_or[beha% Dahlkjbqj_pekjna]`[_or[behaejik`qha[[i]ej[[6 na]`[_or[beha$behaj]ia% Na]`o]?ORbeha]j`napqnjoep]o]heopkbnkso* You should use dahl$% extensively. dahl$% can be invoked with functions as well as mod- ULES&OREXAMPLE THEFOLLOWINGINVOKESHELPONMODULECSV :::dahl$_or% Dahlkjik`qha_or6 J=IA _or)?ORl]noejc]j`snepejc* BEHA +qon+he^+lupdkj.*1+_or*lu IK@QHA@K?O dppl6++sss*lupdkj*knc+`k_+_qnnajp+he^+ik`qha)_or*dpih CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 11 @AO?NELPEKJ Pdeoik`qhalnkre`ao_h]ooaopd]p]ooeopejpdana]`ejc]j`snepejc kb?kii]Oal]n]pa`R]hqa$?OR%behao(]j`eilhaiajpopdaejpanb]_a `ao_ne^a`^uLAL/,1*=hpdkqcdi]ju?ORbehao]naoeilhapkl]noa( pdabkni]peojkpbkni]hhu`abeja`^u]op]^haola_ebe_]pekj]j` eooq^phaajkqcdpd]pl]noejchejaokb]?ORbehasepdokiapdejc hegaheja*olhep$(%eo^kqj`pkb]eh*Pdaik`qhaoqllknpopdnaa ^]oe_=LEo6na]`ejc(snepejc(]j`naceopn]pekjkb`e]ha_po* Next in our dissection is the line `]p]9WY which declares a variable named `]p] and ini- tializes it as an empty list. `]p] will be used to store the values from the CSV file. The csv module helps us read CSV files by automating a lot of the tasks associated with READING#36FILES)WILLDISCUSS#36FILESANDTHECSVMODULEIN#HAPTERSAND SOHERE)LL only provide an overview. These are the operations to perform in order to read CSV files using the csv module: 1. Open the file for reading. 2. Create a _or*na]`an object. The _or*na]`an object has functions that help us read CSV files. 3. Using the _or*na]`an object, read the data from the file, a row at a time. 4. Append every row to variable `]p]. 5. Close the file. Let’s try this, a step at a time: :::b9klaj$#**+`]p]+CLO).,,4),2),0),5),/)01*_or#% :::_n9_or*na]`an$b% :::bknnksej_n6 ***lnejpnks *** W# CLCO=#(#=#(#/#(#.-#(#-4#(#-1#(#.0#(##(#..#(##(##(##(##(##(##( #,/*1#(#,.*.#(#,.*3&,5#Y W# CLCOR#(#/#(#-#(#-.#(#.-#(#42#(#.23#(#/5#(#-4#(#22#(#.42#(#00#(#-1#( #1-#(#,2,#(#0/#(#.0#(#01#(#-.1#(#/,&3=#Y W# CLCOR#(#/#(#.#(#-.#(#,2#(#.4#(#/,,#(#//#(#..#(#.3#(#.21#(#/-#(#,/#( #-4#(#/-.#(#.3#(#.5#(#-1#(#-41#(#/-&3?#Y W# CLCOR#(#/#(#/#(#-.#(#,5#(#-1#(#-/4#(#/-#(#-2#(#,,#(#/,-#(##(#-5#( #,,#(#//.#(##(#,,#(#,,#(#,,,#(#&3,#Y W# CLNI?#(#-0,3,2*.0#(#=#(#0011*2.0-#(#J#(#,5/.4*,1-5#(#S#(#,--*0#(#-1.*3#( #,0,2,4#(#,,-*.#(#A#(#=&.1#Y W# CLCC=#(#-0,3,2*.0#(#0011*2.0-#(#J#(#,5/.4*,1-5#(#S#(#-#(#,0#( #,/*,#(#,,.51*-#(#I#(#),/,*3#(#I#(##(#&1-#Y W# CLCO=#(#=#(#/#(#.-#(#-4#(#-1#(#.0#(##(##(##(##(##(##(##(##(#,4*5#( #,/*,#(#,4*0&,0#Y :::b*_hkoa$% &IRSTWEOPENTHEDATAFILEANDASSIGNITTOVARIABLEb. The opened file can now be referred to by the variable b. Next, we create a _or*na]`an object, _n. We associate the _or*na]`an CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION12 object, _n, with the file b. We then iterate through every row of the _or*na]`an object and print that row. Lastly, we close the file by calling b*_hkoa$%. It is considered good practice to close the file once you’re done with it, but if you neglect to do so, Python will close the file automati- cally once the variable b is no longer in use. One of the things that you can do in Python is cascade functions. This means you can call functions on results of other functions. This process can be repeated several times. Cascad- ing (usually) adds clarity and produces more elegant scripts. In our case, since variable b isn’t really important to us, we discard it after we attach it to a _or*na]`an object; so instead of the preceding code, we can write the following: :::_n9_or*na]`an$klaj$#`]p]+?>0,-).,,1),2).-),-/1,0*_or#%% :::bknnksej_n6 ***lnejpnks The same holds true for variable _n, so if we’re feeling particularly brave, we can use this script: :::bknnksej_or*na]`an$klaj$#`]p]+?>0,-).,,1),2).-),-/1,0*_or#%%6 ***lnejpnks While the script might be shorter, there’s no performance gain. It is therefore suggested that you cascade functions only if it adds clarity; there’s a good chance you’ll be editing this code later on, and it’s important to be able to understand what’s going on. In fact, not cas- cading functions might be useful at times because you might need access to intermediate variables (such as b and _n in our case). The _or*na]`an object converts each row we read into a row of fields, in the form of a list. That row is then appended to a list of rows, `]p]. This is also the value returned by the function. NNote By now you’ve seen the dot symbol (*) used several times, and it might be a bit confusing, so an explanation is in order. The dot symbol is used to access function members of modules as well as function members of objects (classes). You’ve seen it in member functions of modules, such as _or*na]`an$%, but also for objects, such as b*na]`$%. In the latter, it means that the file object has a member function na]`$% and that function is called to operate on variable b. To access these functions, we use the dot operator. We’ll touch on this again in Chapter 3. Lastly, we use the ellipsis symbol (***) to denote line continuation when interactively entering commands in Python. Analyzing GPS Data Let’s take a closer look at the GPS data. s %ACHROWSEEMSTOSTARTWITHATEXTHEADERSTAMP BEGINNINGWITHTHECHARACTERS CL. s 4HEREARESEVERALHEADERSTAMPS FOREXAMPLE CLCO= and CLNI?. s &OLLOWINGTHEHEADERAREADDITIONALVALUES MOSTOFWHICHARENUMERIC CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 13 Not being GPS savvy, I looked up the GPS format on the Internet. It turns out the for- mat is known as NMEA 0183. NMEA stands for the National Marine Electronics Association; see dppl6++sss*jia]*knc for more information. The NMEA 0183 data format is described at dppl6++sss*cloejbkni]pekj*knc+`]ha+jia]*dpi. There are a lot of header stamps in the for- mat, and some might hold useful information for our task. As mentioned earlier, several CL header stamps appear in our data files, but which ones EXACTLYAREOFRELEVANCEISADIFFERENTQUESTION&IRST ITWOULDBENICETOKNOWWHICHHEADER stamps from the NMEA standard are even present in our data files. One option would be to open the files, look for the headers, and jot down every new header once we see it. Another, of course, would be to use Python to do that for us. Python is a very high-level programming language. As such, it has built-in support for dictionaries (also known as associative arrays in Perl), which are data structures that have a one-to-one relationship between a key and a value, very much like real dictionaries. Tradi- tional dictionaries, however, often have several values for a key, that is, several interpretations (values) for one word (key). You can easily implement this in Python’s using the dictionary object as well by assigning a list value to a key. That way you can have several entries per one key, because the key is associated with a list that can hold several values. In reality, it’s still a one-to-one relationship. But enough about that for now, I’ll cover dictionaries in more detail in future chapters. What we want to do here is use a dictionary object to hold the number of times a header is encountered. Our key will be the GPS header stamp, and our value will be a number, indicating occurrence. We’ll increment the value whenever a key is encountered, as shown in ,ISTING  Listing 1-5. Function heop[clo[_kii]j`o$% `abheop[clo[_kii]j`o$`]p]%6 ?kqjpopdajqi^ankbpeiao]CLO_kii]j`eok^oanra`* Napqnjo]`e_pekj]nuk^fa_p* clo[_i`o9`e_p$% bknnksej`]p]6 pnu6 clo[_i`oWnksW,YY'9- at_alpGauAnnkn6 clo[_i`oWnksW,YY9- napqnjclo[_i`o Some notes about this FUNCTION&IRST THEDOCSTRINGSPANSMULTIPLELINES WHICHISONEOF the key benefits of docstrings. Docstrings will display all the spaces and line breaks as shown in the function itself. Next we initialize a variable, clo[_i`o, to be our dictionary. We then pro- cess every list in the GPS data: we only care about the first element of every row, as that’s the value that holds the GPS header stamps. We then increment the value associated with the key: clo[_i`oWnksW,YY'9-. We use the '9 operation to increment the value by 1, similar to how it’s done in C (Python, however, does not use the '' operator). If the key does not exist, which will happen whenever we encounter a new header stamp, an exception will be raised. We CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION14 catch the exception with our at_alpGauAnnkn statement. In case of an exception, we set the dictionary value associated with the key to -. The function heop[clo[_kii]j`o$% can be written even more compactly using the diction- ary method cap$%; see Chapter 3 for details. Let’s analyze some GPS data: :::t9na]`[_or[beha$#**+`]p]+CLO).,,4),1)/,),5),,)1,*_or#% :::heop[clo[_kii]j`o$t% w# CLCO=#6.4.(# CLCOR#6402(# CLCC=#6.4.(# CLNI?#6.4/y Turns out there are four distinct GPS headers being generated by my GPS. Of those, only two interest me: CLCOR, which holds the number of satellites in view (Hey! It’s really impor- tant!), and CLNI?, which holds location and velocity information. So what we’d like to do is code a function that takes the GPS data and, whenever the header field is CLCOR or CLNI?, extracts the information and stores it in numerical arrays that will be easier to manipulate later on. Numerical arrays are introduced with the NumPy mod- ule, so we have to issue eilknpjqilu. Since we’ll be using a lot of the functionality of NumPy, SciPy, and matplotlib, an easier approach would be to issue eilknpluh]^, which imports all these modules, as follows: :::bnkiluh]^eilknp& NNote The name PyLab comes from Python and MATLAB. PyLab provides MATLAB-like functionality in Python. Extracting GPS Data In the case of a CLCOR header, the number of satellites is the fourth entry. In case of a CLNI? header, we have a bit more interesting information. The second field is the timestamp, the fourth field is the latitude, the sixth field is the longitude, and the eighth field is the velocity. Again, turn to the NMEA 0183 format for more details. Table 1-1 summarizes the fields and their values in a CLNI? line. Table 1-1. CLNI? Information (Excerpt) Field Name Index Format Header 0 CLNI? (fixed) Timestamp 1 hhmmss.ss Latitude 3 DDMM.MMM ,ONGITUDE  $$$----- Velocity 7 VVV.V Some caveats regarding the information in CLNI?. We first turn to the timestamp of an arbitrary line: CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 15 :::tW-.Y W# CLNI?#(#-0,,11*,,#(#=#(#0010*-30,#(#J#(#,5/.1*,-0/#(#S#(#,,,*,#(#-.4*3#( #/,,1,4#(#,,-*-#(#A#(#=&.4#Y In this output, the timestamp appears as #-0,,11*,,#. This follows the format hhmmss.ss where hh are two digits representing the hour (it will always consist of two digits—if the hour is one digit, say 7 in the morning, a 0 will be added before it), mm are two digits representing the minute (again, always two digits), and ss.ss are five characters (four digits plus the dot) representing seconds and fractions of seconds. (There’s also a North/South field as well as an East/West field. Here, for simplicity, we assume northern hemisphere, but you can easily change these values by reading the entire CLNI? structure.) NNote In the ISO time format, we’ve used HHMMSS to denote hours minutes and seconds. Here we follow the convention in NMEA, which uses hhmmss.ss for hours, minutes, and seconds and sets DD and MM to angular degrees and minutes. The timestamp string is a bit hard to work with, especially when plotting data. The first reason is that it’s a string, not a number. But even if you translated it to a number, the system does not lend itself nicely to plotting because there are 60 seconds in a minute, not a 100. So what we want to do is “linearize” the timestamp. To achieve this, we translate the timestamp as seconds elapsed since midnight, as follows: T = hh * 3600 + mm * 60 + ss.ss. The second issue we have is that hh, mm, and ss.ss are strings, not numbers. Multiplying a string in Python does something completely different from what we want here. So we have to first convert the strings to numerical values, in our case, bhk]p, because of the decimal point in the string representing the seconds. This all folds nicely into the following: :::nks9tW-.Y W# CLNI?#(#-0,,11*,,#(#=#(#0010*-30,#(#J#(#,5/.1*,-0/#(#S#(#,,,*,#(#-.4*3#( #/,,1,4#(#,,-*-#(#A#(#=&.4#Y :::bhk]p$nksW-YW,6.Y%&/2,,'bhk]p$nksW-YW.60Y%&2,'bhk]p$nksW-YW062Y% 1,001*, The operator WY denotes the index, so nksW-Y is the second field of nks (counting starts at zero) which is a string. The first two characters of a string are denoted by W,6.Y; this is known as string slicing. So to access the first two characters of the first field, we write nksW-YW,6.Y. Upcoming chapters will include more about strings and methods of slicing them. Next we tackle latitude and longitude. We face the same issue as with the timestamp, only here we deal with degrees. Latitude follows the format DDMM.MMM where DD stands for degrees and MM.MMM stands for minutes. We decide to use degrees this time. To translate the latitude into decimal degrees, we need to divide the minutes by 60: :::nks9tW-.Y W# CLNI?#(#-0,,11*,,#(#=#(#0010*-30,#(#J#(#,5/.1*,-0/#(#S#(#,,,*,#(#-.4*3#( #/,,1,4#(#,,-*-#(#A#(#=&.4#Y :::bhk]p$nksW/YW,6.Y%'bhk]p$nksW/YW.6Y%+2,*, 00*5,.5,,,,,,,,,,. CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION16 &ORLATITUDEINFORMATIONWErequire the fourth field, hence nksW/Y. This example also introduces another notation, W.6Y, which means the slice of the string from the third character until the end. Also notice that the code uses 2,*, and not 2,. When dividing by 60, it’s implied that you want an integer division; dividing by 60.0 means you want a floating-point division, which is to say you care about the information past the decimal point. However, seeing as we already specified that we want the information as a floating-point number as indicated by the bhk]p$% conversion, the result will be a floating point regardless. Still, it’s good practice to let Python know what kind of division you really want. Here are some examples to further illustrate the point: :::-,,+2, - :::-,,+2,*, -*2222222222222223 :::bhk]p$-,,%+2, -*2222222222222223 Longitude information is similar to latitude with a minor difference: longitude degrees are three characters instead of two (up to 180 degrees, not just up to 90 degrees) so the indices to the strings are different. Listing 1-6 presents the entire function to process GPS data. Listing 1-6. Function lnk_aoo[clo[`]p]$% bnkiluh]^eilknp& _kjop]jp`abejepekjo JIE9-41.*, `ablnk_aoo[clo[`]p]$`]p]%6 Lnk_aooaoCLO`]p](JIA=,-4/bkni]p* Napqnjo]pqlhakb]nn]uo6h]pepq`a(hkjcepq`a(rahk_epuWgi+dY( peiaWoa_Y]j`jqi^ankbo]pahhepao* Oaa]hok6dppl6++sss*cloejbkni]pekj*knc+`]ha+jia]*dpi* h]pepq`a9WY hkjcepq`a9WY rahk_epu9WY p[oa_kj`o9WY jqi[o]po9WY bknnksej`]p]6 ebnksW,Y99# CLCOR#6 jqi[o]po*]llaj`$bhk]p$nksW/Y%% ahebnksW,Y99# CLNI?#6 p[oa_kj`o*]llaj`$bhk]p$nksW-YW,6.Y%&/2,,'X bhk]p$nksW-YW.60Y%&2,'bhk]p$nksW-YW062Y%% h]pepq`a*]llaj`$bhk]p$nksW/YW,6.Y%'X CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 17 bhk]p$nksW/YW.6Y%+2,*,% hkjcepq`a*]llaj`$$bhk]p$nksW1YW,6/Y%'X bhk]p$nksW1YW/6Y%+2,*,%% rahk_epu*]llaj`$bhk]p$nksW3Y%&JIE+-,,,*,% napqnj$]nn]u$h]pepq`a%(]nn]u$hkjcepq`a%(X ]nn]u$rahk_epu%(]nn]u$p[oa_kj`o%(]nn]u$jqi[o]po%% Some notes about the lnk_aoo[clo[`]p]$% function: sJIE is defined as -41.*,, which is one nautical mile in meters and also one minute on the equator. The reason the constant JIE is not defined in the function is that we’d like to use it outside the function as well. s 7EINITIALIZETHERETURNVALUESh]pepq`a, hkjcepq`a, rahk_epu, p[oa_kj`o, and jqi[o]po by setting them to an empty list: WY. Initializing the lists creates them and allows us to use the ]llaj`$% method, which adds values to the lists. s 4HEeb and aheb statements are self-explanatory: eb is a conditional clause, and aheb is equivalent to saying “else, if.” That is, if the first condition didn’t succeed, but the next condition succeeds, execute the following block. s 4HESYMBOLX that appears on the several calculations and on the napqnj line indicates that the operation continues on the next line. s ,ASTLY THERETURNVALUEISa tuple of arrays. A tuple is an immutable sequence, mean- ing you cannot change it. So tuple means an unchangeable sequence of items (as opposed to a list, which is a mutable sequence). The reason we return a tuple and not a two-dimensional array, for example, is that we might have different lengths of lists to return: the length of the number of satellites list may be different from the length of the longitude list, since they originated from different header stamps. Here’s how you call lnk_aoo[clo[`]p]$%: :::u9na]`[_or[beha$#**+`]p]XXCLO).,,4),1)/,),5),,)1,*_or#% :::$h]p(hkjc(r(p(o]po%9lnk_aoo[clo[`]p]$u% The second line introduces sequence unpacking, which allows multiple assignments. Armed with all these functions, we’re ready to plot some data! Data Visualization Our next step is to visualize the data. We’ll be relying on the matplotlib package heavily. We’ve already imported matplotlib with the command bnkiluh]^eilknp&, so there’s no additional importing needed at the moment. It’s time to read the data and plot the course. Our first problem is that the information is given in latitude and longitude. Latitude and longitude are spherical coordinates, that is, those are points on a sphere, the earth. But we want a map-like plot, which uses Cartesian coordinates, that is, x and y. So first we have to transform the spherical coordinates to Cartesian. We’ll use the quick-and-dirty method shown in Listing 1-7 to do this, one that’s actually quite accurate as long as the distances traveled are small relative to the radius of the earth. CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION18 Listing 1-7. “Quick-and-Dirty” Spherical to Cartesian Transformation t9hkjcepq`a&JIE&2,*,&_ko$h]pepq`a% u9h]pepq`a&JIE&2,*, To justify this to yourself, consider the following reasoning: As you go up to the North Pole, the circumference at the location you’re at gets smaller and smaller, until at the North Pole it’s zero. So at latitude 0º, the equator, each degree (longitude) means more distance trav- ELEDTHANATLATITUDEŽ4HATSWHYt is a function of the longitude value itself but also of the latitude: the greater the latitude, the smaller a longitude change is in terms of distance. On the other hand, u, which is north to south, is not dependent on longitude. The next thing to understand is that the earth is a sphere, and whenever we plot an x-y map, we’re only really plotting a projection of that sphere on a plane of our choosing, hence we denote it by (px,py), where p stands for “projection.” We’ll take the southeastern-most point as the start of the GPS data projection: (px,py) = (0,0). This translates into the code shown in Listing 1-8. Listing 1-8. Projecting the Traveled Course to Cartesian Coordinates lu9$h]p)iej$h]pepq`a%%&JIE&2,*, lt9$hkjc)iej$hkjcepq`a%%&JIE&2,*,&_ko$@.N&h]pepq`a% Some things to note: s 6ARIABLESlu and lt are arrays of floating-point values. We now operate on entire arrays seamlessly. This is part of the NumPy package. s@.N is a constant equal to ›/180, converting degrees to radians. s 4OSETTHEY AXISATTHEMINIMUMLATITUDEANDTHEX AXISATTHEMINIMUMLONGITUDE WE subtract the minimum latitude and minimum longitude values from latitude and lon- gitude values, respectively. GPS Location Plot Now the moment we’ve been waiting for, plotting GPS data. To be able to follow along and plot data, be sure to define the functions na]`[_or[beha$% and lnk_aoo[clo[`]p]$% as previ- ously detailed and set the file name variable to point to your GPS data file. I’ve suppressed matplotlib responses so that the code is cleaner to follow. :::behaj]ia9#CLO).,,4),1)/,),5),,)1,*_or# :::u9na]`[_or[beha$#**+`]p]+#'behaj]ia% :::$h]p(hkjc(r(p(o]po%9lnk_aoo[clo[`]p]$u% :::lt9$hkjc)iej$hkjc%%&JIE&2,*,&_ko$@.N&h]p% :::lu9$h]p)iej$h]p%%&JIE&2,*, :::becqna$% :::c_]$%*]tao*ejranp[t]teo$% :::lhkp$lt(lu(#^#(h]^ah9#?nqeoejc#(hejase`pd9/% :::pepha$behaj]iaW6)0Y% :::hacaj`$hk_9#qllanhabp#% :::th]^ah$#a]op)saop$iapano%#% CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 19 :::uh]^ah$#okqpd)jknpd$iapano%#% :::cne`$% :::]teo$#amq]h#% :::odks$% &IGURE SHOWSTHERESULT WHICHISRATHERPLEASING Figure 1-1. GPS data We’ve used a substantial number of new functions, all part of the matplotlib package: lhkp$%, cne`$%, th]^ah$%, hacaj`$%, and more. Most of them are self-explanatory: sth]^ah$opnejc[r]hqa% and uh]^ah$opnejc[r]hqa% will print a label on the x- and y-axis, respectively. pepha$opnejc[r]hqa% is used to print a caption above the graph. The string value in the title is the file name up to the end minus four characters (so as to not dis- play “.csv”). This is done using string slicing with a negative value, which means “from the end.” shacaj`$% prints the labels associated with the graph in a legend box. hacaj`$% is highly configurable (see dahl$hacaj`% for details). The example plots the legend at the top-left corner. scne`$% plots the grid lines. You can control the behavior of the grid quite extensively. slhkp$% requires additional explanation as it is the most versatile. The command lhkp$lt(lu(#^#(h]^ah9#?nqeoejc#(hejase`pd9/% plots lt and lu with the color blue as specified by the character #^#. The plot is labeled “Cruising” so later on, when we call the hacaj`$%FUNCTION THEPROPERTEXTWILLBEASSOCIATEDWITHTHEDATA&INALLY we set the line width to 3. CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION20 s 4HEFUNCTION]teo$% controls the behavior of the graph axis. Normally, I don’t call the ]teo$% function because lhkp$% does a decent job at selecting the right values. How- ever, in this case, it’s important to visualize the data properly, and that means to have both x- and y-axes with equal increments so the graph is true to the path depicted. This is achieved by calling ]teo$#amq]h#%. There are other values to control axis behavior as described by dahl$]teo%. s ,ASTLY c_]$%*]tao*ejranp[t]teo$% is a rather exotic addition. It stems from the way we like to view maps and directions. In longitude, increasing values are displayed from right to left. However, in mathematical graphs, increasing values are typically displayed from left to right. This function call instructs the x-axis to be incrementing from right to left, just like maps. s 7HENYOUREDONEPREPARINGthe graph, calling the odks$% function displays the output. Matplotlib, which includes the preceding functions, is a comprehensive plotting package and will be explored in Chapter 6. Annotating the Graph We’d like to add some more information to the GPS graph: we’d like to know where we’ve STOPPEDANDWHEREWEWERESPEEDING&ORTHISWEUSETHEFUNCTIONbej`$%, which is part of the PyLab package. bej`$% returns an array of indices that satisfy the condition, in our case: :::OP=J@EJC[GID9-,*, :::OLAA@EJC[GID91,*, :::Eop]j`9bej`$r8OP=J@EJC[GID% :::Eolaa`9bej`$r:OLAA@EJC[GID% :::E_nqeoa9bej`$$r:9OP=J@EJC[GID%"$r89OLAA@EJC[GID%% We also calculate when we’re cruising (i.e., not speeding nor standing) for future process- ing. To annotate the graph with these points, we add another plot on top of our current plot, only this time we change the color of the plot, and we use symbols instead of a solid blue line. The combination #oc# indicates a green square symbol (g for green, s for square); the combi- nation #kn# indicates a red circle (r for red, o for circle). I suggest you use different symbols for standing and speeding, not just colors, because the graph might be printed on a monochrome printer. The function lhkp$% supports an assortment of symbols and colors; consult with the interactive help for details. The values we plot are only those returned by the bej`$% function. :::lhkp$ltWEop]j`Y(luWEop]j`Y(#oc#(h]^ah9#Op]j`ejc#% :::lhkp$ltWEolaa`Y(luWEolaa`Y(#kn#(h]^ah9#Olaa`ejc#% :::hacaj`$hk_9#qllanhabp#% &IGURE SHOWSTHEOUTCOME CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 21 Figure 1-2. GPS data with additional speed information We’d also like to know the direction the car is going. To implement this, we’ll use the patp$% function, which allows the writing of a string to an arbitrary location in the graph. So to add the text “Hi” at location (10, 10), issue the command patp$-,(-,(#De#%. One of the nice features of the patp$% function is that you can rotate the text at an arbitrary angle. So to PLOTh(IvATLOCATION  ATDEGREES YOUISSUEpatp$-,(-,(#De#(nkp]pekj901%. Our implementation of heading information involves rotating the text “>>>” at the angle the car is heading. We’ll only do this ten times so as not to clutter the graph with “>” symbols. Calculat- ing the direction the car is heading at a given point, e, is shown in Listing 1-9. Listing 1-9. Calculating the Heading `t9ltWe'-Y)ltWeY `u9luWe'-Y)luWeY da]`ejc9]n_p]j$`u+`t% Instead of actually using the function ]n_p]j$`u+`t%, we’ll use the function ]n_p]j.$`u( `t%. The benefits of using ]n_p]j.$% over ]n_p]j$% are twofold: 1) there’s no division that might cause a divide-by-zero exception in case `t is zero, and 2) ]n_p]j.$% preserves the angle from –180 degrees to 180 degrees, whereas ]n_p]j$% produces values between 0 degrees and 180 degrees only. The following code adds the direction symbols: :::bkneejn]jca$,(haj$r%(haj$r%+-,)-%6 ***patp$ltWeY(luWeY(:::(X ***nkp]pekj9]n_p]j.$luWe'-Y)luWeY()$ltWe'-Y)ltWeY%%+@.N(X ***d]9#_ajpan#% CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION22 &IGURE SHOWSTHEresulting graph. Figure 1-3. GPS graph with heading Velocity Plot We now turn to plotting a graph of the speed. This is a lot simpler: :::becqna$% :::p9$p)pW,Y%+2,*, :::lhkp$p(r(#g#% :::lhkp$WpW,Y(pW)-YY(WOP=J@EJC[GID(OP=J@EJC[GIDY(#)c#% :::patp$pW,Y(OP=J@EJC[GID(X ***Op]j`ejcpdnaodkh`6'opn$OP=J@EJC[GID%% :::lhkp$WpW,Y(pW)-YY(WOLAA@EJC[GID(OLAA@EJC[GIDY(#)n#% :::patp$pW,Y(OLAA@EJC[GID(X ***Olaa`ejcpdnaodkh`6'opn$OLAA@EJC[GID%% :::cne`$% :::pepha$#Rahk_epu#% :::th]^ah$#Peiabnkiop]npkbbeha$iejqpao%#% :::uh]^ah$#Olaa`$Gi+D%#% We start by opening a different figure with the becqna$% command. We proceed by chang- ing the timescale units to minutes, a value easier for most humans to follow than seconds. Selecting the proper units of measurement is important. Most people will find it easier to fol- low the sentence “I drove for 30 minutes” as opposed to “I drove for 1800 seconds.” We also set the time axis to start at pW,Y. Next we plot the velocity as a function of time, in black. Good CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 23 graphs require annotation, so we choose to add two lines describing the thresholds for stand- ing and speeding as well as text describing those thresholds. To generate the text, we combine the text “Standing threshold” with the threshold value (after casting it to a string) and use the ' OPERATORTOCONCATENATESTRINGS,AST OFCOURSE ARETHETITLE XANDYLABELS ANDGRID&IGURE  shows the final result. Figure 1-4. Velocity over time Subplots We’d also like to display some statistics. But before we do that, it would be preferable to combine all these plots (GPS, velocity, and statistics) into one figure. To do this, we use the oq^lhkp$% function. oq^lhkp$% is a matplotlib function that divides the plot into several smaller SECTIONSCALLEDSUBPLOTSANDSELECTSTHESUBPLOTTOWORKWITH&OREXAMPLE oq^lhkp$-(.(-% informs subsequent plotting commands that the area to work on is 1 by 2 subplots and the currently selected subplot is 1, so that’s the left side of the plot area. oq^lhkp$.(.(.% will choose the top-right subplot; oq^lhkp$.(.(0% will choose the lower-right subplot. A selec- tion I found most readable in this scenario is to have the GPS data take half of the plot area, the velocity graph a quarter, and the statistics another quarter. Text Sometimes, the best way to convey information is using text, not graphics. We’ll be limiting our work to the statistics quarter for this section. Our first task is to get rid of the plot frame and the x and y ticks. We just want a plain canvas to display text on. This is achieved by issuing the following: CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION24 :::oq^lhkp$.(.(0% :::]teo$#kbb#% The first call to oq^lhkp$% selects our region of work as the lower-right quarter. The second line removes the axes and hides the frame box. It’s time to calculate some statistics. It appears that GPS data is being sent in regular inter- vals, typically one second. So to calculate the time spent standing, in seconds, we calculate the length of the vector Eop]j`. Likewise, to calculate the time speeding, we can calculate the length of Eolaa`. To estimate how much these were in percent values, we divide the length of the Eop]j` and Eolaa` vectors by the length of the velocity vector and multiply by 100. To cal- culate the average speed, we use the ia]j$% function, which is part of PyLab. We also would like to calculate the total distance traveled. The distance can be calculated as the sum of the distances between each two consecutive data points. The function `ebb$% returns a vector of the differences of the input vector. :::`ebb$W-(0(,(.Y% ]nn]u$W/()0(.Y% This is really useful because now to calculate the distance we can do the following: :::oqi$omnp$`ebb$lt%&&.'`ebb$lu%&&.%% -21.*-000,552.01.4 which in turn yields the total distance traveled. To automate the whole process of printing the statistics, we store the text to be printed in the variable op]po, a list of strings. We also use a method of formatting strings similar to C’s lnejpb$% function, although the syntax is a bit different. !o indicates a string; the !b indicates a floating point number, in our case !*-b indicates a bhk]p with one digit after the decimal point; and !` indicates an integer. The following generates the statistics text: :::Pkp]h[`eop]j_a9bhk]p$oqi$omnp$`ebb$lt%&&.'`ebb$lu%&&.%%+-,,,*,% :::Op]j`[peia9haj$Eop]j`%+2,*, :::?nqeoa[peia9haj$E_nqeoa%+2,*, :::Olaa`[peia9haj$Eolaa`%+2,*, :::Op]j`[lan9-,,&haj$Eop]j`%+haj$r% :::?nqeoa[lan9-,,&haj$E_nqeoa%+haj$r% :::Olaa`[lan9-,,&haj$Eolaa`%+haj$r% :::op]po9W#Op]peope_o#(X ***#!o#!behaj]ia(X ***#Jqi^ankb`]p]lkejpo6!`#!haj$u%(X ***#=ran]cajqi^ankbo]pahhepao6!`#!ia]j$o]po%(X ***#Pkp]h`nerejcpeia6!*-biejqpao6#!$haj$r%+2,*,%(X ***#Op]j`ejc6!*-biejqpao$!`!!%#!X ***$Op]j`[peia(Op]j`[lan%(X ***#?nqeoejc6!*-biejqpao$!`!!%#!X ***$?nqeoa[peia(?nqeoa[lan%(X ***#Olaa`ejc6!*-biejqpao$!`!!%#!X ***$Olaa`[peia(Olaa`[lan%(X ***#=ran]caolaa`6!`gi+d#!ia]j$r%(X ***#Pkp]h`eop]j_apn]raha`6!*-bGi#!Pkp]h[`eop]j_aY CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 25 To print the text on the canvas, we again use the patp$% function, in a bkn loop, iterating over every string of the op]po list. :::bknej`at(op]p[hejaejajqian]pa$naranoa`$op]po%%6 ***patp$,(ej`at(op]p[heja(r]9#^kppki#% *** :::lhkp$Wej`at)*.(ej`at)*.Y% :::]teo$W,(-()-(haj$op]po%Y% We’ve introduced two new functions. One is naranoa`$%, which yields the elements of op]po, in reversed order. The second is ajqian]pa$%, which returns not just each row in the op]po array but also the index to each row. So when variable op]p[heja is assigned the value #=ran]caolaa`***#, the variable ej`at is assigned the value 4, which indicates the ninth row in op]po. The reason we want to know the index is that we use it as location on the y-axis. Lastly, the vertical alignment of the text is selected as bottom as suggested by the parameter r]9#^kppki# (r] is short for vertical alignment). Tying It All Together &INALLY ,ISTING SHOWSTHEcombined code to analyze and plot all GPS files in directory `]p]. Listing 1-10. Script clo*lu bnkiluh]^eilknp& eilknp_or(ko _kjop]jp`abejepekjo OP=J@EJC[GID9-,*, OLAA@EJC[GID91,*, JIE9-41.*, @.N9le+-4,*, `abna]`[_or[beha$behaj]ia%6 Na]`o]?ORbeha]j`napqnjoep]o]heopkbnkso* `]p]9WY bknnksej_or*na]`an$klaj$behaj]ia%%6 `]p]*]llaj`$nks% napqnj`]p] `ablnk_aoo[clo[`]p]$`]p]%6 Lnk_aooaoCLO`]p](JIA=,-4/bkni]p* Napqnjo]pqlhakb]nn]uo6h]pepq`a(hkjcepq`a(rahk_epuWgi+dY( peiaWoa_Y]j`jqi^ankbo]pahhepao* Oaa]hok6dppl6++sss*cloejbkni]pekj*knc+`]ha+jia]*dpi*  CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION26 h]pepq`a9WY hkjcepq`a9WY rahk_epu9WY p[oa_kj`o9WY jqi[o]po9WY bknnksej`]p]6 ebnksW,Y99# CLCOR#6 jqi[o]po*]llaj`$bhk]p$nksW/Y%% ahebnksW,Y99# CLNI?#6 p[oa_kj`o*]llaj`$bhk]p$nksW-YW,6.Y%&/2,,'X bhk]p$nksW-YW.60Y%&2,'bhk]p$nksW-YW062Y%% h]pepq`a*]llaj`$bhk]p$nksW/YW,6.Y%'X bhk]p$nksW/YW.6Y%+2,*,% hkjcepq`a*]llaj`$$bhk]p$nksW1YW,6/Y%'X bhk]p$nksW1YW/6Y%+2,*,%% rahk_epu*]llaj`$bhk]p$nksW3Y%&JIE+-,,,*,% napqnj$]nn]u$h]pepq`a%(]nn]u$hkjcepq`a%(X ]nn]u$rahk_epu%(]nn]u$p[oa_kj`o%(]nn]u$jqi[o]po%% na]`aranu`]p]beha(behpan(]j`lhkppda`]p] bknnkkp(`eno(behaoejko*s]hg$#**+`]p]#%6 bknbehaj]iaejbehao6 _na]pabqhhbehaj]iaej_hq`ejcl]pd _qn[beha9ko*l]pd*fkej$nkkp(behaj]ia% ebbehaj]ia*aj`osepd$#_or#%6 u9na]`[_or[beha$_qn[beha% ahoa6 _kjpejqa kjhubehaosepdpda*_oratpajoekjbnkidanakj lnk_aooCLO`]p] $h]p(hkjc(r(p(o]po%9lnk_aoo[clo[`]p]$u% pn]joh]paoldane_]h_kkn`ej]paopk?]npaoe]j lu9$h]p)iej$h]p%%&JIE&2,*, lt9$hkjc)iej$hkjc%%&JIE&2,*,&_ko$@.N&h]p% bej`kqpsdajop]j`ejc(olaa`ejc(kn_nqeoejc Eop]j`9bej`$r8OP=J@EJC[GID% Eolaa`9bej`$r:OLAA@EJC[GID% E_nqeoa9bej`$$r:9OP=J@EJC[GID%"$r89OLAA@EJC[GID%% CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 27 habpoe`a(CLOhk_]pekjcn]ld becqna$% oq^lhkp$-(.(-% hkjcepq`ar]hqaockbnkinecdppkhabp( sas]jpej_na]oejcr]hqaobnkihabppknecdp c_]$%*]tao*ejranp[t]teo$% lhkp$lt(lu(#^#(h]^ah9#?nqeoejc#(hejase`pd9/% lhkp$ltWEop]j`Y(luWEop]j`Y(#oc#(h]^ah9#Op]j`ejc#% lhkp$ltWEolaa`Y(luWEolaa`Y(#kn#(h]^ah9#Olaa`ejc#% ]```ena_pekjkbpn]rah bkneejn]jca$,(haj$r%(haj$r%+-,)-%6 patp$ltWeY(luWeY(:::(X nkp]pekj9]n_p]j.$luWe'-Y)luWeY(X )$ltWe'-Y)ltWeY%%+@.N(d]9#_ajpan#% hacaj`o]j`h]^aho pepha$behaj]iaW6)0Y% hacaj`$hk_9#qllanhabp#% th]^ah$#a]op)saop$iapano%#% uh]^ah$#okqpd)jknpd$iapano%#% cne`$% ]teo$#amq]h#% pkl)necdp_knjan(olaa`cn]ld oq^lhkp$.(.(.% oappdaop]nppeia]opW,Y7_kjranppkiejqpao p9$p)pW,Y%+2,*, lhkp$p(r(#g#% lhkppdaop]j`ejc]j`olaa`ejcpdnaodkh`hejao lhkp$WpW,Y(pW)-YY(WOP=J@EJC[GID(OP=J@EJC[GIDY(#)c#% patp$pW,Y(OP=J@EJC[GID(X Op]j`ejcpdnaodkh`6'opn$OP=J@EJC[GID%% lhkp$WpW,Y(pW)-YY(WOLAA@EJC[GID(OLAA@EJC[GIDY(#)n#% patp$pW,Y(OLAA@EJC[GID(X Olaa`ejcpdnaodkh`6'opn$OLAA@EJC[GID%% cne`$% hacaj`]j`h]^aho pepha$#Rahk_epu#% th]^ah$#Peiabnkiop]npkbbeha$iejqpao%#% uh]^ah$#Olaa`$Gi+D%#% CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION28 necdp)oe`a_knjan(op]peope_o`]p] oq^lhkp$.(.(0% naikrapdabn]ia]j`t)+u)]tao*sas]jp]_ha]joh]pa ]teo$#kbb#% cajan]pa]j]nn]ukbopnejcopk^alnejpa` Pkp]h[`eop]j_a9bhk]p$oqi$omnp$`ebb$lt%&&.'`ebb$lu%&&.%%X +-,,,*,% Op]j`[peia9haj$Eop]j`%+2,*, ?nqeoa[peia9haj$E_nqeoa%+2,*, Olaa`[peia9haj$Eolaa`%+2,*, Op]j`[lan9-,,&haj$Eop]j`%+haj$r% ?nqeoa[lan9-,,&haj$E_nqeoa%+haj$r% Olaa`[lan9-,,&haj$Eolaa`%+haj$r% op]po9W#Op]peope_o#(X #!o#!behaj]ia(X #Jqi^ankb`]p]lkejpo6!`#!haj$u%(X #=ran]cajqi^ankbo]pahhepao6!`#!ia]j$o]po%(X #Pkp]h`nerejcpeia6!*-biejqpao6#!$haj$r%+2,*,%(X #Op]j`ejc6!*-biejqpao$!`!!%#!X $Op]j`[peia(Op]j`[lan%(X #?nqeoejc6!*-biejqpao$!`!!%#!X $?nqeoa[peia(?nqeoa[lan%(X #Olaa`ejc6!*-biejqpao$!`!!%#!X $Olaa`[peia(Olaa`[lan%(X #=ran]caolaa`6!`gi+d#!ia]j$r%(X #Pkp]h`eop]j_apn]raha`6!*-bGi#!Pkp]h[`eop]j_aY `eolh]uop]peope_oejbkni]pekj bknej`at(op]p[hejaejajqian]pa$naranoa`$op]po%%6 patp$,(ej`at(op]p[heja(r]9#^kppki#% `n]s]heja^ahkspdaOp]peope_opatp lhkp$Wej`at)*.(ej`at)*.Y% oap]teolnklanhuok]hhpdapatpeo`eolh]ua` ]teo$W,(-()-(haj$op]po%Y% odks$% &IGURE SHOWSTHEFINALRESULTS CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION 29 Figure 1-5. Output of clo*lu on some GPS data Final Notes and References The GPS problem described here is research in nature: a computation, an intermediate result, not an end product. Research, or R&D work, especially feasibility studies, requires rapid responses. This means using readily available tools as much as possible and combining them to get the job done. If those tools are inexpensive, or free, that’s yet another reason to use them. Throughout the book, we will examine different packages and modules and see how they may be used to perform data analysis and visualization. The theme we’ll be using is open software, including software published under the GNU Public License (GPL) and the Python 3OFTWARE&OUNDATION03& LICENSE%XAMPLESOFTHESETOOLSINCLUDE'.5,INUXAND of course, Python. There are several benefits to developing data analysis and visualization scripts in Python: s $EVELOPINGANDWRITINGCODEISQUICK APPEALINGFORRESEARCHWORK s 2EADILYAVAILABLEPACKAGESFURTHERINCREASEPRODUCTIVITYANDENSUREACCURATERESULTS s 3CRIPTSINTRODUCEAUTOMATION-ODIFYINGANALGORITHMISEASILYDONE Scripts will be numerous and explained in detail, and I aim to cover most of the issues you are likely to encounter in the real world. Examples include scripts to deal with binary files, to combine data from different sources, to perform text parsing, to use high-level numerical algorithms, and much more. Scripts will be written in Python: some will be simple one-liners, others more complex. Special attention will be given to data visualization and how to achieve pleasing results in Python. CHAPTER 1 N NAVIGATING THE WORLD OF DATA VISUALIZATION30 If you’d like to read more about Python in general (and not necessarily for data analysis and visualization), the Python official web site is an excellent resource: s 0YTHON0ROGRAMMING,ANGUAGEˆOfficial Website, dppl6++sss*lupdkj*knc CHAPTER 2 The Environment Tools of the Trade In the previous chapter we’ve seen a case study involving the collection, analysis, and visual- ization of GPS data. Unless you’re already familiar with Python and the packages we’ve used, you should read this chapter and build yourself a development environment. Analyzing and visualizing data requires several software tools: a text editor to write code, Python to run and test the scripts, and perhaps a tool to present the results. I’ve decided to break the discussion of software tools into two categories: general-purpose software components and specific software components. The general-purpose software com- ponents are merely a recommendation on my part on tools I think improve productivity. If you’re already comfortable with another software package, by all means use it over the one suggested here. The specific software components category, on the other hand, is composed of tools required to run the examples in the book. To clarify, whenever a component is a required component, it is clearly mentioned. The following is a suggested list of software components that I feel provides a solid devel- opment environment. General software components: s !NOPERATINGSYSTEM/3 s !TEXTEDITOR s !NIMAGEVIEWER s 4OOLSFORPRESENTINGANDVIEWINGTHERESULTS s !VERSIONCONTROLSYSTEM Specific software components: s 0YTHONWITHITSBUILT INPACKAGES s !DDITIONAL0YTHONPACKAGESFORDATAANALYSISANDVISUALIZATION This chapter introduces the different software components in a linear fashion, that is, it BUILDSTHINGSFROMTHEGROUNDUPˆFIRSTTHE/3 THEN0YTHONAND0YTHONPACKAGES ANDLASTLY supporting software components. 31 CHAPTER 2 N THEENVIRONMENT32 Although the chapter is organized in a linear fashion, feel free to skip the general software components section if you already know what applications you’ll be using. You should, how- EVER ENSUREYOUHAVETHESPECIFICCOMPONENTS0YTHONANDADDITIONALPACKAGES PROPERLY installed; code presented in the book assumes that is the case. Operating Systems The development environment is built upon an operating system. There are several options TOCHOOSEFROM5.)8 BASEDOPERATINGSYSTEMSINCLUDING'.5,INUX -AC/38 ANDOTH- ERS AND7INDOWS/FTHOSE WELLFOCUSON,INUXAND7INDOWS!SFOR-AC/3 SINCEITISA 5.)8 BASEDOPERATINGSYSTEM MOSTOFTHEDISCUSSIONSREGARDING,INUXAPPLYTOITASWELL4HE 0YTHONWEBSITEdppl6++sss*lupdkj*knc ISANEXCELLENTRESOURCEFORALLTHINGS0YTHON INCLUD- ing supported operating systems. GNU/Linux ,INUXISAGENERICTERMTHATDESCRIBES5.)8 LIKEOPERATINGSYSTEMSBASEDONTHE,INUXKERNEL !,INUXDISTRIBUTIONISACOLLECTIONCONSISTINGOFTHE,INUXKERNELALONGWITHADDITIONALSOFT- WAREPACKAGESTHATTOGETHERPROVIDEAFULL/3-OSTDISTRIBUTIONSPROVIDEMORETHANABASIC /3FUNCTIONALITYTHEYPROVIDEADDITIONALSOFTWAREPACKAGESSUCHASMULTIMEDIAAPPLICATIONS games, office productivity suites, and much more. A considerable portion of the packages in MOST,INUXDISTRIBUTIONSISBASEDONTHE'.5PROJECTdppl6++sss*cjq*knc HENCETHETERM '.5,INUX 4HEREISALARGENUMBEROF,INUXDISTRIBUTIONSDISTROS AVAILABLETODAY INCLUDING sFedora project: dppl6++sss*ba`kn]lnkfa_p*knc sDebian: dppl6++sss*`a^e]j*knc sUbuntu: dppl6++sss*q^qjpq*_ki sGentoo: dppl6++sss*cajpkk*knc -OSTOFTHESEAREEXCELLENTDISTRIBUTIONS SOIFYOUPLANONGOINGTHE,INUXROUTE SPEND some time to acquaint yourself with these distributions to decide on the one that best suits YOURNEEDSORSHOULD)SAYBESTFITSYOURPERSONALITY  )TISESPECIALLYIMPORTANTTHATYOUKNOWHOWTOINSTALLAPPLICATIONSINTHE,INUXDISTRIBU- TIONOFYOURCHOICE-OSTDISTRIBUTIONSCOMEWITHAPACKAGEMANAGEMENTTOOLEG RPM9UM ON&EDORA APT GET!04ON$EBIAN ANDEMERGE0ORTAGEON'ENTOO THATENABLESDOWNLOAD- INGAPPLICATIONSANDINSTALLINGTHEMONYOUR,INUX/34YPICALLY PACKAGEMANAGEMENTTOOLS synchronize with an online repository and enable downloading and upgrading software. They also take care of any version conflicts and perform the actual installation tasks such as copying files and updating system information. !SAGENERALRULE OPTFORUSINGYOUR,INUXDISTRIBUTIONSBUILT INPACKAGEMANAGEMENT tool to install the software components discussed in this chapter, Python and packages INCLUDED OVERAMANUALINSTALLTHISWILLENSUREASTABLE,INUXSYSTEM In case a software application of your liking is not available via the package management TOOLOFYOUR,INUXDISTRIBUTION YOUSTILLHAVETHEOPTIONOFMANUALLYINSTALLINGTHATAPPLICA- TION4HISISNOTATRIVIALTASKANDREQUIRESSOME,INUXEXPERTISE4HATBEINGSAID INTHECASEOF CHAPTER 2 N THEENVIRONMENT 33 Python packages, a manual install is straightforward, and an example will be provided later in THECHAPTERINSECTIONh-ANUALLY)NSTALLINGAPython Package.” Windows /FTHE7INDOWSVERSIONSAVAILABLETODAY ANYVERSIONFROM7INDOWS80UPWARDSHOULDBE FINEPREVIOUSFLAVORSOF7INDOWSAREGROWINGOBSOLETE SOSUPPORTFORAPPLICATIONSRUNNINGON OLDERVERSIONSOF7INDOWSISLIMITED4HATBEINGSAID MOSTAPPLICATIONSSTILLDORUNONOLDER VERSIONSOF7INDOWSHOWEVER YOUSHOULDCHECKWITHTHEPACKAGESONLINEDOCUMENTATION 5NLIKE,INUX AFTERSELECTING7INDOWSASTHE/3 THERESSTILLADECISIONTOBEMADE AND that is what exact environment Python will be running on. There are three main options to choose from: s 3TAND ALONENATIVELY s #YGWIN s 6IRTUALMACHINES6-S Stand-Alone (Natively) Unless you have a strong reason against it, this should be your preferred choice if you intend ONUSING7INDOWSINSTALLING0YTHONNATIVELYWITHOUTANADDITIONALENVIRONMENT0YTHON comes as an executable file with an installer application. After downloading, double-click the EXECUTABLEANDINSTALL0YTHONMOREON0YTHONINSTALLATIONSHORTLY -OSTOTHERPACKAGESWELL be dealing with also come bundled in this fashion, so installing them should be simple as well. In case you’d like to install a package that doesn’t come with an installer, you’ll have to consult with that package’s documentation. By the way, regardless of whether you choose a STAND ALONEAPPROACHORONEOFTHEOTHERMETHODSSUGGESTEDNEXTOR,INUX THEREAREBOUND to be packages that require a manual installation, so knowing how to do a manual package install is of value. Cygwin #YGWINdppl6++sss*_ucsej*_ki ISANENVIRONMENTTHATRUNSIN7INDOWSANDPROVIDES5.)8 LIKEFUNCTIONALITY)TISANEXCELLENTSOFTWAREPRODUCTEVENIFYOUAREADEVOTED7INDOWSUSER #YGWINCOMESWITHA'5)INSTALLERTHATRUNSON7INDOWSNAMED#YGWIN.ET2ELEASE 3ETUP0ROGRAMoapql*ata THATALLOWSPICKINGANDINSTALLINGSOFTWAREPACKAGES4HE#YGWIN INSTALLERISACTUALLYAPACKAGEMANAGEMENTTOOLJUSTLIKEANYOTHERPACKAGEMANAGEMENTTOOL INMOST,INUXDISTRIBUTIONS!SYOUBROWSETHROUGHTHELISTOFPACKAGES YOULLREALIZETHERES an extensive selection to choose from; however, that should not deter you. Install the default options knowing you can always go back and add or remove applications; it’s as simple as RERUNNINGTHE#YGWININSTALLER!FTERINSTALLING#YGWIN RUNITVIA3TART¢ All Programs ¢ #YGWIN¢#YGWINBash Shell. #YGWINPROVIDESAGREATNUMBEROFADDITIONALOPENSOURCESOFTWAREPACKAGES INCLUDING Python. If you want additional functionality—Bash shell, SSH, editors, viewers, version control SYSTEMS 8FUNCTIONALITY ANDMOREˆTHEN#YGWINISANEXCELLENTCHOICE4HEDOWNSIDEISTHAT it is a bit more complex for a less-experienced user than the stand-alone approach presented EARLIER4HERESALSOASMALLPERFORMANCEHITUSING#YGWINCOMPAREDWITHANATIVEINSTALLATION CHAPTER 2 N THEENVIRONMENT34 For example, on my computer, a simple bkn loop summing values was 20 percent slower in 0YTHONON#YGWINCOMPAREDWITHANATIVE0YTHONINSTALLATION NNote Cygwin treats drives differently from Windows as it follows a UNIX directory structure. If you installed Cygwin under _6X_ucsej, then this directory is usually denoted as the topmost directory: +. To access directories outside _6X_ucsej, use the following notation: +_uc`nera+`eog. For example, if a file is located in _6X`]p], it is accessible in Cygwin as +_uc`nera+_+`]p]. Virtual Machines The third option, which is ABITMOREEXOTIC ISRUNNINGAVIRTUALMACHINE6- !VIRTUAL MACHINEALLOWSTHEUSERTORUNA,INUX/3ORANOTHER/3FORTHATMATTER INTHEHOSTOPERATING SYSTEM WHICHISINTHISCASE7INDOWSOR-AC/34HISOPTIONISFORTHEMOREEXPERIENCEDUSER INSTALLINGANDCONFIGURINGA6-ISNOTANEASYTASK /N7INDOWS THEREARESEVERAL6-SAVAILABLETODAYINCLUDINGTHEOPENSOURCE#OOPERA- TIVE,INUXCO,INUX dppl6++sss*_khejqt*knc ANDTHECOMMERCIAL6-WAREdppl6++sss* ris]na*_ki WHICHHASAFREEVERSIONASWELL!POPULAR6-ON-AC/3IS0ARALLELSdppl6++ sss*l]n]hhaho*_ki WHICHALLOWSFORRUNNINGBOTH,INUXAND7INDOWSALONGSIDE-AC/3 NTip Running a virtual machine might be a good option in case you just want to try out Linux in general but don’t want to go the full route of installing an OS. If that is the case, there is also the option of running a live CD, which basically means booting a full-fledged Linux OS from CD-ROM. There’s quite a large number of live CDs available today, with one of the well-known ones being Knoppix (dppl6++sss*gjkllet*jap). INSTALLING COLINUX As mentioned, installing a Linux VM in Windows is not a trivial task. The process involves several steps that require Linux and networking expertise. Here’s a set of steps to install coLinux in Windows XP: 1. First, install coLinux with an image of the Linux distribution of your choice. 2. Set up Internet connectivity on the target OS (Linux) so that you can download and update packages. Update and install packages as needed. 3. Set up a networking connection between the host OS and the target OS so you can transfer data files. VM packages nowadays, both commercial and open source, automate these tasks and make the instal- lation a lot more user friendly. CHAPTER 2 N THEENVIRONMENT 35 /NEOFTHEDOWNSIDESOFUSINGA6-ISTHATYOUPAYAPRICEINPERFORMANCE4HATBEING SAID 6-IMPLEMENTATIONSANDTHEINCREASINGPOWEROFCOMPUTINGHAVEMADETHISARELATIVELY small price to pay. Choosing an Operating System From a data analysis ANDVISUALIZATIONPERSPECTIVE ,INUXISAPERFECTMATCH4HEMAINREASON ISTHAT,INUXCOMESWITHASTRONGCOMMAND LINEINTERFACE#,) COMPAREDWITH7INDOWS which relies heavily on a graphical USERINTERFACE'5)  7ORKINGWITHASIGNIFICANTNUMBEROFFILES #,)WINSHANDSDOWN#ONSIDERRENAMING ALARGENUMBEROFFILES SAY PICTURESYOUTOOKONYOURLASTVACATION-OSTCAMERASGENERATE files that follow a sequential naming scheme: @O>,,,,-*flc,@O>,,,,.*flc, and so forth, which is rather cryptic. You, on the other hand, would like to rename these files to something a bit more informative, such as R]_]pekj.,,3),5).,)JJJJJ*flc, where JJJJJ is the running index. So a file named @O>,,,,.*flc will now be named R]_]pekj.,,3),5).,),,,,.*flc. You can per- FORMTHISTASKWITHBOTHA'5)ANDA#,) s 7ITHTHE'5)APPROACH THISMEANSATASKOFPOINT CLICK ANDRENAMEFOREACHAND EVERYFILE7HILETHISMIGHTBEPERFECTLYREASONABLEFORASMALLNUMBEROFPICTURES AS the number increases, this becomes a tedious task. s 4HE#,)APPROACHISTOWRITEACOMMANDTORENAMEALLTHEFILES)FYOUREFAMILIARWITH Bash, you might issue the following: bknbjej@O>&*flc7`kir bj wbj+@O>+R]_]pekj.,,3),5).,)y7`kja 4HEREARELOTSOFWAYSTODOITWITHA#,) ANDTHISISJUSTONE)PREFER)WILLNOTBE DISCUSSING"ASHINTHEBOOK !GAIN FORASMALLNUMBEROFPICTURES THISSEEMSLIKE overkill; however, once the number of files increases, this is the better approach. /FCOURSE RENAMINGFILESISASIMPLETASK ONETHAT7INDOWSSUPPORTSVIAITSCOMMAND PROMPTASWELLWHICHISTHE7INDOWSVERSIONOFA#,) BUTEVENTHISSIMPLETASKISNOTTRIVIAL IN7INDOWS UNLESSYOUINSTALLADDITIONALSOFTWAREORWRITESOMECODETOPERFORMTHETASK ALTHOUGHRECENTVERSIONSOF7INDOWSALSOINTRODUCESHELLCAPABILITIESENABLINGBOTH'5)AND #,)INTERFACES &ORMORECOMPLEXDATAMANAGEMENTTASKS A#,) CENTRICAPPROACHISMUCH BETTERTHANA'5)!NOPERATINGSYSTEMBUILTAROUND#,)ISUSUALLYABETTERCHOICEFORMANAG- ing data files. NTip There isn’t a right or wrong, whatever OS you choose—the concepts (and code) presented in this book will work just fine. (EREARESOMETHINGSTOCONSIDERWHENCHOOSINGAN/3 CHAPTER 2 N THEENVIRONMENT36 s ,INUXISASTABLEANDABLEOPERATINGSYSTEM4HEBENEFITSOFUSING,INUXINCLUDELOW COSTTYPICALLY NONE SOLID#,) ANDANACTIVEANDSUPPORTIVECOMMUNITY4HEMAIN DISADVANTAGEWITH,INUXISTHATIFYOURENOTFAMILIARWITHTHE/3 THEREISALEARNING curve, although with today’s distributions the curve has leveled off significantly. Also, SUPPORTFORHARDWAREISNTASALL ENCOMPASSINGASISTHECASEIN7INDOWS4HISMIGHT prove a serious disadvantage if your work involves using an already existing piece of HARDWARETHATISNTSUPPORTEDIN,INUXTOGENERATEDATA s 7INDOWSISAWIDELYPOPULAROPERATINGSYSTEM-OSTUSERSHAVEEXPERIENCEDWORKING IN7INDOWSTOSOMEDEGREE SOTHELEARNINGCURVEISVERYSHALLOW IFANY3UPPORTFOR HARDWAREISVERYGOODMOSTHARDWAREVENDORSTARGET7IDOWSASTHEIRPRIMARY/34HE DRAWBACKSOFUSING7INDOWSARELACKOFASTRONG#,)ANDCOSTOFTHE/3ANDADDITIONAL software applications. s -AC/3ISGAININGPOPULARITYITcombines the GUI experience with UNIX power. Although relatively new in the data analysis and visualization scene, due to those two TRAITS )HAVEAFEELINGYOULLSEEMOREANDMOREOF-AC/3BEINGUSED-AC/3DOWN- sides as I view them are cost and support for legacy hardware. Table 2-1 summarizes the aforementioned pros and cons. Table 2-1. Linux, Windows, and Mac OS as Development Environments for Data Processing and Visualization Linux Windows MacOS #,) 6ERYGOODNATIVE 'OODWITH0YTHON 6ERYGOODNATIVE !PPLICATIONS &ULLMOSTLYFREE &ULLPOSSIBLEADDITIONALCOST &ULLPOSSIBLEADDITIONAL COST ,EARNINGCURVE 3TEEP 'ENTLE 'ENTLE #OST ,OW -EDIUM -EDIUM (ARDWARESUPPORT 'OOD 6ERYGOOD -EDIUM 3TABILITY 6ERYGOOD 6ERYGOOD 6ERYGOOD Then Again, Why Choose? Using Several Operating Systems The nice thing about Python is that it eliminates the operating system from the equation. Python is a complete environment, with a “batteries-included” approach: you should be pretty much good to go, out of the box, after installing Python; the standard library provides full FUNCTIONALITY7HATTHATMEANSISTHATALLOFASUDDEN 7INDOWSHASASTRONG#,)ASWELLTHE Python interpreter. 7ITHTHATINMIND THESELECTIONOFAN/3BECOMESMOREOFAPERSONALPREFERENCETHAN ANYTHINGELSE)HAVEBOTH,INUXAND7INDOWSANDUSEBOTHFORDATAANALYSISANDVISUALIZATION MY,INUXMACHINEISASTATIONARYHOMESERVERSO)CANTUSEITTORECORD'03DATAWHENDRIV- INGMYLAPTOPRUNS7INDOWSANDDOESTHATFORME If you require more UNIX-like functionality than Python provides but would still like to USE7INDOWS OPTFOR#YGWINASDISCUSSEDPREVIOUSLY#YGWINPROVIDESAHOSTOF'.5TOOLS PORTEDTO7INDOWS)NFACT )USE#YGWINS8SERVERANDCONNECTTOMY,INUXMACHINEIF)DLIKE CHAPTER 2 N THEENVIRONMENT 37 SOMEINTERACTIVEWORK PLOTTINGDATATHE,INUXMACHINEISTUCKEDUNDERTHEDESKANDHASNO MONITOR  If you plan on USINGBOTH7INDOWSAND,INUXTOANALYZEDATAONTHESAMECOMPUTER THATIS DUAL BOOTING THINKABOUTHOWYOUREGOINGTOTRANSFERDATABETWEENTHE,INUXAND 7INDOWSPARTITIONS4HEREARESEVERALWAYSHAVINGASHAREDPARTITIONTHATBOTH,INUXAND 7INDOWSCANHANDLE&!4 .4&3ONSOME TRANSFERRINGFILESTHROUGHA53"DEVICE OREVEN networking to another machine. Each has its benefits, but remember that you might be deal- ing with a large number of files, so it would be best if you could access the data on a shared resource. NCaution Installing an OS is a time-consuming task, taking twice as long if you intend to dual-boot. You should consult with the Linux documentation of your distribution on how to best achieve dual-booting, and especially on what OS (Linux or Windows) you should install first. Dual-booting is an advanced topic and is not suggested for the beginner. Using a dual-boot system can be annoying at times, especially since you have to reboot to switch operating systems. Not to mention that the installation process is a bit risky: there COULDBESCENARIOSOFLOSTDATADUETOREPARTITIONINGOFTHEHARDDISKWHICHCANBEAVOIDED IF YOUKNOWWHATYOUREDOING 4HISISEXACTLYWHYA6-ISAGOODALTERNATIVEDATAISSAFEFROM REPARTITIONS ANDACTUALREBOOTSARENOTREQUIRED-Y0#ISSTRONGENOUGHTORUN,INUXASA 6-IN7INDOWSWITHEXCELLENTPERFORMANCE)FYOUDLIKETOUSETHISSETUP AGAIN THINKABOUT HOWYOUREGOINGTOSHAREDATABETWEENTHEHOST/3ANDTHETARGET/3!COMMONANDGOOD approach is to transfer files using a virtual network interface. /N-AC/3 THENEEDFORTHESESOLUTIONSISSOMEWHATLESSREQUIRED-AC/3ISALREADYA UNIX-LIKE/3 The Python Environment By now you should’ve ALREADYSELECTEDANDINSTALLEDTHE/3OFYOURCHOICE9OUSHOULDALSOBE comfortable with downloading and installing packages. It’s now time to install Python. This section discusses the installation of Python and Python packages to enable programming data analysis and visualization scripts. A more detailed discussion on using Python both in an inter- ACTIVESHELLANDASASTAND ALONEAPPLICATIONWILLBEGIVENIN#HAPTER)NTHISCHAPTER )LLBE COVERING0YTHONDISTRIBUTIONS 0YTHON)$%S AND0YTHONPACKAGES Versions The book covers Python version 2.5 and should work on version 2.4 as well. As a general rule, you should opt for the most updated Python version. Unfortunately, that’s not always possible: CHAPTER 2 N THEENVIRONMENT38 s 3OMEOPERATINGSYSTEMS FOREXAMPLETHE'ENTOO,INUXDISTRIBUTION RELYHEAVILYON Python for system administration, and upgrades require extensive testing to ensure the system is stable. So although a new release of Python becomes available, you might not be able to use it yet. There are workarounds to that such as installing several versions of 0YTHONONONEMACHINEAGAIN REFERTOYOUR,INUXDISTRIBUTIONFORFURTHERINFORMATION as this topic is beyond the scope of this chapter. s !TTHETIMEOFwriting, Python 2.6 was released. However, not all the packages used INTHEBOOKHAVECAUGHTUPYET SO)VEHADTOSTICKWITH0YTHON7EALREADYKNOW 0YTHONISINTHEMAKING ANDALOTOFTHEINFORMATIONREGARDINGTHEUPCOMING CHANGESCANALREADYBEVIEWEDONTHE0YTHONWEBSITEdppl6++sss*lupdkj*knc 7HEN applicable, I’ve tried to cover the differences between Python 2.6 and 0YTHON NTip Always make sure you’re downloading and installing a version of a package that is compatible with the version of Python you’re using. Some packages keep older versions if you need them for compatibility reasons. Python You can download a 0YTHONIMPLEMENTATIONFORYOURSPECIFIC/3FROMdppl6++sss*lupdkj* knc+`ksjhk]`+2EADCAREFULLYANDSELECTTHEPACKAGETHATFITSYOUR/3!GAIN IFYOURERUN- NINGA,INUX/3 OPTFORUSINGTHATSYSTEMSPACKAGEMANAGEMENTTOOLOVERDOWNLOADINGAND INSTALLINGFROMTHE0YTHONWEBSITE4HESAMEAPPLIESFOR#YGWINUSETHE#YGWININSTALLERIF YOUCAN/N7INDOWS THECOMMONPRACTICEISTOUSETHE0YTHONBINARIESDISTRIBUTEDWITHAN installer from the PRECEDING52, You can install Python from source code, that is, download the source code and compile ITONYOUR/30ERSONALLY )HAVENOTFOUNDAREASONTODOTHISOTHERTHANTOSATISFYMYCURIOSITY that the code does indeed compile properly. If you are wondering ABOUT*YTHONANIMPLEMENTATIONOF0YTHONWRITTENPURELYIN*AVA see dppl6++sss*fupdkj*knc AND)RON0YTHONANIMPLEMENTATIONOF0YTHONON-ICROSOFTS .NET platform, see dppl6++sss*_k`alhat*_ki+Sege+Reas*]olt;Lnkfa_pJ]ia9EnkjLupdkj )M afraid they’re not good options for this book. A lot of the code and examples rely heavily on packages that do not run on Jython or IronPython. Python Distributions with Scientific Packages Another option is to use a Python distribution that already bundles a significant number of the 0YTHONPACKAGESWELLBEUSING%NTHOUGHT0YTHON$ISTRIBUTION%0$ dppl6++sss*ajpdkqcdp* _ki+ AND0YTHONX Y dppl6++sss*lupdkjtu*_ki+ PROVIDE0YTHONDISTRIBUTIONSTHATSHOULD PROVEAGOODOPTIONIFYOUDONTWANTTHEHASSLEOFINSTALLINGINDIVIDUALPACKAGES/PTFORTHIS option if you can’t wait to be past the installation phase and up and running code. CHAPTER 2 N THEENVIRONMENT 39 NTip If you choose the distribution from Enthought or Python(x,y), you can skip the sections related to SciPy, NumPy, matplotlib, and IPython later in the chapter. Both these distributions include those packages out of the box. Python Integrated Development Environments An integrated development environment )$% IS SIMPLYPUT ANAPPLICATIONTHATHELPSPRO- GRAMMERSWRITECODE4YPICALLYAN)$%ISCOMPOSEDOFTHELANGUAGEENGINE0YTHON AN EDITOR ADEBUGGER DOCUMENTATION ANDPOSSIBLYADDITIONALPRODUCTIVITYTOOLS7HILEITISBYALL MEANSPOSSIBLETOUSE0YTHONWITHOUTAN)$% USINGONEWILLGREATLYINCREASEYOURPRODUCTIVITY and will enable a faster learning pace. 4HEREISAWEALTHOF0YTHON)$%S ANDARATHEREXTENSIVELISTISPROVIDEDINTHEBOOKS Python in a Nutshell: A Desktop Quick Reference and Beginning Python: From Novice to Profes- sionalSEETHEREFERENCESATTHEENDOFTHISCHAPTER )NTHISCHAPTER WELLLIMITOURDISCUSSION TO)$,%AND)0YTHONWHICHISNTREALLYAN)$% MOREOFA0YTHONSHELLENHANCEMENT  IDLE )$,%dppl6++sss*lupdkj*knc+`k_+.*2+he^n]nu+e`ha*dpih ISACROSS PLATFORM0YTHONGUI )$%)FYOUINSTALLEDFROM7INDOWSBINARIES )$,%ISAUTOMATICALLYINSTALLEDACCESSITVIA Start ¢ All Programs ¢ Python 2.5 ¢)$,%0YTHON'5) )$,%ISACAPABLE)$%WITHTHEFOL- lowing features: seamless integration with the Python interpreter, an editor, a debugger, and a help system. It’s an excellent environment to get up and running, especially if you’re new to programming. /NEOFTHEBENEFITSOFUSING)$,%ISTHATYOUCANWRITECODEINANEDITOR SPECIFICALLY DESIGNEDFOR0YTHON ANDTHENQUITEEASILYEXECUTEITIN)$,%BYPRESSING&7ITHA#,) APPROACH YOUDHAVETOINVOKE0YTHONWITHTHEFILEYOUDLIKETOEXECUTEMOREONTHISIN #HAPTER  IPython As you start working WITHA#,) YOULLREALIZETHEREARESOMETHINGSYOUDREALLYLIKEENHANCED )0YTHONdppl6++elupdkj*o_elu*knc+ PROVIDESANENHANCEDINTERACTIVE0YTHONSHELLANDIS highly recommended mostly because data analysis and visualization is interactive in nature. IPython is supported on most platforms. Here’s a short list of the added features that come with IPython: s 4ABCOMPLETION WHICHINVOLVESCOMPLETIONOFVARIABLES FUNCTIONS METHODS ATTRI- BUTES ANDFILENAMES4ABCOMPLETIONISACHIEVEDWITHTHE'.52EADLINELIBRARY dppl6++peosss*_]oa*a`q+ldl+_dap+na]`heja+nhpkl*dpih ANDISHIGHLYADDICTIVE)TS VERYHARDTOGOBACKTOAREGULAR#,)AFTERYOUVEBEENEXPOSEDTO'.52EADLINE s #OMMANDHISTORYCAPABILITIESISSUETHECOMMANDdeopknu for a full account of the commands you’ve recently typed. You can copy and paste those into a Python script and save time and effort. CHAPTER 2 N THEENVIRONMENT40 s 3EAMLESSINTEGRATIONWITHSYSTEMSHELLYOUCANUSEhoÌh or _`+dkia+qoan, for example. s #OLOREDOUTPUT NNote IPython is not required but is highly recommended. The code in the book will work without IPython as well as with it. IPython comes bundled with ANINSTALLERFOR7INDOWSANDISAVAILABLEONMOSTPACKAGE MANAGEMENTTOOLSASWELLON,INUXAND#YGWIN$EPENDINGONYOUR/3 YOUMIGHTNEEDTO INSTALL'.52EADLINEON7INDOWS YOULLALSONEEDTOINSTALL0Y2EADLINEdppl6++elupdkj* o_elu*knc+ikej+LuNa]`heja+Ejpnk #ONSULTWITH)0YTHONSinstallation documentation. NNote IPython should be installed after Python, GNU Readline, and PyReadline are installed. CHARACTER COMPLETION WITH GNU READLINE Character completion with GNU Readline is a welcomed addition to an interactive CLI. With IPython, character completion can be used to complete s .AMESOFVARIABLES s .AMESOFMETHODSANDATTRIBUTES s &ILENAMES To invoke character completion, start by spelling out the first few characters of the word you wish to write and then press the Tab key to have GNU Readline try and complete the word for you. The following is from IPython: EjW-Y6o9=opnejc EjW.Y6o*eo o*eo]hjqio*eo]hld]o*eo`ecepo*eohksano*eool]_ao*eopephao*eoqllan EjW.Y6o*eo After typing s.is, the user pressed the Tab key and was presented with a list of options. Had the user spelled the word o*eo` and pressed Tab, the entire o*eo`ecep would have appeared automatically at the prompt. The way GNU Readline works is that it tries to complete the word by searching for a variable, function, method, attribute, or file name that matches the typed characters. In case of one option, that word is auto- matically spelled out at the prompt. In case of several options, all the options are displayed. To select which of the options you’d rather have completed, supply the next character and then press Tab again. In case of no matches to the typed word, nothing happens. CHAPTER 2 N THEENVIRONMENT 41 You can also use the character completion feature to explore methods and attributes of a class, or any other namespace for that matter. In the following listing, the Tab key is pressed after ]* is entered (notice the dot). EjW-Y6]9`e_p$% EjW.Y6]* ]*[[_h]oo[[]*[[d]od[[]*[[oap]ppn[[]*epanr]hqao ]*[[_il[[]*[[ejep[[]*[[oapepai[[]*gauo ]*[[_kjp]ejo[[]*[[epan[[]*[[opn[[]*lkl ]*[[`ah]ppn[[]*[[ha[[]*_ha]n]*lklepai ]*[[`ahepai[[]*[[haj[[]*_klu]*oap`ab]qhp ]*[[`k_[[]*[[hp[[]*bnkigauo]*ql`]pa ]*[[am[[]*[[ja[[]*cap]*r]hqao ]*[[ca[[]*[[jas[[]*d]o[gau]*l`b ]*[[cap]ppne^qpa[[]*[[na`q_a[[]*epaio ]*[[capepai[[]*[[na`q_a[at[[]*epanepaio ]*[[cp[[]*[[naln[[]*epangauo Scientific Computing A significant portion of the book is dedicated to the processing of data prior to visualization. 4WOPACKAGESHELPUSACHIEVETHATEND.UM0YAND3CI0Y.UM0YWILLBEDISCUSSEDIN#HAP- TER AND3CI0YWILLBEREVIEWEDIN#HAPTER4HESETWOPACKAGES COMBINEDWITHMATPLOTLIB MOREONTHISPACKAGESHORTLY BEHAVESIMILARLYTOMOSTHIGH ENDMATHPACKAGESSUCHASTHE OPENSOURCE'.5/CTAVEdppl6++sss*k_p]ra*knc ANDTHECOMMERCIAL-!4,!"dppl6++sss* i]pdskngo*_ki )NFACT THERESEVENANAMEFORTHESETHREEPACKAGESWORKINGTOGETHER0Y,AB WHICHISACOMBINATIONOF0YTHONAND-!4,!"!PORTALFOR3CI0YAND.UM0YISLOCATEDat dppl6++sss*o_elu*knc. SciPy, NumPy, and matplotlib are all open source software packages and are required to run the code presented in the book. NumPy NumPy provides a powerful N-dimensional array that is the basis for most of the data process- INGWELLPERFORM9OUVEALREADYSEENITINACTIONINTHE'03EXAMPLEIN#HAPTER.UM0Y also provides additional numerical capabilities: linear algebra, Fourier transforms, and more. NumPy is a mature and stable package and can be downloaded and installed from dppl6++jqilu*o_elu*knc+.UM0YWILLBEDISCUSSEDIN#HAPTERAND#HAPTER SciPy SciPy builds on top of NumPy and adds additional scientific computing tools. These include numerical integration, differential equations, interpolation, signal processing, optimization, linear algebra, and more. CHAPTER 2 N THEENVIRONMENT42 Even if you’re not interested in scientific computing, I encourage you to give SciPy a try— it provides additional utility functions to NumPy that are very useful and used extensively in the book. SciPy can be downloaded and installed from dppl6++sss*o_elu*knc+ and will be reviewed in #HAPTER NNote SciPy relies on NumPy and should be installed after NumPy is installed. Plotting 6ISUALIZATIONISTHEFINALstep, displaying data graphically to the audience, portraying an idea, ANDCAPTURINGINFORMATIONEFFICIENTLYANDELEGANTLY7ENOWTURNTOTWOPACKAGESTHATALLOW easy plotting and graphing. Matplotlib Plotting throughout the book will rely heavily on the matplotlib package, maintained at dppl6++i]plhkphe^*okqn_abknca*jap+-ATPLOTLIBISA $PLOTTINGPACKAGETHATINTERFACESWELL WITH.UM0YAND3CI0Y4HEPACKAGEISCROSS PLATFORMANDWORKSON,INUX 7INDOWS AND-AC /3 -ATPLOTLIBCANPRODUCEBOTHINTERACTIVEANDHARD COPYPLOTSUSINGVARIOUSENGINES9OU can therefore use it both for interactive work, which is very useful in the early stages of an algo- rithm design; or you can use it in an automatic mode, for example, batch processing, to plot results to, say, a shared directory or a web server. -ATPLOTLIBISBOTHSIMPLETOUSEANDHIGHLYCUSTOMIZABLE YIELDINGANEXCELLENTPACKAGE FOROURPURPOSES)TALLOWSARANGEOF $PLOTTYPESANDHASEXCELLENTGRAPHANNOTATIONCAPA- bilities. NTip Matplotlib has some additional toolkits available, out of which the one that is of interest especially in light of Chapter 1 is the basemap toolkit. The basemap toolkit allows working with map projections. I will not be covering the basemap toolkit in this book. Gnuplot An alternative package suggested HEREISGNUPLOTdppl6++sss*cjqlhkp*ejbk+ 'NUPLOTISA WIDELYPOPULARPLOTTINGPACKAGETHATHASBEENPORTEDTONUMEROUSPLATFORMSINCLUDING,INUX 7INDOWS AND-AC/34HISRENDERSGNUPLOTAVERYGOODGRAPHINGANDPLOTTINGPACKAGE'NU- plot also supports both interactive and hard-copy graphs. /NEOFTHEBENEFITSOFGNUPLOTOVERMATPLOTLIBIS $GRAPHSUPPORT)FYOUREQUIRESUCH capabilities, opt for gnuplot. CHAPTER 2 N THEENVIRONMENT 43 )NORDERTOUSEGNUPLOTINTERACTIVELYFROMTHE0YTHON#,) ASOFTWAREPACKAGETOCONNECT THETWOISREQUIRED)HAVEUSEDTHE'NUPLOTPYPACKAGEdppl6++cjqlhkp)lu*okqn_abknca* jap+ TODOSOWITHGOODRESULTS NNote To use gnuplot from Python, be sure to install both gnuplot and Gnuplot.py. After installing Gnuplot. py, you’ll have to set the variable Cjqlhkp*CjqlhkpKlpo*cjqlhkp[_kii]j` to point to the location of the gnuplot binary executable. Alternatively, you can edit a configuration file to permanently set this variable; consult with Gnuplot.py’s documentation. In Windows, you’ll also require lcjqlhkp*ata, which is a part of gnuplot for Windows and allows sending commands to wgnuplot (the Windows version of the gnuplot appli- cation). As mentioned previously, most of the examples in the book rely on matplotlib, so you’ll need to modify the code if you wish to use gnuplot solely. Unless you have a strong reason not to use matplotlib, or that gnuplot is already installed on your system and heavily used, I sug- gest you stick with matplotlib. Image Processing Image processing provides the final piece of the puzzle. It is an important part of data visu- ALIZATIONANDWILLBEDISCUSSEDEXTENSIVELYIN#HAPTER7ELLBEUSINGTHE0YTHON)MAGING ,IBRARY0), TOPROVIDEIMAGEPROCESSINGSUPPORT Python Imaging Library 4HE0YTHON)MAGING,IBRARYdppl6++sss*lupdkjs]na*_ki+lnk`q_po+leh+ ENHANCES0YTHON WITHEXCELLENTIMAGEPROCESSINGCAPABILITIES0),SUPPORTSMOSTPOPULARIMAGEFILEFORMATS ANDPROVIDESAWEALTHOFFUNCTIONSFORMANIPULATINGIMAGEDATA0), COMBINEDWITH.UM0Y provides a very capable image processing environment for Python. Additional Python Packages Numerous Python packages are available, and more are being written every day. The following are good sources of information on Python packages: sThe Python Package Index: dppl6++lule*lupdkj*knc+lule sSourceForge: dppl6++sss*okqn_abknca*jap PySerial )N#HAPTERWEused pySerial to capture GPS data through the serial port. PySerial is available at dppl6++luoane]h*sege*okqn_abknca*jap+luOane]h. CHAPTER 2 N THEENVIRONMENT44 NNote In Windows, you will also need to install the Python Win32 Extensions (win32all) from dppl6++ lupdkj*jap+_nas+id]iikj`+sej/.+@ksjhk]`o*dpih as well as possibly a real-time library. Consult the pySerial and Python Win32 Extensions documentation. Example: Manually Installing a Python Package As mentioned previously, some Python packages do not come with a stand-alone installer. In that case, you’ll have to perform a manual install. Not to worry, this is easier than it sounds. As a general rule, it’s best to read the documentation and follow the instructions. That being said, most Python packages require a similar set of steps to install: 1. $OWNLOADTHEPACKAGE 2. 5NPACKTHEPACKAGETOATEMPORARYDIRECTORY-OSTPACKAGESAREDISTRIBUTEDASCOM- pressed files, with extensions such as *p]n*cv or *vel, or even self-extracting *ata files. 9OULLNEEDTOUNPACKTHEPACKAGETOATEMPORARYDIRECTORY/CCASIONALLY FILESHAVING the extension *p]n*cv are downloaded as *p]n*p]n. If that is the case, rename the file with the extension *p]n*cvANDCONTINUETOUNPACKASYOUNORMALLYWOULD 3. 2UNlupdkjoapql*luejop]hhINTHETEMPORARYDIRECTORY/FCOURSE THISHASTOBE done after Python is installed and working properly on your system. 4HEFOLLOWINGDOCUMENTSTHESTEPS)TOOKTOINSTALLPY3ERIALON#YGWIN p]nvtrbluoane]h).*0*p]n*cv _`luoane]h).*0 lupdkjoapql*luejop]hh The first command unpacks the downloaded file to a newly created directory named luoane]h).*0; the creation of the new directory is done automatically by the application tar ANDISREPORTEDTOTHEUSER)NCASEYOURERUNNING7INDOWSANDNOT#YGWIN YOUCANUSE ANATIVE7INDOWSUTILITY SUCHAS :IPdppl6++sss*3)vel*knc+ TOUNPACKTHEFILESTHETAR APPLICATIONISAVAILABLEIN,INUXANDUSUALLYCOMESPREINSTALLEDWITHTHE/34HESECOND command changes directory to the temporary directory. The third command performs the installation and ensures the package is properly installed. You can also use THESETUPTOOLSPACKAGEWHICHINCLUDESTHEEASY?SETUPTOOL AVAILABLE from dppl6++la]g*paha_kiiqjepu*_ki+@ar?ajpan+oapqlpkkho, for better control over install- ing and maintaining packages, especially packages that depend on other packages. Another benefit of the package is that you can also install Python packages without worrying about root SUPERUSER PERMISSIONS Installation Summary Table 2-2 summarizes the Python packages discussed previously and indicates which software is required to run the examples in the book. CHAPTER 2 N THEENVIRONMENT 45 Table 2-2. Package Installation Summary Software/Package Functionality Required? Python Python programming language Yes )$,% 0YTHON)$% .O )0YTHON 2EADLINE 0YTHON#,)ENHANCEMENTS .O NumPy N-dimensional arrays and math package Yes SciPy Scientific tools Yes -ATPLOTLIB 0LOTTINGANDGRAPHINGPACKAGE 9ES Gnuplot, gnuplot.py Plotting and graphing package No 0), 0YTHON)MAGING,IBRARY 0ARTIAL#HAPTER 0Y3ERIAL 3ERIALINTERFACE 0ARTIAL#HAPTER Additional Applications "YNOWYOUSHOULDHAVEAWORKINGDEVELOPMENTENVIRONMENTTHATINCLUDESTHE/3OFYOUR CHOICE 0YTHON AND0YTHONPACKAGES7ENOWTURNTOADDITIONALSOFTWAREAPPLICATIONSTO complete an environment for developing and running data analysis and visualization scripts in Python. This section suggests tools to augment the development environment from the open SOURCESOFTWAREWORLD7HILETHEREAREEXCELLENTCOMMERCIALAPPLICATIONSASWELL )WILLNOTBE covering those. The suggested applications are perfectly good for me, but you might have your own preference, even an application that’s not mentioned here. By all means, use your favor- ite; this section is mostly intended for those who require some starting points. Editors The number one tool in a developer’s arsenal is a text editor. Think of it as your Swiss Army knife: it can be used to read, write, or modify scripts, view data files, as a scratchpad for ideas, as a clipboard for intermediate copy and paste, and more. Basic text editors will soon frustrate you as some are limited in the size of files they can edit, others do not allow several open files, and yet others are missing syntax highlighting or bookmark capabilities. Selecting the Proper Editor for You %DITORSPLAYAMAJORROLEINYOURDEVELOPMENTENVIRONMENT4HERESABITOFALEARNINGCURVE with a new editor, so consider the following points when you select a text editor or switch from your current one. sEase of use4HISONEISOBVIOUS)STHEEDITOREASYTOUSEANDINTUITIVE)STHEREALEARN- INGCURVE ANDIFSO HOWLONGWILLITTAKEYOUTOMASTER sMultiple file editing: You might be dealing with a considerable number of script files or even examining data files in the editor. Having one application deal with all these files removes clutter from your desktop and is generally easier to handle. CHAPTER 2 N THEENVIRONMENT46 sMaximum file size7HATSTHELARGESTFILEYOUCANOPENINTHEEDITOR!GAIN USEFUL when you’d like to view large data files. sSyntax highlighting: Syntax highlighting is a feature that displays reserved or specific syntax of a programming language in a different color or font so that the code is easier TOVIEW-OSTEDITORSTHATSUPPORTSYNTAXHIGHLIGHTINGHAVEBUILT INSUPPORTFORSEVERAL programming languages, including Python. This feature is handy as it will highlight possible syntax errors as well as make the code more readable. sLine numbering: Errors and warnings typically return line information where they occurred. Therefore, being able to know what line caused an error without counting LINESISIMPORTANT3OMEEDITORSALSOSUPPORTAJUMP TO LINECOMMAND WHICHCANBE USEFULIFYOURCODEISLONG,ASTLY LINENUMBERSAREHELPFULWHENCOMMUNICATINGWITH another person. sMost recently used files list: This is a nice feature that allows you to easily access one of the files you’ve recently viewed or edited, without specifying its full path. sBookmarks: Bookmarks allow easy navigation and are especially useful with large files. sMacro support and macro recording-ACROSANDTHEABILITYTORECORDANDPLAYBACK MACROSCANBOOSTPRODUCTIVITYSEETHESIDEBARh2ECORDING-ACROSv  sAutocompletion: This feature is similar to character completion, described previously INTHESIDEBARh#HARACTER#OMPLETIONIN2EADLINEvBUTUSUALLYWITHADIFFERENTKEY- STROKE SUCHAS#TRL SPACE )TCANBOOSTPRODUCTIVITYBUTREQUIRESSOMEGETTINGused to. sOther features: The preceding is a list of features I consider important. You might have DIFFERENTNEEDSANDDIFFERENTREQUIREMENTS SOJOTTHEMDOWNANDUSETHOSETOSELECT the proper editor for you. RECORDING MACROS Macro recorders are a quick and effective way to perform automation without actually writing code. Suppose you want to combine every two consecutive lines in a file into one line with "" symbols in between. This is not easily done with a search and replace (unless your search and replace also supports new-line characters). Of course, you could write a Python script to do this, but let’s suppose in this particular case there’s no point in automation simply because you’ll only do it once. This is exactly where you would use a macro recorder. First, move your cursor to the beginning of the file (or press Ctrl+Home on some editors to get there). Now start your macro recorder and perform the following actions: press End to reach the end of the line, press Del to delete the line separator and combine the two lines into one long line, type &&, move down one line with the down arrow, and press Home to get to the beginning of the next line. Stop your macro recorder to finish the recording of your macro. This sequence combines two lines into one, adding "" in between. Note that I’ve used the keyboard and not the mouse; this is important, as most macro recorders in editors don’t support mouse recording. Next, run the macro N times where N is the number of lines in the file divided by 2 (remember you combine two lines per run). Or you can run that macro for each pair of lines you want to combine. Some editors have the option to run the macro to the end of file. The following figure shows a macro recorder in Notepad++. CHAPTER 2 N THEENVIRONMENT 47 The macro is highly reliant on the location of the cursor. If you move the cursor to the end of the file and run the macro, you might get some unintended results. A Short List of Text Editors 4ABLE PRESENTSASHORTLISTOFSOMEPOPULARTEXTEDITORS5SETHISTABLEASASTARTINGPOINTIN selecting an editor. This is by all means not a comprehensive list of available editors, so shop around and use the Internet to find more. Table 2-3. Short List of Open Source Editors Editor OS/Environment Notes .OTEPAD dppl6++jkpal]`)lhqo*okqn_abknca*jap+ 7INDOWS Has all the features described previ- OUSLYINTHECHAPTERANDMORE$OWN- SIDEAVAILABLEONLYIN7INDOWS SORRY ,INUXFOLKS SciTE Scintilla Text Editor dppl6++sss*o_ejpehh]*knc+O_ePA*dpih 7INDOWS 8 A very good text editor, especially if YOUREDEVELOPINGONBOTH7INDOWS and X: you can use one editor for BOTHPLATFORMS,ACKSINTHENUMBER of open files and macro recording capabilities. GNU Emacs dppl6++sss*cjq*knc+okbps]na+ai]_o+ 7INDOWS ,INUX 8 #YGWIN -AC/3 A very rich EDITOR2UNSONMOSTANY PLATFORMINCLUDINGTEXT BASED#,) ,INUX 8 AND7INDOWSASWELLAS #YGWIN(ASABITOFALEARNINGCURVEIF you’re new to Emacs. 6IM dppl6++sss*rei*knc 7INDOWS ,INUX 8 #YGWIN -AC/3 A very rich editor that runs on most any platform; has most of the features DESCRIBEDPREVIOUSLYANDMOREEG HEXEDITOR  GNU Nano dppl6++sss*j]jk)a`epkn*knc+ ,INUX #YGWIN $/3 !TEXT BASEDNONGRAPHICAL LIGHT- WEIGHTEDITOR-ISSINGSOMEFEATURES but makes up for that in size and performance. A good candidate when writing code over a telnet or SSH connection. CHAPTER 2 N THEENVIRONMENT48 A BINARY EDITOR At times it proves useful to edit binary files as well (see Chapter 10 for discussion of binary files). Binary files typically cannot be viewed nor edited using regular editors (with maybe the exception of Vim). Hexedit (dppl6++laklha*i]j`ner]*_ki+zlnec]qt+data`ep*dpih) is a useful utility that allows editing of binary files. It displays the hex values as well as their ASCII representation (if such is available) and allows editing of both the hexadecimal and ASCII values. I wouldn’t recommend writing binary files in hexedit, rather using it to tweak or modify binary files. Hexedit is available with most Linux distributions as well as Cygwin. To invoke hexedit, issue the following: data`epbehaj]ia While in hexedit, pressing F1 will bring up a help screen. To exit hexedit without saving, press Ctrl+C. Spreadsheets Spreadsheets are excellent tools for data processing and visualization. The ease in which a user can import data from various file formats, organize it, and generate graphs is outstanding. #36 AMOSTUSEFULFILEFORMAT ISSUPPORTEDBYVIRTUALLYALLSPREADSHEETAPPLICATIONS#36 files are used extensively in data analysis and visualization, and being able to edit them easily is a great benefit of spreadsheets. -OSTSPREADSHEETSCOMEEQUIPPEDWITHADDITIONALTOOLSSUCHASLINEARREGRESSION STATISTI- cal computations, financial functions, and more. A more experienced user may be able to use macros to automate tasks or to update results when new data is entered. Because of these fea- tures, spreadsheets will definitely complement your development environment. Spreadsheets are not ideal for data processing. They’re designed with an interactive point- AND CLICK'5) USERINMIND WHICHMAKESTHEMLESSNATURALATSCRIPTAUTOMATION4HEYRE also limited in the amount of data they can process—you typically have to open the entire file INTHESPREADSHEET ANDWITHLARGEFILESTHATSANISSUE,ASTLY THEYLACKINHERENTDOCUMENTA- tion—it’s hard to capture and document the steps you took to reach a result. Therefore, we will not be using spreadsheets in this book; however, I will mention their usage when appropriate. For example, it is of value to know how to export and import data to and from spreadsheets. The following are open source spreadsheet applications: s 'NUMERICdppl6++sss*cjkia*knc+lnkfa_po+cjqiane_+ ISPARTOFTHE'./-%DESKTOP ENVIRONMENTPROJECT s #ALCdppl6++sss*klajkbbe_a*knc+ ISPARTOFTHE/PEN/FFICEORGPROJECTANDISAVAIL- able on most platforms. Word Processors Finally, it might be of value to write a report or a presentation, displaying the results of your WORK!NDYOUMIGHTWANTTOPUBLISHTHERESULTSIN(4-,OR0$&FORMAT!GAIN SEVERALOPEN source applications are available, most notably the following: CHAPTER 2 N THEENVIRONMENT 49 s !BI7ORDdppl6++sss*]^eokqn_a*_ki+ ISAWORDPROCESSINGAPPLICATIONAVAILABLEFOR 7INDOWS '.5,INUX AND-AC/3 s 7RITEdppl6++sss*klajkbbe_a*knc+ ISPARTOFTHE/PEN/FFICEORGPROJECTANDISAVAIL- able on most platforms. Image Viewers If you plan on performing image processing tasks, an image viewing utility is required. Even if you’re not really performing an image processing task, for example, generating a hard-copy graph in known file formats such as PNG and JPG, an image viewing utility is still a must. 7INDOWSHASBUILT INSUPPORTFORMOSTPOPULARIMAGEFORMATS/N,INUX BOTH'./-% AND+$%DESKTOPENVIRONMENTSCOMEWITHBUILT INIMAGEVIEWERS0LUS ITSPOSSIBLETOOPEN ANIMAGEUSINGAWEBBROWSERBOTHON7INDOWSAND,INUX ASBROWSERSALSOSUPPORTMOST image formats. 0OINTOFTHEMATTERNONEEDTOINSTALLANYTHING5SEYOUR/3IMAGEVIEWERORweb browser. Version Control Systems 6ERSIONCONTROLSYSTEMS6#3S ENABLEMANAGEMENTOFSEVERALREVISIONSOFADOCUMENTOR DOCUMENTS WITHFULLHISTORY TAGGING ANDDATECAPABILITIES-OSTPACKAGESALSOSUPPORTSEV- eral developers working together simultaneously on the same file. !6#3ALLOWSGOINGBACKTOAPREVIOUSWORKINGVERSION ORCHECKINGTHEDIFFERENCE between the current version and an older one, or even viewing a version of the document based on date. It might hold such information as who edited the file or the tag assigned to the document to mark its status. !6#3ISINCREASINGLYRECOGNIZEDASAREQUIREDTOOLFORATEAMOFDEVELOPERS"UTTHERES also a case to be made for even one developer. These management systems are growing in POPULARITYANDFORAGOODREASONTHEYSAVETIMEANDHELPMANAGESOFTWAREPROJECTS&ORTHIS reason, they’re good software to enhance your development environment. 4HEDOWNSIDEOFUSINGA6#3ISTHATITSNOTTRIVIALTOMASTERANDPERHAPSSHOULDBEPOST- poned until after you’re comfortable with your programming environment. To help offset the COMPLEXITYINVOLVEDWITH6#3S SOMEALSOPROVIDEA'5)FRONTEND WORKING WITH A VERSION CONTROL SYSTEM In a nutshell, working with a version control system can be described as follows: 1. Check-out the project: create a local copy of the most updated version of the documents. 2. Modify your local copy: edit source code, fix bugs, and add features. 3. Review your changes: make sure the right files are modified. 4. Commit changes: save the changes you’ve made in the version control repository. CHAPTER 2 N THEENVIRONMENT50 When you check out a document from the VCS repository, the system ensures you have the most updated version to work with. This is typically done once, and from here on you edit your local copy. You then modify your document, and once you’re satisfied with the results, review the changes. Reviewing the changes can be done by performing a `ebb of the file you have with the copy in the repository. You then commit your changes (also known as checking in) and possibly add a description of the changes. Subsequent modifica- tions follow steps 2 through 4. The version control system notifies you in case of a conflict. For example, suppose you checked out version 1 of the document, but by the time you wish to commit your changes, another developer has already checked in his version of the document: the system will alert you of a possible conflict, because you’re trying to update a document which is now version 2, whereas you were working on version 1. The system also maintains a full history of the project. So even if you’re the only person working on a project, the ability to go back to previous versions of the project is as simple as checking out an older revision. Most systems allow checking out of documents based on date, revision, or even a tag that you’ve previously supplied. Because the system maintains such a complete history, most developers feel that you should commit changes as often as possible— you won’t be negatively affecting “good” releases. One final note: if you can, choose to use text files over binary files. Performing a `ebb on text files is supported by most VCS systems and is a valuable tool. With the binary version of the file (e.g., an execut- able), a `ebb yields very little information other than that the current version is not identical to the one in the repository. Here’s a set of commands I often use, working on a local copy I’m continually editing, once I’m done editing my local copy. With Mercurial, I issue dcop]pqo dc_kiiepbehaj]ia dclqod dcql`]pa The first command checks the status of the project: which files are modified. The second and third com- mands check in the local copy and update the repository (where Mercurial stores the files). The last command ensures I have the most updated version of the project in my local directory. In CVS, I follow a similar set of commands: _ro`ebb _ro_kiiep _roql`]pa Here are some pointers TOSEVERALOPENSOURCE6#3software applications: s #63dppl6++sss*jkjcjq*knc+_ro+ ISAWIDELYPOPULARSYSTEMWITHSEVERALGRAPHICAL user interfaces including web-based ones. s 3UBVERSIONdppl6++oq^ranoekj*pecneo*knc+ ISANOTHERWIDELYPOPULARSYSTEMAVAIL- able on most platforms. s -ERCURIALdppl6++sss*oahaje_*_ki+ian_qne]h+sege+ ISALIGHTWEIGHT6#3PACKAGE DESIGNEDFORDISTRIBUTEDPROJECTS CHAPTER 2 N THEENVIRONMENT 51 Example: Directory Structure for the Book )NTHEPROCESSOFWRITINGTHISBOOK )VEUSEDA6#3SYSTEMTOCONTROLTHEDOCUMENTS IMAGES source code, and data for each chapter. I’ve used the following directory structure: each chap- ter has a directory of its own named ?dTT WITH88BEINGTHECHAPTERNUMBER7ITHINEACH directory corresponding to a chapter, I’ve added four additional directories named `k_, `]p], ei]cao, and on_-YACTUALWRITINGWASPLACEDINDIRECTORY`k_; my data files in directory `]p]; IMAGESSUCHASTHOSEEMBEDDEDINDOCUMENTS INDIRECTORYei]cao; and source code in direc- tory on_. >kkg ?d- `]p] ei]cao on_ `k_ ?d. `]p] ei]cao on_ `k_ *** Another side benefit of this directory structure is that it is helpful in envisioning how a PROJECTWILLLOOK)FTHERESSOMETHINGIMPORTANTYOUREALIZEDINTHEFIRSTPIECEOFCODEINMY CASE THEFIRSTCHAPTER BUTITDOESNTREALLYBELONGTHERE SIMPLYDUMPTHEIDEASANDCODEIN the relevant directory for future processing. NTip This directory structure is also apparent in the source code listing. Since the source code resides in directory ?dTT+on_, and data files reside in directory ?dTT+`]p], the relative path to directory data is **+`]p]. Similarly, the relative path to directory images is **+ei]cao. 4HEREASON)DECIDEDONUSINGA6#3SYSTEMFORTHEBOOKISQUITESIMPLE)VEHANDEDOVER documents of various revisions to editors, I’ve revisited others, and I’ve sent reviewers yet a different version. Some would return responses to a revision that I’ve already updated, and so I had to know what document they’ve edited. If you think about it, in a sense, there were really several developers for one document, and managing them all is a lot easier with a version con- trol system. Licensing -OSTOFTHESOFTWAREDESCRIBEDINTHECHAPTERISOPENSOURCEANDFREEWITHTHEOBVIOUS EXCEPTIONOF7INDOWSANDOTHERCOMMERCIALPACKAGES-!4,!"AND6-WARE TONAMEA COUPLE 4HATBEINGSAID THEREARELIMITATIONSONWHATYOUCANDOWITHOPENSOURCESOFTWARE CHAPTER 2 N THEENVIRONMENT52 especially if you intend on distributing your applications. Several software licenses exist, and I urge you to read each and every one. The same applies for commercial software: ensure you read the license agreement. The following is a list of some of the license agreements of the software described in this chapter. It is neither complete nor comprehensive, and the licenses might change with time, so be sure to check the most recent license documentation. sGNU licenses, including GPL and LGPL, which cover a substantial number of the pack- ages described in this chapter: dppl6++sss*cjq*knc+he_ajoao+he_ajoao*dpih sLinux distributions licenses2EFERTOTHERESPECTIVEWEBPAGEOFTHEDISTRIBUTIONOFYOUR choice sCygwin2EFERTOTHELICENSEDOCUMENTSINSTALLEDIN#YGWIN USUALLYUNDER_6X_ucsejX qonXod]naX`k_X_kiikj)he_ajoao as well as dppl6++sss*na`d]p*_ki+okbps]na+_ucsej+ sVMware: dppl6++sss*ris]na*_ki+ sPython: dppl6++sss*lupdkj*knc+lob+he_ajoa+ sEnthought (EPD): dppl6++sss*ajpdkqcdp*_ki+lnk`q_po+al`he_ajoa*ldl sIPython: dppl6++elupdkj*o_elu*knc+ sMatplotlib: dppl6++i]plhkphe^*okqn_abknca*jap+qoano+he_ajoa*dpih sSciPy and NumPy: dppl6++sss*o_elu*knc+He_ajoa[?kil]pe^ehepu sPython Imaging Library (PIL): dppl6++sss*lupdkjs]na*_ki+lnk`q_po+leh+he_ajoa*dpi sPySerial: dppl6++luoane]h*orj*okqn_abknca*jap+ sPython Windows extensions (win32all): refer to the license agreement as part of the package. sScintilla and SciTE: dppl6++o_ejpehh]*okqn_abknca*jap+He_ajoa*ptp sSubversion: dppl6++oq^ranoekj*pecneo*knc+ Final Notes and References By now you should have a full development environment, one that provides all the tools of the trade. Experiment with your environment, get accustomed to it; in the following chapters you’ll be using it extensively. The following provide additional useful information in building a Python development environment should you want to investigate some more: sBeginning Python: From Novice to Professional, Second Edition,BY-AGNUS,IE(ETLAND !PRESS  sPython in a Nutshell: A Desktop Quick Reference, Second EditionBY!LEX-ARTELLI /2EILLY  CHAPTER 3 Python for Programmers The Building Blocks Python is a very readable language. Assuming you’ve had some previous experience in pro- gramming, you should be able to read the code presented in the book without much trouble; you’ll understand what’s going on. That being said, the book would be incomplete without coverage of the Python program- ming language. From a book-design perspective, it stands to reason that this chapter appears in the beginning. But that shouldn’t bind you; feel free to skip it and come back to it later. Furthermore, this chapter does not cover the full extent of the language. Some Python topics that I felt were not crucial for data analysis and visualization were left out of scope. If you would like to learn more about the Python programming language, I’ve listed several books in the “Final Notes and References” section at the end of the chapter; these books are all Python oriented and should prove valuable resources. Now to the chapter itself: I’ll be taking you quickly through the Python building blocks and complement the discussion with short examples. We’ll start by going through the basics of invoking and using Python interactively and noninteractively, entering expressions, and running scripts. We then look at the basic building blocks of most modern programming lan- guages: data types, structures, variables, printing, flow control, and functions. We continue with a brief discussion of object-oriented programming (OOP) and finalize with a discussion of modules and packages. What Is Python? Python is an open source, object-oriented, high-level programming language. This is a rather vague definition; if you’re looking for a more accurate one, have a look at dppl6++sss*lupdkj* knc+ and dppl6++sss*lupdkj*knc+]^kqp+. That being said, I think it’s easier to show what Python is, rather than try and define it. This really is the purpose of this book in a narrow sense: using Python effectively for data analysis and visualization and not just learning Python for the purpose of knowing the language. Python seemed to have developed a culture around it. You’ll find such notions as “Pythonic” or “Easier to Ask Forgiveness than Permission” (EAFP) or the “batteries included” approach—all of which shows that Python is more than just a programming language. 53 CHAPTER 3 N PYTHON FOR PROGRAMMERS54 It is rumored that many developers first use the language as a simple tool to solve a specific problem, but with time they are absolutely captivated to the point they start writing haikus in Python. I’m afraid I’m not that artistic, so you won’t be seeing any haikus in here. Here are the language features I view as the most important for the topics presented in the book: sOpen source: Yes, I view this as one of the fundamental aspects about Python. Python, and its packages, have been developed by an active community. The language evolves and changes, providing a dynamic environment built on discussion, on actual needs, on real problems people have to solve. I think this approach ensures a good language that hopefully will withstand the test of time. sEase of learning: It’s easy to learn Python, especially if you’re familiar with other pro- gramming languages—Python combines the best of several programming languages and programming paradigms in one. s“Batteries included”: Python includes a great number of libraries as part of the standard library (several will be explored in this book). Additional packages can be installed and used seamlessly. You should be able to do most, if not all, of the work associated with data analysis and visualization without ever leaving the Python environment. sVersatility: Python is versatile in that it supports both the early stages of development, as a rapid application development tool, and later phases of the project, when more structured programming paradigms are required. sInteractive nature: More about this in the next section. Interactive Python The ability to run Python interactively, with a command-line interface (CLI), is an envious ability. The CLI allows both understanding of the workings of the programming language as well as your code as you write it. It’s not a new concept, and personally, the first programming environment I ever used was also interactive in nature: Basic in Sinclair’s ZX-81 (see dppl6++ aj*segela`e]*knc+sege+Vt)4- for some nostalgia). At times, when I write C code, I just wish I could do the same . . . The interactive nature of Python is elegantly introduced in Guido van Rossum’s “Python Tutorial” available at dppl6++`k_o*lupdkj*knc+pqp+pqp*dpih (Guido van Rossum is Python’s creator). Nevertheless, here’s a short introduction to running Python interactively, from a data analysis and visualization perspective. Invoking Python How you invoke Python depends on your platform: s )N7INDOWS ASSUMINGYOUVEINSTALLEDTHEBINARIES CLICK3TART¢ All Programs ¢ Python 2.5 ¢ Python (command line) or IDLE (Python GUI) if you prefer a GUI envi- ronment. You might have a newer version by now. CHAPTER 3 N PYTHON FOR PROGRAMMERS 55 s )N7INDOWS UNDER#YGWIN STARTA#YGWINBASHSHELLANDISSUETHEFOLLOWING command: lupdkj s )N,INUX OPENATERMINALANDISSUETHESAMECOMMAND lupdkj To exit Python, either press Ctrl+D or enter :::atep$% Entering Commands After starting Python in interactive mode, you’re presented with version information along with a short list of introductory commands, dahl,_klunecdp,_na`epo, and he_ajoa, and the Python prompt :::. NNote Whenever you encounter the ::: prompt in any listings in the book, it is meant to indicate that the command was issued interactively with the Python interpreter, and you should try it yourself by repeating the same commands in your Python interpreter. Similarly, when you encounter three dots (***) at the beginning of a line of code, it means that this is a continuation of the text entered interactively in the previous line. Issue any of the these commands by entering the command name and pressing Enter (from now on, I’ll refrain from mentioning to press Enter or discussing how to erase charac- ters; I assume you know how to use a CLI). Here’s the output from issuing the dahl command: :::dahl Puladahl$%bknejpan]_peradahl(kndahl$k^fa_p%bkndahl]^kqpk^fa_p* Python’s CLI allows entering statements and evaluating expressions. Some basic ones are described here. Try them to get a feel for the interactive nature of Python: :::-'.'/'0'1'2'3'4'5 01 :::..&. 00 :::]90 :::]&0 -2 :::#]#&0 #]]]]# :::omnp$]% Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: J]iaAnnkn6j]ia#omnp#eojkp`abeja` CHAPTER 3 N PYTHON FOR PROGRAMMERS56 The first couple of lines use Python to do basic arithmetic. One of the nice benefits of using interactive Python as opposed to using a calculator is that you can edit your previous entries very easily. Plus, you can retrace your steps and find a typo. But that’s hardly the rea- son for using Python, just an added bonus. The third line is an assignment: we assign the value 0 to the variable ]. The next line prints out the value of ] times 4. You’ll learn more about variables, functions, and statements soon, but for now, let’s examine the interactive environment and get you up to speed on how to work with it efficiently. The following line shows Python’s string capabilities. A string is typically enclosed in quotes, so the next command multiplies the string #]# by 4. Which is exactly that: #]# multi- plied four times results in #]]]]#—pretty cool. The last line shows what happens when the interpreter encounters a problem: it raises an exception and reports the reason back to the user. In this particular case, the interpreter doesn’t know of the function omnp$%; this can be easily remedied if we import the function by issuing bnkii]pdeilknpomnp, but that’s reserved for later. The Result Variable Whenever Python executes a statement, the result is stored in a special variable named [. This is useful when you’re doing some manual calculations: :::-'.'/'0'1'2'3'4'5 01 :::[ 01 :::['-, 11 :::[ 11 :::[+1 -- The result variable keeps on being updated, as shown in this example, so bear that in mind. The Interactive Help System The interactive help system is a valuable tool both when learning the language and when pro- gramming. Python has a considerable number of functions, modules, and packages, so a help system is a must. As the name suggests, the help system is an interactive system. Invoking it is straightforward (notice the required empty parentheses): :::dahl$% Enter mqep to exit the system. Enter a function name to read about it (e.g., ata_beha). If you enter omnp, the help system will respond that there’s no documentation regarding omnp. The reason for this is that omnp is part of the math module, and to view its help informa- tion you’ll have to enter i]pd*omnp instead. Refer to the “Modules and Packages” section later in this chapter for discussion about modules. CHAPTER 3 N PYTHON FOR PROGRAMMERS 57 You can also view specific function help (noninteractively) using dahl$bqj_pekj%: :::dahl$ata_beha% Dahlkj^qehp)ejbqj_pekjata_behaejik`qha[[^qehpej[[6 ata_beha$***% ata_beha$behaj]iaW(chk^]hoW(hk_]hoYY% Na]`]j`ata_qpa]Lupdkjo_nelpbnki]beha* Pdachk^]ho]j`hk_]ho]na`e_pekj]neao(`ab]qhpejcpkpda_qnnajp chk^]ho]j`hk_]ho*Ebkjhuchk^]hoeoceraj(hk_]ho`ab]qhpopkep* In reality what happens when you issue the command dahl$bqj_pekj% is that that func- tion’s docstring is printed. More about docstrings in the “Defining Functions” section later in this chapter. Moving Around At times, it’s of value to know how to change the current working directory within the Python interpreter. This is especially important if the code and data are located in different directories; it might be easier to just switch to another directory as the situation requires. Suppose you defined a function that accepts a file name as input, reads it, and does some processing. Furthermore, this function was defined interactively, as you were using the inter- preter. Now you’d like to run this function on some files, but the path to these files is long and cumbersome. This is a situation where it’d be much easier to switch to the directory where the files reside and execute the function with the relatively shorter file name, that is, excluding the path. Module os provides us with this functionality. You’ve already seen the os module in Chapter 1, but this time we use it to move around the interpreter: :::eilknpko :::ko*cap_s`$% #+dkia+od]e+lupdkj# :::ko*heop`en$#*#% W#on_#(#`]p]#Y :::ko*_d`en$#on_#% :::ko*cap_s`$% #+dkia+od]e+lupdkj+on_# :::ko*_d`en$#+dkia+od]e+lupdkj+`]p]#% :::ko*cap_s`$% #+dkia+od]e+lupdkj+`]p]# :::ko*_d`en$#**+on_#% :::ko*cap_s`$% #+dkia+od]e+lupdkj+on_# :::ko*heop`en$#+dkia+od]e+lupdkj+`]p]#% W#CLO).,,4),2),1)-/),.)12*_or#Y In this listing I’ve used several functions. The first line imports the os module, containing functions required to move around. I’ve then used several functions from the os module: CHAPTER 3 N PYTHON FOR PROGRAMMERS58 s 4HEFUNCTIONko*heop`en$l]pd% lists directory contents. The function must be supplied with a string argument. If the string argument is an empty string (##) or a string with a single dot in it (#*#), the function will return the contents of the current directory. s 4OFIGUREOUTWHATSyour current working directory, issue the command ko*cap_s`$%. This function takes no arguments. s &INALLY )VESHOWNTHEUSAGEof the ko*_d`en$l]pd% function. The function accepts one string as an argument and, depending on the string, changes directories accordingly. The function accepts both relative directory paths (such as #**+on_#) as well as full directory paths (#+dkia+od]e+lupdkj+`]p]#). NTip In IPython (see Chapter 2), you can use the commands _`, ho, and ls` as you would in any Linux shell instead of using the os module functions (it’s faster to type!). Running Scripts The interactive environment can only get you so far. Eventually, you will want to write pro- grams (scripts) and run them either noninteractively or from within the Python interpreter. There could be various reasons to write scripts, and most are due to the fact that you might perform a task more than once. Say the accountants in your company use nonlinear depreciation equations, and being their favorite programmer they ask you for a personal favor, so you decide to write a web-based depreciation calculator. Or the clinical people in your medical device company often require access to log files that are large and require processing, so they ask you to write an end-of-day report per patient summarizing the day’s events based on those log files, or . . . and the list goes on and on. The path I typically follow is that I use the interactive environment in parallel to coding the script. That is, I run Python interactively, run a few statements, assign some variables, plot some graphs; if things look good, I copy over the commands I issued to an editor where my script resides. NTip In IPython you can issue the command deopknu to view a list of recently entered commands. The benefit of coding interactively is that you can examine the variables and data struc- tures of your code, without additional debugging tools. If the script raises an exception, you now have at your fingertips all the variables and data structures: you can reproduce the error and possibly fix the bug. Once your script is ready (well it’s never really ready, let’s just agree that it’s ready to be test-driven), you have several options to run it: CHAPTER 3 N PYTHON FOR PROGRAMMERS 59 s 2UNTHESCRIPTFROMTHE'5)ENVIRONMENTRUNNING)$,%0YTHON'5) SELECT&ILE¢ Open and choose the script to run. This will open the Python script in the IDLE editor (if it’s already open, there’s no need to reopen it). To run the script, press F5 or select Run Module from the Run menu. The output should appear in the Python GUI shell. s 2UNTHESCRIPTFROM0YTHONS#,)5SINGata_beha$#l]pd+pk+behaj]ia*lu#% is my favor- ite option when developing. The reason I prefer this method over the GUI environment is that I like editing my code in an editor that is not part of an IDE. NTip If you’re using IPython, you could issue the command nqj l]pd+pk+behaj]ia*lu instead of ata_beha$#behaj]ia*lu#%; the benefits are 1) you can use character completion to select the file name, and 2) you can supply command-line parameters to the script: nqjl]pd+pk+behaj]ia*lul]n]i-l]n]i.. s )NVOKETHESCRIPTFROMTHESHELLORACOMMANDWINDOWNONINTERACTIVEMODE  Even though you might have developed your script in interactive Python, it’s a good idea to test your script in a shell as well, especially if you’re distributing your code for others to use: they might not want to run the code interactively. To run the script from a Linux shell or Cygwin, use this command: lupdkjl]pd+pk+behaj]ia*lu Or in Windows: _6Xlupdkj.1Xlupdkj*atal]pdXpkXbehaj]ia*lu In Windows, you could also set the L=PD variable to include the Python directory path, in this case, _6Xlupdkj.1, so invoking the script will not require a full path to the Python executable: l]pd9!l]pd!7_6Xlupdkj.1 lupdkjl]pdXpkXbehaj]ia*lu s &INALLY ITSALSOPOSSIBLETOenjoy both worlds: interactive and noninteractive mode! This is done by running the script with the )e switch, which opens up a Python shell after the script has run and lets you examine variables, interactively: lupdkjÌeokiao_nelp*lu NTip Since the backslash character (X) has a special meaning in strings (we’ll get to that later) and is also used as a path separator in Windows, it’s best to use the slash (+) character whenever you’re working with file names and file paths. If you can, opt to use relative paths (e.g., **+`]p] instead of _6X`]p]); your code will be portable across operating systems and much easier to read. That’s it. I think we’re ready for the language itself now. CHAPTER 3 N PYTHON FOR PROGRAMMERS60 Data Types Python data types are similar to data types in other programming languages; you’ll see here strings and numbers just as you would in, say, Basic. But there are some niceties you should know about even in those basic data types, for example, the hkjc data type supports infinite integer precision. Numbers We’ll start off with numbers. Python natively supports ejp, hkjc, bhk]p, and _kilhat. Int and Long The data type int is equivalent to C’s hkjc data type, and its precision is system dependent. I run a 32-bit machine, so on my system, ejp defaults to a 4-byte integer. This means the maxi- mum ejp I can represent on my system is 231–1, and the minimum is –231. If you’re uncertain of the bit count on your system or if your code might be running on different platforms (e.g., both 32-bit and 64-bit platforms), you can use the following to determine the maximum ejp value: :::eilknpouo :::ouo*i]tejp .-0304/203 The data type long provides infinite integer precision. It’s not limited by the platform. However, there is a price to be paid: performance. Long integer numbers are denoted in Python with a trailing H character: :::.&&3, --4,15-2.,3-30--/,/0.0H :::.&&3,, 1.2,-/15,-104/3/1,3.0,54544.44,-.422111,//54,.4./-3/415054.4,5,/,243/. -10.53,4,4..--/2221/2.3314401-..254.524412-34.-33-/,-50/..1,-4/4,/42/- .34-033,21-44,405511../23--.4000154-5-22/313440/..3-3.3-.5/.1-3/134-/3 2H I’ve introduced the operator power, denoted by &&, so .&&3, is 270. Once you leave the ejp range (4 bytes on a 32-bit machine, 8 bytes on a 64-bit machine), that is, your calculation extends to a number greater than ouo*i]tejp, Python automatically converts the number to a hkjc integer value, giving it infinite precision. So if your plan was to use an ejp, make sure you didn’t accidentally cast it into a hkjc. Here’s a possible pitfall: :::.&&/-). .-0304/202H As you can see, the result is a hkjc, denoted by the trailing H. But surely this number is less than ouo*i]tejp! The problem is that the first calculation, 231, already exceeded ouo*i]tejp, and now any future computations are performed in infinite precision, denoted by the trailing H. Once a number is hkjc, it will keep on being treated as hkjc unless you specifically convert it back to an ejp using the ejp$% function, assuming the number indeed can be represented as an ejp. CHAPTER 3 N PYTHON FOR PROGRAMMERS 61 Personally, I haven’t used hkjc all that much. I typically use integer values when counting things (for example, in loops) and 231–1 is more than enough. However, had I required such a large number, I’d have to jump through a series of hoops in, say, C, but in Python it’s a lot easier (not effortless, but still easier). WHY NOT EFFORTLESS? This is a bit off-topic and is an advanced discussion that assumes some knowledge of Python. The reason it’s not effortless doing infinite precision with integers is that a lot of the functions we’re used to working with in Python return ejp, and not hkjc. To illustrate this problem, suppose you’d like to compute a sum of numbers from 1 to N, where N is greater than 232 (yes, there are easier ways but I’m trying to make a point here). A typical approach would be to use a bkn loop with an tn]jca$% iterator as follows: :::pkp]h9, :::bkntejtn]jca$-,,,%6 ***pkp]h'9t *** :::pkp]h 0551,, Note that I’ve used a variable named pkp]h and not oqi because sum is a built-in function in Python. Now the problem lies with the call to tn]jca$%—the iterator accepts only ejp values. So if you were to replace the number -,,, with, say, .&&/., you’d get an error: :::tn]jca$.&&/.% Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: KranbhksAnnkn6hkjcejppkkh]ncapk_kjranppkejp which means you’d have to resort to other techniques such as (caution: this is a long run) :::t(pkp]h9,(, :::sdeha$t8.&&//%6 ***t'9- ***pkp]h'9t It’s definitely doable in Python, but it’s not effortless. That said, doing the same in C is even harder. Other Useful Bases Bases that are powers of 2 are native to computing systems. One byte is 28 as opposed to, say, a power of the decimal system. For this reason, the ability to convert values to and from bases that are powers of 2 (such as the hexadecimal base or the octal base) to the decimal system is important. CHAPTER 3 N PYTHON FOR PROGRAMMERS62 NNote The octal base is less popular nowadays. However, some octal notations are still active, for exam- ple, file permissions in Linux systems. Hexadecimal values are denoted with a leading 0x. Thus, 0x20 is 32 (decimal). You can use both capital and noncapital letters for digits A–F: :::,t]> -3- :::,tbb .11 Octal values are denoted with a leading 0 (that’s a zero, not the character O). Thus 020 is 16 (decimal): :::,., -2 Regardless of how you enter numbers, that is, what base you’ve used, they’re still retained as numbers in Python. Should you want to look up the different base representation, use the k_p$% and dat$% function calls. Both these functions return a string: :::dat$-,,% #,t20# :::k_p$-,,% #,-00# You can also perform any other base conversion using the function ejp$opnW(n]`etY%, which returns a number, not a string. In case n]`et isn’t specified, it is assumed to be 10: :::ejp$#-,,#% -,, :::ejp$#-,,#(/% 5 The argument to the function ejp$% is a string and not a number. So in case you’d like to convert 101 in base 3 to a decimal value, write ejp$#-,-#(/% or ejp$opn$-,-%(/%. The latter is more useful if you’d like to use a variable, that is, ejp$opn$r]ne]^ha%(^]oa%. It’s possible to use higher bases than hexadecimal (base 16), using an increasing number of letters from the alphabet as the new digits for the base. In base 17, the character c is added; in base 18, the character d is used, and so on. So the number #ccc# in base 17 should be 173–1: :::ejp$#ccc#(-3% 05-. :::-3&&/)- 05-. This support for bases is up to value 36, corresponding with the letter “z”. CHAPTER 3 N PYTHON FOR PROGRAMMERS 63 Comparisons You can compare values using the regular operators: : and 8 for greater than and less than, respectively. Equality checks are done using a double equal sign (99) to differentiate from the assignment symbol denoted by a single equal sign (9). Inequality is 9, and you can also use :9 and 89 for greater-than-or-equal and less-than-or-equal comparisons. :::.&/:1 Pnqa :::.&/91 Pnqa Some comparisons are not allowed, for instance, comparing a complex number (described in the next section) with an integer value: :::-'-f:. Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: PulaAnnkn6jkkn`anejcnah]pekjeo`abeja`bkn_kilhatjqi^ano Bitwise Operations Bitwise operators are similar to C’s bitwise operators as shown in Table 3-1. Table 3-1. Bitwise Operations Operator Description Example z Bitwise not z,t,bb',t-,, returns ,. 88 Shift left -884 returns .12. :: Shift right .12::. returns 20. Z Bitwise exclusive OR (XOR) ,tbbZ,tb, returns -1 (,t,b). " Bitwise AND ,tbb",t,b returns -1 (,t,b). x Bitwise OR ,t,bx,tb, returns .11 (,tbb). Augmented Assignments Augmented assignments introduce the operators '9,)9,&9, +9,!9,&&9 , 889,::9,"9, Z9, and x9. This is notation is similar to C/C++ syntax. That is, instead of writing ]9]'-, you can write ]'9-. Similarly, instead of writing ]9]::-, you can write ]::9-. Please note that Python does not support the increment operator ''. Float and Complex Floating-point values have been around for quite some time, and there’s no escaping them. Python’s float data type is equivalent to C’s `kq^ha, so it’s really more accurate than C’s bhk]p (C’s bhk]p has fewer bytes than C’s `kq^ha). CHAPTER 3 N PYTHON FOR PROGRAMMERS64 Floating-point values are represented with a dot or with the a or A character denoting exponential notation. So if you want to ensure your value is a bhk]p, either add a leading dot (or dot zero, or a/A) or explicitly do so with the function bhk]p$%: :::.*, .*, :::. . :::bhk]p$.% .*, :::-a/ -,,,*, The reason specifying a bhk]p is important is that you might get an integer operation where you really want a floating-point operation. :::.+/ , :::.*+/ ,*2222222222222222/ :::.+/* ,*2222222222222222/ :::.*,+/*, ,*2222222222222222/ :::bhk]p$.%+/ ,*2222222222222222/ :::bhk]p$.+/% ,*, NNote In Python 3.0, .+/ will return a floating-point value; for an integer division, use .++/ instead. See dppl6++sss*lupdkj*knc+`ar+lalo+lal),./4+ for more details. In the first operation and the last operation, the division is integer division, returning the value ,. As a general rule, whenever a floating-point number is introduced, any integers (both ejp and hkjc) are converted to a bhk]p, and from that point onward the calculation continues with floating-point values. This is also known as promotion or coercion; hkjc and ejp are pro- moted to a bhk]p. You can force a value into a floating-point value by using the bhk]p$% function. This works on strings as well as numbers, as long as the conversion is possible. The complex data type represents complex numbers and is composed of two floating- point values, one representing the real part and one representing the imaginary part. The imaginary part is appended with the trailing letter f (or F). Accessing the real and imaginary parts is possible using the *ei]c and *na]h attributes, as follows: CHAPTER 3 N PYTHON FOR PROGRAMMERS 65 :::]9-'.f :::]*na]h -*, :::]*ei]c .*, You can use most any operator on complex numbers just as you would on floating-point numbers. Once a computation involving a complex number is encountered, the remaining computation will remain a _kilhat, that is, integers and floating-point values are promoted to complex values. You can convert a number to a complex number using the _kilhat$na]hW( ei]cY% function. In case ei]c is provided, it holds the imaginary value of the complex number: :::_kilhat$-,% $-,',f% :::_kilhat$-,(.% $-,'.f% The _kilhat data type as well as examples on using it will be discussed in Chapter 7. Strings Per the classic Python definition, strings are an immutable sequence of characters. This means a string is a sequence of characters, and it is unchangeable: you can’t change the characters. I know that this might seem odd at first: you’re probably thinking, “How do I work with strings if I can’t modify them?” The answer is that you create new strings based on your current string. Expressing Strings There are several ways to express a string: single quotes, #opnejc#; double quotes, opnejc; and triple-double-quotes, opnejc (phew), to name a few. And there are even more: raw strings denoted by the letter n such as nopnejc and Unicode strings denoted by the letter q, for example, qqje_k`a. To express a basic string, use single quotes as follows: :::#olhep# #olhep# In case your string has a quote in it, you’ll have to escape it with a backslash (X): :::#epX#o]olhep# ep#o]olhep The reason we escaped the quote that’s part of the word ep#o is so that the quote before the letter o won’t terminate the string. Single quotes and double quotes are interchangeable. Therefore, we could’ve achieved the same result, without escaping the quote that’s part of the word ep#o, by replacing the enclos- ing quote (the ones at the beginning and end) with double quotes: :::ep#o]olhep ep#o]olhep CHAPTER 3 N PYTHON FOR PROGRAMMERS66 But what if we wanted a string that actually does have the backslash before the quote as well—a string that looks exactly like this: epX#o]olhep? Well, one option is to escape the backslash as well as the quote: :::#epX#XXo]olhepoa_kj`# ep#XXo]olhepoa_kj` :::lnejp#epX#XXo]olhepoa_kj`# ep#Xo]olhepoa_kj` Notice how the interpreter represents that string differently from how it’s printed. Well, this pattern can keep on going, making things harder to understand. Instead, we could use a raw string: :::nepX#o]olhepoa_kj` epXX#o]olhepoa_kj` :::lnejpnepX#o]olhepoa_kj` epX#o]olhepoa_kj` A raw string means that everything following the character n and the starting quote and before the ending quote should be taken literally. Have Python escape what needs escaping and return a proper string back to me! NNote Raw strings will be used extensively in regular expressions so as not to escape special meaning characters on several levels. See Chapter 5 for details. Strings can also span multiple lines with a backslash: :::ep#o]X ***olhepoa_kj` ep#o]olhepoa_kj` This obviously could bring about more disasters—what if you really wanted that backslash to appear, as well as the line break? Not to worry, time to use triple-double-quotes (or triple- single-quotes, they’re interchangeable): :::lnejpep#o]X ***olhepoa_kj` ep#o]X olhepoa_kj` If all this sounds too confusing, you’re in good company. To acquaint yourself with these caveats, launch Python interactively and experiment! I haven’t talked about Unicode strings here; I’ll touch on that in Chapter 5. String Operations So what can you do with strings? Table 3-2 lists some operations that can be performed on strings, along with examples. In the examples, I’ve selected strings that don’t require escaping CHAPTER 3 N PYTHON FOR PROGRAMMERS 67 so they’re easier to follow, but the same can be applied to any string expression described previously. Table 3-2. String Operations Operator Description Example Adding and Multiplying opn-'opn. Concatenates strings opn- and opn..#olhep#'#oa_kj`# returns #olhepoa_kj`#. opn&j Concatenates the opn string j times. #oa_kj`#&/ returns #oa_kj`oa_kj`oa_kj`#. Indexing and Slicing j and i are positive integer values less than the length of opn. Negative values are counted from the end of the string. oWjY Retrieves the jth character of o #olhep#W/Y returns #e#. oWj6iY Retrieves a string slice from jth character to the ith character, excluding the ith character. If j or i are negative, they are counted from the end of the string. olhepoa_kj`#W26-.Y returns #oa_kj`#. #olhepoa_kj`#W)26).Y returns #oa_k#. oW6iY Equals oWj6iY with j9,.#olhepoa_kj`#W6/Y returns #olh#. oWj6Y Retrieves a string slice from the jth character to the end. #olhepoa_kj`#W26Y returns #oa_kj`#. You can check whether a character is in a string using the ej operator: :::#`#ej#]^_`a# Pnqa or count the number of characters in a string using the haj$% function: :::haj$#]^_`a#% 1 Both ej and haj$% operate on other sequences, as you’ll soon see. I’ll discuss strings (including Unicode strings and raw strings) in more detail in Chapter 5. Booleans I’ve postponed discussion of Boolean values until after you’ve seen some other data types because Booleans values shine in the context of other data types. Booleans can take two val- ues: Pnqa (-) or B]hoa (,). :::]9Pnqa :::]:- B]hoa :::]99- Pnqa :::^kkh$1% CHAPTER 3 N PYTHON FOR PROGRAMMERS68 Pnqa :::pula$]% 8pula#^kkh#: You can cast a value to a Boolean by using the ^kkh$% function. Empty strings, as well as other empty sequences, and the value zero of any form are considered B]hoa: :::^kkh$,% B]hoa :::^kkh$1% Pnqa :::^kkh$% B]hoa :::^kkh$o% Pnqa Logical Operations Logical operations ]j`, kn, and jkp operate on Booleans. I assume you know how to use them. Let’s see if you know the answer to the following . . . :::1:-]j`$$08/%kn.'081]j`jkp28.% Data Structures Python, being a high-level programming language, also provides additional, more complex, data types, which I refer to as data structures. These include lists, tuples, dictionaries, and sets, to name a few. Data structures make the programming experience a lot more enjoyable. Python documentation does not necessarily differentiate between data types and data structures the way I have. My purpose in this distinction is to split the discussion into two categories: simple data types, which you’re likely to encounter in popular programming lan- guages (such as C), and more complex data types, or data structures, which you’re likely to see in higher-level programming languages such as Python and Perl. Regardless of the classifica- tion presented in this chapter, both are built-in data types as far as Python is concerned. In a sense, you’ve already been exposed to data structures: strings and complex num- bers. The string is an immutable sequence, hardly a “simple” data type. By comparison, the C programming language does not support a native string data type, rather an array of char- acters, which is to show that strings aren’t really all that basic. But a string is still limited—it’s a sequence of characters. What about sequences of other objects? And what about mutable (changeable) sequences? Not to worry, Python provides those as well. A list in Python is a mutable sequence of arbi- trary data types. A tuple is quite similar to a list, only that it’s immutable. We’ll also talk about some more complex data structures that can make programming yet more entertaining. You’ve already seen a dictionary object in Chapter 1, and we’ll explore that data structure as well as the set object. Python is also an object-oriented-programming lan- guage; therefore, a discussion of the class object will be presented after we have talked about functions. CHAPTER 3 N PYTHON FOR PROGRAMMERS 69 Lastly, there are also additional native data types and structures in Python, but most of them will be left out of the scope for this book; they’re not a must for data analysis and visual- ization (with the possible exception of file data types, which will be discussed in Chapter 5). Lists A list is a mutable sequence of objects. A list is denoted by brackets: :::W-(#dau#(-'.fY W-(#dau#($-'.f%Y You can also create a list using the heop$% function. This is useful when converting differ- ent sequences to a list, say, from a string: :::heop$#okiapatp#% W#o#(#k#(#i#(#a#(##(#p#(#a#(#t#(#p#Y A list can be modified. You can add another element to a list by using the ' operator. The ' operator concatenates lists, so you have to supply another list: :::W-(#dau#(-'.fY'W#dau#(#dau#Y W-(#dau#($-'.f%(#dau#(#dau#Y The following, however, will fail, since you cannot add an integer to a list: :::W-(.(/Y'. Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: PulaAnnkn6_]jkjhu_kj_]paj]paheop$jkpejp%pkheop The proper way to do this would be to form another list, made solely of the value .: :::W-(.(/Y'W.Y W-(.(/(.Y If you’re looking to add the value . to each and every element of the list W-(.(/Y, that is, to modify the list to W/(0(1Y, you’ll get the details in the sections “The for Statement” and “List Comprehensions” later in this chapter. A list is an object too, so you can also have a list inside a list: :::WW-(.Y(W/(0YY WW-(.Y(W/(0YY Now things get trickier, both in describing the object and in actually performing opera- tions. Say you’d like to add another list, W1(2Y, to the preceding example. How exactly would you like to add it? Should the updated list be WW-(.Y(W/(0Y(W1(2YY or WW-(.Y(W/(0Y( 1(2Y or WWW-(.Y(W/(0YY(W1(2YY (which really is shamelessly tricky)? The way I like to describe the data structure WW-(.Y(W/(0YY is as a list of rows. The first row is W-(.Y and the second row is W/(0Y. Here are some of the things you can do to concatenate lists: :::WW-(.Y(W/(0YY'W1(2Y WW-(.Y(W/(0Y(1(2Y CHAPTER 3 N PYTHON FOR PROGRAMMERS70 :::WW-(.Y(W/(0YY'WW1(2YY WW-(.Y(W/(0Y(W1(2YY The first line adds the elements 1 and 2. The second line adds the row W1(2Y. Another option is to use a variable to hold the list, H, and use the ]llaj`$% and atpaj`$% methods: :::H9WW-(.Y(W/(0YY :::H*]llaj`$W1(2Y% :::H WW-(.Y(W/(0Y(W1(2YY :::H*atpaj`$W3(4Y% :::H WW-(.Y(W/(0Y(W1(2Y(3(4Y The method ]llaj`$% adds an item to the list, in this case the list W1(2Y. The method atpaj`$% adds elements from the sequence one by one to the list, in this case, the elements 3 and 4. It’s a bit hard to follow at first, but experiment with lists interactively to get a feel for how to use them properly. Lists can also be indexed, similarly to strings: :::H9WW#dau#(#-#Y(W.(/(0Y(##Y :::HW,Y W#dau#(#-#Y :::HW-Y W.(/(0Y :::HW-YW-Y / The last statement, HW-YW-Y, requires some explanation. The statement HW-Y returns the second element in the list (indices start at 0, so index 1 is the second element). For our pur- poses, let’s mentally assign HW-Y to variable I. But variable I is a list as well: W.(/(0Y. So clearly we can index I as well: IW-Y is /. Instead of doing those two steps, we can write this more compactly as HW-YW-Y. Lists much like strings can also be sliced: :::H9WW#dau#(#-#Y(W.(/(0Y(##Y :::HW6)-Y WW#dau#(#-#Y(W.(/(0YY :::HW.6Y W##Y You can check whether an item is in a list using the ej operator: :::#dau#ejW#dau#(#dau#(#olhep#(#oa_kj`#Y Pnqa You can count the number of elements in a list using the haj$% statement: :::haj$WW#dau#(#-#Y(W.(/(0Y(##Y% / CHAPTER 3 N PYTHON FOR PROGRAMMERS 71 Since lists are mutable, they can be reassigned: :::H9WW#dau#(#-#Y(W.(/(0Y(##Y :::HW-Y9W0(1(2Y :::H WW#dau#(#-#Y(W0(1(2Y(##Y or have items removed using the `ah statement: :::H9WW#dau#(#-#Y(W.(/(0Y(##Y :::`ahHW,Y :::H WW.(/(0Y(##Y Lists also have methods, functions that operate only on list objects such as ]llaj`$% and atpaj`$%, shown previously. To use a method, follow the list object with a dot and the function name with parentheses and parameters within (empty ones in case of no parameters): :::H9W#dau#(#dau#(#olhep#(#oa_kj`#Y :::H*_kqjp$#dau#% . :::H*oknp$% :::H W#dau#(#dau#(#oa_kj`#(#olhep#Y I’ve used the methods _kqjp$%, which counts the occurrences of an item in a list, and oknp$%, which sorts a list. Table 3-3 describes the list methods along with some examples. In the examples, assume that H is W#oa_kj`#(#oa_kj`#(4Y. Table 3-3. List Methods Method Description Example ]llaj`$k^f% Adds an element to the end of a list. H*]llaj`$#dau#% changes H to W#oa_kj`#(#oa_kj`#(4( #dau#Y. _kqjp$r]h% Returns the number of times r]h appears in the list. H*_kqjp$#oa_kj`#% returns .. atpaj`$epan]^ha% Adds elements to the list from epan]^ha (more on iterators and iterables later in this chapter). H*atpaj`$tn]jca$.%% changes H to W#oa_kj`#(#oa_kj`#(4( ,(-Y. ej`at$r]h(Wop]np( WopklYY% Returns the first index of r]h in the list. If op]np is supplied, this method returns the first index that is greater than op]np; if opkl is supplied, the index also has to be less than opkl. H*ej`at$#oa_kj`#% returns ,. H*ej`at$#oa_kj`#(-% returns -. H*ej`at$#oa_kj`#(.(/% raises an exception tjkpejheop. ejoanp$j(k^f% Inserts an object at index j. H*ejoanp$.(#ia#% changes H to W#oa_kj`#(#oa_kj`#( #ia#(4Y. Continued CHAPTER 3 N PYTHON FOR PROGRAMMERS72 Method Description Example lkl$WjY% Returns the jth element in the list and removes it. If j is not supplied, this method returns the last element. H*lkl$% returns 4, and the modified list is W#oa_kj`#( #oa_kj`#Y. H*lkl$)/% returns #oa_kj`#, and the modified list is W#oa_kj`#( 4Y. naikra$r]h% Removes the first occurrence of r]h in t. H*naikra$#oa_kj`#% changes H to W#oa_kj`#(4Y. naranoa$% Reverses the list. H*naranoa$% modifies H to W4(#oa_kj`#(#oa_kj`#Y. oknp$% Sorts the list. You can supply a sort func- tion to the list; see dahl$heop*oknp%. H*oknp$% modifies H to W4(#oa_kj`#(#oa_kj`#Y. Tuples A tuple is an immutable (unchangeable) sequence of objects. A tuple is denoted by parenthe- ses and can be created using the pqlha$% function: :::$-(.(/% $-(.(/% :::pqlha$#dau#% $#d#(#a#(#u#% Tuples don’t necessarily require parentheses; merely adding a comma suggests the expression is a tuple: :::-(. $-(.% :::-( $-(% :::$-% - The expression $-% is not a tuple: it’s the value - within parentheses, which is treated simply as 1. Tuples behave similarly to lists, with the exception of modification: you can’t modify a tuple. But you can create a new one based on an existing one: :::pqlha$W-(.(/Y% $-(.(/% :::[&. $-(.(/(-(.(/% In the first statement, I’ve created a tuple based on a list. Note that pqlha$-(.(/% would raise an exception, because the function pqlha$% expects one argument, not three. In the pre- ceding code I passed a list as an argument: W-(.(/Y. I could’ve also written pqlha$$-(.( /%%, effectively achieving the same thing: the first outer set of parentheses in the expression Table 3-3. Continued CHAPTER 3 N PYTHON FOR PROGRAMMERS 73 is the function parentheses; the inner one is the tuple parentheses. In the second statement listed, I’ve created a second tuple based on the first one, by multiplying the result variable. Tuples can contain different data types and data structures: :::$W-(.Y($/(0%% $W-(.Y($/(0%% The preceding is a tuple containing a list and a tuple. Tuples can also be indexed, similarly to lists and strings. Remember that indexing requires brackets, not parentheses: :::$W-(.Y($/(0%%W,Y W-(.Y :::$W-(.Y($/(0%%W-Y $/(0% :::$W-(.Y($/(0%%W-YW,Y / A tuple can be sliced, generating a new tuple: :::$W-(.Y($/(0%%W-6Y $$/(0%(% However, tuples cannot be reassigned: :::]9$W#dau#(#-#Y(##% :::]W,Y9, Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: PulaAnnkn6#pqlha#k^fa_p`kaojkpoqllknpepai]ooecjiajp But the lists within them can be changed, since lists are mutable: :::]9$W#dau#(#-#Y(##% :::]W,YW,Y9#sks# :::] $W#sks#(#-#Y(##% Checking whether an item is in a tuple can be done using the ej operator: :::-ej$.(/% B]hoa Finally, it’s common practice to use tuples to perform multiple assignments, also known as unpacking: :::](^9-(. :::]'^ / CHAPTER 3 N PYTHON FOR PROGRAMMERS74 Dictionaries Dictionaries are mutable sequences that connect a key with a value. The key must be unique, whereas the value need not be. I like to use a phonebook analogy when I think about diction- aries. Every phone number (key) has but one entry (value) associated with it, usually a person; however, one person (value) can have several phones (keys). The key and value objects can be most data types, with the exception of some (e.g., another dictionary). There are several ways to create a dictionary: using the `e_p$% function with a sequence of (key, value) tuples or using the curly braces (wy) with colons separating keys and values: :::`e_p$$$#olhep#(4%($#oa_kj`#(-%%% w#oa_kj`#6-(#olhep#64y :::w#olhep#64(#oa_kj`#6-y w#oa_kj`#6-(#olhep#64y There are many parentheses in the first expression: the outermost are the parentheses for the function `e_p$%, the innermost are specific tuple pairs, and the ones in between denote a tuple of tuples, because `e_p$% can only accept one argument. A more readable approach would be to pass `e_p$% a list of tuples, instead of a tuple of tuples: :::`e_p$W$#olhep#(4%($#oa_kj`#(-%Y% w#oa_kj`#6-(#olhep#64y Retrieving values from a dictionary is achieved using brackets: :::@9`e_p$W$#olhep#(4%($#oa_kj`#(-%Y% :::@W#olhep#Y 4 Checking for membership in a dictionary is done using the ej operator, which defaults to checking against the keys of the dictionary, not the values. If you wish to check against the val- ues, use the r]hqao$% method: :::@9`e_p$W$#olhep#(4%($#oa_kj`#(-%Y% :::#olhep#ej@ Pnqa :::4ej@ B]hoa :::4ej@*r]hqao$% Pnqa Changing values and assigning new values is done using brackets as well: :::@9`e_p$W$#olhep#(4%($#oa_kj`#(-%Y% :::@W#lupdkj#Y9#oj]ga# :::@ w#lupdkj#6#oj]ga#(#oa_kj`#6-(#olhep#64y :::@W#lupdkj#Y9#lnkcn]iiejch]jcq]ca# :::@ w#lupdkj#6#lnkcn]iiejch]jcq]ca#(#oa_kj`#6-(#olhep#64y CHAPTER 3 N PYTHON FOR PROGRAMMERS 75 In the preceding example, the second assignment to the key #lupdkj# has overwritten the previous value, #oj]ga#, with the value #lnkcn]iiejch]jcq]ca#. If you think about it, real-world dictionaries may have several entries for one key: the word “Python” can mean the Python snake or the Python programming language. This behav- ior can be mimicked in Python dictionaries as well; simply have the value contain a list: :::@9`e_p$% :::@W#lupdkj#Y9W#oj]ga#(#lnkcn]iiejch]jcq]ca#Y :::@ w#lupdkj#6W#oj]ga#(#lnkcn]iiejch]jcq]ca#Yy Dictionaries are implemented using a hashing algorithm. This means that retrieving a value from a key is extremely efficient. There’s a lot of information regarding hashing algo- rithms and hashing functions on the Internet, so look that up if you’re interested in knowing how they work. There’s also good discussion on specific Python dictionary implementation in the Python Cookbook (see “Final Notes and References”). Used properly, a dictionary can sim- plify your code and make it a lot more efficient. In Chapter 4, I present an example of using a dictionary to locate duplicate files on a hard drive. Table 3-4 lists dictionary member functions. In the examples in the table, assume @ is w#oa_kj`#6-(#olhep#64y. Table 3-4. Dictionary Methods Method Description Example Functions _ha]n$% Removes all items from the dictionary. @*_ha]n$% changes @ to wy. _klu$% Returns a shallow copy of @ (see the “Variables” section later in the chapter). @.9@*_klu$%. bnkigauo$GW(rY% Creates a dictionary from keys G. If r is provided, all values are set to r. wy*bnkigauo$W#olhep#( #oa_kj`#Y(4% returns w#oa_kj`#64(#olhep#64y. cap$gW(`abY% Returns the value associated with key g. If g is not in the dictionary, this method returns `ab if provided. @*cap$#benop#(-% returns -. d]o[gau$g% Returns Pnqa if g is a key. @*d]o[gau$#j]#% returns B]hoa. epaio$% Returns key-value tuples. In a sense this is the opposite of `e_p$%. @*epaio$% returns W$#oa_kj`#( -%($#olhep#(4%Y. gauo$% Returns the list of keys. @*gauo$% returns W#oa_kj`#( #olhep#Y. lkl$gW(`abY% Returns the value associated with key g and removes it from the dictionary. If g is not in the dictionary, this method returns `ab if provided; otherwise, it raises an exception. @*lkl$#olhep#% returns 4 and changes @ to w#oa_kj`#6-y. lklepai$% Returns an arbitrary key-value tuple and removes the pair from the dictionary. @*lklepai$% returns $#oa_kj`#(-% and changes @ to w#olhep#64y. Continued CHAPTER 3 N PYTHON FOR PROGRAMMERS76 Method Description Example oap`ab]qhp$gW(`abY% Returns the value associated with the key g. If g is not in the dictionary, this method returns `ab if provided and sets @WgY to `ab. @*oap`ab]qhp$#dau#(2% returns 2 and changes @ to w#oa_kj`#6-(#olhep#64( #dau#62y. ql`]pa$a% Updates the dictionary with data from dictionary a. See the upcoming example in this section. r]hqao$% Returns the list of values. @*r]hqao$% returns W-(4Y. Iterators Iterators will be discussed later in the chapter. For reference purposes, I’ve listed dictionary iterator methods in this table. epanepaio$% Returns an iterator holding key-value pairs. epangauo$% Returns an iterator holding the dictionary keys. epanr]hqao$% Returns an iterator holding the dictionary values. While most of these member functions are easy to follow (with the exception of iterators, which we’ll soon get to), I’d like to talk about two member functions that I feel require more explanation: ql`]pa$% and cap$%. The method ql`]pa$% updates the dictionary with key-value pairs from another diction- ary. For ease of discussion, I’ll refer to the function call @-*ql`]pa$@.%. In case a key exists in both dictionaries @- and @., the value associated with the key in the dictionary @- is updated with the value from dictionary @.. If a key from @. does not exist in @-, it is added to @- along with its value. The following illustrates this behavior: :::@-9w#oa_kj`#6-(#olhep#64y :::@.9w#oa_kj`#6/(#dau#63y :::@-*ql`]pa$@.% :::@ w#oa_kj`#6/(#olhep#64(#dau#63y The value associated with the key #oa_kj`# was updated, and the key-value pair #dau#63 was added. The next member function I want to talk about is cap$%. At first, this seems rather odd; how is cap$% different from simply accessing the key using brackets? The difference is that if you use brackets and the key is not in the dictionary, a GauAnnkn exception is raised. The func- tion cap$% allows checking whether a key is in a dictionary and as a side product also returns a default value. A good way to show how this is useful is perhaps with an example. Consider the function heop[clo[_kii]j`o$% presented in Chapter 1 (I’ve removed the doc- string), shown here in Listing 3-1. Table 3-4. Continued CHAPTER 3 N PYTHON FOR PROGRAMMERS 77 Listing 3-1. Function heop[clo[_kii]j`o$% `abheop[clo[_kii]j`o$`]p]%6 clo[_i`o9`e_p$% bknnksej`]p]6 pnu6 clo[_i`oWnksW,YY'9- at_alpGauAnnkn6 clo[_i`oWnksW,YY9- napqnjclo[_i`o To further illustrate the example, let’s build a short list of GPS commands (H) to later sort in a dictionary so you can try the example for yourself. First, we execute a set of commands similar to those detailed in the function heop[clo[_kii]j`o$%: :::H9W# CLCO=#(# CLCOR#(# CLCOR#(# CLCOR#(# CLNI?#(# CLCC=#Y :::@-9`e_p$% :::bknahaiejH6 ***pnu6 ***@-WahaiY'9- ***at_alpGauAnnkn6 ***@-WahaiY9- *** :::@- w# CLCO=#6-(# CLCOR#6/(# CLCC=#6-(# CLNI?#6-y The approach is simple. We first try to access a key in the dictionary. If the key exists, we increment the count. If the key doesn’t exist, an exception is raised, which means it’s a new entry, so we set it to -. A second approach is to check whether a key exists in a dictionary using the ej statement and then follow up with an eb sentence, as follows: :::H9W# CLCO=#(# CLCOR#(# CLCOR#(# CLCOR#(# CLNI?#(# CLCC=#Y :::@.9`e_p$% :::bknahaiejH6 ***ebahaiej@.6 ***@.WahaiY'9- ***ahoa6 ***@.WahaiY9- *** :::@. w# CLCC=#6-(# CLCO=#6-(# CLCOR#6/(# CLNI?#6-y It’s also possible to use the d]o[gau$% member function in a similar manner. A much more elegant approach would be to use the cap$% method with a default value of ,: :::H9W# CLCO=#(# CLCOR#(# CLCOR#(# CLCOR#(# CLNI?#(# CLCC=#Y :::@/9`e_p$% :::bknahaiejH6 CHAPTER 3 N PYTHON FOR PROGRAMMERS78 ***@/WahaiY9@/*cap$ahai(,%'- *** :::@/ w# CLCO=#6-(# CLCOR#6/(# CLCC=#6-(# CLNI?#6-y I chose the first approach in Chapter 1 because I think it’s clearer to those unfamiliar with the language. However, the last approach presented here is a clear winner in my mind. Sets Our last data structure for now will be a set. Sets are sequences of unique items. To create a set, use the oap$% function: :::oap$W#olhep#(#oa_kj`#Y% oap$W#oa_kj`#(#olhep#Y% :::oap$W#olhep#(#oa_kj`#Y&4% oap$W#oa_kj`#(#olhep#Y% If you pass a duplicate to the oap$% function, it will not be added to the set. This is shown in the second statement where a list multiplied by 8 is passed as an argument. In a sense, you’ve already been introduced to sets: the keys in a dictionary form a set since they are unique items. Set operations are a bit different from the previous sequences you’ve seen. They are derived from the math operations and include intersection, union, and differences, to name a few: :::O-9oap$W#olhep#(#oa_kj`#Y% :::O.9oap$W#olhep#(4Y% :::O-xO. oap$W4(#oa_kj`#(#olhep#Y% :::O-*qjekj$O.% oap$W4(#oa_kj`#(#olhep#Y% :::O-"O. oap$W#olhep#Y% :::O-)O. oap$W#oa_kj`#Y% :::O-*`ebbanaj_a$O.% oap$W#oa_kj`#Y% :::O.*`ebbanaj_a$O-% oap$W4Y% The operator x is equivalent to the member function qjekj$%. The operator " is equivalent to the member function ejpanoa_pekj$%. The operator Ì is equivalent to the member function `ebbanaj_a$%, and much like regular subtraction, the order is important: O-)O. is different from O.)O-. Table 3-5 lists some set functions. In the examples, assume O- equals oap$W4(#dau#Y%. CHAPTER 3 N PYTHON FOR PROGRAMMERS 79 Table 3-5. Set Methods Method Description Example ]``$k^f% Adds k^f to the set. O-*]``$5% changes O- to oap$W4(5(#dau#Y%. _ha]n$% Removes all elements from the list. O-*_ha]n$% changes O- to oap$WY%. _klu$% Returns a shallow copy of O- (see a discussion of shallow copy in the “Vari- ables” section later in the chapter). O.9O-*_klu$%. `ebbanaj_a$O.% Returns the difference of two sets. This is equivalent to O-)O.. O-*`ebbanaj_a$oap$W4Y%% returns oap$W#dau#Y%. `ebbanaj_a[ql`]pa$O.% Similar to `ebbanaj_a$% but modifies the list (not merely returns a copy). O-*`ebbanaj_a[ ql`]pa$oap$W4Y%% changes O- to oap$W#dau#Y%. `eo_]n`$r% Removes the element r from the set. If r is not in the set, nothing happens (no exception is raised). O-*`eo_]n`$4% changes O- to oap$W#dau#Y%. ejpanoa_pekj$O.% Returns the intersection of O- and O.. This is equivalent to O-"O.. O-*ejpanoa_pekj$W#dau#Y% returns oap$W#dau#Y%. ejpanoa_pekj[ql`]pa$O.% Similar to ejpanoa_p$% but modifies the set (not merely returns a copy). O-*ejpanoa_pekj[ ql`]pa$W#dau#Y% changes O- to oap$W#dau#Y%. eooq^oap$O.% Returns Pnqa if O- is a subset of O. (all elements of O- appear in O.). O-*eooq^oap$oap$W#dau#(4( #j]#Y%% returns Pnqa. eooqlanoap$O.% Returns Pnqa if O- is a superset of O. (all elements of O. appear in O-). O-*eooqlanoap$oap$W4Y%% returns Pnqa. lkl$% Returns an arbitrary element and removes it from the set. O-*lkl$% returns 4 and changes O- to oap$W#dau#Y%. naikra$r]h% Removes r]h from the set. If r]h is not in the set, this method raises an exception. O-*naikra$#dau#% changes O- to oap$W4Y%. ouiiapne_[`ebbanaj_a$O.% Returns the symmetric difference. This is equivalent to $O-)O.%x$O.)O-%. O.9oap$W#fq`a#(#dau#Y%. O-*ouiiapne_[ `ebbanaj_a$O.% returns oap$W4(#fq`a#Y%. ouiiapne_[`ebbanaj_a[ ql`]pa$O.% Similar to ouiiapne_[`ebbanaj_a$% but modifies the set (not merely returns a copy). qjekj$O.% Returns the union of O- and O. (all unique elements that appear in both sets). O-*qjekj$oap$W#j]#(4Y%% returns oap$W4(#j]#( #dau#Y%. ql`]pa$O.% Similar to qjekj$% but modifies the set (not merely returns a copy). O-*ql`]pa$oap$W#j]#(4Y%% changes O- to oap$W4( #j]#(#dau#Y%. CHAPTER 3 N PYTHON FOR PROGRAMMERS80 I find I use sets much less than dictionaries. However, using sets at times can be quite elegant. Consider the example shown in our previous discussion about dictionaries that enumerates GPS commands. Now suppose you don’t care how many times a GPS command appears, only what types of GPS commands exist. Then this is easily done with a set: :::H9W# CLCO=#(# CLCOR#(# CLCOR#(# CLCOR#(# CLNI?#(# CLCC=#Y :::O9oap$H% :::O oap$W# CLCO=#(# CLCOR#(# CLCC=#(# CLNI?#Y% Variables Next topic of our discussion is variables. Variables in Python are similar to variables in most other programming languages. Variable names can consist of characters, digits, and an under- score, but they have to start with a character or an underscore and must not contain spaces. I recommend you avoid odd variable names such as [,. (which is a legitimate variable name) as it might lead to some confusing code. Consider [,.9/; that just doesn’t look right. An important concept regarding variables of data structures in Python is that of binding. When you assign variable ^ to be equal to variable ], which we’ll suppose is a list, Python does not copy the contents of ] to ^. Rather, it sets both ] and ^ to refer to the same object. This is to achieve speed and performance. :::]9W-(.Y :::^9] :::^W,Y9#dau# :::] W#dau#(.Y :::^ W#dau#(.Y In case you do want a real copy of the data structure, and not merely another reference, you have several options: s 3OMEDATASTRUCTURESPROVIDETHE_klu$% method, such as dictionaries. s )NSOMECASES YOUCANCREATEANOTHERITEMUSINGTHECONSTRUCTOR FOREXAMPLE H.9heop$H-%. s 9OUCANUSETHECOPYMODULEFROMTHESTANDARDLIBRARY :::eilknp_klu :::]9W-(.Y :::^9_klu*_klu$]% :::^W,Y9, :::](^ $W-(.Y(W,(.Y% CHAPTER 3 N PYTHON FOR PROGRAMMERS 81 NNote In case a variable is a more complex structure (e.g., a list of rows), it’s not enough to use _klu* _klu$%, as the newly constructed list still points to the rows in the original list. In this case, you might want to use _klu*`aal_klu$% instead. For more information about shallow copy, deep copy, and lazy copy, see dppl6++aj*segela`e]*knc+sege+K^fa_p[_klu. Statements We now turn to Python statements. You’ve already seen the use of statements, but here I’ll cover more ground by talking about statements I haven’t discussed yet. Python is a rich language that keeps evolving, so I will not be covering the entire language here. But the state- ments I cover should be enough to get you going. I’ve split the discussion into three statement categories: printing, user input, and flow control. We’ll have some off-track discussions about comments, iterators, and list comprehen- sions as well. Printing One of the basic statements in most programming languages is the lnejp statement. You can use lnejp to display Python objects: :::lnejp$.&&-,,% -.2321,2,,..4..50,-0523,/.,1/32 :::lnejp$-'-f% $-'-f% :::lnejp$,t.,% /. :::lnejpOpnejc Opnejc :::lnejp$W#odknpheop#Y% W#odknpheop#Y :::lnejp$$#]#(#pqlha#%% $#]#(#pqlha#% :::lnejp`e_p$W$#dau#(#fq`a#%($4(-%Y% w46-(#dau#6#fq`a#y :::lnejpoap$W-(.(-Y% oap$W-(.Y% NTip The function llnejp from module pprint provides an alternative to the lnejp statement, one that formats the output in a “prettier” fashion, such as avoiding word breaks. This is especially useful if you’re displaying large data structures. To use it, eilknpllnejp and issue the command llnejp* llnejp$k^fa_p%. CHAPTER 3 N PYTHON FOR PROGRAMMERS82 Suppressing Line Breaks If you follow a lnejp command with a comma, the next lnejp statement will continue on the same line after printing a space: :::bkneejW-(.(/Y6 ***lnejpe *** - . / :::bkneejW-(.(/Y6 ***lnejpe( *** -./ Format Specifications The lnejp statement is similar to C’s lnejpb$% function in that it accepts format specifica- tions in the form !Wbh]coYWsYW*lnaYpula. Other than the ! and pula fields, all parameters are optional. The simplest use of the format specifications is with the ! operator, as follows: :::lnejp!`!.&&0 -2 If more than one specifier is present, provide a tuple after the ! operator: :::lnejp!`6!o9!`!$-(#dau#(4% -6dau94 The operator ! is present after the string to be printed and before the tuple containing the values to be formatted. NNote The function lnejpb$% (on which lnejp is based) is a complex function with a considerable number of options and parameters. This section is quite detailed and should provide most of your daily pro- gramming needs. However, should you wish to explore lnejp and lnejpb$% some more, a good source of information is the lnejbp$% manual page (also known as the man page). In any Linux (or Cygwin) prompt, enter i]j/lnejpb for an accurate overview. This is C-level documentation, but C programming skills are not required. There are several values pula can have, but only one is allowed in each specification (e.g., the format specifier !o` will be interpreted as a string, followed by the character #`#). Table 3-6 provides a distilled list of types. CHAPTER 3 N PYTHON FOR PROGRAMMERS 83 Table 3-6. Print Format Specification Types Character Type ` Integer a,A An engineering notation of a floating-point number with a or A, respectively (mantissa and exponent are always present). b Floating-point number c Floating-point number in either b or a form, omitting trailing zeros and the decimal point if it’s not needed k Octal o String t,T Hexadecimal (lowercase), hexadecimal (uppercase) NNote Starting from Python 3.0, lnejp becomes a function and not a statement, and to use lnejp you’ll have to add parentheses: lnejp$k^f%. We now turn to optional flags in the format specifier. The value bh]co can take several of the following values: 1) a number, specifying the num- ber of characters to left-align, 2) the character ', specifying that in case of a numeric value, the sign must be present (either ' or )), 3) the character Ì, specifying that the text should be left- aligned, 4) the character , which modifies behavior of some numeric types (out of the scope of this discussion—refer to the documentation), and 5) the character ,, used to left-pad values with zeros. Here are some examples: :::lnejp!`!. . :::lnejp!1`!. . :::lnejp!'1`!. '. :::lnejp!)'1`&&!. '.&& :::lnejp!,1`!. ,,,,. The value s specifies minimum width. If the width of the object to print is less than s, the output is left-padded with spaces. If it is greater than s, the value is displayed as is: :::lnejp!-,o!#Na]hhuhkjcopnejc# Na]hhuhkjcopnejc :::lnejp!-,o!#odknpan# odknpan CHAPTER 3 N PYTHON FOR PROGRAMMERS84 The value lna is preceded with a dot and specifies the maximum number of decimal points in floating-point numbers, the maximum number of characters to print in a string, or the minimum number of digits in integers: :::lnejp!*.b(!*/o(!*0`!$-*,+/(#pdeosehh^apnqj_]pa`#(-% ,*//(pde(,,,- You can mix and match format specifiers. Here’s a lnejp statement that makes use of several format specifiers: :::lnejp!',4*/b!$-*,+5% ',,,*--- The ' character forces the sign to appear in the output, the digit , takes care of the zero padding, the digit 4 forces the output to be at least eight characters long (the plus symbol, three digits, the dot symbol, and three more digits), the dot followed by / ensures at most three digits are displayed, and lastly the character b announces that this is a floating-point number. Employing lnejp in this manner is especially useful when you want to create text output that’s properly aligned and can be displayed in a report. Format specifiers, with the use of the ! operator, can also be used to format strings, not only print them: :::o9!',4*0b!$-*,+/% :::o #',,*////# User Input We complement our output (printing) discussion with some input discussion, specifically, user input. Other sorts of input, for example, files and command-line parameters, will be dis- cussed in future chapters. User input in Python is done using the n]s[ejlqp$WlnkilpY% function. The function prints the lnkilp string, reads a string from the standard input, and returns it, stripped of end-of-line characters. The lnkilp argument is optional: :::o9n]s[ejlqp$Dksi]jupeiao;% Dksi]jupeiao;3 :::lnejpolhep&o Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: PulaAnnkn6_]j#piqhpelhuoamqaj_a^ujkj)ejpkbpula#opn# :::lnejpolhep&ejp$o% olhepolhepolhepolhepolhepolhepolhep The bqj_pekjn]s[ejlqp$% returns a string, thus even though I’ve input a numeric value, the function returns the string 3. I’ve converted the string to a number using the ejp$% function. In Windows, it’s common to see n]s[ejlqp$% at the end of a script. This ensures that the command window stays open, waiting for user input and displaying the results of running the CHAPTER 3 N PYTHON FOR PROGRAMMERS 85 script. The default behavior in Windows is that this box is automatically closed, preventing the user from reading the output, and so n]s[ejlqp$% overrides this behavior. Comments Comments start at the symbol  provided it’s not part of a string: :::lnejpOkiapatpPdeoeo]_kiiajp Okiapatp :::Pdeoajpenahejaeo]_kiiajp *** :::lnejpPatp]bpanpdeooecjeojkp]_kiiajp Patp]bpanpdeooecjeojkp]_kiiajp Flow Control Flow control statements control the behavior of a script. Python provides several flow control statements, some similar to other programming languages. Typically, a flow control statement is followed by a block, which is indented to the left. if, elif, else The eb statement follows this syntax: eb?kj`epekj-6 >hk_g- aheb?kj`epekj.6 >hk_g. aheb?kj`epekj/6 >hk_g/ * * * ahoa6 Ahoa>hk_g Behavior is as follows: if ?kj`epekj- evaluates to Pnqa, the code in >hk_g- is executed. >hk_g- can be more than one line long and must be indented to the same level. If ?kj`epekj- is B]hoa,?kj`epekj. is evaluated, causing >hk_g. to be executed if it is Pnqa. This continues on to >hk_g/, and so forth. If none of the conditions are met, the Ahoa>hk_g is executed. The statements eb, aheb, and ahoa should be left-aligned. Statements in each block should be left-aligned as well, but further in than the eb clause. The colon after the eb, aheb, and ahoa statements is required. Here’s an example: :::eb/:-,6 ***lnejp?da_ga`sdapdan/eocna]panpd]j-, ***lnejpEpeo ***ahebkn`$#=#%99216 ***lnejpKn`ej]hkb#=#eo21 CHAPTER 3 N PYTHON FOR PROGRAMMERS86 ***ahoa6 ***lnejp=hhb]eha`(jkpdejcskngo *** Kn`ej]hkb#=#eo21 Other than the eb statement, all other statements (aheb, ahoa) are optional. In case of a short eb statement, you can write the block on the same line as the eb statement: :::eb#o#8#p#6lnejpUa]l *** Ua]l Conditions can be more complex and can include conditionals such as ]j` and kn: :::t9.1 :::ebt:.,]j`t!.99-6 ***lnejpK``&]j`&kran., *** K``&]j`&kran., The pass Statement The l]oo statement does nothing, and can be used as a placeholder, for example, in multiple eb assignments: :::t9,*. :::ebt8,*-6 ***lnejpPkkoi]hh ***ahebt8,*/6 ***l]oo ***ahebt:,*16 ***lnejph]nca ***ahoa6 ***lnejpdqca *** ::: As you can see, nothing happened, which is exactly what I wanted. Exceptions: try, else, and finally Exceptions are Python’s mechanism of dealing with runtime issues. You’ve already seen exceptions reported and also how to catch them, that is, prevent them from halting program execution, in Chapter 1. You can catch, or intercept, exceptions before they stop program execution with the following syntax: pnu6 Pnu>hk_g at_alpWAt_alpekjPula-Y6 CHAPTER 3 N PYTHON FOR PROGRAMMERS 87 At_alp>hk_g- at_alpWAt_alpekjPula.Y6 At_alp>hk_g. bej]hhu6 Bej]hhu>hk_g If an exception happens someplace inside the Pnu>hk_g, At_alp>hk_g- is executed. In case At_alpekjPula- is specified, only exceptions that are of type At_alpekjPula- are caught. You can have several at_alp clauses to deal with different types of exceptions. The Bej]hhu>hk_g is optional and executed after both the pnu and at_alp section have completed execution. First, let’s see an exception in action, without catching it: :::opn9#oa_kj`# :::j9#3# :::lnejpopn&j Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: PulaAnnkn6_]j#piqhpelhuoamqaj_a^ujkj)ejpkbpula#opn# The reason for this exception is that the operator & doesn’t know how to multiply #oa_kj` # by #3# (it does know how to do #oa_kj`#&3, but that’s a different statement). As you can see, the exception that was raised was a PulaAnnkn exception. Let’s catch it and print it: :::opn9#oa_kj`# :::j9#3# :::pnu6 ***lnejpopn&j ***at_alpPulaAnnkn(a6 ***lnejpAt_alpekj_]qcdp ***lnejpa ***bej]hhu6 ***lnejpPdeosehh^anqjnac]n`haoo *** At_alpekj_]qcdp _]j#piqhpelhuoamqaj_a^ujkj)ejpkbpula#opn# Pdeosehh^anqjnac]n`haoo We’ve caught the exception in the at_alp block, plus we printed what the exception was in the second print line. Lastly, the code in the bej]hhu block was executed. Let’s run it again, this time without triggering an exception: :::opn9#oa_kj`# :::j93 :::pnu6 ***lnejpopn&j ***at_alpPulaAnnkn(a6 ***lnejpAt_alpekj_]qcdp ***lnejpa ***bej]hhu6 CHAPTER 3 N PYTHON FOR PROGRAMMERS88 ***lnejpPdeosehh^anqjnac]n`haoo *** oa_kj`oa_kj`oa_kj`oa_kj`oa_kj`oa_kj`oa_kj` Pdeosehh^anqjnac]n`haoo As you can see, the code in the bej]hhu block was executed regardless of whether the exception was raised or not. Now let’s trigger an exception that’s not of the PulaAnnkn exception. I’ll modify the line print opn&j to print -+,, which raises a different exception: :::pnu6 ***lnejp-+, ***at_alpPulaAnnkn(a6 ***lnejpAt_alpekj_]qcdp(a *** Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja.(ej8ik`qha: Vank@ereoekjAnnkn6ejpacan`ereoekjknik`qhk^uvank This time, the exception wasn’t caught by the code (it didn’t print “Exception caught!”) and was handled by the interpreter because it wasn’t of type PulaAnnkn. If you don’t specify an exception condition, all exceptions are caught: :::pnu6 ***lnejp-+, ***at_alp6 ***lnejpAt_alpekj_]qcdp *** At_alpekj_]qcdp As a general rule, try to make your exception specific, that is, try to specify the exception condition. If the list of exceptions is too long, maybe wide-range exception catching (i.e., with- out a condition) is a better approach. Exceptions are a fundamental part of flow control. The EAFP concept is built around the idea that it’s at times simpler to just try to perform an operation, later catching the exception in case of an issue. Exceptions can occur deep within your code. For instance, say bqj_pekj-$% calls bqj_pekj.$%, which calls bqj_pekj/$%. Now let’s suppose an exception occurred in bqj_pekj/$%. In case bqj_pekj/$% doesn’t handle the exception with the pnu/at_alp mechanism, the exception moves to bqj_pekj.$%. If bqj_pekj.$% doesn’t handle the exception, bqj_pekj-$% has a chance. And finally, if bqj_pekj-$% doesn’t handle the exception, the interpreter will issue an exception and print the cause. In the preceding scenario, in case bqj_pekj/$% does handle the exception, it will not resurface in bqj_pekj.$%. However, if you wish to catch an exception and pass it to the call- ing function, you can do that. That’s left out of the scope of this discussion; refer to the online documentation for more details at dppl6++`k_o*lupdkj*knc+nabanaj_a+ata_qpekjik`ah*dpih under the section Exceptions. CHAPTER 3 N PYTHON FOR PROGRAMMERS 89 You can also raise exceptions of your own. This is of value if you write code and want to ensure it’s being used properly. Suppose your algorithm only works on odd numbers; a good approach would be to check whether a parameter passed to the algorithm is odd, and if not, raise an exception: :::j92 :::ebjkpj!.99-6 ***n]eoaR]hqaAnnkn(r]hqaiqop^ak`` *** Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja.(ej8ik`qha: R]hqaAnnkn6r]hqaiqop^ak`` In the preceding example, I’ve used an existing exception, R]hqaAnnkn. You can create exceptions of your own or use existing exceptions. For more details and a list of existing excep- tions, refer to Python’s online documentation: dppl6++`k_o*lupdkj*knc+he^n]nu+at_alpekjo* dpih. Iterators Before we move to the bkn statement, I’d like to cover an important concept, iterators. Iterators are objects that return an element one at a time, instead of returning a full sequence. An object that can be iterated over is known as iterable. Using iterators is more memory efficient than using a sequence. For example, the function n]jca$-,,,% creates a list of a thousand values, whereas the iterator tn]jca$% creates an iterator object that consumes much less memory: calls to tn]jca$% yield the values from zero to 1000, excluding the value 1000, one at a time. Python relies heavily on iterators and provides a great number of iterators that work on data structures I’ve covered. Iterators are best understood in the context of the bkn statement, so let’s now take a look at this statement. The for Statement The bkn statement is one of the most versatile statements in Python. The statement follows the following syntax: bknahaiajpejoamqaj_a6 Bkn>hk_g In case of a one-line block, the Bkn>hk_g can appear on the same line as the bkn statement. Indentation rules for blocks are the same as those described in the eb statement (and for any block for that matter—they must be indented to the same level). The bkn statement assigns ahaiajp to be a value from oamqaj_a and executes the Bkn>hk_g. This happens for all the values in oamqaj_a: :::bknahaiejW#dau#(#fq`a#(4Y6 ***lnejpahai( *** daufq`a4 CHAPTER 3 N PYTHON FOR PROGRAMMERS90 If you’re interested in a format similar to that of C’s bkn function, use the n]jca$% function: :::bkntejn]jca$-,%6 ***lnejpt( *** ,-./012345 The bkn statement can also operate on an iterator. The function tn]jca$% creates an itera- tor object, whereas the function n]jca$% creates a list. Both can be used in the context of a bkn statement: :::bkntejn]jca$1%6 ***lnejp ***bknuejtn]jca$1%6 ***lnejp!0`!$t&1'u%( *** ,-./0 12345 -,---.-/-0 -1-2-3-4-5 .,.-.../.0 In the preceding example, I’ve used both tn]jca$% and n]jca$%, effectively yielding the same result. Also, as the preceding code suggests, bkn loops can be nested. The bkn statement shines in the context of iterators. Let’s cover a few. The naranoa`$oam% iterator returns one element at a time from a sequence in reversed order: :::bkntejnaranoa`$W#olhep#(#oa_kj`#Y%6lnejpt( *** oa_kj`olhep The iterator ajqian]pa$oam% returns both the index to the item in the sequence and the item, as a tuple: :::bkne(ahaiejajqian]pa$W#olhep#(4(#oa_kj`#Y%6lnejpe()):(ahai *** ,)):olhep -)):4 .)):oa_kj` Some data structures provide iterators themselves. The iterator epanepaio$% returns a (key, value) tuple and is used to iterate over items in a dictionary: :::`9w#olhep#64(#oa_kj`#6-y :::bkng(rej`*epanepaio$%6lnejpg()):(r *** oa_kj`)):- olhep)):4 CHAPTER 3 N PYTHON FOR PROGRAMMERS 91 List Comprehensions List comprehensions is a topic I’ve postponed until after we talked about the bkn statement. They really do apply to lists, but they’re rather hard to explain unless you understand bkn statements. List comprehensions are an efficient method to create lists from lists, but with a slightly different notation than a regular bkn loop. List comprehensions follow this syntax: Wb$t%bkntejheopeb_kj`epekjY The _kj`epekj clause is optional: :::Wt&tbkntejn]jca$-,%ebt:1Y W/2(05(20(4-Y :::Wt&&.bkntejn]jca$2(-,%Y W/2(05(20(4-Y You can also write a nested list comprehension, similar to nested bkn loops: :::W$t(u%bkntejn]jca$/%bknuejn]jca$/%Y W$,(,%($,(-%($,(.%($-(,%($-(-%($-(.%($.(,%($.(-%($.(.%Y You’ll encounter numerous uses of list comprehensions throughout the book. The while Statement The sdeha statement complements bkn loops and is best used in case a condition has to occur before the loop is terminated. You’ve seen the sdeha statement in use in Chapter 1, which allows recording of GPS data until a Ctrl+C is pressed, and also previously in this chapter. The sdeha syntax is as follows: sdeha_kj`epekj6 Sdeha>hk_g As long as _kj`epekj evaluates to Pnqa, the Sdeha>hk_g is executed: :::eilknpn]j`ki :::sdehan]j`ki*n]j`ki$%8,*56lnejp&( *** &&&&&&&&&&&&&&&&&&&&&&&&&&& This example will print a star as long as a random number between 0 and 1 is less than 0.9. I’ve used the function n]j`ki$% from module random (see Chapter 7). Statements break and continue The statements ^na]g and _kjpejqa are used to modify behavior within a loop or a block. The statement ^na]g exits a flow control block, and the statement _kjpejqa stops execution of the block but picks up on the next iteration. :::bkntejn]jca$1%6 ***ebt99/6^na]g ***lnejpt( *** CHAPTER 3 N PYTHON FOR PROGRAMMERS92 ,-. :::bkntejn]jca$1%6 ***ebt99/6_kjpejqa ***lnejpt( *** ,-.0 In the first bkn statement, I’ve used the statement ^na]g when t is equal to /, effectively terminating the bkn loop. In the second bkn statement, I’ve merely skipped execution of the block in case t is equal to /, suppressing the lnejp statement, but resuming on the next value. Some Built-in Functions Let’s now turn to built-in Python functions that weren’t covered in any of the previous sec- tions. By built-in, I mean functions that do not require any eilknp command prior to using them. Table 3-7 presents these functions, in alphabetical order. Table 3-7. Some Python Built-in Functions Statement Description Example ]hh$o% Returns Pnqa if all elements of o are not B]hoa. ]hh$W#de#(.Y% returns Pnqa. ]hh$W##(.Y% returns B]hoa. ]ju$o% Returns Pnqa if some elements of o are Pnqa. ]ju$W##(.Y% returns Pnqa. ]ju$WY% returns B]hoa. _dn$j% Returns the ASCII value of j._dn$21% returns #=#. _il$t(u% Returns )- if t8u,, if t99u, and - if t :u. _il$#]#(#^_#% returns )-. _il$.(-% returns -. kn`$_d% Returns the ordinal value of _d. This is the inverse of _dn$j%. kn`$#=#% returns 21. kn`$_dn$4,%% returns 4,. n]jca$We(YfW(gY% Returns a list starting at e (if supplied, default is zero), ending right before f, with an increment step of g (if supplied; default is -). n]jca$1% returns W,(-(.( /(0Y. n]jca$.(1% returns W.(/(0Y. n]jca$.(1(.% returns W.(0Y. n]jca$1(.().% returns W1( /Y. oknpa`$o% Returns sequence o, sorted. oknpa`$#dau#% returns W#a#( #d#(#u#Y. oqi$o% Returns sum of elements in o. oqi$n]jca$-,%% returns 01. pula$k^f% Returns the type of k^f. pula$-f% returns 8pula #_kilhat#:. vel$o-W(o.Y% Returns a list of tuples, each composed of elements at the same location in the sequences. o. is optional. vel$n]jca$.%(W#dau#( #fq`a#Y% returns W$,(#dau#%( $-(#fq`a#%Y. vel$n]jca$.%% returns W$,(%( $-(%Y. CHAPTER 3 N PYTHON FOR PROGRAMMERS 93 Some of these functions are very useful. For example, have a look at the Newton fractal example in Chapter 7 for an interesting use of the vel$%function. Defining Functions Functions are a convenient way to reuse code. Functions in Python are similar to procedures, subroutines, and functions in other programming languages. There’s no distinction between a function that returns a value and a function that does not—both are considered functions. (In some programming languages, if a function doesn’t return a value, it is named differently: procedure or subroutine, for example.) Functions are declared as follows: `abbqj_j]ia$]ncqiajpo%6 Bqj_pekj>k`u The keyword `ab defines a start of a function. The name of the function is bqj_j]ia; ]ncqiajpo are optional: :::`abb-$%6 ***lnejpB- *** :::b-$% B- :::`abb.$j%6 ***lnejpB.&j *** :::b.$-,% B.B.B.B.B.B.B.B.B.B. I’ve defined two functions: b-$% and b.$%. Function b-$% requires no parameters, while function b.$% requires one parameter. Using the functions (calling them) requires the addition of a set of parentheses. You can also specify optional parameters using an assignment in the list of arguments in the function name, as follows: :::`abb/$j(o9B/%6 ***lnejpo&j *** :::b/$.% B/B/ :::b/$.(#B0#% B0B0 In the first call to b/$%, the default value of o is B/. In the second call, that value is assigned the string #B0#. Functions can return values using the napqnj statement: :::`abb1$j%6 ***napqnjb1&j *** CHAPTER 3 N PYTHON FOR PROGRAMMERS94 :::b1$/% #b1b1b1# :::]9b1$/% :::] #b1b1b1# The napqnj statement doesn’t necessarily have to appear at the end of the function; how- ever, the function ends execution when it reaches a napqnj. Functions are typically documented with docstrings (which are bold in the following code): :::`abb2$j9-(o9b2%6 ***Napqnjo]opnejc_kilkoa`kbpdaopnejco(nala]pa`jpeiao* *** ***j]j`o]na^kpdklpekj]h* ***napqnjo&j *** :::dahl$b2% Dahlkjbqj_pekjb2ejik`qha[[i]ej[[6 b2$j9-(o9#b2#% Napqnjo]opnejc_kilkoa`kbpdaopnejco(nala]pa`jpeiao* j]j`o]na^kpdklpekj]h* :::b2$% #b2# :::b2$.(#b3#% #b3b3# The benefit of using a docstring immediately after the function declaration is that execut- ing dahl$bqj_j]ia% returns the docstring, which is an excellent way to document a function. Generators Generators are functions used to create iterators. The main difference between a generator and a regular function is that generators return one element at a time using the ueah` state- ment, while functions return one element using the napqnj statement (it could be a sequence or tuple, but it’s essentially one object). :::`abk``$o%6 ***=cajan]pknbqj_pekjpkepan]papdnkqcdk``ahaiajpokbo* ***e9, ***sdeha$e8haj$o%%6 ***ueah`oWeY ***e'9. *** :::bkneejk``$W#dau#(#olhep#(#oa_kj`#(4Y%6 CHAPTER 3 N PYTHON FOR PROGRAMMERS 95 ***lnejpe( *** dauoa_kj` In the preceding example, I’ve defined an iterator named k``$% that yields the odd ele- ments in a list (i.e., the first, third, fifth, and so forth). I’ve implemented the iterator using a sdeha loop and proper indexing. There are also other methods I could’ve used to implement the iterator, but it’s important to understand that the motivation behind using an iterator is that of efficiency. A different implementation could be one that makes use of the indexing operator with a step value of 2, as follows: :::`abk``$o%6 ***=cajan]pknbqj_pekjpkepan]papdnkqcdk``ahaiajpokbo ***bknahaiejoW66.Y6 ***ueah`ahai While this might look like more elegant code, in my mind it’s not as good. The reason is that the bkn loop creates an entire list (albeit half the size), and in case of large lists, this is not memory efficient. The first implementation, on the other hand, is quite memory efficient. It’s also possible to implement the function k``$% using a bkn loop instead of a sdeha loop, in which case I would suggest using the iterator tn]jca$% (over a list comprehension) to avoid creating additional large data structures. Generator Expressions Generator expressions, or genexps, are a compact method to implement simple generators. Generator expressions follow this syntax: $b$t%bknahaiajpejoamqaj_aeb_kj`epekj% In a sense, they are very similar to list comprehensions, with the difference being that they are iterators and not lists, and hence are more memory efficient. Here’s an implementation of the k``$% generator function using a genexp: :::H9W#dau#(#olhep#(#oa_kj`#(4Y :::k``9$tbkntejHW66.Y% :::bkneejk``6 ***lnejpe( *** dauoa_kj` or in one big line: :::H9W#dau#(#olhep#(#oa_kj`#(4Y :::bkneej$tbkntejHW66.Y%6 ***lnejpe( *** dauoa_kj` CHAPTER 3 N PYTHON FOR PROGRAMMERS96 If I were a bit more conscious about memory usage, I’d notice that I’ve created another list in the bkn loop: HW66.Y, which probably is not a good idea (from a memory-conscious applica- tion). A different approach is to use the tn]jca$% iterator as follows: :::H9W#dau#(#olhep#(#oa_kj`#(4Y :::k``9$HWeYbkneejtn]jca$,(haj$H%(.%% :::bknahaiejk``6 ***lnejpahai( *** dauoa_kj` This might be a bit less clear, but it is a more memory-conscious implementation. Alter- natively, you could also use the ajqian]pa$% iterator, iterating over list elements and only printing an element if the index is odd. Deciding whether an index is odd or even can be done using the modulo (!) operator, which returns the remainder from dividing by a number, in our case .: :::H9W#dau#(#olhep#(#oa_kj`#(4Y :::k``9$ahaibkne(ahaiejajqian]pa$H%ebjkp$e!.%% :::bknahaiejk``6 ***lnejpahai( *** dauoa_kj` Opt for using genexps over list comprehensions if you just want to iterate over items and don’t require the list itself. Unless you’re using really large data structures (on the order of scale of the memory you have in your computer), using either is fine. Object-Oriented Programming Per the description I’ve given of the Python language in the beginning of the chapter, you can deduce that Python is an object-oriented programming language. You’ve already seen this. For example, the data structure list, whose methods are in essence member functions, is an object. The purpose of this section is to quickly (very quickly!) go over the syntax of object- oriented programming and to show how to implement a basic object. The reason I won’t be covering OOP in detail is that this book mostly deals with using objects, rather than coding them. If you’d like to know more about coding an object, refer to the online Python documen- tation and the references at the end of this chapter. The basic data structure to implement object-oriented programming in Python is a class. Classes have functions, called methods, and variables, called attributes. Listing 3-2 shows a simple class named K`` that implements the odd functionality, that is, retrieves odd elements. Listing 3-2. Listing of k``*lu _h]ooK``6 `ab[[ejep[[$oahb(o9WY%6 oahb*oamqaj_a9o `abk``$oahb%6 napqnjoahb*oamqaj_aW66.Y CHAPTER 3 N PYTHON FOR PROGRAMMERS 97 The first line defines a class named K``. From here, functions and variables indented per the usual block rules denote functions and variables belonging to class K``. I’ve defined two functions. The first function is the constructor [[ejep[[ (double under- scores on both sides). The constructor function is called whenever an object is instantiated, or created. To instantiate a class object, call the K`` class with parentheses. Here are some ways you can instantiate the K`` class object (be sure to execute the preceding script first): :::k``-9K``$% :::k``.9K``$#]opnejc#% :::k``/9K``$W#dau#(#olhep#(#oa_kj`#(4Y% The implementation I chose is that in case a parameter is provided, the variable oahb* oamqaj_a is assigned this parameter. An important note here is the use of the argument oahb: the word oahb is a convention and not a reserved word. Whenever you call a class property or method, the argument oahb is passed automatically but not spelled out. That is, to instantiate an K`` object, you enter K``$o% and not K``$oahb(o%. By passing the argument oahb (hidden), Python identifies one created object from another. The analogy I like to use is that oahb is simi- lar to C++’s pdeo statement. Another important concept here is that of scope. Had I not used the notation oahb* oamqaj_a and written oamqaj_a instead, the local variable oamqaj_a, that is, local to the function (and not the class), would have been updated. Once the function returned, that variable would have disappeared. To ensure that the class variable oamqaj_a is updated (and not the func- tion’s local variable), I’ve used the notation oahb*oamqaj_a. The second function I defined is k``$%, which returns the odd elements in a sequence. To call the function, use the dot operator after the K`` object, as follows: :::k``/9K``$W#dau#(#olhep#(#oa_kj`#(4Y% :::k``/*k``$% W#dau#(#oa_kj`#Y So far, I’ve only shown methods, but the class K`` also contains a variable: oamqaj_a. To access this variable, you can use the dot operator as well: :::k``/9K``$W#dau#(#olhep#(#oa_kj`#(4Y% :::k``/*oamqaj_a W#dau#(#olhep#(#oa_kj`#(4Y There’s a lot more to object-oriented programming in Python, including most of the con- cepts that appear in other object-oriented programming languages such as inheritance and operator overloading, to name a couple. Again, the references at the end of the chapter should prove valuable resources should you need to learn more about object-oriented programming and design in Python. Modules and Packages One of Python’s strong suits is the extensive number of packages readily available. You’ve seen how to install packages in Chapter 2; now it’s time to see how to use them. A module is a set of functions and data structures. In essence, it is similar to a class. Accessing modules is performed using the module’s namespace, followed by a dot to access CHAPTER 3 N PYTHON FOR PROGRAMMERS98 functions and variables. Packages are collections of modules. Accessing modules within pack- ages is performed using the dot operator. It’s also of value to know that it’s possible to extend Python with modules from C and C++. From a Python user’s perspective, you just import a module and use it as is, regardless of whether it was written in another programming language. The import Statement The eilknp statement loads a module, effectively allowing us to access the functions and vari- ables within the module. You can issue the eilknp statement in several ways: eilknpik`qha eilknpik`qha]oj]ia bnkiik`qhaeilknpbqj_pekj bnkiik`qhaeilknpbqj_pekj]oj]ia bnkiik`qhaeilknp& The first method, eilknpik`qha, loads a module with its namespace. To access the mod- ule functions, use ik`qha*bqj_pekj$%. The second method loads the module but renames it, so to use its functions, use j]ia*bqj_pekj$%. The third statement imports only one function from the module; to access it simply use its name: bqj_pekj$%. You can have multiple func- tions imported in this manner by separating the functions with commas. The fourth statement is identical to the third, only the name of the function is now j]ia; to call the function, enter j]ia$%. Lastly, the last eilknp statement loads all functions from a module; to access the func- tions you enter their name (without the module name). Here are some examples: :::eilknpi]pd :::i]pd*le /*-0-15.21/14535/- :::i]pd*omnp$0% .*, :::eilknpi]pd]oi :::i*le /*-0-15.21/14535/- :::i*omnp$0% .*, :::bnkii]pdeilknpomnp :::omnp$0% .*, :::bnkii]pdeilknpomnp]oomq]na[nkkp :::omq]na[nkkp$0% .*, :::bnkii]pdeilknp& :::oej$,% ,*, Whether you’ll be loading the entire module or just some pieces of the module is totally up to you (and a function of the amount of memory you have). At times, though, it’s easier to load entire modules, and yet at other times it’s important to be able to load modules with their CHAPTER 3 N PYTHON FOR PROGRAMMERS 99 namespace, for example, when two modules have the same function names (such as modules math and cmath). Modules Installed in a System Before you start importing modules and reading about their functions, it would be valuable to know what modules are currently installed and available in your system. Don’t forget that the Python Standard Library is vast, with a substantial number of modules and packages to choose from. Maybe a function you’re looking for already exists in the standard library? Of course, you can refer to the online documentation, but you can also refer to the interactive help system. Invoke the interactive help system by entering dahl$%. At the help prompt, enter ik`qhao. This will provide a list of available modules in your system. Enter dahl$ik`qha% to read more about that module. The dir Statement Another useful statement is the `en statement, which lists the contents of a specific object (for example, a class) but in this context, it lists the methods and properties of a module as well: :::eilknpi]pd :::`en$i]pd% W#[[`k_[[#(#[[j]ia[[#(#]_ko#(#]oej#(#]p]j#(#]p]j.#(#_aeh#(#_ko#(#_kod#( #`acnaao#(#a#(#atl#(#b]^o#(#bhkkn#(#bik`#(#bnatl#(#dulkp#(#h`atl#(#hkc# (#hkc-,#(#ik`b#(#le#(#lks#(#n]`e]jo#(#oej#(#oejd#(#omnp#(#p]j#(#p]jd#Y This is very useful if you’re exploring the functions in a module or if you forgot the exact name of a function. Final Notes and References It is far beyond the scope of this chapter and this book to cover the entire Python program- ming language. However, this chapter should get you up and running, and you’ll be able to follow through with the rest of the book with very little need for additional references. That being said, one of the purposes of the book is to introduce the language and provide additional resources should you want to expand your knowledge. I have found the following references of value, and I hope you find them useful as well: s h4HE0YTHON4UTORIALv by Guido van Rosso, dppl6++`k_o*lupdkj*knc+pqpkne]h+ej`at* dpih s 4HE0YTHON3TANDARD,IBRARY dppl6++`k_o*lupdkj*knc+he^n]nu+ej`at*dpih sBeginning Python: From Novice to Professional, Second Edition by Magnus Lie Hetland (Apress, 2008) sDive into Python by Mark Pilgrim (Apress, 2004; free online version also available at dppl6++`eraejpklupdkj*knc+) sPython in a Nutshell: A Desktop Quick Reference by Alex Martelli (O’Reilly, 2006) sPython Cookbook: Recipes from the Python Community by Alex Martelli, Anna Martelli Ravenscroft, and David Ascher (O’Reilly, 2005) CHAPTER 4 Data Organization Organizing Chaos A preliminary step to designing and programming an algorithm is gathering data and sorting it. When you first go out to test a thesis or write code to analyze network traffic, only part of the information is readily available; some of the data is still unknown. First estimations are made based on the first set of data files. As data is gathered, new insights and understandings arise, resulting in possible changes to the processing script and data gathering application, such as adding a previously unlogged parameter and graphing it over time. Some changes may include data gathering over substantial longer time periods than originally anticipated. Consequently, to accommodate for manageable data files, a reduction in the sampling rate is required, imple- mented by logging only every nth value. Another plausible scenario is that of parsing log files, where the generating application, for example, a web server, recently went through a software upgrade altering the file format and the file name scheme. The situation can get more complex. Some files may have an error due to a hardware mal- function of the recording apparatus; or some portions of the file are corrupt due to hard drive issues (back up!), or the application that stored the file had a bug and generated incorrect data. By now, you realize you need to modify the erroneous data or remove it from your analysis, be it manually or automatically. In some cases, part of the data should be used as a teacher set to help define the algo- rithm, while another set of data should be used as a tester set to estimate performance. In this case, you may need to feed the algorithm additional information regarding the contents of the files so that more complex tests can be carried out. Documenting file contents is important so that the knowledge of what each file contains is not lost. A few years from now I doubt you’ll remember what each and every file is; but you might be expected to reuse your previous work. So annotating, or note taking, is of value. Ideally you’d like the annotations and documentation to reside with the data, and not in an inaccessible notebook. By now you have quite a number of different file types: varying number of parameters, different file lengths, different logging periods, various file formats, several file name schemes, clean and raw data, annotated data, and much more. Ideally, you’d like to use data from all the files, even if some of them have partial information or conform to a different file format; they still hold valuable information. Or it could be that you’d like to use historical information to ensure backward compatibility with older versions of the software. 101 CHAPTER 4 N DATA ORGANIZATION102 A lot of the work has many unknowns. Data gathering is an iterative process in nature, and if you don’t manage your data files properly, you’ll lose control. I’m not suggesting that we stop and design an entire data management infrastructure from the get-go. On the contrary, I think data should be gathered as I’ve described. However, following some simple guidelines and conventions can make life a lot easier. The purpose of this chapter is to address all these issues: file names, file formats, data organization, data cleaning, and annotation and data documentation. I’ll touch on each topic, suggesting guidelines and conventions to help man- age data more easily for the programmer and the processing application. File Name Conventions Our first step in data organization is deciding on a file name convention. You’d be surprised at the odd names people choose for their files. Not because they’re not inventive enough, rather because they’ve never given it much thought. File name conventions are also of value when more than one person accesses the data. A good convention will help all data users locate files and manage them: your administrator will find it easier to restore previously backed-up files if he knows the file name pattern. A good naming convention should also have in mind scripts, or programs, so that automation is easier to implement. For example, if the file names contain the day of the week, it’s easier to have those limited to three letters, Sun, Mon, Tue, Wed, Thu, Fri, Sat, instead of full day names, allowing the script that processes them to be less complex. Date and Time in a File Name We remember a lot based on date. “Remember that time when we ran that test? That was when you joined the group, about a year and a half ago.” One of the best ways to capture date and time information is to use it to name a file. Following this guideline allows easy file searches. Instead of going through the files one at a time, opening them, and looking at the contents, you can browse the directory contents and find data based on date. The following are benefits of using date and time in a file name: s $ATEISUSEFULINFORMATION*USTLOOKINGATTHEFILENAMETELLSYOUALOTABOUTTHEFILE s &ILENAMESAREALMOSTGUARANTEEDTOBEUNIQUE4HISISIMPORTANTWHENYOURDATALOG- ging application is creating file names, because it won’t overwrite existing files. If you want to further ensure uniqueness, include the time in seconds along with the date information. s &ILENAMESARERETAINEDWHENCOPYINGORMOVINGEVENIFMODIFYING(OWEVER IFYOU rely on the operating system to record the file names, you will find that there are issues with that: copying files using different media and/or over a network might not always retain all the date information such as creation date. They will, however, retain the file name. s )TSEASYTOAUTOMATEANDWRITESCRIPTSINTHISMANNER!SCRIPTTODISPLAYALLTHEGRAPHS from last month is straightforward to implement. s 4HECONVENTIONISEASILYFOLLOWEDONAWIDERANGEOFSYSTEMSANDPROGRAMMING languages. The application that records the data can be written in C programming language and not necessarily Python. CHAPTER 4 N DATA ORGANIZATION 103 We therefore would like our file names to embed the date and time, preferably up to a second resolution. That being said, there are a lot of possible ways to denote date and time. Personally, I follow the date and time format suggested in ISO 8601: UUUU)II)@@Pdd6ii6oo (see the section “Final Notes and References”) with some modifications, as it is not possible to have a file name with colon (6) as is required by the format. Instead of colons, I use a dash (Ì). Another possible modification is replacing letter P used to separate the date and time portions in the ISO standard with a dash as well. The side benefit of those two replacements (replacing both the colons and the P with a dash) is that now there’s a single field separator that separates year, month, day, hour, minute, and second. This is quite valuable for automation and is easily implemented in most programming languages. Some prefer keeping the character P as it does help remind where the date ends and when the time starts, and it’s not all that complex to manage either. Leaving the P or replacing it with a dash are both good options and mostly are a matter of personal preference. As you’ll soon see, we have a dedicated function for parsing dates, opnlpeia$%, that can handle the P quite easily. Python provides us with the olhep$oq^opn% function, which splits a string into a list of sub- strings once oq^opn is encountered. In this case, olhep$#)#% will split the date-time format: :::][`]pa9.,,4),0),.)..)-0)-0 :::][`]pa*olhep$#)#% W#.,,4#(#,0#(#,.#(#..#(#-0#(#-0#Y The following example extracts the month as an integer: :::ejp$.,,0),0)-.)--)--)--*olhep$#)#%W-Y% 0 In the latter example, I chose to operate directly on the string, not saving it in a variable. The month is the second element in the list, hence to access it I index it: W-Y (counting starts at 0). The function ejp$% converts the string value to an integer. If you follow the scheme where P is used instead of a dash, you can use the function opnlpeia$%, which is part of the time module. I assume opnlpeia$% is short for string-parse-time; regardless if it’s true, it helps to remember the function name: :::bnkipeiaeilknpopnlpeia :::opnlpeia$.,,/),2).4P,5).5)..(!U)!i)!`P!D)!I)!O% $.,,/(2(.4(5(.5(..(1(-35()-% NNote Small and capital letters are used to distinguish between a date and time fields, mainly because the character i can mean both month and minutes. So the convention is that time is denoted by caps (DD, II, OO) and date is denoted by small letters (uu, ii, ``). There’s one exception and that’s the year: when using a four-digit notation (e.g., 2008), the characters are capitalized: UUUU. As you can see, it’s quite easy to extract date and time information in Python from a file name so long as one conforms to the convention. Processing all the files from, say, April 2008 can be done using a single olhep$% command followed by an eb statement. CHAPTER 4 N DATA ORGANIZATION104 Useful File Name Titles Another important aspect of a file name is a useful title. A short, descriptive title can be a time- saver. OuopaiU or I]noPahao_kla= are good candidates. Avoid titles that describe the data such as Hkcbehao or Pailan]pqna=j`Bhks. You want to describe the system more than the data; the data will speak for itself when you analyze it. If you do want to describe the data, do so in addi- tion to describing the system: Ouo3/2Hkco is a good option. The following sample titles further clarify this point: sLqilN]s@]p] is lacking system description. What if you have several pumps you want to test for flow? One alternative is to use the pump’s serial number: Lqil03.N]s@]p] (assuming 472 is the pump’s serial number). sRkhp]caOuo.=I]u.,,4 is probably not a good title either. If you append the date to this title, you might end up with a title that looks like this: Rkhp]caOuo.=) I]u.,,4).,,5),-),-),-),-),-. So which one is it—year 2009 or year 2008? sRkhp]ca?qnnajpOuopai.= is OK; however, I’d opt to rename it to be less specific, or should I say, more general: Aha_pne_]h@]p]Ouopai.=. The reason for the renaming is that it’s possible you’ll decide to record additional values, say, power, as well as voltage and current, and unless you want to rename your code to look for different headers, having a file name titled Rkhp]ca?qnnajpOuopai.= that also has power values will be a bit misleading. File Name Extensions The last part of the file name convention is an indication of the file format, usually denoted by the file name extension. File name extensions are typically three characters long (some are less, such as *cv, and some are longer, such as *dpih). We’ll try to follow a convention of three characters for the extension, again because it will be easier for the processing application. I suggest thinking about three distinct file name extension subcategories: sKnown file formats: Image formats follow very specific extensions: *flc,*ljc,*^il, *pebb, and more. These file names have a meaning, so if you’re recording data in those file formats, use the known extensions. There are also known extensions for com- pressed file formats, video file formats, and others, so use them accordingly. sText file formats: Here I suggest using either a *ptp or a *_or extension. If the text file format is not the Comma Separated Values (CSV) format, use the *ptp extension, sug- gesting it is viewable by most text editors. Exceptions to this guideline include files that already have a known extension, for example, INI files: although they are text files, you really want to capture that they’re files holding initialization values. The same would apply to batch files and shell scripts. But those typically are not data files. sBinary file formats: Binary file formats are not as self-descriptive as CSV files. And unlike CSV or plain text files, they are hard to view without knowing in advance the specific file format. For this reason, binary file formats should be accompanied by a header file that describes the contents and format of the binary files. However, it’s still CHAPTER 4 N DATA ORGANIZATION 105 valuable to know a bit more about the binary file format even if the exact format is unknown. The following is the suggested convention: one character denoting whether the data is signed (e), unsigned (q), or floating point (b) followed by the number of bits used to store the data, as described in Table 4-1. Table 4-1. Suggested Binary File Name Extensions Description Precision Extension Signed integers 8, 16, 32, 64 *e,4,*e-2,*e/.,*e20 (respectively) Unsigned integers 8, 16, 32, 64 *q,4,*q-2,*q/.,*q20 (respectively) Floating point 32 (float) f32 64 (double) f64 sOther binary file formats: When binary files contain several values of different pre- cisions, the convention described in the Table 4-1 is not feasible, at least not in a three-character extension notation. In that case use *^ej or *t*^ej where t is a num- ber. The reason for the t is that it’s conceivable you’ll have several file formats of varying precisions, and a good way to tell them apart would be to add an integer prefix. Notice that they still all end with a *^ej, enabling easy file distinction. In Conclusion Three items are important to file naming conventions: date and time in a file name, useful and descriptive file name titles, and proper file name extensions. If you follow these conventions, you’ll find that writing scripts to manipulate these files is simple. Using these conventions, we have file names that follow the scheme Pepha)UUUU)ii)``) DD)II)OO*atp with the placeholders detailed in Table 4-2. Table 4-2. Convention Scheme for File Name Pepha)UUUU)ii)``)DD)II)OO*atp Placeholder Description Pepha A descriptive title of your choice UUUU Year the file was created ii -ONTHTHEFILEWASCREATED)NTHECASEOF*ANUARY ii is ,-. `` Day file was created. In the case of the 7th, `` is ,3. DD Hours in 24-hour notation. 11 p.m. would be represented as ./. Values are from ,, to ./. II Minutes. 5 minutes past the hour is ,1. OO Seconds. 7 seconds past the minute is ,3. atp An extension describing the file format, three characters long (if possible). CHAPTER 4 N DATA ORGANIZATION106 NNote In case of values occupying less than the assigned number of digits, a zero is added. So if the time is 5 minutes past 1 o’clock, the value of dd will be ,- and the value of ii will be ,1. Example: Automating File Name Creation Listing 4-1 presents an implementation, qjemqa*lu, that conforms to the file name conven- tions suggested previously. Listing 4-1. Creating a Unique File Name, qjemqa*lu bnkipeiaeilknphk_]hpeia ]o_nelppk_na]paqjemqabehaj]iao^]oa`kjpepha( `]pa]j`peiaop]il]j`]jatpajoekj `]papeia[op]il9#!0`)!,.`)!,.`P!,.`)!,.`)!,.`#!hk_]hpeia$%W62Y pepha9#Ouo=Hkco# atp9#_or# lnejp#Qjemqabehaj]ia6!o)!o*!o#!$pepha(`]papeia[op]il(atp% Here’s the result I got from executing lupdkjqjemqa*lu: Qjemqabehaj]ia6Ouo=Hkco).,,4),5),/P,5).5)/2*_or NNote We’re assuming that files are generated at a slower rate of one file per second and that there’s only one application logging data, hence a file name based on seconds is unique. Also, in case of a system time change, there’s a chance of files being nonunique. Before creating a file, we could check whether a file with the same name exists, but for clarity reasons it’s left out of the script. The function hk_]hpeia$% is part of the time module and provides a tuple of values rep- resenting the year, month, day of the month, hours, minutes, seconds, week day, day of the year, and daylight saving time (phew). We only require the first six arguments of hk_]hpeia$% to create our unique file name. To access the first six elements of the tuple, we use the slicing operator W62Y. So hk_]hpeia$%W62Y returns the very six elements we’re interested in for creating our unique file name. Next we use the ! operator to format the string containing the timestamp: #!0`)!,.`)!,.`P!,.`)!,.`)!,.`#. The substring #!0`# means up to four digits; the substring #!,.`# means two digits, and in case there are less than two digits, padded with zeros. We also use the ! operator to output the final unique file name, which is composed of the strings stored in variables pepha, `]papeia[op]il, and atp. In this case we use #!o# to format strings instead of integers. CHAPTER 4 N DATA ORGANIZATION 107 Other Schemes Unfortunately, automating file name creation and using the date and time mostly applies if you’re writing the application that generates the data files. That’s not always the case: you might be using an embedded system’s output files and have no control of the source code. As long as the system generating the files has a real time clock, and assuming you can change the code, or later change the file names, following the preceding convention is doable. On the occasions where a real time clock is unavailable, a different naming scheme should be employed. One of the alternatives to using a timestamp in a file name is a running index. That’s a bit more complex than using the date because now we have to figure out what’s the last index used. That being said, it’s still a good option: it provides consistency, and unless files are randomly deleted, it also provides some sort of chronological order. Incidentally, that’s the scheme used by most digital cameras. Example: Running Index Listing 4-2 is a suggested running index implementation. The script will look for files accord- ing to a title and extension and determine a running index (up to 999). It will then create a file accordingly. Repeatedly running the script will create files with incrementing index values. Listing 4-2. Running Index Implementation ]o_nelppk_na]paqjemqabehaj]iaoqoejc]nqjjejcej`at bnkiko*l]pdeilknpateopo ej`at[op]il9- i]t[ej`at9555i]teiqijqi^ankbbehao pepha9#**+`]p]+Ouo=Hkco# atp9#ptp# sdehaej`at[op]il8i]t[ej`at6 qjemqa[behaj]ia9#!o)!,/`*!o#!$pepha(ej`at[op]il(atp% ebateopo$qjemqa[behaj]ia%6 ej`at[op]il'9- _kjpejqa b9klaj$qjemqa[behaj]ia(#sp#% b*snepa$@]p]% b*_hkoa$% ^na]g nalknpop]pqo ebej`at[op]il:9i]t[ej`at6 lnejp?kqh`jkp_na]pa]qjemqabehaj]ia ahoa6 lnejp?na]pa`qjemqabeha6(qjemqa[behaj]ia The general operation of this script is as follows: first we create a file name string with the current index. Next, we check to see whether the file exists by calling the function ateopo$%, which is part of the os.path module (more on os.path in Chapter 10). If the file exists, we CHAPTER 4 N DATA ORGANIZATION108 increment the index and restart the loop; this is done with the statement _kjpejqa. In case the file name we’ve created does not exist, we proceed with writing the data to the file and break- ing out of the sdeha loop. Lastly, in case a unique file name was not available (we check up to index 999, per variable i]t[ej`at), the script reports that a unique file name could not be created. Notice that we choose to pad the running index with zeros as denoted by the substring #!,/`# in the line qjemqa[behaj]ia9#!o)!,/`*!o#!$pepha(ej`at[op]il(atp%. This is generally a good idea and allows easier processing of file names, as they have identical lengths, and the strings representing the file names can be easily sliced. NNote If you change the value of i]t[ej`at, be sure to change the format string accordingly. For example, if i]t[ej`at is 55555, replace !,/` with !,1` in the format specifications for qjemqa[behaj]ia. This can also be done automatically by calculating the number of digits using ejp$hkc-,$i]t[ej`at%'-% and using the result in the format specifications (see the section “Example: Searching Inside a Text File” in Chapter 5). File Formats Up to this point we’ve discussed the form of the file names. Now it is time to discuss the for- mat of the contents, that is, file formats. As previously pointed out, you may not be able to choose the file format used to store the data. Assuming you do have influence over the file for- mat, the question is what format to use. A good file format is portable, easily recognizable, and does not impact performance drastically, be it size or computation overhead, depending on the nature of the application. When you select a file format, consider the amount of data you’ll be dealing with. If you’re looking at large amounts of data, you want to be as efficient as possible in both storing the data and accessing it, sacrificing a bit for portability and using a less self-descriptive file for- mat. This means choosing a binary format. If the amount of data is not large and you want the data to be self-descriptive and portable as much as possible, choose text file formats, specifi- cally CSV. By large amounts of data, consider the following: s (OWMUCHSTORAGESPACEDOYOUHAVE)FYOURERUNNINGADESKTOP0# AREASONABLE size to be dealing with is less than 1 terabyte. Of course, this number is ever-changing as storage space and processing power increase. At times you will find that due to storage space limitations your only option is going with binary files. The reason for this is that text representation is not as efficient as binary representations. 8-bit inte- gers (characters) require 1 byte of storage in binary form and from 1 to 3 bytes in text form used in CSV. Storing floating-point values, which typically require 4 or 8 bytes in binary form, will now require a considerably larger amount of bytes. The value 0.00000095367431640625 (which is 2 to the power of minus 20) will now require 22 bytes to represent properly in a CSV file. And that’s not counting the separators and delimiters. CHAPTER 4 N DATA ORGANIZATION 109 s (OWCRITICALISPERFORMANCETOYOURAPPLICATION4HESMALLERTHEDATAFILES THEFASTER you can process them. There’s no need to parse the data, simply read it. If performance is your major concern, opt for binary file format. NNote The sentence “The smaller the data files, the faster you can process them” is not always correct. In case of compressed files, data files are smaller but require more processing power to work with, hence performance is worse, not better. However, assuming no compression, performance of binary files is usually better. So from a high-level file format category, you want to decide whether you’ll be looking at binary data or text data. Table 4-3 lists the pros and cons of using either. Table 4-3. Pros and Cons of Binary and Text File Formats Pros Cons Text Self-descriptive (usually) Does not require specific knowledge of the file format Can be viewed by any text editor Not storage efficient Medium read/write access Requires “text” parsers Binary Relatively small storage space Fast read and write access Not so self-descriptive Requires knowledge of the file format Requires a specific application to view data Text and binary are high-level categorizations. When dealing with text files, we will mostly limit our discussion to plain text files and CSV files and touch lightly on other file formats. When dealing with binary files, we’ll talk mostly about straightforward file formats such as q-2 and e/. and not complex file formats such as IL/ and CV that might support compression and/or encryption. CSV File Format The CSV file format is a text file format and can be viewed by any text editor. Furthermore, most spreadsheet applications are capable of reading and writing CSV files, parsing the val- ues properly into rows and cells. In CSV files, values are separated by commas; values are strings that represent numbers, dates, titles, or any other textual fields. If the string value has a comma in it, quoting is required, that is, the string will have beginning and ending quotes. Alternatively, the comma in the field can be escaped (more on this in Chapter 5). CSV format does not require a fixed number of fields per line (also called a row), which can be quite useful: it allows easy annotation of headers or descriptions of the data, which in turn can later be read by most any spreadsheet and/or editor with all the information recorded still intact and easily accessible. CHAPTER 4 N DATA ORGANIZATION110 The following are the contents of a valid CSV file: Ouopai= @]p]cajan]pa`^uhkccan- Da]`an(-(Da]`an. R]hqa-(- R]hqa.(== Example: Stock Price Charts Following a convention that stores a short description of the data in the beginning lines of the CSV files can be very useful for annotating a graph or a report associated with the data in the file. To follow along with the example, ensure your directory structure is similar to that pre- sented in Chapter 2 in the section “Example: Directory Structure for the Book.” Your base directory should be ?d0; within ?d0 there should be three subdirectories named on_, `]p], and ei]cao. If you wish to use a different scheme, be sure to change the file path variable and the call to function o]rabec$% in the script in Listing 4-3, which appears a little later in this section. For this example you can download data from the NASDAQ stock exchange web site (dppl6++sss*j]o`]m*_ki). Select a stock, for instance, the NASDAQ-100 (IXNDX) or your com- pany’s stock chart, you wish to display on the intranet web site. You will be presented with a chart of the stock. When you click the chart, the NASDAQ web site presents the actual values used to create the chart. You can choose to download the file in Excel format: do so, and save the file under directory ?d0+`]p]+_d]npo*tho. If you open the file ?d0+`]p]+_d]npo*tho in a text editor, you’ll notice that there’s header information describing what each column means: @]paKlajDecdHks?hkoa+H]opRkhqia ,5+,.+.,,4-5,0*31-5-.*3.-40/*,3-41,*-0, ,4+.5+.,,4-453*12-455*12-422*4--43.*10, ,4+.4+.,,4-5,3*-3-5.-*-5-5,0*.,-5-1*-., ,4+.3+.,,4-442*32-5-/*1/-44-*10-5,,*/,, In reality, the file format is a form of CSV, the separator being a tab instead of a comma. We can easily overcome this with Python’s csv module by specifying the delimiter to be tab #Xp#. Listing 4-3 shows our implementation, opk_g[_d]npo*lu, which reads a stock chart file and presents a graph with the header information properly displayed. Be sure to save it in folder ?d0+on_. The result will be a PNG image, opk_g[lne_a*ljc, in directory ?d0+ei]cao. Listing 4-3. opk_g[_d]npo*lu, Plotting NASDAQ _d]npo*tho File bnkiluh]^eilknp& eilknp_or bnkipeiaeilknpcipeia(igpeia CHAPTER 4 N DATA ORGANIZATION 111 ik`ebupdabkhhksejcpklkejppkukqn`]p]beha behal]pd9#**+`]p]+_d]npo*tho# na]`pdaajpena?ORbeha]j`opknaepej]j]nn]ukbheopo qoap]^$#Xp#%]o]`aheiepan `]p]9WY bknnksej_or*na]`an$klaj$behal]pd%(`aheiepan9#Xp#%6 `]p]*]llaj`$nks% olheppda`]p]pkda]`an]j`r]hqao da]`an9`]p]W,Y r]hqao9]nn]u$`]p]W-6Y% pdabenop_khqijeo`]paejbkni]pekjej]opnejcbkni]p sapn]jobknieppk]`]ukbua]nbkni]p jkpe_apd]ppdeosehhjkpskngkranua]n^kqj`]nu$jaa`pk]``/21% ua]n`]u9vanko$haj$r]hqaoW6(,Y%% bkne(`]uejajqian]pa$r]hqaoW6(,Y%6 i]ngap[_hkoa[peia9$ejp$`]uW26Y%(ejp$`]uW6.Y%(ejp$`]uW/61Y%(X -2(,(,(,(,(,% ua]n`]uWeY9cipeia$igpeia$i]ngap[_hkoa[peia%%*pi[u`]u lhkppda`]p] bkneejn]jca$-(1%6 lhkp$ua]n`]u(r]hqaoW6(eY(h]^ah9da]`anWeY(hejase`pd9/% ]jjkp]papdaop]np]j`aj``]pao patp$ua]n`]uW,Y(r]hqaoW,(-Y(r]hqaoW,(,Y% patp$ua]n`]uW)-Y(r]hqaoW)-(-Y(r]hqaoW)-(,Y% cne`$% hacaj`$% uh]^ah$#Opk_glne_aWQO@Y#% th]^ah$#@]uobnkiop]npkbpdaua]n#'r]hqaoW,(,YW26Y% pepha$#J=O@=M)-,,$ETJ@T%Opk_glne_a(lanek`!o)!o#!$r]hqaoW)-(,Y(r]hqaoW,(,Y%% o]rabec$#**+ei]cao+opk_g[lne_a*ljc#% We start by reading the CSV data file and passing a tab as a delimiter. The first line in vari- able `]p] is the header information, describing what each column means: Date, Open, High, Low, Close/Last, and Volume. The remaining lines are the values to plot. We therefore split the variable `]p] into da]`an and r]hqao, accordingly. We also convert the values to a NumPy array using the function call ]nn]u$%. Using a NumPy array, the data will be easier to process and plot; more about NumPy in Chapter 7. The following is not so much an explanation of working with CSV files but is important to fully understand the script. CHAPTER 4 N DATA ORGANIZATION112 Next is the so-called linearization process. Much like in the GPS example of Chapter 1, data in _d]npo*tho is not linear. The information is stock prices on a daily basis; however, stocks are not traded every day, weekends being the prime example but also holidays. If we plot the information as is, neglecting these “holes” in the data, the picture presented will be skewed. So instead, we need to choose a different time base, one that will take into consider- ATIONNONTRADEDAYS)CHOSETOUSETHEDAY OF THE YEARVALUE*ANUARYIS *ANUARYIS  December 31 is 365 or 366 (leap year dependent). Since I don’t want to get into the process of determining leap years or summing up the days in each month, I’ve decided to use the time module again. The idea here is to use the function cipeia$% and as a side effect, retrieve the day-of-the-year value. Function cipeia$% receives a value representing the number of seconds elapsed since the epoch, a fixed point in time (see more about the epoch in Chapter 5). While this sounds even more complicated than calculating the day of the year, in reality it’s easier because of function igpeia$%. Func- tion igpeia$% receives a tuple of nine values, detailed previously, and returns the number of seconds since the epoch. So we first construct a tuple of those nine values, the first three being year, month, and day, which are known to us, and arbitrarily assigning the hour to be 4 p.m. (which coincides with the end of trade). We leave the remaining fields zero. We then feed this number to cipeia$% and receive a new tuple, now properly populated with the year of day, the eighth element of the tuple, accessible with pi[u`]u, which we save in vector ua]n`]u. NNote The script does not take into account data over more than one year. To accommodate for this, you could take into consideration the number of days in a year (365 or 366, depending on a leap year) and use the lowest year as a baseline. We then plot the data and annotate the graph. For the legend, we use the header values of the CSV file stored in variable da]`an. We also use actual values from the variable r]hqao to annotate the start and end of period on the graph itself, the title, and the x-axis label (see Figure 4-1). NNote If you look closely at the data in _d]npo*tho, you’ll notice that it’s reversed, that is, backward in time. One of the side effects of using the day-of-the-year value is that values are now plotted from lower to higher values, that is, older times are on the left, and newer events are on the right. If you’d like to reverse this behavior, issue the command c_]$%*]tao*ejranp[t]teo$%. CHAPTER 4 N DATA ORGANIZATION 113 Figure 4-1. Stock price chart output Example: Automatically Reading Yahoo! Financial Data The following discussion is a bit off-topic, but as it is a direct continuation of the previous example, this is probably a logical spot for it. There’s an alternative method to manually saving the _d]npo*tho file from NASDAQ. One such option is using the matplotlib.finance module. The two core functions that fetch the data and parse it are bap_d[deopkne_]h[u]dkk$% and l]noa[u]dkk[deopkne_]h$% (although you could easily parse the data yourself). Another function of interest is the _]j`haope_g$% func- tion, which plots a candlestick graph of the stocks. Listing 4-4 is a modification of the previous example to use the functions from the mat- plotlib.finance module. Notice that there are some other minor changes to the code because the data structure is a bit different from the NASDAQ _d]npo*tho file. You can control the stock you wish to view and the start and end dates by changing the values opk_g[j]ia, p[op]np, and p[aj`. Listing 4-4. Fetching and Plotting Yahoo! Data bnkiluh]^eilknp& bnkii]plhkphe^*bej]j_aeilknp& opk_gj]ia]j`lanek` opk_g[j]ia9#J@T# p[op]np9`]papeia*`]papeia$.,,4(-(-% CHAPTER 4 N DATA ORGANIZATION114 p[aj`9`]papeia*`]papeia$.,,4(-(/-% ua]n[op]np9`]papeia*`]papeia$.,,4(-(-% napneara]j`l]noaopk_g`]p] `]p]9bap_d[deopkne_]h[u]dkk$opk_g[j]ia(p[op]np(p[aj`% u9]nn]u$l]noa[u]dkk[deopkne_]h$`]p]%% `]paoiecdpjkp^apn]`a`]uo(okql`]par]hqao pkodks]_pq]h`]paonapneara` p[op]np9jqi.`]pa$uW,(,Y% p[aj`9jqi.`]pa$uW)-(,Y% jkni]hevapdat)]teopkodksr]hqaobnkipdaop]npkbua]n uW6(,Y9uW6(,Y)`]pa.jqi$ua]n[op]np%'- lhkp]_]j`haope_gcn]ld becqna$% _]j`haope_g$c_]$%(u% ]jjkp]papdacn]ldsepd]``epekj]hpatp op]np[opn9!`)!,.`)!,.`!$p[op]np*ua]n(p[op]np*ikjpd(p[op]np*`]u% aj`[opn9!`)!,.`)!,.`!$p[aj`*ua]n(p[aj`*ikjpd(p[aj`*`]u% pepha$#Opk_g6!o(lanek`!opk!o#!$opk_g[j]ia(op]np[opn(aj`[opn%% th]^ah$#@]uobnkiop]npkbpdaua]n!`#!p[op]np*ua]n% uh]^ah$#!oOpk_glne_aWQO@Y#!opk_g[j]ia% patp$uW,(,Y(uW,(-Y(op]np[opn% patp$uW)-(,Y(uW)-(-Y(aj`[opn% cne`$% o]rabec$#**+ei]cao+!o[_]j`haope_g[u]dkk)!o)!o*ljc#!X $opk_g[j]ia(op]np[opn(aj`[opn%% Some notes: s 4HETIMEBASEISNORMALIZED THATIS THEDATESARESHOWNFROMTHESTARTOFTHEYEAR 2008 and not the epoch. This is implemented in line uW6(,Y9uW6(,Y)`]pa.jqi$ua]n[ op]np%'-. s 4HEACTUALDATESREQUESTEDMIGHTNOTBETRADEDAYS4HEREFORE THESTARTANDEND times are updated after the data is fetched and parsed. This is done in line p[op]np9 jqi.`]pa$uW,(,Y% and p[aj`9jqi.`]pa$uW)-(,Y%. Figure 4-2 shows the results of the example in Listing 4-4. CHAPTER 4 N DATA ORGANIZATION 115 Figure 4-2. Automatically generated candlestick graph Example: Creating a CSV File The following is an example of writing a list to a CSV file. I assign some arbitrary mixed data (strings and numbers) to a list named `]p] and write it to file. Try it yourself, and then open the created file paop*_or to view the file contents. :::H9WW#Peia#(#R]hqa#(#Jkpao#Y(W,(.,(#Op]nplkejp#Y(X ***W,*-(#Ie``halkejp#Y(W.YY :::eilknp_or :::b9klaj$#**+`]p]+paop*_or#(#s^#% :::_or*snepan$b%*snepankso$H% :::b*_hkoa$% Here are the contents of the test file, paop*_or: Peia(R]hqa(Jkpao ,(.,(Op]nplkejp ,*-(Ie``halkejp . CHAPTER 4 N DATA ORGANIZATION116 Try changing the values of the list, such as adding a comma to one of the strings. Now, open the file in a spreadsheet application: did the application manage to read the comma properly? Open the file in a text editor and notice the string containing the comma is now quoted. The csv module took care of adding quotes as required. More about the csv module in Chapter 5. USING THE CSV MODULE INSTEAD OF THE SPLIT() FUNCTION So far we’ve used Python’s csv module liberally. You might be wondering why we’re not using the function olhep$#(#% instead of the _or*na]`an object. The answer is that the csv module also addresses special cases such as a string that includes a comma. Consider the following row: Oqnj]ia(J]ia(.,,4(01, Module csv will handle this properly and return three elements. However, olhep$#(#% will return four elements: the quoted string will be broken in two. CSV Limitations All’s not roses in the world of CSV. Here are some things to consider: sSize: CSV files are typically not size efficient, compared with binary file formats. sPerformance: There’s also a performance hit with CSV files because they require pars- ing. An application, be it a spreadsheet application or even our code in Python, calls a function to translate the CSV file into values more easily used by the application. That is, it parses fields and rows and translates from text to integer or floating point in the case of number values. Running the parser to read the CSV file takes time, so reading a large file will take considerable time. If performance is of importance and your applica- tion reads very large files, consider using a binary file format instead. What to Store As a general rule, store as much information as possible. Unfortunately, sometimes that’s sim- ply not possible. Consider the data rate of an uncompressed HDTV video signal at 1280720 pixels, 30 frames per second, true colors (24 bits). That’s 1280720303 bytes per second, or roughly 83 megabytes per second and on the order scale of today’s hardware limitations. Which means you’ll have to discard some of the information or compress it, or get better hardware. Deciding what to store and what not to store will be very much system dependent. Some opt for decimating the data, which has its implications. Others decide on discarding a param- eter they deem less important. Barring file size limitations, consider the following guidelines in deciding what to store: CHAPTER 4 N DATA ORGANIZATION 117 s 7RITEHEADERFILEINFORMATIONINTHEBEGINNINGOFTHEFILE DESCRIBINGTHESYSTEMANDTHE data, including units of measurement. You can use free-form text for this. Some even go a further step by adding a special character (e.g., ) at the beginning of every line, ensuring the reader understands those are remarks and not part of the data. s )NCLUDEAHEADERFOREACHCOLUMN EXPLAININGWHATEACHCOLUMNMEANS)TSVERYUSEFUL for both viewing the files using a spreadsheet and for automated scripts to visualize the data. s !LWAYSTRYTOSTORETHETIMEANDDATE3TORETHEDATEANDTIMEVALUESINTHEFIRSTCOLUMN You can follow the ISO 8601 specifications, or you might opt to use a different notation. An alternative valuable notation to ISO 8601 format is to store the number of seconds THATHAVEELAPSEDSINCETHEEPOCH*ANUARYONMOST,INUXMACHINES4HATWAY you have a number that is very easy to manipulate, as opposed to a date and time that requires parsing. There’s also a side benefit and that is if you have several files, you can use the same time base for all of them. The seconds-since-the-epoch notation is very useful in binary formats. Here’s an example of the contents of a file that follows the preceding guidelines: Qjepo(?ahoeqo Oajokn(=- Ouopaioane]hjqi^an(0,- @]pa]j`Peia(Pailan]pqna(Lnaooqna .,,1),5)-1P,-6,36,4(0.*,(1/*- .,,1),5)-1P,-6,36-0(0.*,(1/*. .,,1),5)-1P,-6,36-5(/5*,(1-*4 When to Use CSV Use CSV whenever possible, with the following exceptions: s 0ERFORMANCEISANISSUE s &ILESIZEISANISSUE s $ATAISALREADYINAdifferent format. Binary Files Binary files are an efficient method of storing data. The term “binary files” means files that are not represented as ASCII text; that is, if you open these files in a text editor, the data will appear to be gibberish. In reality there’s no difference between binary files and text files, other than what the data in the files represent. From the computer’s perspective, they’re both just files. So in essence, if the file is not a text file, it’s a binary file, but that’s a loose definition. As discussed previously, there are merits to using binary file formats, and those are typi- cally size and performance. There’s also another reason, and that’s the nature of the data. A digital picture is not easily represented as a text file (it can be though—for example, every pixel CHAPTER 4 N DATA ORGANIZATION118 value is an integer in a CSV file). The same applies to compressed files. Regardless of the rea- son, it’s almost impossible to avoid using binary files. In this book, when I refer to binary files, I typically mean one of the following file formats: an array of values, an array of structs, or other commonly used binary file formats. An Array of Values The most simple binary file format we’ll be using is an array of values, that is, a repeating single data type. The file could be holding 16-bit signed values or unsigned bytes. The array-of- values file format lends itself nicely to storing simple binary data. Example: Reading and Writing an Array of Binary Values The Python ]nn]u data type is an ideal candidate for this sort of binary file handling. The ]nn]u data type is part of the array module, so to use it, issue the following command: :::bnki]nn]ueilknp& To create an array, call the ]nn]u$% function with the data type and optional initialization parameters, as follows: :::]9]nn]u$#d#%]nn]ukboecja`odknpo(kbvankoeva :::] ]nn]u$#d#% :::^9]nn]u$#H#(W-,,,(.,,,(/,,,Y%]nn]ukbpdnaaqjoecja`hkjco :::^ ]nn]u$#H#(W-,,,H(.,,,H(/,,,HY% :::_9]nn]u$#`#(n]jca$-,%%]nn]ukb`kq^hao(bnki,pk5ej_hq`ejc :::_ ]nn]u$#`#(W,*,(-*,(.*,(/*,(0*,(1*,(2*,(3*,(4*,(5*,Y% The data types listed in Table 4-4 can be used in initializing array objects. Table 4-4. Array Data Types Data Type Data Meaning and Size #_# Character, 1 byte. #q# Unicode character, 2 bytes. #^# Signed character, 1 byte. #># Unsigned character, 1 byte. #d# Signed short, 2 bytes. #D# Unsigned short, 2 bytes. #e# Unsigned int, size is CPU dependent. #E# Unsigned int, size is CPU dependent. #h# Signed long, 4 bytes. #H# Unsigned long, 4 bytes. CHAPTER 4 N DATA ORGANIZATION 119 Data Type Data Meaning and Size #b# Floating-point value, 4 bytes. #`# Floating-point value, 8 bytes. Of these data types, as a guideline, try not to use the #e# and #E# data types, since they’re system dependent and might prove problematic when you transfer your code to another sys- tem (unless of course that functionality is exactly what you require). Writing array values to file is done using the pkbeha$% member function of the ]nn]u data type: :::b9klaj$#^*q/.#(#s^#% :::^*pkbeha$b% :::b*_hkoa$% Reading is performed using the bnkibeha$% member function of the ]nn]u data type. The function bnkibeha$% also requires the number of values to read. If you supply a number greater than the number of elements in the file, an exception is raised; however, values will still be retrieved. :::`9]nn]u$#H#% :::b9klaj$#^*q/.#(#n^#% :::`*bnkibeha$b(/% :::b*_hkoa$% :::`99^ Pnqa An Array of Structs A more complex binary data structure we’ll be dealing with is an array of structs. The word “struct” is taken from the C programming language and describes a structure combined of several data types. Suppose data is stored as follows: hkjc, bhk]p, bhk]p, hkjc, bhk]p, bhk]p, and so forth. This series can be viewed as an array of structures, with the structure being whkjc(bhk]p(bhk]py. In this sense, an array of values, discussed previously, is also an array of structs with the struct being a single data type, for example, w_d]ny. If you’re familiar with C, the preceding struc- ture might be described as in Listing 4-5. Listing 4-5. A Struct in C opnq_pokia[^ej]nu[beha[bkni]p w hkjcalk_d7 bhk]pbPailan]pqna7 bhk]pbLnaooqna7 y7 CHAPTER 4 N DATA ORGANIZATION120 Note that unlike our previous binary file formats, this one doesn’t lend itself to a nice extension naming convention such as *q-2 or *b/., so we simply choose the extension *^ej, noting that it’s a binary file. Example: Reading and Writing an Array of Structs In this example, we’ll create a structure containing two data types (hkjc and bhk]p), write it to file, and then read it using two different methods: a structure at a time and the entire file at once. You can follow along by entering the commands interactively at the Python shell. First, we have to import the struct module: :::eilknpopnq_p To illustrate the concept of an array of structs, we’ll create a list of rows. Each row is a list of three values: a long and two floats, which represent a structure. We’ll generate a relatively short list, only two rows long: :::H9WW-,H(-*,(.*,Y(W.,H(,*-.1(,*.1YY Next, we define two variables, behaj]ia and bkni]p, so we don’t have to enter them every time: :::behaj]ia9#**+`]p]+opnq_po*^ej# :::bkni]p9#Hbb# I’m assuming there’s a directory named **+`]p]; if one does not exist, either create it or change the value of the variable behaj]ia accordingly. The format #Hbb# means a long, fol- lowed by a float and a float per Table 4-4. Next, we write the list to file: :::bkqp9klaj$behaj]ia(#s^#% :::bknnksejH6 ***`]p]9opnq_p*l]_g$bkni]p(nksW,Y(nksW-Y(nksW.Y% ***bkqp*snepa$`]p]% *** :::bkqp*_hkoa$% The first call to klaj$% opens a file in binary mode. We then use a bkn loop and iterate over the rows in the list H. Every row is packed using the function opnq_p*l]_g$%. The function opnq_p*l]_g$% accepts a format and then the values to pack. The return value is a string that can be written to file. We then write the string to file. Finally, the last line closes the file. So now we should have a file named **+`]p]+opnq_po*^ej. This file contains the list of values from the list H. Let’s read it a struct at a time: First, we’ll start by defining a variable equivalent to the size of the struct format: :::opnq_p[oeva9opnq_p*_]h_oeva$bkni]p% The function opnq_p*_]h_oeva$% calculates the size in bytes of the format. Armed with the struct size, we start reading the data, a struct at a time: :::bej9klaj$behaj]ia(#n^#% :::`]p]9bej*na]`$opnq_p[oeva% :::`]p] #XjXt,,Xt,,Xt,,Xt,,Xt,,Xt4,;Xt,,Xt,,Xt,,<# CHAPTER 4 N DATA ORGANIZATION 121 The first line opens the file for reading in binary mode. We then use the function na]`$j% to read j bytes and store them in the variable `]p]. So now the variable `]p] holds the first structure from the binary file, but it isn’t legible yet. We’ll need to unpack it, using opnq_p*qjl]_g$%, that is, convert it from a string to a tuple of values using the format speci- fier #Hbb#. But since we’ll be reading and unpacking several values, it stands to reason to use a sdeha loop as follows: :::sdeha`]p]6 ***r]hqao9opnq_p*qjl]_g$bkni]p(`]p]% ***lnejpr]hqao ***`]p]9bej*na]`$opnq_p[oeva% *** $-,(-*,(.*,% $.,(,*-.1(,*.1% :::bej*_hkoa$% The sdeha condition evaluates to Pnqa as long as variable `]p] is nonempty, hence data will be processed until the end of the file. Each struct read is unpacked to a tuple of values using the opnq_p*qjl]_g$% function. Once a struct is unpacked, we read the next structure. This continues until all the structs are read from the input file. Lastly, we close the file. The second method we’ll examine here is reading the entire file at once. To do so, we first read the entire file to memory, using the na]`$% function: :::`]p]9klaj$behaj]ia(#n^#%*na]`$% :::haj$`]p]% .0 If no parameters are provided for na]`$%, the entire file is read into memory until an end of file (EOF) is reached. This might not be a problem with small files, but with larger files be wary; your computer might not be able to handle all the data at once, so you will need to read the files in chunks per the previous method. Note that I’ve chosen not to assign a file handle for the data file and let Python handle the closing of the file for me. The function opnq_p*qjl]_g$% accepts bkni]p as a parameter and unpacks the data to a tuple. However, we need to unpack the entire array, not just the first structure. We can take the obvious route of using a bkn loop to unpack the binary data a piece at a time. An alterna- tive approach is to change the format value to qjl]_g$% from a single #Hbb# to a repetitive #HbbHbbHbb***#. This allows unpacking of the entire binary data in one call to opnq_p*qjl]_g$%. Luckily, Python provides us with a very useful tool for multiplying strings, the multiplication operator: :::#Hbb#&1 #HbbHbbHbbHbbHbb# We can calculate the size of the array we want to unpack by dividing the length of the data by the size of one struct. In our case, that’s haj$`]p]%+opnq_p[oeva. So to generate a format to unpack by, we multiply the format by that value, which folds neatly into the following: :::lnejpopnq_p*qjl]_g$bkni]p&$haj$`]p]%+opnq_p[oeva%(`]p]% $-,(-*,(.*,(.,(,*-.1(,*.1% CHAPTER 4 N DATA ORGANIZATION122 NNote (Advanced readers) This implementation assumes the file is in accordance with the native operat- ing system’s byte order. If you try to unpack data in this manner with any of the struct’s byte order, size, and alignment format characters, such as <, 9, 8,:, and , the function will fail. Other Binary File Formats Binary files can be more complex and can follow a different scheme from the repeating fixed- size structure. Some employ compression, which typically involves a non-fixed-size structure. Others might store data sequentially, that is, using the data of the preceding example, you could write all the long values, followed by the float values. In that case, a different method to read the file should be employed, but it’s quite straightforward if you know the file format. In this book I’ll touch lightly on this topic, specifically about known file types such as pictures and compressed files. Since the number of file formats is virtually unlimited, the topic is too vast for one book to cover. Header Files Unlike CSV files, with binary files you can’t really tell whether the information is in integer representation, floating point, or an altogether different scheme. This means that you, the programmer, need to know in advance what file format you’re dealing with. At first that might not seem such a complex task, but in reality, it’s not trivial. Even with the same notation as explained previously in this chapter, say, *q-2, you still don’t know what the values represent: are they sampled voltage values? Is there a timestamp? And you might have several binary file formats you’re dealing with. To resolve this, we use a header file to describe each file type, or directory, in case all the files conform to the same format. A header file is a text file that describes the format of the binary file. But if we’re using a text file, we might as well use CSV! It’s a good idea to have the same base file name for the header file as the binary file (excluding extension). I typically add an *d`n*_or extension for my header files; for example, for file H]r].,,-),/).-P,4)..)./*b/. I name the header file H]r].,,-),/).-P,4)..)./*b/.* d`n*_or. Here’s an example of header file contents for an array-of-structs file format: J]ia(Jqi^ankb^upao(Bkni]p(Qjepo PeiaAh]loa`oej_aalk_d(0(ejpacan(oa_kj`o Pailan]pqna(0(bhk]p(@acnaao?ahoeqo Lnaooqna(0(bhk]p(Loe The nice thing about this structure is that it’s quite self-explanatory. It lends itself easily to automation and scripting. I’ve also added a column titled Units. This column is obvious; however, you will find later that it’s quite useful. Say you know the temperature is an integer, but what exactly does it rep- resent? Degrees? And if so, are those in Kelvin, Fahrenheit, or Celsius? CHAPTER 4 N DATA ORGANIZATION 123 If the file format is different and does not follow the repeating fixed-size structure format, you can come up with a header that best describes that file format. In the case of sequential data, the header file might look like this: J]ia(Jqi^ankb^upao(Bkni]p(Qjepo(R]hqao PeiaAh]loa`oej_aalk_d(0(ejpacan(oa_kj`o(-,, Pailan]pqna(0(bhk]p(@acnaao?ahoeqo(-,, Behpana`r]hqao(0(bhk]p(@acnaao?ahoeqo(-,, This format implies that the data is sequential, having 100 values for each parameter. This is a more complex file format and not at all popular due to the complexity associated with implementing a format that behaves like this; you’d have to remember all the information and then store it to file instead of gathering values and storing them one at a time. Again, at times, you’re given data files to work with and can’t control the file format. Readme Files Readme files are documentation files placed in a directory describing the contents of the files in that directory. There’s no clear definition of the contents of Readme files, only that the information should be in clear text so as to be viewed by any text editor. Some Readme files have directions on what should be run and how to use the software. Others add author information and credentials. Using Readme files is an excellent way to document what you’ve done without the overhead of writing a user’s manual. Here are occasions where I found using Readme files of value: s 4HEYAREHELPFULFORDESCRIBINGTHECONTENTSOFDATAINDIRECTORIESFILEFORMATS ORIGINOF data, date and time, person in charge, and so forth. See Chapter 1 for an example of a Readme file describing data. s 7HENDIRECTORIESCONTAINBOTHDATAANDSCRIPTSTOANALYZETHEM THERESBOUNDTOBEA multitude of scripts. Describing the entry point, or what the user should run first, is a time saver—especially if a process is required before running the scripts, for example, uncompressing the data. Describe that in your Readme file. 2EADMEFILESCANBEASDETAILEDORASCRYPTICASYOUDLIKE*USTREMEMBERTHATTHEYRE there to help; include detail in them according to the level of the user or developer so they understand what’s going on. The common full file name for Readme files is Na]`ia*ptp. INI Files As you add content and capabilities to your scripts, you’ll find that you need to control the scripts’ behavior using options, such as running the script but only generating a text out- put, without graphs or running the scripts on a different set of data points. As the number of options increase, you’ll need methods for controlling the options. There are several ways to implement options. Following are the common ones: s )NTERACTIVEINPUTFROMTHEUSER FOREXAMPLE h'ENERATEGRAPHSYN v s #OMMAND LINEPARAMETERSSUCHASTHEÌh in the command hoÌh. CHAPTER 4 N DATA ORGANIZATION124 s !NEXTERNALCONFIGURATIONFILEHOLDINGTHECHOICESANDPARAMETERS4OCHANGETHE behavior of the script, the user changes the values in the configuration file. The script reads the configuration file and acts accordingly. The latter option, a configuration file, is also referred to as an INI file. The reason is that back in the days before the registry was introduced in Windows, applications used to store parameters in files having the application name and ending with an *EJE extension. In Linux this is commonly referred to as a configuration files; configuration files typically reside in the directory +ap_ and have a *_kjb extension. Python supports INI files natively with the ConfigParser module. Much like Readme files describing the data, INI files describe the parameters, options, and choices used to run a script. They provide a clean way of explaining what the options mean. The general markup of an INI file (config file) is a section, denoted by brackets, fol- lowed by a list of parameters and their assigned values and optional remarks, as outlined in Table 4-5. Table 4-5. INI File Format INI/Config Line Format Notes Section Woa_pekjY Used to group parameters logically Parameter l]n]i-9r]hqa- or l]n]i-6r]hqa- Used to set a parameter to a value Remark 7nai]ng or nai]ng Used to document sections and parameters Example: Reading and Writing INI Files Listing 4-6 shows an implementation of writing an INI file using the ConfigParser module. Listing 4-6. Creating an INI (Config) File _na]pejc]jEJE$_kjbec%beha eilknp?kjbecL]noan klpekjo9?kjbecL]noan*?kjbecL]noan$% klpekjo*]``[oa_pekj$#QoanKlpekjo#% klpekjo*oap$#QoanKlpekjo#(#]hh[`]p]#(Pnqa% klpekjo*oap$#QoanKlpekjo#(#cn]ld#(-% klpekjo*]``[oa_pekj$#Lhkp#% klpekjo*oap$#Lhkp#(#cne`#(Pnqa% b9klaj$#**+`]p]+klpekjo*eje#(#s#% klpekjo*snepa$b% b*_hkoa$% First we import the ConfigParser module. We then set sections with the ]``[oa_pekj$% method and parameters and values with the oap$% method. Lastly, we create a file and output the ConfigParser object to file, generating an INI file. The following are the results from run- ning the script in Listing 4-6: CHAPTER 4 N DATA ORGANIZATION 125 WLhkpY cne`9Pnqa WQoanKlpekjoY cn]ld9- ]hh[`]p]9Pnqa Reading an INI file is even easier. Assuming you have run the previous script, you should now have an INI file named **+`]p]+klpekjo*eje. The script in Listing 4-7 will read that file and parse its contents. Listing 4-7. Reading an INI (Config) File na]`]jEJE$_kjbec%beha eilknp?kjbecL]noan na]`[klpo9?kjbecL]noan*?kjbecL]noan$% na]`[klpo*na]`$#**+`]p]+klpekjo*eje#% lnejpl]n]iapano]j`r]hqao bknoa_pekjejna]`[klpo*oa_pekjo$%6 lnejpW!oY!oa_pekj bknl]n]iejna]`[klpo*epaio$oa_pekj%6 lnejpl]n]i The function ?kjbecL]noan*na]`$% accepts a file name (use na]`bl$% if you want to use a file object) and parses the INI file with the ?kjbecL]noan object. The code following the na]`$% function call prints the sections, options, and values. Here are the results from running the script in Listing 4-7: WLhkpY $#cne`#(#Pnqa#% WQoanKlpekjoY $#]hh[`]p]#(#Pnqa#% $#cn]ld#(#-#% XML XML, or Extensible Markup Language, has been growing in popularity as a data file format. XML is more descriptive than CSV and definitely more descriptive than binary, hence its popularity. XML is a very good format for data files, but it has its overhead. Mainly it requires a complex parser to read the data and check data validity. While that’s true for CSV as well, CSV is much less complex. XML, however, is left out of scope for this book, mainly because CSV provides us with the functionality we require, but also because the topic is too large to be addressed properly in this book. If you do require XML processing, rest assured that Python has extensive XML support. There’s also a large selection of books available on XML, and I suggest you consult with them or the Internet should you require XML support. CHAPTER 4 N DATA ORGANIZATION126 Other File Formats There are a large number of other file formats you’re likely to encounter. These include image FORMATSSUCHAS0.' *0%' BITMAPSAND')& ORCOMPRESSEDFILEFORMATSSUCHAS:)0OR': and yes, XML too. It is far beyond the scope of this book to detail and discuss all these file formats. One of the benefits of using Python is its popularity and an active developer base with an extensive number of freely available packages contributed by the Python community. There’s a good chance there’s already a module out there that’s suitable for reading different file formats and converting them to programmer-friendly values. For example, a module we’ll be exploring in Chapter 9, the Python Imaging Library (PIL), supports most popular image formats. Locating Data Files As described in the introduction to this chapter, as you gather data, you’re bound to end up with files of various types: raw data files, clean data files, processed data files, files of older file formats, and the list goes on and on. The question is, how do you organize all this data, and furthermore, how do you later locate it for analysis? This section suggests several approaches to organizing files and what’s more important, maintaining well-organized data. One approach is storing files in directories and subdirecto- ries, and we’ll discuss methods to locate the files using that approach; another is to use catalog files and annotate them. Organization into Directories The most popular method of organizing files is in directories. If you go with this approach, try to have all your subdirectories containing data files in a parent directory named `]p] or simi- lar. If you intend to preprocess the data, split the directory into “raw” and “clean” data. The reason you want to do this is that you may find out later that the preprocessing algorithm has a bug or that a different method should be employed to preprocess the data. Or if you manually preprocessed the data (that is, cleaned up the data files, removed wrong files, edited others, etc.), you may later realize you accidentally erased the wrong data file or that you made some other mistake. From here on, there are several options, for example, putting all the data files in one directory or creating subdirectories and organizing files there. Personally, I like to split the directories further for several reasons. One is that it gives me greater control over documenta- tion: it’s possible to generate Readme files for every directory. The other is that it allows greater control over what files to process, for example, I could process all files from directory ouopai=. Lastly, it helps provide a more aesthetic view, and that’s an important part of any engineering work. The actual breakdown into subdirectories is very problem specific. It could be based on dates, type of files, contents, and pretty much anything else you would like. However, do try to group the directories in one root directory, as it will be a lot easier to iterate through the data. `]p] n]s ouopai= CHAPTER 4 N DATA ORGANIZATION 127 ouopai> ouopai? _ha]j ouopai= ouopai> ouopai? Searching for Files One of the obvious methods for searching for a file is by recursively going through all the sub- directories and looking for files that match a given pattern. Example: Storing Directory Contents in an Array When you first look for a file, you don’t always find it on your first search, maybe because you chose the wrong file name pattern or because of a simple typo. There’s a good chance you’ll require additional searches. Now if you have a significant number of data files, it can be tedious to rewalk the entire directory again. Every search is laborious, and time spent finding files will increase dramatically. Instead, it’s possible to store the intermediate result in a data structure. Try this yourself. Define the function cap[]hh[behao$%, as shown in Listing 4-8, and call it interactively in Python by issuing ]hhbehao9cap[]hh[behao$okia[l]pd%. Observe the results by issuing lnejp]hhbehao at the Python shell. Listing 4-8. A Function to Retrieve All Files in a Directory and Store It in an Array eilknpko `abcap[]hh[behao$on_dl]pd%6 Cappdaj]iao(l]pdo]j`oevaokb]hhpdabehaoej]`ena_pknu* ]hhbehao9WY bknnkkp(`eno(behaoejko*s]hg$on_dl]pd%6 bknbehaejbehao6 l]pdj]ia9ko*l]pd*fkej$nkkp(beha% behaoeva9ko*l]pd*capoeva$l]pdj]ia% ]hhbehao*]llaj`$Wbeha(l]pdj]ia(behaoevaY% napqnj]hhbehao The function stores an entry to each file in a list (]hhbehao). Each entry in the list holds the file name, path name, and file size. A path name is the full path plus a name of a file (e.g., +dkia+od]e+beha*ptp); a file name is the name of the file excluding the path (e.g., beha*ptp), and file size is given in bytes. The function ko*s]hg$% was described in Chapter 1 and should not require additional clarifications. I’ve made use of the function ko*l]pd*capoeva$% to retrieve the size of a file. CHAPTER 4 N DATA ORGANIZATION128 NNote In cases where file names contain non-English characters, I’ve seen the function capoeva$% raise an exception because it was unable to read the file. If you’re dealing with such files, either rename them or add a pnu/at_alp clause to catch the exception. Indexing The act of going through directories and recording file information in an organized manner is called indexing. Done properly, indexing can allow fast searches. Example: Searching for Duplicate Files Continuing our previous example, now that we have an array containing all the files in a direc- tory, we can perform fast searches on the array. We can sort the array based on file size and find the ten largest files; or we can look for files matching a given pattern. In this example we’ll explore a more complex search, one that checks for duplicate files. This is a true need, one that arises especially when dealing with a large number of files. Assuming you have followed the unique file name convention suggested earlier, there shouldn’t be any duplicate file names. However, that’s not always the case. Consider the fol- lowing: data is generated by copying pictures from a digital camera. Many digital cameras follow a simple running index scheme (see the section “Other Schemes” earlier in this chapter) whereby file names follow the pattern Da]`an,,,-*flc, Da]`an,,,.*flc, and so on, with each camera having its own Da]`an string. After you copy the files to your computer, you delete the old files in the camera, clearing space for new pictures. New pictures taken by the camera will in turn start from index 1 and eventually, as they’re copied to your computer, will have non- unique file names. To ensure files are not accidentally overwritten, you copy over each batch of pictures to a directory of its own, each directory named uniquely based on date and time. So you end up with several directories, but their contents could contain nonunique file names. Maybe some are the same. Can we clear some up? Another scenario is that of backups, or that of using several storage locations, say, your laptop and your home PC. You may have copies of data lying around in several spots, and the question again is whether you have multiple copies of the same data. Of course, if you follow a central server approach and that server is backed up on a regular basis, you’ll find that these occasions are rare. Still, it’s nice to be able to identify duplicate files, and that’s the motivation behind this example. In the example we’ll confine ourselves to the following: we assume files to be identical if they have the same file name and file size. While this isn’t necessarily true, the example can be easily modified to compare contents as well. We’ll show three different implementations and discuss the best solution of the three. In all three methods we’ll use a dictionary object. NNote To be able to follow along, ensure you’ve defined the function cap[]hh[behao$% from the previous example. Run it in interactive Python and store the results in an array as follows: ]hhbehao9cap[]hh[ behao$l]pdj]ia%. CHAPTER 4 N DATA ORGANIZATION 129 Method 1: We use the file name as the unique key in our dictionary, iu`e_p-. The value is a list of Wbehal]pd(behaoevaY. At first, iu`e_p- is empty. For every entry, we ask whether the file name is a key to the dictionary. If it wasn’t encountered, we add the list Wbehal]pd( behaoevaY as a value to the key, file name. If the key is in the dictionary, it means that this file name has been encountered in the past. We then retrieve the file size and compare it with the current entry file size. Listing 4-9 shows the implementation. Listing 4-9. Looking for Duplicate Files, Method 1 `abbej`[`qlao[-$pdabehao%6 Oa]n_daobknbeha`qlhe_]pao(iapdk`-* naoqhp-9WY iu`e_p-9`e_p$% bknbehaj]ia(l]pdj]ia(behaoevaejpdabehao6 ebbehaj]iaejiu`e_p-6 W`ql[beha(`ql[oevaY9iu`e_p-Wbehaj]iaY eb`ql[oeva99behaoeva6 naoqhp-*]llaj`$l]pdj]ia% ahoa6 iu`e_p-Wbehaj]iaY9Wl]pdj]ia(behaoevaY napqnjnaoqhp- One of the obvious shortcomings of this method is that there might be several files with the same file name but different sizes; the algorithm might not catch some of them. For exam- ple, if the first file is of size A, and several other files have the same file name but are of size B, the algorithm will not identify files of size B as duplicates. Method 2: This method uses the path name as the unique key in the dictionary iu`e_p. and the list Wbehaj]ia(behaoevaY as the value. Since we’re using the path name as the key, it’s guaranteed to be unique; there are no two files with the same file name and path name. To check whether a file name already exists in the dictionary, we iterate through all the elements in the dictionary using the epanepaio$% method. If the file name and the file size are identical, we announce them to be a duplicate. If not, we add the associated path name as key and the Wbehaj]ia(behaoevaY as a new value to the dictionary (see Listing 4-10). Listing 4-10. Looking for Duplicate Files, Method 2 `abbej`[`qlao[.$pdabehao%6 Oa]n_daobknbeha`qlhe_]pao(iapdk`.* naoqhp.9WY iu`e_p.9`e_p$% bknbehaj]ia(l]pdj]ia(behaoevaejpdabehao6 bkng(rejiu`e_p.*epanepaio$%6 ebrW,Y99behaj]ia]j`rW-Y99behaoeva6 naoqhp.*]llaj`$l]pdj]ia% ahoa6 iu`e_p.Wl]pdj]iaY9Wbehaj]ia(behaoevaY napqnjnaoqhp. CHAPTER 4 N DATA ORGANIZATION130 While this method does resolve the shortcoming of method 1 in that if there are sev- eral files with the same file name, they will all be checked, the implementation is not a good one. The major issue is that we use a dictionary object to store values and neglect to use the inherent hashing mechanism properly: we iterate through all the items linearly. We probably could’ve just as well used an array. Method 3: This method uses the file name as the key in the dictionary object iu`e_p/. The difference from method 1 is that instead of a list holding Wl]pdj]ia(behaoevaY, we now hold an array of Wl]pdj]ia(behaoevaY lists for every key, much like in a real dictionary where one entry (key) might have several definitions (values). The second change we introduce is that we don’t ask whether the file name (key) is part of the current set of keys. Instead, we simply access the dictionary object with the file name using the method cap$%. If there’s an entry, we go through the array of Wl]pdj]ia(behaoevaY values and check for duplicate files. If one matches, it’s a duplicate. If none matches, we append our new Wl]pdj]ia(behaoevaY to the array of current values. In case there’s no entry for the file name, we add it as a new entry to the dictionary object (see Listing 4-11). Listing 4-11. Finding Duplicate Files, Method 3 `abbej`[`qlao[/$pdabehao%6 Oa]n_daobknbeha`qlhe_]pao(iapdk`/* naoqhp/9WY iu`e_p/9`e_p$% bknbehaj]ia(l]pdj]ia(behaoevaejpdabehao6 ebiu`e_p/*cap$behaj]ia%6 bknW`ql[beha(`ql[oevaYejiu`e_p/Wbehaj]iaY6 eb`ql[oeva99behaoeva6 naoqhp/*]llaj`$l]pdj]ia% iu`e_p/Wbehaj]iaY*]llaj`$Wl]pdj]ia(behaoevaY% ahoa6 iu`e_p/Wbehaj]iaY9WWl]pdj]ia(behaoevaYY napqnjnaoqhp/ Of the three methods, the third one is the best because it uses hashing properly. To check performance for yourself, copy the function implementations per Listing 4-9, 4-10, and 4-11 to a text editor, save them under o_nelpj]ia*lu, and then issue ata_beha$#o_nelpj]ia*lu#% in an interactive Python shell. Once that’s done, here’s a short set of commands you can use to measure performance. Be sure to change the on_dl]pd vari- able to point to a directory containing a large number of files, with some duplicates. :::on_dl]pd9#_6+Lupdkj.1# :::]hhbehao9cap[]hh[behao$on_dl]pd% :::p9WY :::bnkipeiaeilknp_hk_g]o_hg :::p*]llaj`$_hg$%%7nao-9bej`[`qlao[-$]hhbehao%7p*]llaj`$_hg$%% :::p*]llaj`$_hg$%%7nao.9bej`[`qlao[.$]hhbehao%7p*]llaj`$_hg$%% :::p*]llaj`$_hg$%%7nao/9bej`[`qlao[/$]hhbehao%7p*]llaj`$_hg$%% :::haj$]hhbehao%jqi^ankb`]p]behaolnk_aooa` 4/3- CHAPTER 4 N DATA ORGANIZATION 131 :::lnejpiapdk`-6!1*1b7iapdk`.6!1*1b7iapdk`/6!1*1b!X ***$pW-Y)pW,Y(pW/Y)pW.Y(pW1Y)pW0Y% iapdk`-6,*,,32-7iapdk`.61*2-1..7iapdk`/6,*,-501 :::haj$nao-%(haj$nao.%(haj$nao/% $0-(4,.(4,.% I’ve imported the method _hk_g$% and renamed it to _hg$% (to save a few characters). The function _hk_g$%, part of the time module, returns the system clock and is very useful for comparing performance. Notice how I’ve entered three function calls in one line. This is important: if you split those into three separate sentences, the time it actually took you to write the command is also added to the time difference, offsetting results. NNote Because method 2 is quite inefficient, for a large number of files or a slow machine it might take considerable time to compute. Although method 1 seems the fastest, in reality it’s inaccurate and shouldn’t be used. In the preceding implementations, we do not check the contents of the files to ensure they are indeed identical. It is quite possible to add that capability by modifying the functions and comparing the contents of two files, beha- and beha., as well: :::ebklaj$beha-(#n^#%*na]`$%99klaj$beha.(#n^#%*na]`$%6 ***lnejp#e`ajpe_]hbehao# This method reads the entire files to memory and compares them byte by byte. Note that this is a not a good option if the files are large; reading chunks or using other mechanisms may be better (see “Comparing Files” section in Chapter 10). Catalogs We’ve discussed splitting data files into directories and subdirectories and mentioned that it’s a good habit to group files in that manner. While this is an excellent method of maintaining what’s what, it’s limited to one division. That is, if you’d like to split files into directories based on several criteria, what do you do with a data file that fits several of those criteria? This is where catalogs come in handy. Catalogs are text files that hold data in columns: the first column contains the file names, and subsequence columns contain subcategories (other criteria). Ideally you’d like to use CSV because there’s a good chance you’ll be editing the catalog file manually in a spreadsheet application or automatically with Python; CSV fits that role perfectly. Once you have a catalog file, it’s easy to select only files meeting a specific criterion and run a script on those selected files. Example: Creating a Clean Catalog File The first step is to generate a basic catalog file, or a clean catalog file. This clean catalog file is generated automatically, using Python. For every file encountered, the full path as well as CHAPTER 4 N DATA ORGANIZATION132 the file size is retrieved. Listing 4-12 shows an example of creating a clean catalog of files with extension *lu. Listing 4-12. Creating a Clean Catalog eilknpko(_or naj]iapdabkhhksejcpk]`ena_pknukbukqn_dkkoejc on_dl]pd9#**+on_# pda?ORda]`an _]p]hkc9WW#Behaj]ia#(#l]pdj]ia#(#oeva#YY s]hg`ena_pknupnaa bknnkkp(`eno(behaoejko*s]hg$on_dl]pd%6 bknbehaejbehao6 lnk_aookjhu*lubehao ebbeha*hksan$%*aj`osepd$#lu#%6 l]pdj]ia9ko*l]pd*fkej$nkkp(beha% behaoeva9ko*l]pd*capoeva$l]pdj]ia% _]p]hkc*]llaj`$Wbeha(l]pdj]ia(behaoevaY% _na]papda_ha]j_]p]hkc b9klaj$#**+`]p]+_ha]j[_]p]hkc*_or#(#s^#% _or*snepan$b%*snepankso$_]p]hkc% b*_hkoa$% To follow along, change the on_dl]pd variable to point to a directory containing Python files, such as the root Python directory (_6XLupdkj.1). I chose to list the contents of my **+on_ directory. The script walks the search directory looking for Python files (files ending with the exten- sion *lu, case insensitive). For every file encountered, we retrieve the file size. We then store all the information in a CSV file as shown in previous examples. Behaj]ia(l]pdj]ia(oeva cap[]hh[behao*lu(**+on_+cap[]hh[behao*lu(/41 na]`[eje*lu(**+on_+na]`[eje*lu(.4. snepa[eje*lu(**+on_+snepa[eje*lu(//, _il[b`*lu(**+on_+_il[b`*lu(..41 qjemqa*lu(**+on_+qjemqa*lu(.3/ pelo*lu(**+on_+pelo*lu(-1- _na]pa[_]p]hkc*lu(**+on_+_na]pa[_]p]hkc*lu(151 opk_g[_d]npo*lu(**+on_+opk_g[_d]npo*lu(-.5, u]dkk[`]p]*lu(**+on_+u]dkk[`]p]*lu(-.-4 na]`[snepa[opnq_po*lu(**+on_+na]`[snepa[opnq_po*lu(351 nqjjejc[ej`at*lu(**+on_+nqjjejc[ej`at*lu(2-/ CHAPTER 4 N DATA ORGANIZATION 133 Next you take notes. For example, if a script is a draft, you mark it as such. So now you have an additional column: “Draft?” The contents of the catalog file will look something like this: Behaj]ia(l]pdj]ia(oeva(@n]bp; cap[]hh[behao*lu(**+on_+cap[]hh[behao*lu(/41( na]`[eje*lu(**+on_+na]`[eje*lu(.4.( snepa[eje*lu(**+on_+snepa[eje*lu(//,( _il[b`*lu(**+on_+_il[b`*lu(..41(Uao qjemqa*lu(**+on_+qjemqa*lu(.3/( pelo*lu(**+on_+pelo*lu(-1-(Uao _na]pa[_]p]hkc*lu(**+on_+_na]pa[_]p]hkc*lu(151( opk_g[_d]npo*lu(**+on_+opk_g[_d]npo*lu(-.5,( u]dkk[`]p]*lu(**+on_+u]dkk[`]p]*lu(-.-4( na]`[snepa[opnq_po*lu(**+on_+na]`[snepa[opnq_po*lu(351( nqjjejc[ej`at*lu(**+on_+nqjjejc[ej`at*lu(2-/( For the purpose of this exercise, I chose to use *lu files, but you could just as well use the script on data files. In this manner, running a script on only clean data from the annotated catalog is manageable and reproducible. NNote Maintaining catalog files is a delicate job. Ensure your catalog files are always under version control, or better yet, a software configuration management system (for example, CVS, Subversion, or Mer- curial—see Chapter 2). You will constantly need to re-create clean (unannotated) catalogs if data is added. Consider investing time in maintaining your catalogs to keep them clean and up to date. If you find that the number of columns in your catalog files has increased and is unmanageable, consider using a database instead of a CSV file. Files vs. a Database There are a lot of pros for using databases over the management of files in directories. If your data becomes too complex to manage, rethinking and redesigning your data infrastructure is not a bad idea. That being said, I personally have found that databases do not add to my pro- ductivity. In my mind, the reasons are as follows: sThe nature of the work: When you design a database, it’s important to know a lot of the information up front. A good database relies on a good database design. And good database design relies on knowing the information and structure beforehand. The work described here does not follow that path. As presented in the beginning of the chapter, it’s an iterative process; you do not know all the information before you start. And your application is mostly for your usage, not for end users (at least at first). It’s not “production-level” code yet. When it does get to production level, that is, it’s an application to be used by end users, rethinking the data organization is a good idea, at which point you should consider using a database as well. CHAPTER 4 N DATA ORGANIZATION134 sThe nature of the data: The nature of the data described here is somewhat flat. There are not a lot of connections and interconnections and hierarchy and logic. There’s simply a lot of data. There’s a need to analyze it, fast. Some of the files are quite large, and while it’s possible store large files in a database, it’s probably not the most effi- cient way. sOverhead: Databases introduce overhead. Some may argue that it’s not significant, and they may be right. However, there’s another piece of code, a database engine, that needs interfacing. Yes, Python provides good database support, but it’s not the same as opening a file natively in your operating system. The overhead is in several layers: backup is more complex, code writing requires additional libraries, designing data- bases requires some experience (which you might not have), transferring the work to another computer is not easy, and maintenance is also required. NNote It’s worth mentioning that the SQLite database module (sqlite), which is part of the Python Stan- dard Library, has very little overhead and is an excellent package for working with databases should you require one. sImmediate interaction: Say you’d like to browse for data and view files. With a database you’d have to write an application just to extract data, and then to view it. The interac- tion is less immediate in my mind. I know I’m not being fair in my analysis; I’m mostly showing the cons of databases. So to offset that, I’ll say that databases do have their role. If you feel that you’d like to store your data in a database, you should at least know that Python provides a great number of tools for you to choose from, so even then, Python is the right programming language for you. Final Notes and References Data organization is an important part of any serious data analysis and visualization proj- ect. If you follow through with the guidelines suggested in this chapter, I believe you will find that the overhead associated with maintaining data coherently is minimal, and furthermore that it’s easy to write scripts to process large sets of data. I have found the following of great value, when deciding on the file name format or the date and time format in a log file: s h.UMERICREPRESENTATIONOF$ATESAND4IME4HE)3/SOLUTIONTOALONG standing source of confusion,” dppl6++sss*eok*knc+eok+oqllknp+b]mo+ b]mo[se`ahu[qoa`[op]j`]n`o+se`ahu[qoa`[op]j`]n`o[kpdan+ `]pa[]j`[peia[bkni]p*dpi CHAPTER 5 Processing Text Files Text Is Everywhere A considerable amount of data we process is text based. From a simplistic approach, text files are files that contain characters. The Python scripts we write are text files, the HTML files our web browser receives are text files, the e-mail messages we read are text files. They’re sim- ply everywhere. Because of the abundance of text files, you’re likely to analyze data that comes in some form of a text file. But in reality, there’s no difference between a text file and another file, say, a binary file. They’re both just files that occupy space on your hard drive. The important difference is what text files represent. If you look at data in a text file, a byte at a time, and convert every value using the ASCII table, you will be able to find (usually) intelligible text. NNote ASCII, short for American Standard Code for Information Interchange, is a 7-bit character encoding. Each character has a number (0–127) associated with it. Characters can also include digits and symbols. To view the ASCII table in Python, issue the following: bkneejn]jca$-.4%6lnejpe(naln$_dn$e%%. Note that nonprintable characters (usually values 32 and below) will be displayed with their hexadecimal notation. In a sense, text files are regular files that have information encoded in accordance with a known code. Nontext files, that is, binary files, will have values that don’t necessarily corre- spond to the ASCII table, and if you use the ASCII table to decode a binary file, you’ll probably end up with gibberish, not with text. Text files can conform to yet another set of rules, say, the CSV format or the XML markup language. Text files that don’t necessarily conform to any mapping other than the ASCII table are called plain text files. You’ll mostly encounter plain text files and CSV files in this chapter. The goal of this chapter is to present tools to work with text. First, we’ll talk about strings and how to process them, and then continue with a discussion of reading and writing files, complementing the discussion with a considerable number of examples. We then turn to some topics that are likely to pop up when dealing with text files: handling CSV files, reading 135 CHAPTER 5 N PROCESSINGTEXT FILES136 date and time information and parsing it, and working with regular expressions, a powerful tool for processing text. Lastly, since date and time are denoted differently around the world, we turn to a discussion about reading data that originated in a different locale. Text and Strings Text is composed of strings of characters, usually separated by spaces or other separators, such as commas, dots, and punctuation marks. Processing text is therefore based on process- ing strings. You’ve already seen a discussion of strings in Chapter 3, one that deals with strings as sequences of characters: slicing, indexing, and concatenating. But in essence, we didn’t deal with the string as a text object. You could’ve just as well thought of the string as a sequence of numbers, and the discussion would still be valid. While that approach is correct, it’s too simplistic. When we view strings that way, we lose important information. Consider this book: it’s made of text. But as you read it, there’s more information than just a sequence of characters. There are words, lines, and punctuation marks. And even then, there’s still yet more information: for example, words that begin with a capital letter have a different meaning. Those distinctions are important to us when we’re reading text. The following section deals with functions and ideas that help us write code to process higher-level textual concepts; “string” is no longer merely a sequence of characters. Splitting Text The first tool at our disposal is the olhep$% function, which is a string method: :::olhepoa_kj`*olhep$% W#olhep#(#oa_kj`#Y The olhep$% function splits a string into a list of strings once a separator is encountered. The default separator is a whitespace string and is one of the following: carriage return #Xn#, line feed #Xj#, tab #Xp#, vertical tab #Xr#, form feed #Xb#, and a space. Vertical tabs and form feeds are less frequently used. The olhep$% function does not include the separators in the list, nor does it care how long the separator string is. This is especially useful if you’re splitting text that’s made of columns, with a varying number of spaces between fields: :::cnk_anu[heop9Iehg. ***Acco-. :::lnejpcnk_anu[heop Iehg. Acco-. :::cnk_anu[heop*olhep$% W#Iehg#(#.#(#Acco#(#-.#Y Much like it’s useful to split text on words, it’s also useful to split text on lines. The func- tion olhephejao$Wgaalaj`oY% splits a string based on line endings and removes the line endings, that is, removes the characters #Xj# or #XnXj# if they exist. In case the optional value gaalaj`o is Pnqa, end-of-line characters are retained: CHAPTER 5 N PROCESSINGTEXT FILES 137 :::cnk_anu[heop9Iehg. ***Acco-. :::cnk_anu[heop*olhephejao$% W#Iehg.#(#Acco-.#Y :::cnk_anu[heop*olhephejao$-% W#Iehg.Xj#(#Acco-.#Y Example: Counting the Number of Words and Number of Lines in a String At times you’d like to count the number of words, or the number of lines in a string. This can be done by using the function haj$% to count the number of elements in the lists generated from the calls to functions olhep$% and olhephejao$%, as demonstrated in Listing 5-1. Listing 5-1. Counting the Number of Words and Lines in a String `abskn`[heja[_kqjp$o%6 Napqnjopdajqi^ankbskn`o]j`pdajqi^anokbhejaoej]opnejc* napqnj$haj$o*olhep$%%(haj$o*olhephejao$%%% The function returns a tuple: the first element is the number of words in the string, and the second element is the number of lines in the string. Once you define the function, use it as follows: :::cnk_anu[heop9Iehg. ***Acco-. :::skn`[heja[_kqjp$cnk_anu[heop% $0(.% Joining Strings Much like you can split a string into a list of strings, you can join a list of strings into a new string using the fkej$% member function. Remember, though, that fkej$% is a string method; therefore you must have a string to operate on to begin with. So if you’d just like to combine a list of strings with no spaces in between, you should write the following: :::OKO9W#***#(#)))#(#***#Y :::*fkej$OKO% #***)))***# Converting Strings to Numbers Common use of the olhep$% function is to parse the text and then extract numeric data, which usually comes in the form of a string of digits. Once extracted, the strings representing num- bers can be converted to an actual Python numeric representation instead of a sequence of digits. Converting strings to numbers can be done with either bhk]p$%, ejp$%, or hkjc$% function calls: CHAPTER 5 N PROCESSINGTEXT FILES138 :::bhk]p$#/*.1#% /*.1 :::ejp$#-,,#% -,, If you try to convert a string that doesn’t represent a number to a number, an exception is raised: :::bhk]p$#olhep#% Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: R]hqaAnnkn6ejr]he`hepan]hbknbhk]p$%6olhep This can be used to your advantage: say you’re looking to print only the number of items from the grocery list in previous examples. Simply employ the EAFP principle to convert every string to a number and print it. :::cnk_anu[heop9Iehg. ***Acco-. :::bknepaiejcnk_anu[heop*olhep$%6 ***pnu6 ***lnejpejp$epai%( ***at_alpR]hqaAnnkn6 ***l]oo *** .-. In this example, I took special care to discard only R]hqaAnnkn exceptions, which occur in case of a conversion problem. Example: Base Conversion—Binary, Octal, Decimal, and Hexadecimal At times, it’s useful to convert a number from one base representation to another. In case you’re dealing with octal, decimal, and hexadecimal bases, this is easily achieved using the functions ejp$%, dat$%, and k_p$%. Since we’re dealing with representations of numbers, it stands to reason we use strings. I’ve therefore chosen to define several functions, all of which accept a string as an argument. As shown in Listing 5-2, these functions are dat.`a_$%, dat.k_p$%, `a_.dat$%, `a_.k_p$%, k_p.`a_$%, and k_p.dat$%. The names of these functions are self-explanatory. Listing 5-2. Base Conversion Helper Functions `abk_p.`a_$o%6 napqnjopn$ejp$o(4%% `abdat.`a_$o%6 napqnjopn$ejp$o(-2%% `ab`a_.k_p$o%6 napqnjk_p$ejp$o%% `ab`a_.dat$o%6 napqnjdat$ejp$o%% CHAPTER 5 N PROCESSINGTEXT FILES 139 `abdat.k_p$o%6 napqnj`a_.k_p$dat.`a_$o%% `abk_p.dat$o%6 napqnj`a_.dat$k_p.`a_$o%% I’ve left out the docstrings: I think the function names are documentation enough. I also chose to use ejp$%; the function hkjc$% would’ve worked just as well. The functions do not perform any sort of error checking (e.g., ensuring that they received a string as an input). Here’s a possible use of these functions: :::dat.`a_$#bbbb#% #211/1# :::k_p.dat$#333#% #,t-bb# NNote In Python 2.6, the notation of the octal base accepts a zero and the k character: ,k or ,K. This change is accompanied with the introduction of binary numbers in Python 2.6 and above. Binary numbers are denoted with a leading ,> or ,^ (zero and the character ^). BINARY CONVERSION IN PYTHON 2.5 At the time of the writing of this book, the external packages used to create much of the code did not yet catch up to Python version 2.6, and so I resorted to using version 2.5. In case you’re in the same boat and would like to use the binary base-conversion helper functions, here’s a short implementation of the `a_.^ej$% function that works in Python 2.5 as well. Combine this function with the function ejp$% or hkjc$% to implement all other base conversions. `ab`a_.^ej$o%6 ^ej[heop(jqi9WY(ejp$o% ebjqi8,6sa`kj#p_kjranpjac]perajqi^ano n]eoaR]hqaAnnkn(r]hqaiqop^alkoepera ebjkpjqi6ola_e]h_]oa(jqi^aneovank napqnj#,# nacqh]n_]oa sdehajqi6 ^ej[heop*]llaj`$#-#eb$jqi"-%ahoa#,#% jqi::9- napqnj*fkej$naranoa`$^ej[heop%% The way the function works is as follows. The string representing a decimal value is converted into a number. The number is then checked for special cases (negative, zero) and proceeds to the conversion in the sdeha loop. CHAPTER 5 N PROCESSINGTEXT FILES140 Within the sdeha loop, if the number’s least significant bit is equal to 1 (condition jqi"#-#), a - is added to the list of digits, ^ej[heop; otherwise, a #,# is added to the list. I’ve used a conditional expression similar to the ;6 expression in C (go to dppl6++`k_o*lupdkj*knc+sd]pojas+.*1*dpih and scroll down to PEP 308). The number is then right-shifted 1 bit, and the whole cycle repeats itself. The sdeha loop ends when the shifted number reaches zero, effectively meaning all its binary digits were converted. Finally, the function returns a string from the list of digits, only the string has to be reversed since we’ve converted from the least significant bit first to the most significant bit last. I’ve used the iterator naranoa`$% to present the binary digits in the proper sequence. The function does not accept negative values. It shouldn’t be too hard to support negative numbers as well, but I’ve never used negative binary numbers, and so the need has never arisen. Let’s complete the preceding helper function by implementing the functions ^ej.k_p$%, ^ej.`a_$%, ^ej.dat$%, k_p.^ej$%, and dat.^ej$% using `a_.^ej$%, as shown in Listing 5-3. If you’re running version 2.6, I would suggest implementing the function `a_.^ej$% as a simple napqnj$^ej$ejp$o%%; if you’re using version 2.5, use the implementation suggested in the side- bar “Binary Conversion in Python 2.5.” Listing 5-3. Binary Conversion Helper Functions `ab^ej.k_p$o%6 napqnj`a_.k_p$^ej.`a_$o%% `ab^ej.`a_$o%6 napqnjopn$ejp$o(.%% `ab^ej.dat$o%6 napqnj`a_.dat$^ej.`a_$o%% `abk_p.^ej$o%6 napqnj`a_.^ej$k_p.`a_$o%% `abdat.^ej$o%6 napqnj`a_.^ej$dat.`a_$o%% Testing Your Implementation: exec and assert This is a bit of an off-track discussion and somewhat advanced, but I thought it appropriate in the context of the preceding discussion. As you implement the base-conversion helper functions, you’ll find that it’s quite possible that you’ve made a mistake. Those are implementations of nested function calls and are prone to human error. Python provides several testing modules: `k_paop and qjeppaop (see the Python Standard Library, dppl6++`k_o*lupdkj*knc+he^n]nu+paop*dpih). However, I chose a different approach, one that does not make use of these modules, in hopes of shedding light on two new state- ments: ata_ and ]ooanp. The first statement, ]ooanp, will return an =ooanpekjAnnkn in case a condition isn’t met. This is quite useful for testing purposes: CHAPTER 5 N PROCESSINGTEXT FILES 141 :::]ooanp-99. Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: =ooanpekjAnnkn Adding ]ooanp statements in your code is a good way to ensure things behave the way you expect them to, for example, making certain an argument passed to a function is of a specific type. NTip ]ooanp statements are not executed when you run Python with the optimization switch turned on ()K). The statement ata_ executes a string as if you typed it in the interpreter: :::ata_lnejp-'. / The ata_ statement can be used for automating command execution. At first, ata_ might not seem such a big deal. But consider those functions in the previ- ous example: there are 12 functions corresponding to all combinations of base conversions, ^ej.k_p$%, ^ej.`a_$%, ^ej.dat$%, k_p.^ej$%, and so forth. Testing all these functions is tedious. If you watch closely though, you’ll find there’s a pattern. And when there’s a pattern, it stands to reason to write a computer program to perform the task for us. This is exactly where ata_ comes to life. The idea is to create a list of strings, each string detailing a function to be exe- cuted, and then executing each and every string (see Listing 5-4). Listing 5-4. Testing Base-Conversion Function Implementations `abpaop^]oao$%6 Paopoeilhaiajp]pekjkb^]oa_kjranoekjbqj_pekjo r,9w#^ej#6#,#(#k_p#6#,#(#`a_#6#,#(#dat#6#,t,#y r-9w#^ej#6#----#(#k_p#6#,-3#(#`a_#6#-1#(#dat#6#,tb#y bknrejWr,(r-Y6 lanio9W$](^%bkn^ejrbkn]ejreb]9^Y bkn$o-(o.%ejlanio6 p_9]ooanp!o.!o$rW#!o#Y%99rW#!o#Y!$o-(o.(o-(o.% ata_p_ I created two test vectors: r- and r.. Variables r- and r. are dictionaries containing a string representing the base as the key and a string representing the number as the value. I took care in ensuring that the string representing the base names follows the three-letter nota- tions I’ve used for the function names. I then iterate through my test vector list and execute each test case (p_). Let’s break this down into smaller chunks. CHAPTER 5 N PROCESSINGTEXT FILES142 I first create a list comprehension named lanio that generates all permutations of bases as long as they’re not identical (hence the condition ]9^): :::r9w#^ej#6#,#(#k_p#6#,#(#`a_#6#,#(#dat#6#,t,#y :::lanio9W$](^%bkn^ejrbkn]ejreb]9^Y :::bnkillnejpeilknpllnejp :::llnejp$lanio% W$#`a_#(#^ej#%( $#dat#(#^ej#%( $#k_p#(#^ej#%( $#^ej#(#`a_#%( $#dat#(#`a_#%( $#k_p#(#`a_#%( $#^ej#(#dat#%( $#`a_#(#dat#%( $#k_p#(#dat#%( $#^ej#(#k_p#%( $#`a_#(#k_p#%( $#dat#(#k_p#%Y To modify this list comprehension to generate actual assertion calls requires some string manipulation, but you already know how to use format specifiers, so here it is: :::bkn$o-(o.%ejlanio6 ***p_9]ooanp!o.!o$rW#!o#Y%99rW#!o#Y!X ***$o-(o.(o-(o.% ***lnejpp_ *** ]ooanp`a_.^ej$rW#`a_#Y%99rW#^ej#Y ]ooanpdat.^ej$rW#dat#Y%99rW#^ej#Y ]ooanpk_p.^ej$rW#k_p#Y%99rW#^ej#Y ]ooanp^ej.`a_$rW#^ej#Y%99rW#`a_#Y ]ooanpdat.`a_$rW#dat#Y%99rW#`a_#Y ]ooanpk_p.`a_$rW#k_p#Y%99rW#`a_#Y ]ooanp^ej.dat$rW#^ej#Y%99rW#dat#Y ]ooanp`a_.dat$rW#`a_#Y%99rW#dat#Y ]ooanpk_p.dat$rW#k_p#Y%99rW#dat#Y ]ooanp^ej.k_p$rW#^ej#Y%99rW#k_p#Y ]ooanp`a_.k_p$rW#`a_#Y%99rW#k_p#Y ]ooanpdat.k_p$rW#dat#Y%99rW#k_p#Y I’ve printed a string associated with the command to be executed. The strings represent commands that check the functionality of the base helper functions. Now all that’s needed is ata_p_. CHAPTER 5 N PROCESSINGTEXT FILES 143 NNote If you’re using the built-in function ^ej$% available in Python 2.6 and not the implementation of `a_.^ej$% in the sidebar “Binary Conversion in Python 2.5,” be sure to change the notation to include a leading ,^ in case of binary values. The octal notation ,k is optional in version 2.6, and the default behav- ior is just a leading zero without an k, so there’s no need to change those. Vector r, would then be r,9 w#^ej#6#,^,#(#k_p#6#,#(#`a_#6#,#(#dat#6#,t,#y. The nice thing about this implementation is that you can easily add other bases, say, the functions that convert base 5: mqe.^ej$%, mqe.k_p$%, mqe.`a_$%, and so on (I’ve used the nota- tion mqe, which is short for quinary, base 5). Find and Replace The next set of interesting functions are bej`$% and nalh]_a$%. The method bej`$% locates the first occurrence of a substring in a string and takes the general form bej`$oq^opnejcW( op]npW(aj`YY%. The parameters op]np and aj` are optional and are used to limit the search to indices that are greater than or equal to op]np and are less than aj`, if those arguments are provided: :::cnk_anu[heop9Iehg.XjAcco-. :::cnk_anu[heop*bej`$#.#% 3 :::cnk_anu[heop*bej`$#.#(-,% -2 :::cnk_anu[heop*bej`$#.#(-,(-2% )- In case a substring isn’t found, the return value is )-. NCaution Be sure to compare the value of bej`$% with )-, and not with Pnqa, as )- (substring not found) is considered Pnqa. That is, instead of writing ebopn*bej`$oq^opn%, write ebopn*bej`$oq^opn%9)-. The function nalh]_a$% doesn’t really replace items in a string, as strings are immutable. Instead, it creates a new string, with every occurrence of the old substring replaced with the new substring: :::cnk_anu[heop9Iehg.XjAcco-. :::cnk_anu[heop*nalh]_a$#Acco#(#Knc]je_Acco#% #Iehg.XjKnc]je_Acco-.# The nalh]_a$% method will replace as many occurrences as are possible unless the _kqjp argument is provided, as follows: nalh]_a$kh`(jasW(_kqjpY%—in which case only the num- ber of values up to and including _kqjp will be replaced. CHAPTER 5 N PROCESSINGTEXT FILES144 In case you’d like to know in advance how many substitutions will occur, you can use the _kqjp$oq^opn% method, which counts the number of occurrences of a substring in a string: :::cnk_anu[heop9Iehg.XjAcco-. :::cnk_anu[heop*_kqjp$#.#% . :::cnk_anu[heop*_kqjp$#Acco#% - The method _kqjp$% also accepts an optional start-of-search and end-of-search indices: _kqjp$oq^opnW(op]npW(aj`YY%; the behavior is similar to that of function bej`$%. Stripping Strings Stripping strings is the process of removing extra whitespace characters or other set of char- acters from a string. The method opnel$W_d]noY% removes whitespace characters from both the right side and the left side of a string. If _d]no is provided, characters from the string _d]no are used as separators instead of whitespace characters, each character acting as a separa- tor. Methods nopnel$W_d]noY% and hopnel$W_d]noY% do so on the right or left sides only, respectively: :::Dahhk*nopnel$% #Dahhk# :::#&)&)&OA?PEKJ>NA=G&)&)&#*opnel$#&)#% #OA?PEKJ>NA=G# Example: Removing Extra Spaces In this example, we’d like to remove extra spaces from some text. We could try to use the nalh]_a$% method, replacing two spaces with one: :::cnk_anu[heop9Iehg.XjAcco-. :::jas[cnk_anu[heop9cnk_anu[heop*nalh]_a$(% :::lnejpjas[cnk_anu[heop Iehg. Acco-. That didn’t work well: there are two spaces between Iehg and . after the call to nalh]_a$%. The reason for this is that nalh]_a$% is not a recursive search and replace. After a replace has been made, the function keeps on looking for other occurrences, but the ones that have already been replaced add up together to form extra spaces again: those are not replaced. For example, four spaces become two, and not one. Of course, you could keep on replacing until there are no more changes to the string, but that’s a bit cumbersome. Another approach is to use olhep$%, olhephejao$%, opnel$%, and fkej$%: :::cnk_anu[heop9Iehg.XjAcco-. :::bknhejaejcnk_anu[heop*olhephejao$%6 ***_ha]n[heja9Wo*opnel$%bknoejheja*olhep$%Y ***lnejp*fkej$_ha]n[heja% *** CHAPTER 5 N PROCESSINGTEXT FILES 145 Iehg. Acco-. I’ve used a bkn loop to iterate through the split lines. For every line, I’ve created the list _ha]n[heja, which is each word stripped of separators (in our case extra spaces). I then joined the list of words with a space. If all this seems considerable effort for a simple task, you’re absolutely right. There are other, better ways to perform this: regular expressions. More about these in the section “Regular Expressions” toward the end of this chapter. String Formatting Using format specifiers, presented in Chapter 3, you can control string format very accurately. But format specifiers do not take into consideration that we’re dealing with text and words; they treat strings mostly as a sequence of characters. The functions presented in this section add string formatting options that are more suited for working with words and text. The methods qllan$% and hksan$% return strings with all characters in uppercase or lower- case, respectively: :::Ie``hakbPksj*qllan$% #IE@@HAKBPKSJ# :::Ie``hakbPksj*hksan$% #ie``hakbpksj# The function os]l_]oa$% returns a string with the characters’ case inverted: :::Ie``hakbPksj*os]l_]oa$% #iE@@HAKBpKSJ# The method _]lep]heva$% returns a string with the first letter in uppercase and the remaining letters in lowercase. Note that this affects only the first character of the string, disre- garding English grammar rules or whether there are punctuation marks or line breaks: :::benopoajpaj_a*XjOa_kj`oajpaj_a*Pden`Oajpaj_a*_]lep]heva$% #Benopoajpaj_a*Xjoa_kj`oajpaj_a*pden`oajpaj_a# The method pepha$% capitalizes every first letter of a sentence. Again, not in accordance with the English grammar rules, as some words in titles should not be capitalized (e.g., “the”): :::benopoajpaj_a*XjOa_kj`oajpaj_a*Pden`Oajpaj_a*pepha$% #BenopOajpaj_a*XjOa_kj`Oajpaj_a*Pden`Oajpaj_a# The method _ajpan$jW(_d]nY% returns a string of length j with left and right padding of the string with _d]n (default is space) so as to have the string centered in the middle: :::Ie``hakbPksj*_ajpan$.2(#&#% #&&&&&Ie``hakbPksj&&&&&# The methods hfqop$jW(_d]nY% and nfqop$jW(_d]nY% perform left and right justification, respectively, with the optional fill character being _d]n: CHAPTER 5 N PROCESSINGTEXT FILES146 :::A]opoe`a*nfqop$.,% #A]opoe`a# :::Saopoe`a*hfqop$.,(#)#% #Saopoe`a)))))))))))# String Conditionals The following is a set of string conditionals: methods that ask questions about strings. The method aj`osepd$oq^opnW(op]npW(aj`YY% returns Pnqa if a string ends with oq^opn. The op]np and aj` arguments limit the search indices similarly to previously discussed string functions; from now on I’ll refrain from explaining their effect. The aj`osepd$% function is use- ful for checking file name extensions, for example: :::].,*flc*aj`osepd$#flc#% Pnqa :::].,*flc*aj`osepd$#FLC#% B]hoa The second expression evaluates to B]hoa because aj`osepd$% is case sensitive. However, oq^opn can be a tuple as well, accommodating several condition tests: :::].,*FLC*aj`osepd$$#flc#(#FLC#%% Pnqa The method op]nposepd$oq^opnW(op]npW(aj`YY% is similar to aj`osepd$% only in that it checks the beginning of a string. The methods eo]hld]$%, eo`ecep$%, and eo]hjqi$% return Pnqa if all the characters in the string are alphabetic, digits, or both, respectively: :::].,*flc*eo]hjqi$% B]hoa The reason the method eo]hjqi$% returns B]hoa in the preceding example is that the char- acter #*# (dot) is not alphabetic nor a digit. Similarly, the methods eohksan$%, eool]_a$%, eopepha$%, and eoqllan$% return Pnqa if the string is all lowercase, all whitespace, of the title form (first letter in every word capitalized), or all uppercase, respectively: :::].,*flc*eohksan$% Pnqa NNote The conditionals starting with eo, such as eohksan$%, will return B]hoa if the string is empty. CHAPTER 5 N PROCESSINGTEXT FILES 147 More on Strings The preceding isn’t a full account of strings and string methods. For example, in Python 2.6 a new formatting function, opn*bkni]p$%, is provided (see dppl6++`k_o*lupdkj*knc+he^n]nu+ opnejc*dpih). If your work is text-heavy, have a look at the online documentation for addi- tional information. The discussion that follows relies on the preceding string methods but not on ones that were not discussed. Files Text files are files that contain textual data, that is, text strings. We’ve talked about strings and text extensively; now it’s time to talk about files. In Python, you access files using the beha data type. Working with files is quite similar to doing so in other programming languages: open a file and receive a file object, read from the file or write to the file using the file object, and lastly close the file, again using the file object. You can open a file either for reading, writing, or appending, and you can open it in binary mode or text mode. Opening a File To open a file, use the klaj$behaj]iaW(ik`aW(^qbbanejcYY% built-in function. The func- tion klaj$% returns a file object that is used for subsequent file operations. The first argument, behaj]ia, is required. The second argument, ik`a, is optional and can take the values listed in Table 5-1. Table 5-1. File Open Modes Mode Description n Opens a file for reading. This is the default value if mode isn’t specified. s Opens a file for writing, overwriting an existing file. ] Opens a file for writing in append mode; all write operations are performed at the end. If the file doesn’t exist, it is created. n' Opens a file for reading and updating. If the file doesn’t exist, an exception is raised. s' Creates a new file for writing and updating, overwriting an existing one. ]' Opens a file for reading and writing in append mode. All write operations are performed at the end. If the file doesn’t exist, it is created. Adding the character #^# to the mode ensures the file is open in binary mode (e.g., # n^#). Adding the character #p# to the mode ensures the file is opened in text mode (e.g., #sp#). A file can be opened either in text mode or in binary mode, but not both (default is text mode). The difference between binary mode and text mode is whether Python tries to convert line-ending characters it encounters to #Xj#. In Windows, the characters #XnXj# are used to denote the end of the line; in Linux, it’s just #Xj#. To have a consistent method to access text files, use text mode. When dealing with binary files, or when it’s important for you to have end-of-line characters unmodified, use the binary mode. CHAPTER 5 N PROCESSINGTEXT FILES148 NNote In append mode, write operations are performed at the end of the file, effectively guarding existing data, whereas in write mode, you’re allowed to write anywhere in the file, possibly overwriting data. The third parameter, ^qbban, is optional as well and determines the file buffering mode. See dppl6++`k_o*lupdkj*knc+he^n]nu+bqj_pekjo*dpih for information about buffering. Closing a File Contrary to the klaj$% function, which is a built-in function, the _hkoa$% function is a mem- ber function of the beha object and not a built-in function. To close files, use the file method b*_hkoa$% (assuming b is a beha object). It’s generally good practice to close a file once you’re done with it. But in case you don’t, Python closes the file for you automatically once the file object is no longer in use. The following sample shows how to open and close a file: :::b9klaj$#okiabeha*ptp#% :::b 8klajbeha#okiabeha*ptp#(ik`a#n#]p,t,,>@,/?4: :::b*_hkoa$% :::b 8_hkoa`beha#okiabeha*ptp#(ik`a#n#]p,t,,>@,/?4: :::pula$b% 8pula#beha#: Writing Text Once a file is open for writing (or appending) and before it’s closed, you can write strings to it using the methods snepa$opn% and snepahejao$opnoam%. The method snepa$% writes a string to the file: :::cnk_anu[opn9Iehg.XjAcco-. :::b9klaj$#**+`]p]+pk^qu*ptp#(#sp#% :::b*snepa$cnk_anu[opn% :::b*_hkoa$% The contents of the file pk^qu*ptp are Iehg. Acco-. The method snepahejao$opnoam% writes a sequence of strings to a file: :::cnk_anu[heop9WIehg(.(Acco(-.Y :::b9klaj$#**+`]p]+pk^quheop*ptp#(#sp#% :::b*snepahejao$cnk_anu[heop% :::b*_hkoa$% CHAPTER 5 N PROCESSINGTEXT FILES 149 The contents of file pk^quheop*ptp are Iehg.Acco-. Notice how snepahejao$% does not add spaces nor line breaks. NNote I’ve assumed you are following the book convention of having your source code reside in directory ?d1+on_ and your data directory in ?d1+`]p]. If that’s not the case, change the path to your data files in the preceding examples. Reading Text Once a file is open for reading, you can make use of the methods na]`$%, na]`heja$%, or na]`hejao$% to read the file contents. You can also iterate over the file object to read a line at a time. I found that when dealing with text files, my code typically falls into one of three categories: 1. Reading the entire file at once using the methods na]`$% or na]`hejao$%. This option is preferable if the files are not too large. 2. Iterating over the file object, reading a line at a time. This option is preferable for larger text files. 3. Using a sdeha loop with the method na]`$j%. This option is a good candidate in case you don’t necessarily want to treat the file as lines of text. Reading the Entire File at Once Assuming you’re dealing with not-too-large files, using method na]`$% or na]`hejao$% to read the entire file at once should be your first choice. The method na]`$WjY% reads j bytes from the file, returning them as a string. If j is not specified or negative, the entire file is read. For this example, we’ll use the file pk^qu*ptp gener- ated in the previous section, “Writing Text” (first snippet of code): :::b9klaj$#**+`]p]+pk^qu*ptp#% :::patp9b*na]`$% :::b*_hkoa$% :::lnejppatp Iehg. Acco-. or more compactly: :::lnejpklaj$#**+`]p]+pk^qu*ptp#%*na]`$% Iehg. Acco-. CHAPTER 5 N PROCESSINGTEXT FILES150 The method na]`hejao$% reads the file at once, returning a list of strings: :::klaj$#**+`]p]+pk^qu*ptp#%*na]`hejao$% W#Iehg.Xj#(#Acco-.#Y Iterating Over the File Object This option is suited for cases when you want to process your file a line at a time, but you don’t want to read the entire file at once, due to, say, memory constraints. Here’s an example: :::bkne(hejaejajqian]pa$klaj$#**+`]p]+pk^qu*ptp#%%6 ***lnejp!`6!o!$e(heja*nopnel$%% *** ,6Iehg. -6Acco-. Using a while Loop Use this method in conjunction with na]`$% to process chunks of the file at a time. Again, this is best suited for larger files and in cases where you don’t want to treat a file as a list of lines: :::b9klaj$#**+`]p]+pk^qu*ptp#% :::_d9b*na]`$-% :::sdeha_d9#g#]j`_d6 ***lnejp_d( ***_d9b*na]`$-% *** Ieh :::b*_hkoa$% This example reads the file a byte at a time and stops upon encountering the character #g# or an end-of-file where _d would then evaluate to B]hoa. Working with Text Files Now that we have the basics covered, that is, reading and writing files and processing strings, it’s time to combine the two new skills. This section is presented as a list of examples. The examples can be used for educational purposes, but they can also be used to form the basis of helper functions for text-based data processing. With time, I hope you modify the code presented here to best fit your needs. It is important that you treat these examples for what they are, that is, examples and not production code. Most of the functions do not perform any sort of error checking or handle exceptions and should not be used as-is but only for educational purposes. When possible, I’ve added a discussion on how these examples can be improved upon. For the purpose of working with larger files than the contrived grocery list used previously, I’ve selected to use the electronic version of the book Flatland, by Edwin A. Abbott, available for download from Project Gutenberg, located at dppl6++sss*cqpaj^anc*knc. A direct link to CHAPTER 5 N PROCESSINGTEXT FILES 151 the e-book at the time of the writing of this book is dppl6++sss*cqpaj^anc*knc+behao+53+53* ptp. Once you download the file, save it in folder `]p] under the original file name 53*ptp. Your directory structure should look similar to that presented near the end of Chapter 2 in the sec- tion “Example: Directory Structure for the Book”: ?d1 on_ `]p] In this directory structure, on_ is where you code is, as well as your current working direc- tory, and `]p] indicates the location of data files (namely 53*ptp—the Flatland e-book). Example: Character, Word, and Line Count Similar to the example presented in the beginning of the chapter, now you’re confronted with the task of counting the characters, words, and lines in a file, and not just a string. The solution is an immediate extension of the example provided before, using olhep$%, olhephejao$%, and haj$%. I’ve named the function s_$%, which is a popular command name on Linux shells s_behaj]ia (see Listing 5-5). Listing 5-5. Counting the Number of Characters, Words, and Lines in a File `abs_$behaj]ia%6 Napqnjopdajqi^ankb_d]n]_pano(skn`o]j`hejaoej]beha* Pdanaoqhpeo]pqlhakbpdabkni$_d]n]_pano(skn`o(hejao%* `]p]9klaj$behaj]ia(#n^#%*na]`$% napqnj$haj$`]p]%(haj$`]p]*olhep$%%(haj$`]p]*olhephejao$%%% The function returns a tuple of three elements: the first element is the number of char- acters in the file, the second element is the number of words, and the third element is the number of lines. I’ve also selected to open the file in binary mode and not in text mode. This is so that the number of characters will be counted properly, without any conversions of #XnXj# to #Xj# that would throw numbers off count. NNote Notice how I’ve indented the docstring; it’s standard practice that in case more than one line is needed to document the function, a blank line is added immediately after the short function description. Subsequent documentation is left justified with no indentation. It’s possible that you’re dealing with truly large files, in which case a different approach should be employed: iterating over the file object (see Listing 5-6). CHAPTER 5 N PROCESSINGTEXT FILES152 Listing 5-6. Counting the Number of Characters, Words, and Lines in a Very Large File `abs_[h]nca$behaj]ia%6 Napqnjopdajqi^ankb_d]n]_pano(skn`o]j`hejaoej]h]ncabeha* Pdanaoqhpeo]pqlhakbpdabkni$_d]n]_pano(skn`o(hejao%* jqi[_d]no(jqi[skn`o(jqi[hejao9,(,(, bknhejaejklaj$behaj]ia(#n^#%6 jqi[_d]no'9haj$heja% jqi[skn`o'9haj$heja*olhep$%% jqi[hejao'9- napqnj$jqi[_d]no(jqi[skn`o(jqi[hejao% Here are the results from running both functions: :::s_$#**+`]p]+53*ptp#% $.-212.(/211/(/5.,% :::s_[h]nca$#**+`]p]+53*ptp#% $.-212.(/211/(/5.,% Example: head and tail Most Linux system administrators know and love the da]` and p]eh command-line utilities. It’s a fast check on how an installation is coming along, it’s great for probing message logs, and it’s good to see whether any errors occurred during boot time. The way the da]` and p]eh command-line utilities work is that they print j lines from the beginning or end of a file, respectively. You’d typically use these commands to look at log files because most log files are plain text files with data written sequentially: a recent event is logged at the end of a file. The following command will print the last 20 lines from the `iaoc log file (a common Linux log file): p]eh).,+r]n+hkc+`iaoc Using the method na]`hejao$%, both da]`$% and p]eh$% functions are easily implemented, as shown in Listing 5-7. Listing 5-7. da]`$% and p]eh$% Functions `abda]`$behaj]ia(j9-,%6 Lnejpopdabenopjhejaokbpdabeha* bknhejaejklaj$behaj]ia%*na]`hejao$%W6jY6 lnejpheja*nopnel$% `abp]eh$behaj]ia(j9-,%6 Napqnjopdah]opjhejaokbpdabeha* CHAPTER 5 N PROCESSINGTEXT FILES 153 bknhejaejklaj$behaj]ia%*na]`hejao$%W)j6Y6 lnejpheja*nopnel$% NNote It’s also possible to replace the line lnejpheja*nopnel$% with lnejpheja( (notice the comma) to suppress the extra line breaks. In case your files are too large to be read entirely into memory, things get trickier. Implementing the da]`$% function is possible by iterating over the file object, as shown in Listing 5-8. Listing 5-8. da]`$% Function for Very Large Files `abda]`[h]nca$behaj]ia(j9-,%6 Napqnjopdah]opjhejaokb]ranuh]ncabeha* bkne(hejaejajqian]pa$klaj$behaj]ia%%6 lnejpheja*nopnel$% ebe99j)-6^na]g You can convince yourself both functions return proper results by modifying the code to return a list and then comparing the returned values from the two functions. Unfortunately, implementing the function p]eh$% in a similar manner, iterating over the file object, is much more complex. First, you’d have to go through the entire file, reading every line. Remember, this is a large file, therefore doing so will take considerable time. And even then, the second problem you encounter is that you don’t know in advance which line is the last line, so you’d either have to perform two passes on the file, on the first one counting the number of lines and on the second one printing the last j lines, or you’d have to continually store the last j lines read. In both cases, yikes. A third approach is to use random access func- tions to start reading files at the end and work your way backward. This requires use of the function oaag$%, which will be covered in Chapter 10. Example: Splitting and Combining Files Back in the day, computer users used to transfer files on 360K diskettes. Since this was rather painful, they opted to use a file compression utility that did both splitting and compressing. Alas, one of the disks always seemed to be misplaced, rendering the data totally useless. Which brings up a common task when dealing with large files: you may need to split them into smaller, more manageable files. By that, I don’t mean that each chunk contains legible information, merely that now you can use size-limited media (e-mail, flash drive) to transfer the split files. The receiving end will then need to reconstruct the original file from the split files. The function olhepbeha$% splits a file into j smaller files, each with a modified file name that is composed of the original file name plus the split file index (e.g., 53*ptp*-). The function _ki^ejabehao$% combines several files of the preceding pattern into one, as demonstrated in Listing 5-9. CHAPTER 5 N PROCESSINGTEXT FILES154 Listing 5-9. Splitting and Combining Files `abolhepbeha$behaj]ia(oeva9-,.0&&.%6 Olhepo]behaejpkjoi]hhanbehao* Behao]na_na]pa`sepd]nqjjejcej`at* bej(ej`at9klaj$behaj]ia(#n^#%(, `]p]9bej*na]`$oeva% sdeha`]p]6 ej`at'9- kqpbehaj]ia9behaj]ia'#*#'opn$ej`at% bkqp9klaj$kqpbehaj]ia(#s^#% bkqp*snepa$`]p]% bkqp*_hkoa$% lnejp?na]pa`beha!o(oeva!`!$kqpbehaj]ia(haj$`]p]%% `]p]9bej*na]`$oeva% napqnj `ab_ki^ejabehao$behaj]ia%6 ?ki^ejao]lnarekqohuolhepbeha* Behaj]iaatpajoekjo]na]ooqia`]nqjjejcej`at* Eilknp]jpjkpa6eb]behaj]ia`#behaj]ia#ateopoepsehh^akransneppaj* bkqp(ej`at9klaj$behaj]ia(#s^#%(, sdehaPnqa6 ej`at'9- pnu6 `]p]9klaj$behaj]ia'#*#'opn$ej`at%(#n^#%*na]`$% bkqp*snepa$`]p]% at_alpEKAnnkn6 ^na]g bkqp*_hkoa$% lnejp?na]pa`beha!obnki!`beha$o%Xj!$behaj]ia(ej`at)-% The functions themselves are self-explanatory and should prove easy enough to follow. Here’s how you would use them: :::olhepbeha$#53*ptp#(-,,,,,% ?na]pa`beha53*ptp*-(oeva-,,,,, ?na]pa`beha53*ptp*.(oeva-,,,,, ?na]pa`beha53*ptp*/(oeva-212. Now copy the files 53*ptp*& to a temporary folder, say, +pil and issue :::_ki^ejabehao$#+pil+53*ptp#% ?na]pa`beha53*ptpbnki/behao CHAPTER 5 N PROCESSINGTEXT FILES 155 To satisfy yourself that indeed the files are identical, issue the following: :::`]p]-9klaj$#**+`]p]+53*ptp#(#n^#%*na]`$% :::`]p].9klaj$#+pil+53*ptp#(#n^#%*na]`$% :::`]p]-99`]p]. Pnqa In the implementation of the function _ki^ejabehao$% I’ve chosen to again use the EAFP approach. Alternatively, I could’ve listed the directory contents using ko*heop`en$% or used the glob module to achieve the same; the glob module will be discussed in Chapter 10. While the topic of this section is text files, the functions olhepbeha$% and _ki^ejabehao$% should work on binary files just as well, since we’ve opened the files in binary mode. NCaution The function _ki^ejabehao$% will overwrite an existing file if it already exists. Lastly, the function _ki^ejabehao$% will overwrite a file if it exists; if you’d prefer a differ- ent functionality, you can use the function ko*l]pd*ateopo$% to first ensure the file does not exist and act accordingly (ask for user preference or return without overwriting the file). Example: Searching Inside a Text File The examples up to this point dealt with the files themselves, not really with the information they hold. The examples going forward will look at the contents as well, that is, read the file and process text. A common programming task involves searching for a string inside a file. A yet even more common task is the searching of a string in multiple files, but we leave that to future discus- sions (see Chapter 10). In Linux, a handy utility achieves this: cnal. cnal also provides more complex searches, ones that include regular expressions, but in this example we limit our dis- cussion to simple string searches. Searching inside a text file is easily implemented in Python, as you can see in Listing 5-10. Listing 5-10. Searching Inside a Text File `abon_dbeha$behaj]ia(oq^opn%6 Oa]n_daobkn]oq^opnejcej]beha* bkne(hejaejajqian]pa$klaj$behaj]ia(#np#%%6 ebheja*bej`$oq^opn%9)-6 lnejp!1`6!o!$e(heja*nopnel$%% I’ve used the iterator ajqian]pa$% to retrieve both the line and the line number, as the line number is displayed in the lnejp statement later on. I’ve also used the method nopnel$% to remove the extra new-line character. Here’s the output from running the command on_dbeha$#**+`]p]+53*ptp#(#Cqe`a#%: CHAPTER 5 N PROCESSINGTEXT FILES156 .2016Lnkopn]pejciuoahbiajp]hhu^abknaiuCqe`a(E_nea`(Dkseoep(K .2416]^kqpukqnSeba(o]e`iuCqe`a6odasehhjkp^ahkjchabpej]jteapu7 .3/.6Danasa`ao_aj`(o]e`iuCqe`a*Eps]ojksiknjejc(pdabenopdkqn .32.6uap(o]e`iuCqe`a(pdapeiasehh_kiabknpd]p*Ia]jpeiaEiqop .4,/6kbiuksj*E]^okhqpahu`alaj`a`kjpdarkhepekjkbiuCqe`a(sdko]e` /,5-6Hkkgukj`an(o]e`iuCqe`a(ejBh]ph]j`pdkqd]ophera`7kbHejah]j` There’s room for improvement of the function on_dbeha$%. First, the function is case- sensitive. To allow for case-insensitive searches, I would suggest adding a parameter to the function that controls whether searches are case sensitive or not. In case of a case-insensitive search, on_dbeha$% would make use of the function qllan$% (or hksan$%) and convert both the line to be searched and the search string itself prior to calling bej`$%. Another improvement would be to fix the line number indentation. Currently the line number is right-aligned five spaces. In case the number of lines is 100,000 or greater, this will create an indentation problem. Changing the implementation to read the entire file at once gives us a total line count, allowing the calculation of the maximum number of digits using the function hkc-,$%, which is part of the math module (see Chapter 7) as follows: :::jqihejao9-./0 :::bnkii]pdeilknphkc-, :::i]t`ecepo9ejp$hkc-,$jqihejao%%'- :::i]t`ecepo 0 This will only work if you know the line number in advance, hence the change to the func- tion to read the entire file at once. Listing 5-11 shows a possible implementation of fixing the indentation problem. Listing 5-11. Searching Inside a Text File (Proper Indentation) `abon_dbeha[at$behaj]ia(oq^opn%6 Oa]n_daobkn]oq^opnejcej]beha* hejao9klaj$behaj]ia%*na]`hejao$% bip9n#!#'opn$ejp$hkc-,$haj$hejao%%%'-%'n#`6!o# bknej`at(hejaejajqian]pa$hejao%6 ebheja*bej`$oq^opn%9)-6 lnejpbip!$ej`at(heja*nopnel$%% Notice how I’ve first created a format specifier, bip, using a raw string and then used it to print the line number and the string. Example: Working with Comments I find that I repeatedly turn back to code I once wrote. And what’s funny, I seem to remem- ber mostly the comments. An interesting approach to viewing comments is to think of the comment symbol () as a separator. With this in mind, you can implement some interesting searches, as shown in Listing 5-12. For example, you can search inside comments only, or per- form the complement search, that is, one that ensures you’re not searching comments. CHAPTER 5 N PROCESSINGTEXT FILES 157 Listing 5-12. Working with Comments `abon_d_kiiajpo$behaj]ia(oq^opn%6 Oa]n_daoejoe`aLupdkjokqn_a_kiiajpo* bknej`at(hejaejajqian]pa$klaj$behaj]ia(#np#%%6 H9heja*olhep$##% ebhaj$H%99.6 ebHW-Y*bej`$oq^opn%9)-6 lnejp!1`6!o!$ej`at(heja*nopnel$%% This code is hardly foolproof. It assumes that the  symbol appears only once, text before the  is always code, and text after  is a comment. This is not always true; for example, in the preceding code, in the fourth line, H9heja*olhep$##%,  is hardly a comment separator. Nev- ertheless, in many occasions the function works fine. Using the function in Listing 5-12 as a starting point, you could for example, write a func- tion to convert comments that are not on a line of their own to be single-line comments. That is, convert lnejpt(lnejpot]j`oqllnaoo]heja^na]g to lnejpot]j`oqllnaoo]heja^na]g lnejpt( C++ STYLE COMMENTS I originally encountered this problem with C/C++. A compiler I was using to write code accepted the C++ style comments, whose behavior is similar to Python’s comments (only with the symbols ++ instead of ). However, another older compiler was used that didn’t accept the C++ style comment. I had to convert all my single-line comments of the form ++_kiiajp to C-style comments of the form +&_kiiajp&+. I used a script similar to the preceding script. Luckily for me, the symbols ++ didn’t appear anywhere in the code other than as part of the comments themselves. Example: Extracting Numbers from a Text File At times, it’s useful to be able to extract only the numbers in a text file. This could be for the purpose of creating files based on existing ones, say, for testing purposes. Another scenario might be you have a system that maintains the number of users in a text file, and you’d like to write a script to increment that number. The function presented in Listing 5-13 reads a text file and creates a modified version of the file with all the numbers incremented. For the purpose of this example, numbers are sepa- rated by whitespace characters. CHAPTER 5 N PROCESSINGTEXT FILES158 Listing 5-13. Incrementing File Contents `abej_naiajp[_kjpajpo$behaj]ia%6 Ej_naiajpor]hqaoej]beha(_na]pejc]jasbeha* `]p]9klaj$behaj]ia(#np#%*na]`hejao$% bkne(hejaejajqian]pa$`]p]%6 bknskn`ejheja*olhep$%6 pnu6 `]p]WeY9heja*nalh]_a$skn`(opn$ejp$skn`%'-%% at_alpR]hqaAnnkn6 qj_kiiajppdabkhhksejcebukq#`hegabaa`^]_g lnejpskn`(eojkp]jqi^an l]oo klaj$behaj]ia'#*ej_#(#sp#%*snepahejao$`]p]% The function reads the entire file into memory and then processes the data a line at a time (it’s also possible to achieve the same functionality by iterating over a file object). Every line is split into words. The code then tries to convert every word to a number, and upon suc- cess, replaces the number with an incremented value. I’ve used the nalh]_a$% method to do that, and that’s the reason you see another opn$% function call after the increment: nalh]_a$% method requires a string, not a number. In implementing the function, I’ve chosen to use the EAFP approach, that is, I’ve tried to convert every single word I’ve encountered into an ejp, and then increment it. On success, the data is modified. If a R]hqaAnnkn occurs because the conversion didn’t succeed, it means that the string cannot be converted to an ejp, in which case I’ve ignored the word. A different approach might have been to check whether the word is composed of digits using the eo`ecep$% method, but that approach will fail on other characters such as the plus symbol. I think that EAFP here is a clear winner. To test the function, I again resort to the grocery list. Assume the contents of file **+`]p]+ cnk_anu[heop*ptp are as follows: Iehg-. Acco-.*, Kherao-a. Rkk`kk`khho-'-f After executing ej_naiajp[_kjpajpo$#**+`]p]+cnk_anu[heop*ptp#%, a new file, **+`]p]+ cnk_anu[heop*ptp*ej_, is created with the following contents: Iehg-/ Acco-.*, Kherao-a. Rkk`kk`khho-'-f CHAPTER 5 N PROCESSINGTEXT FILES 159 The function only increments the value -. because I’ve used the function ejp$%. Had I used the function bhk]p$%, all first three values would have been properly incremented. However, the first number would then be converted into a bhk]p value, and I wanted to leave it as an ejp. Handling both bhk]p and ejp values is possible with nested exception handling. First, the function tries to use ejp$%, and if ejp$% fails, it tries to use bhk]p$% (see Listing 5-14). Listing 5-14. Incrementing File Contents Using bhk]p$% and ejp$% `abej_naiajp[_kjpajpo[^kpd$behaj]ia%6 Ej_naiajpor]hqaoej]beha(_na]pejc]jasbeha* Skngosepd^kpdejpo]j`bhk]po* `]p]9klaj$behaj]ia(#np#%*na]`hejao$% bkne(hejaejajqian]pa$`]p]%6 bknskn`ejheja*olhep$%6 pnu6 `]p]WeY9heja*nalh]_a$skn`(opn$ejp$skn`%'-%% at_alpR]hqaAnnkn6 pnu6 `]p]WeY9heja*nalh]_a$skn`(opn$bhk]p$skn`%'-%% at_alpR]hqaAnnkn6 qj_kiiajppdabkhhksejcebukq#`hegabaa`^]_g lnejpskn`(eojkp]jqi^an l]oo klaj$behaj]ia'#*ej_#(#sp#%*snepahejao$`]p]% Lastly, the functions assume that numbers are always separated by spaces, which might not always be correct. An alternative approach would be to split based on punctuation marks as well as spaces (see “Example: Words Used Only Once” later in this chapter for splitting on punctuation marks as well). CSV Files Up to this point we’ve been working with plain text files. Plain text files typically do not follow any format other than their contents are text based. But in reality, when you’re dealing with data files, they’re more structured than plain text files. As discussed previously in Chapter 4, the CSV file format is a good format for structured text-based data files. The purpose of this section is to provide tools for more advanced log file processing that will be presented in the next section. The csv Module The csv module, which is part of the Python Standard Library, provides simple methods to read and write CSV files. To use the csv module, issue eilknp_or; the remaining discussion assumes you’ve imported the csv module properly. CHAPTER 5 N PROCESSINGTEXT FILES160 There are two basic objects you’ll be working with: the _or*na]`an and the _or*snepan. As their names suggest, one is used for reading, while the other is used for writing. The _or*na]`an object splits a line of text into a list of words (also referred to as fields). While you might think it’s simpler to split the line of text using the olhep$% string method, some caveats make this a bit more complex than is apparent. Consider a CSV file with contents as follows (assume it is stored in the file **+`]p]+ okiapatp*_or): -+-+.,,,(Okiapatp(]j`]_kii] -+-+.,,,(Okiapatp(]j`]_kii] If I were to use the olhep$% function, I’d get :::H9klaj$#**+`]p]+okiapatp*_or#%*na]`$%*olhep$#(#% :::bnkillnejpeilknpllnejp :::llnejp$H% W#-+-+.,,,#( #Okiapatp#( #]j`]_kii]Xj-+-+.,,,#( #Okiapatp#( #]j`]_kii]Xj#Y Not really what I wanted, plus fields and rows aren’t separated properly. Using the _or*na]`an object, I get a better result: :::b9klaj$#**+`]p]+okiapatp*_or#% :::bknhejaej_or*na]`an$b(ogelejepe]hol]_a9Pnqa%6 ***lnejpheja *** W#-+-+.,,,#(#Okiapatp(]j`]_kii]#Y W#-+-+.,,,#(#Okiapatp(]j`]_kii]#Y The csv.reader Object To create a _or*na]`an object, use the following syntax: _or*na]`an$bW(`e]ha_p9#at_ah#Y W(l]n]ioY%. The first parameter, b, is the CSV file, but it can also be any iterable object. The second parameter is the `e]ha_p. Since there are no clear definitions of what a CSV file con- stitutes (and there are quite a bit of nuances), you can specify a `e]ha_p, which is a set of rules instructing the _or*na]`an parser how to handle those differences. Furthermore, you can use an existing `e]ha_p and override some parameters of its behavior using the l]n]io field. In the previous example, I didn’t specify a `e]ha_p, which defaulted to the #at_ah# dialect. I did provide a format parameter, instructing the _or*na]`an to ignore the space at the begin- ning of the field (there’s an extra space after the comma in the second line of text in my input file). You can view the list of dialects in your system by issuing _or*heop[`e]ha_po$%. Once you have a _or*na]`an object, you can iterate through the object and retrieve a list of fields. CHAPTER 5 N PROCESSINGTEXT FILES 161 The csv.writer Object The _or*snepan object complements the _or*na]`an in that it allows writing of CSV files. Creating a _or*snepan object is similar to creating the _or*na]`an object: _or*snepan$bW( `e]ha_p9#at_ah#YW(l]n]ioY%. The difference (aside from the write vs. read operation) is that the _or*snepan object strictly requires a file, and not an iterable object. Once you have a _or* snepan object, you can use the snepanks$% or snepankso$% methods: :::hejao9WW-+-+.,,,(Okiapatp(]j`]_kii]YY :::_or*snepan$klaj$#**+`]p]+kqppatp*_or#(#s#%%*snepankso$hejao% NNote I’ve created a list of rows (notice the double brackets), even though there is but one row. This is to match the input expected by the _or*snepan object. If you pass the _or*snepan object a list, and not a list of rows, the results might not match what you expect. More csv Functionality The csv module allows considerable customization including the use and creation of user- defined dialects. I won’t be covering this topic; however, I will be covering some parameters and their meaning. The delimiter specifies the field separator. That is, the row is split on an occurrence of the delimiter, provided it is not escaped or quoted. To change the delimiter, add `aheiepan9_d]n as an argument to the _or object; _d]n must be a character: :::bknhejaej_or*na]`an$klaj$#**+`]p]+okiapatp*_or#%(`aheiepan9#+#%6 ***lnejpheja *** W#-#(#-#(#.,,,(Okiapatp(]j`]_kii]#Y W#-#(#-#(#.,,,(Okiapatp(]j`]_kii]#Y The parameter mqkpa_d]n specifies the quoting character used to denote a string in a CSV file. :::hejao9WW-+-+.,,,(Okiapatp(]j`]_kii]YY :::b9klaj$#**+`]p]+kqppatp*_or#(#s#% :::_or*snepan$b(mqkpa_d]n9#x#%*snepankso$hejao% :::b*_hkoa$% The preceding example results in the following: -+-+.,,,(xOkiapatp(]j`]_kii]x In this example, the date wasn’t quoted, while the text was—the reason being that the text contained the delimiter. To override quoting behavior, change the value of mqkpejc to _or* MQKPA[IEJEI=H,_or*MQKPA[=HH,_or*MQKPA[JKJJQIANE?, or _or*MQKPA[JKJA. The names quite CHAPTER 5 N PROCESSINGTEXT FILES162 obviously indicate their functionality. In case you select a _or*MQKPA[JKJA and a field contains the delimiter, you’d have to supply an ao_]la_d]n as well: :::hejao9WW-+-+.,,,(Okiapatp(]j`]_kii]YY :::b9klaj$#**+`]p]+kqppatp*_or#(#s#% :::_s9_or*snepan$b(mqkpejc9_or*MQKPA[JKJA(ao_]la_d]n9#z#% :::_s*snepankso$hejao% :::b*_hkoa$% The preceding example results in the following: -+-+.,,,(Okiapatpz(]j`]_kii] DictReader and DictWriter Objects The csv module provides us with additional useful objects: the @e_pNa]`an and @e_pSnepan objects, which are similar to the _or*na]`an and _or*snepan objects. If you follow the convention that places a header at the beginning of a CSV file, that is, that each column in the CSV file starts with a field name (see Chapter 4 for a discussion of this), accessing values can be done by accessing the dictionary with the field name as key. Let’s turn to an example. To follow along, create the file **+`]p]+pk^qu*_or with the fol- lowing content: Epai(?kqjp Iehg(. Acco(-. Pki]pkao(1 Now let’s create a @e_pNa]`an object: :::eilknp_or :::b_or9klaj$#**+`]p]+pk^qu*_or#% :::bknnksej_or*@e_pNa]`an$b_or%6 ***lnejpLha]oa^qu(nksW#?kqjp#Y(nksW#Epai#Y *** Lha]oa^qu.Iehg Lha]oa^qu-.Acco Lha]oa^qu1Pki]pkao I’ve accessed the values in the CSV file using #?kqjp# and #Epai# as keys to the diction- ary object, nks: nksW#?kqjp#Y and nksW#Epai#Y. If columns were switched, the code would still work as expected. Similarly, you can create a @e_pSnepan object as follows: :::da]`an9W#Epai#(#?kqjp#Y :::nkso9WW#Knc]je_Acco#(1Y(W#?q_qi^ano#(-.YY :::b_or9klaj$#**+`]p]+pk^qu[ikna*_or#(#s^#% :::`e_p[sn9_or*@e_pSnepan$b_or(da]`an% CHAPTER 5 N PROCESSINGTEXT FILES 163 :::`e_p[sn*snepanks$`e_p$vel$da]`an(da]`an%%% :::bknnksejnkso6 ***`e_p[sn*snepanks$`e_p$vel$da]`an(nks%%% *** :::b_or*_hkoa$% I first created the da]`an, a list of strings, and the `]p], a list of rows. I then opened a file named **+`]p]+pk^qu[ikna*_or for writing and attached it to a _or*@e_pSnepan object named `e_p[sn. As you can see, _or*@e_pSnepan requires the header information as well. Now this is where it gets a little tricky. First, I’d like to write the header information to the CSV file. To do so, I create the following dictionary: :::`e_p$vel$da]`an(da]`an%% w#?kqjp#6#?kqjp#(#Epai#6#Epai#y I then pass this dictionary as a parameter to the function snepannks$%, a method of @e_pSnepan, in essence creating the header field. Now, all that’s required is to do the same for the data, that is: :::bknnksejnkso6 ***`e_p[sn*snepanks$`e_p$vel$da]`an(nks%%% And here are the results: Epai(?kqjp Knc]je_Acco(1 ?q_qi^ano(-. As you can see, the @e_pSnepan object is not as simple to work with. That’s why I rarely use it, although I use @e_pNa]`an quite a bit. For a full account of @e_pNa]`an and @e_pSnepan, please consult with the Python Library Reference. Date and Time While text files are important, text isn’t really the object of this book; it has to do more with numbers. The most universal form of data you’re bound to see when processing data files is date and time. Text files that record information based on date and time, especially recording the events that transpired, are commonly referred to as log files. I recently performed a search for file names containing the word “log” in my ?6XSej`kso directory. I came up with dozens of files that are indeed log files. I opened some of them in my favorite editor, and of those that did record date and time, the date and time information was in varying forms. One file had timestamps that looked like this: ,2+.-+.,,4.-6136.,6Okiapatp ,2+.-+.,,4.-6136.16Okiapatp CHAPTER 5 N PROCESSINGTEXT FILES164 Unfortunately, not every line started with a timestamp. Another file looked like this: .,,4)-,).-.,6/36.2Okiapatp .,,4)-,).-.,6/36.2Okiapatp And yet another one looked like this: 999Hkccejcopklla`62+.2+.,,4-161/6,5999 OkiapatpW-161/6,56-,5Y6Okiapatp OkiapatpW-161/6,56-,5Y6Okiapatp 999Ran^koahkccejcopklla`62+.2+.,,4-161/6,5999 and this: WPdq=qc-...6/2605.,,0YOkiapatp WO]p=qc-0-.6106,-.,,0YOkiapatp WO]p=qc-0-/6.16.-.,,0YOkiapatp WO]p=qc-0-060-604.,,0YOkiapatp Of those log files, the ones I particularly like are those that don’t save hard disk space and always dump a timestamp in every entry. The reason for this is that parsing the data later can be done without state machines, simplifying the code considerably. (For a discussion of state machines, refer to Text Processing in Python by David Mertz [Addison-Wesley, 2003], also available online at dppl6++cjkoeo*_t+PLeL.) Log files that sporadically write date and time information are even harder to swallow. Clearly, the person writing those did not think of the parsing application—did he assume that a person will actually read this and not a computer? So here’s a tip for you, when you implement log files: NTip When writing log files, start a line of text with full date and time information. The time module, available as part of the Python Standard Library, allows easy handling of date and time information. Time Module The time module provides a set of helper functions and structures that facilitate handling of date and time in a simple manner. A point in time, that is, both date and time, can be repre- sented in one of two ways in the module: sA nine-elements tuple: This tuple includes year, month, day, hours, minutes, seconds, weekday, Julian day, and DST (DST stands for daylight savings time). Of this tuple, the values weekday and Julian day are redundant: you can calculate these values based on other values—year, month, and date. We’ll refer to this tuple as the opnq_p[peia tuple. CHAPTER 5 N PROCESSINGTEXT FILES 165 sNumber of seconds since the epoch: The epoch is a fixed time reference point and is sys- tem dependent. On my system, the epoch is Thu Jan 01 00:00:00 1970. At times people use the word “epoch” to mean the number of seconds elapsed since the epoch and not the fixed time reference; this is usually done in context and should be easy enough to discern. Some time functions accept the opnq_p[peia tuple, while others accept the epoch. It’s not complex to switch representations. With those two notations covered, let’s explore the time module. But first, to use the module, be sure to issue eilknppeia. The struct_time Tuple It’s possible to access a specific opnq_p[peia tuple element by indexing it; for example, if p is a opnq_p[peia tuple, pW,Y is the year value. However, using indices is quite hard to follow, and you’ll constantly have to look up the documentation to figure out which index is the cor- rect index to the specific value you’re looking for. Instead, you can use the member variables pi[ua]n, pi[ikjpd, pi[`]u, pi[dkqn, pi[iej, pi[oa_, pi[s`]u, pi[u`]u, and pi[eo`op to retrieve these values: :::bnkipeiaeilknphk_]hpeia :::hk_]hpeia$% $.,,4(-,(.2(5(/3(/1(2(/,,(,% :::hk_]hpeia$%*pi[ua]n .,,4 :::hk_]hpeia$%*pi[u`]u /,, In this example, I’ve introduced the function hk_]hpeia$%, which returns the current time as a opnq_p[peia tuple. Parsing and Formatting Date and Time The functions opnbpeia$% and opnlpeia$% are the two functions you’re most likely to use when dealing with log files. The function opnlpeia$%, which was introduced in Chapter 4, accepts a template parse string and the string to parse, and returns a opnq_p[peia representation of the time. The function opnbpeia$% does the opposite: transforms a tuple into a string based on a supplied pattern. Both functions use a similar notation to indicate the values in the opnq_p[ peia tuple, as listed in Table 5-2. Table 5-2. Selected Identifiers for opnbpeia$% and opnlpeia$% Identifier Description Values Range !U Year with century as a decimal number. !i Month as a decimal number. 1–12 !` Day of the month as a decimal number. 1–31 !D Hour as a decimal number. 0–23 Continued CHAPTER 5 N PROCESSINGTEXT FILES166 Table 5-2. Continued Identifier Description Values Range !I Minutes as a decimal number. 0-59 !O Seconds as a decimal number. 00–61 (61 for leap seconds) !s Weekday as a decimal number. 0–6 (where 0 is Sunday) !f Day of the year as a decimal number. 001–366 (366 for leap years) !v Time zone field. DST doesn’t have an identifier of its own and is part of this field. !],!= Locale’s weekday name, abbreviated and full. !^,!> Locale’s month name, abbreviated and full. The full table includes additional identifiers and is available online at dppl6++`k_o* lupdkj*knc+he^n]nu+peia*dpih. In the section “Example: Extracting Date and Time Information from File Contents” later in this chapter, we’ll extract the date and time using the opnlpeia$% function from some of the samples I’ve provided at the beginning of the section. Example: Logging Information with a Date and Timestamp The purpose of this example is to create a log file in accordance with the tip presented previously in the “Date and Time” section and the ISO time format recommendation (see Chapter 4). We’ll use the function hk_]hpeia$%, which returns a opnq_p[peia tuple containing the current time. We then format the current time using the opnbpeia$% function: :::bnkipeiaeilknphk_]hpeia(opnbpeia :::opnbpeia$!U)!i)!`P!D)!I)!O(hk_]hpeia$%% #.,,4)-,)./P-.)-2)0-# (From now on, I’ll assume you either imported the functions by name or imported the entire time module so I don’t have to write that eilknp statement every time.) While this notation is handy for file names, maybe in log files you’d like to have something a little more self-explanatory. In such cases, you can use the ]o_peia$% function: :::]o_peia$hk_]hpeia$%% #PdqK_p./-.6-460..,,4# However, I’ll use the following format: :::opnbpeia$#!`)!^)!U!D6!I6!O#(cipeia$%% #./)K_p).,,4-.6./604# This is to show formats other than the ISO format and because there’s little room to mis- understand it. However, I recommend following the ISO time format whenever possible. CHAPTER 5 N PROCESSINGTEXT FILES 167 NTip The ISO date and time format, in my mind, is the preferred method of writing date and time informa- tion. It might be a little more cryptic at first and might take some getting used to, but it’s consistent and very easy to work with. For example, to sort date in ISO time format, you can sort the actual string without con- verting it to numerical values. The output from ]o_peia$% adds, in my mind, an unnecessary value, redundant informa- tion if you will: the day of the week. The format I’ve selected is problematic as well in case of a different locale (see the section “Locale” later in this chapter), but it’s pretty hard to get it wrong; it’s probably the most self-explanatory of the formats presented. But it’s more a matter of personal taste. So now that we have the date and time as a string, time to write it to a log file. Listing 5-15 presents an example script you can run that generates a log file that adheres to the guidelines I’ve given in this chapter and previous ones. Listing 5-15. Creating Log Files bnkipeiaeilknpopnbpeia(cipeia(ohaal peia[opn9opnbpeia$!U)!i)!`P!D)!I)!O(cipeia$%% bhkc9klaj$**+`]p]+HkcAt]ilha!o*ptp!peia[opn(#sp#% bkneejn]jca$1%6 ohaal$-*3% hkcheja9!oxOkia`]p]!`Xj!X $opnbpeia$#!`)!^)!U!D6!I6!O#(hk_]hpeia$%%(e% bhkc*snepa$hkcheja% bhkc*_hkoa$% I’ve introduced another function from the time module: ohaal$%. The ohaal$% function accepts as an argument the amount of time in seconds it should sleep and returns once that time period has elapsed. I chose to use a fractional value so the log file might appear more “real,” that is, not in fixed time increments: ./)K_p).,,4-.6156.1xOkia`]p], ./)K_p).,,4-.6156.3xOkia`]p]- ./)K_p).,,4-.6156.5xOkia`]p]. ./)K_p).,,4-.6156/,xOkia`]p]/ ./)K_p).,,4-.6156/.xOkia`]p]0 Another benefit of the opnbpeia$% function is that it adds a leading zero for values that do not require the full length of the field, thus 1 a.m. will show as ,- in the hour fields. This is extremely useful when parsing, as the time format has a fixed length and can be string sliced. One of the problems associated with logging is that your program crashes, and you might lose important information. To overcome this, you can open and close the log file every time you log data (or once in a while) to protect data in case of a crash—your file is still updated. CHAPTER 5 N PROCESSINGTEXT FILES168 Another alternative is to use the logging module from the Python Standard Library, which I won’t be covering here. Refer to dppl6++`k_o*lupdkj*knc+he^n]nu+hkccejc*dpih for informa- tion about the logging module. Example: Extracting Date and Time Information from File Contents We’ve already seen an example of using opnlpeia$% in Chapter 4. Let’s cover it in more detail here. This time, we’d like to parse some of the date and time formats presented in the begin- ning of the section “Date and Time.” For the purpose of this example, I’ll assign a string to each different time format and parse every string. By now you should be able to write the wrapper functions to implement reading and writing from a file using either the regular file operation or the csv module. To show that the format was read properly, I’ll print the ]o_peia$% version of the time. :::hkc-9#,2+.-+.,,4.-6136.,6Okiapatp# :::peia-9opnlpeia$hkc-W6-5Y(#!i+!`+!U!D6!I6!O#% :::]o_peia$peia-% #O]pFqj.-.-6136.,.,,4# If you look at the time information in the string hkc-, it appears that it follows a fixed length. That’s especially evident because of the leading zero in the month field (,2). Therefore, to extract just the date and time information, I’ve sliced the string in the beginning. From there, I used the opnlpeia$% function to do the rest of the work. However, had the day been a value less than or equal to 12 (it’s currently 21), I wouldn’t know in advance whether the format is !i+!` or !`+!i. Furthermore, in case the timestamp isn’t a fixed-size string, I would need to resort to other methods such as splitting the string and working with substrings. On to the next string. :::hkc.9#.,,4)-,).-Xp.,6/36.2Okiapatp# :::lnejphkc. .,,4)-,).-.,6/36.2Okiapatp :::peia.9opnlpeia$hkc.W6-5Y(#!U)!i)!`Xp!D6!I6!O#% :::]o_peia$peia.% #PqaK_p.-.,6/36.2.,,4# This example includes a string with the date and time separated by a tab and a slightly dif- ferent format. But again, hardly a problem for opnlpeia$%. Parsing the date and time information of the remaining strings shouldn’t be too complex. The Epoch: “Linearizing” the Time Base Up to this point we’ve been using the opnq_p[peia tuple exclusively. It’s time to talk about the epoch representation. As I mentioned, seconds elapsed since the epoch is another time representation supported by the time module. At times, it’s more beneficial to use an epoch representation than a opnq_p[peia representation. The first reason that springs to my mind is that of visualization. If you want to plot data as a function of time, and your time base is in the form of a opnq_p[peia, it’s pretty hard to do. You’d have to come up with ways to “linearize” the time base so that the time base won’t be skewed. CHAPTER 5 N PROCESSINGTEXT FILES 169 You’ve seen two examples of linearizing the data. One was in Chapter 1, where I manually linearized the data by multiplying the hours value by /2,,, adding the minutes value multi- plied by 2,, and then adding the seconds. The second example was given in Chapter 4 where I used igpeia$% and cipeia$% to calculate the day of the year value as my linear time base. While these are good options, there’s a more standardized way, and that is using the epoch notation. As I mentioned previously, the epoch is system dependent and serves as a reference point in time against which time is measured by the total amount of seconds elapsed (fractional values are allowed as well). To figure out the epoch in your system, issue ]o_peia$cipeia$,%%. From here on out, you already have a linear time base, the epoch representation! Let’s modify the GPS example from Chapter 1. As you recall, the time information was as follows: ddiioo*oo, where the second set of oo values, after the decimal point, represent fractions of a second. Let’s consider the values #-0,,11*,,# and #-0,-12*,,#, which are one minute and one second apart. The original calculation used multiplications and additions: :::r]ho9W#-0,,11*,,#(#-0,-12*,,#Y :::Wbhk]p$tW,6.Y%&/2,,'bhk]p$tW.60Y%&2,'bhk]p$tW062Y%bkntejr]hoY W1,011*,(1,1-2*,Y :::[W-Y)[W,Y 2-*, The difference between the two values (and what we’re really after, as you recall we set the start of the time base at zero) is 61 seconds, as should be expected. We can alternatively calcu- late the epoch representation using the function igpeia$% from the time module and use that to calculate the time difference: :::Wigpeia$$,(,(,(ejp$tW,6.Y%(ejp$tW.60Y%(ejp$tW062Y%(,(,(,%%X ***bkntejr]hoY W50/52/.11*,(50/52//-2*,Y :::[W-Y)[W,Y 2-*, I’ve used filler values for other unknown fields such as day and month. (Module datetime of the Python Standard Library provides functionality that deals with time differences as well but is beyond the scope of this discussion.) We get the same result. So what’s the benefit of using the epoch-based functions from the time module? There are several: s 4HEPRECEDINGEXAMPLEWASPRETTYSIMPLEANDDIDNOTCONTAINDATEINFORMATIONASWELL Suppose your data recorder also records the date. What happens at midnight? You’ll get a rollover if you don’t take into account the date, which makes things considerably more complex. You can of course add the day into calculations. But what happens when a month changes? That’s a bit more complex: months don’t have the same num- ber of days in them, so you’d need a lookup table. And what about leap years? You see, it gets complicated. Instead, use igpeia$%, which takes all of these issues into consider- ation. s 5SINGTHEEPOCHENABLESSHARINGATIMEBASEACROSSFILES4HISMEANSTHATYOUCANPULL in information from all sorts of data sources and treat them as one. CHAPTER 5 N PROCESSINGTEXT FILES170 s )FYOUREDEALINGWITHTIMESTAMPEDBINARYDATAFILES WRITINGTHETIMEBASECANBECHAL- lenging as well. Instead of coming up with complex representations for the time base, use the epoch representation (see Chapter 10 for a general discussion of binary files and an example of time-based binary files). Example: End-of-Day Report The end-of-day report is a summary report of a log file presenting data at the end of the day. Here are two scenarios where this report is useful: sEnding quote for a stock: Suppose you have a log file with the stock prices over a long period of time, say, a month. The end-of-day report prints the stock price at the end of trade days. sPatient discharge: A patient is being treated at the hospital and during his treatment all information is logged. You’d like to know the patient discharge state, that is, you want to receive an end-of-day report with patient’s status. To illustrate the problem, we’ll create a log file , **+`]p]+Ouopai=Hkco*ptp, that will be composed of timestamps and a log message, as follows: O]pK_p.1,46-16,-.,,4(Iaoo]ca- O]pK_p.1,56/,6/-.,,4(Iaoo]ca. OqjK_p.2,56-56/..,,4(Iaoo]ca/ IkjK_p.3,46,16/-.,,4(Iaoo]ca0 Sa`K_p.5,36-360,.,,4(Iaoo]ca1 Sa`K_p.5,46006,0.,,4(Iaoo]ca2 Sa`K_p.5,56006,1.,,4(Iaoo]ca3 Sa`K_p.5-.6/,61,.,,4(Iaoo]ca4 PdqK_p/,--6..6,/.,,4(Iaoo]ca5 In this case, the log file is sorted, which is to be expected. But even if that’s not the case, the algorithm we employ should still work properly. To print an end-of-day report, I’ll use the following algorithm. A dictionary object stores an end-of-day report per day. The key of the dictionary should uniquely identify a day. I chose to use the tuple (year, day of the year) for that purpose. For the dictionary value, I’ll be using a tuple containing the epoch and the message. The reason I’m using the epoch is that first, I can easily compute it using igpeia$%, and second, it’s a simple value to check against. A larger epoch value means a more recent event. We process the file line by line and extract the information to create the key and value to access the dictionary. In case the key is already in the dictionary, we check whether the infor- mation in the current line is from a later time, and if it is, we update the value accordingly. If the key doesn’t exist in the dictionary, we add the computed key and value pair to the diction- ary. Once we’re done processing the file, we print the dictionary object. As you can see, the algorithm doesn’t rely on the log file to be sorted to produce correct results. I’ve chosen to use the csv module to show usage of the module in a full script, presented in Listing 5-16. In this specific case, a simple split, or string slicing, would’ve worked just as well. CHAPTER 5 N PROCESSINGTEXT FILES 171 Listing 5-16. End-of-Day Report Implementation aj`)kb)`]unalknp bnkipeiaeilknpigpeia(opnlpeia(_peia eilknp_or `9wy bknnksej_or*na]`an$klaj$#**+`]p]+Ouopai=Hkco*ptp#%%6 peo]opnq_p[peiapqlha p9opnlpeia$nksW,Y(#!]!^!`!D6!I6!O!U#% _]h_qh]paoa_kj`ooej_apdaalk_d p[alk_d9igpeia$p% _kjopnq_p]gau]j`r]hqa gau9$p*pi[ua]n(p*pi[u`]u% r]h9$p[alk_d(nksW-Y% pnu6 `ksad]ra]iknana_ajpajpnu; eb`WgauYW,Y8p[alk_d6 `WgauY9r]h at_alpGauAnnkn6 _qnnajp`]paeojkpej`e_pekj]nu `WgauY9r]h bknalk_d(hejaej`*epanr]hqao$%6 lnejp_peia$alk_d%(heja I’ve also introduced a new function from the time module in Listing 5-16, _peia$%. The function _peia$% accepts the number of seconds since the epoch and prints out a date string representation. Here are the results from running the script on the preceding log file: O]pK_p.1,56/,6/-.,,4Iaoo]ca. PdqK_p/,--6..6,/.,,4Iaoo]ca5 IkjK_p.3,46,16/-.,,4Iaoo]ca0 Sa`K_p.5-.6/,61,.,,4Iaoo]ca4 OqjK_p.2,56-56/..,,4Iaoo]ca/ The results are correct, but they aren’t sorted by date. This can be easily remedied if instead of printing the files you’d store them to a list and then call the function oknp$% to sort the list. CHAPTER 5 N PROCESSINGTEXT FILES172 Example: Combining Data from Several Sources Based on the Epoch One of the benefits of using the epoch is that it is a standard time base to work with. The inten- tion of this example is to combine data from several sources, in this case two log files, and present a coherent report. For the purpose of this example, we’ll split the file Ouopai=Hkco*ptp from the previous example into two files: Ouopai>Hkco*ptp and Ouopaio?Hkco*ptp. Our script should combine them back into a sorted log file. Following are the contents of Ouopai>Hkco*ptp: OqjK_p.2,56-56/..,,4(Iaoo]ca/ Sa`K_p.5,36-360,.,,4(Iaoo]ca1 PdqK_p/,--6..6,/.,,4(Iaoo]ca5 And the contents of Ouopai?Hkco*ptp: O]pK_p.1,46-16,-.,,4(Iaoo]ca- O]pK_p.1,56/,6/-.,,4(Iaoo]ca. IkjK_p.3,46,16/-.,,4(Iaoo]ca0 Sa`K_p.5,46006,0.,,4(Iaoo]ca2 Sa`K_p.5,56006,1.,,4(Iaoo]ca3 Sa`K_p.5-.6/,61,.,,4(Iaoo]ca4 Trying to sort these lists based on text will generate the wrong results. So the idea is to convert every timestamp string into an epoch representation and sort based on that value. For this purpose we’ll create a list of rows. Each row will be composed of Wp[alk_d(hejaY, where p[alk_d is the converted time representation and heja is the entire line of text. We then use the oknpa`$% function to sort the lists and dump the data back to file, removing the epoch informa- tion (see Listing 5-17). Listing 5-17. Script to Combine Two Time-Based Log Files bnkipeiaeilknpigpeia(opnlpeia `]p]9WY `]p]-9klaj$#**+`]p]+Ouopai>Hkco*ptp#%*na]`hejao$% `]p].9klaj$#**+`]p]+Ouopai?Hkco*ptp#%*na]`hejao$% bip9#!^!`!D6!I6!O!U# bknhejaej`]p]-'`]p].6 peo]opnq_p[peiapqlha p9opnlpeia$hejaW06.0Y(bip% _]h_qh]paoa_kj`ooej_apdaalk_d p[alk_d9igpeia$p% ]llaj``]p] `]p]*]llaj`$Wp[alk_d(hejaY% `]p]9WhejaW-Ybknhejaejoknpa`$`]p]%Y klaj$#**+`]p]+Ouopaio>?Hkco*ptp#(#sp#%*snepahejao$`]p]% CHAPTER 5 N PROCESSINGTEXT FILES 173 This script assumes you’re combining two files. More often than not, if you need to com- bine log files, you’ll need to combine more than two files. Refer to Chapter 10 for discussion on working with several input files. Additional Time and Date Functions We’ve covered most of the functionality available in the time module. For most of your log file processing needs and other time-based processing requirements, the module is comprehen- sive and complete. There are additional time- and date-related modules available in Python. The datetime module provides functionality that includes operations on dates using a more object-oriented approach. The calendar module provides general calendar-related operations. Refer to the Python Standard Library for additional information. Regular Expressions Regular expressions are pattern-matching expressions used for searching and replacing text. At times, they are more flexible than the string operations presented previously. To use regular expressions, you’ll have to import the regular expression module (named re), which is part of the standard library. The module re is similar to Perl’s built-in support of regular expressions. Your next step is to design a pattern to match against and decide what function to use on that pattern. The most notable functions are bej`]hh$l]p(opn%, which finds all occurrences of a regular expression pattern l]p in the string opn; olhep$l]p(opn%, which splits the string opn whenever a regular expression pattern l]p is encountered; and oq^$l]p(nalh(opn%, which substitutes the occurrence of a pattern l]p in a string opn with a supplied substitute nalh. There are additional functions in the module, including i]p_d$%, oa]n_d$%, and _kileha$% to name a few, but I will not be covering those here: the preceding three functions should take care of most our data processing needs. Regular Expression Patterns A regular expression pattern is basically a string. The pattern can contain both regular charac- ters and special characters. A regular character matches itself. So the pattern #]# matches the character #]# whenever encountered, and the pattern #ptp# matches the string #ptp# when- ever encountered: :::eilknpna :::na*olhep$n#]#(#Bh]ph]j`#% W#Bh#(#ph#(#j`#Y :::na*olhep$n#ptp#(#53*ptp#% W#53*#(##Y So far, na*olhep$% is similar to the olhep$% function introduced previously in the chapter. However, the strength of regular expressions lies in the special characters. These characters provide additional functionality to the pattern itself. Let’s take a look at some. The first one is the dot character (#*#). The dot character matches any single character except for a new line: :::na*bej`]hh$n#]*#(#Bh]ph]j`#% W#]p#(#]j#Y CHAPTER 5 N PROCESSINGTEXT FILES174 The character #'# means one or more occurrences of the pattern, the character #;# means zero or one occurrences of the pattern, and the character #&# means zero or more repetitions of the pattern. Note that these are modifiers of the pattern, that is, they change the behavior of the pattern, whereas the dot symbol is not a modifier of the pattern, it’s part of the pattern. Here are some examples: :::na*bej`]hh$n#*;]#(#Bh]ph]j`#% W#h]#(#h]#Y :::na*bej`]hh$n#]*&#(#Bh]ph]j`#% W#]ph]j`#Y The first example finds the pattern composed of zero or one characters, followed by the character #]#. The second example matches the character #]# followed by any number of characters. One might question why is it that the first pattern matched one character before the char- acter #]#, since obviously zero characters would’ve worked as well? Or why did the second pattern matched the string #]ph]j`# where #]# would’ve worked just as well? The answer is that regular expressions are greedy by default, that is, they try to match as many characters as possible. You can turn off the greedy behavior by adding a #;# character after the modifiers presented previously, that is, #;;#,#';#, and #&;#. The next special characters are the #Z# and # # characters, which match the start and end of a string, respectively. Here are some examples that demonstrate these special characters: :::na*bej`]hh$n#Z*&]#(#Bh]ph]j`#% W#Bh]ph]#Y :::na*bej`]hh$n#Z*&;]#(#Bh]ph]j`#% W#Bh]#Y :::na*bej`]hh$n#]*& #(#Bh]ph]j`#% W#]ph]j`#Y These searches can be expressed in English as follows. The first line matches as many characters as possible between the start of the string and the character #]#. The second line finds as few characters as possible between the start of the string and the character #]# (notice the nongreediness modifier, #&;#). The third line matches the character #]# and the remaining characters until the end of the line. I think we’re ready for an example now. Example: Removing Extra Spaces with Regular Expressions Previously in this chapter, I’ve shown how to remove extra spaces in a string. We’ve used olhep$% to split the words on spaces and then opnel$% to remove the excess spaces. We then used fkej$% to combine the list back into a string. With regular expressions, the same can be achieved more easily: :::cnk_anu[heop9Iehg.XjAcco-. :::na*oq^$n#'#(##(cnk_anu[heop% #Iehg.XjAcco-.# I’ve used the function oq^$% with the pattern #'# (a space followed by the plus sign) to replace one or more spaces with one space. CHAPTER 5 N PROCESSINGTEXT FILES 175 Special Sequences Special sequences are used to match some interesting combinations. Here’s a short list: #X`# matches a decimal digit, #Xo# matches a whitespace, and #Xs# matches an alphanumeric character. If you use uppercase, the opposite is achieved, that is, #X@# matches anything but a decimal digit, #XO# matches anything but a whitespace, and #XS# matches anything but an alphanumeric character. Since the #X# character is a modifier, it’s a good idea to use the raw string format, n##, as you’ve seen previously, so as not to escape characters on several levels (i.e., on the string level and on the regular expression level). :::cnk_anu[heop9Iehg.XjAcco-. :::na*bej`]hh$n#X`'#(cnk_anu[heop% W#.#(#-.#Y Alternatives The special character #x# is used to match either l]p- or l]p. in the regular expression n#l]p-xl]p.#: :::cnk_anu[heop9Iehg.XjAcco-. :::na*oq^$n#IehgxAcco#(#?dk_kh]pa#(cnk_anu[heop% #?dk_kh]pa.Xj?dk_kh]pa-.# Ranges You can also match a range of values using brackets. The pattern #W-.Y# will match both the character #-# and the character #.#: :::cnk_anu[heop9Iehg.XjAcco-. :::na*oq^$#W-.Y#(#,#(cnk_anu[heop% #Iehg,XjAcco,,# You can also denote a range of characters using the #)# character. The pattern #W-)1Y# matches any character from 1 to 5, inclusive. Lastly, follow up with a #Z# after the left bracket to negate the range: :::cnk_anu[heop9Iehg.XjAcco-. :::na*oq^$#WZ,)1Y#(#&#(cnk_anu[heop% #&&&&&&&.&&&&&&&-.# When to Use Regular Expressions It’s hard to decide whether to use a regular expression or just plain string operations. Regular expressions are hard to master, but practice makes perfect as the saying goes. Try solving the same problem with string functions and with regular expressions to get a feel for what’s the better approach. If using string operations makes things more complicated to follow, resolve to regular expressions. At times, a simple regular expression makes the code more readable and elegant. CHAPTER 5 N PROCESSINGTEXT FILES176 Such a case was presented previously in the example of removing extra spaces. At other times, it’s the other way around. Opt for simplicity and clarity of your code whenever possible. That being said, there is a special case where I’ve found that using regular expressions is far better than using string methods, and that is when I’d like to split or replace a string based on several options. The main reason is that string method olhep$% requires a separator and does not accept several options, whereas with regular expressions you can provide a range of separators. Example: Words Used Only Once We finalize this discussion of regular expressions with an example that uses a dictionary in conjunction with text. The idea is to find words used only once in a file. The motivation behind this example is that words used only once might be typographical errors in source code. To implement the solution, we’ll use a dictionary to count the number of occurrences of each and every word in a file (see Listing 5-18). Listing 5-18. Finding Words Used Only Once bnkiopnejceilknplqj_pq]pekj(sdepaol]_a eilknpna `abjkj_a$behaj]ia%6 Napqnjoskn`oqoa`kjhukj_aej]beha* `]p]9klaj$behaj]ia(#np#%*na]`$% `(naoqhp9`e_p$%(WY bknskn`ejna*olhep$#W#'lqj_pq]pekj'sdepaol]_a'#Y#(`]p]%6 `Wskn`*hksan$%Y9`*cap$skn`*hksan$%(,%'- bknskn`(k__qnej`*epanepaio$%6 ebk__qn99-6 naoqhp*]llaj`$skn`% napqnjnaoqhp The function jkj_a$% should prove quite readable. The heart of the script lies in the bkn loop, which splits the text using a regular expression. This is a prime example where a regular expression is better than the string method olhep$%: the split happens on either a punctuation character or a whitespace character by means of the regular expression range specifier, #WY#. Bear in mind that this script is good for mostly source code and not plain-text English. The reason is that it doesn’t take into consideration such things as plural forms (e.g., “girls” and “girl” are considered two different words) and other spoken language characteristics. Internationalization and Localization At times you’re faced with working with data files that originated in a different locale. This could pose some problems: date and time notations can be different from what your code expects, or the text characters can be of another language. The purpose of this section is to introduce the topic of internationalization (i18n) and localization (l10n). I’ll touch on two topics: the locale and its impact on date notations, and CHAPTER 5 N PROCESSINGTEXT FILES 177 Unicode, which is a convenient method to support different languages, at least when it comes to text files. NNote The abbreviation i18n comes from the number of characters between the “i” and the “n” in the string “internationalization.” Similarly, the abbreviation l10n refers to “localization.” Locale In the context of software, locale is a set of rules governing the behavior of some functions that are either country or language oriented. From a data analysis perspective, if a log file contain- ing a timestamp of both date and time is used, and the locale is not identical to the one in use, the function opnlpeia$% might fail. For example, some countries use day/month/year notation while others use the month/day/year notation. To accommodate for different locales, Python provides support via the locale module, which is part of the Python Standard Library. To use it, issue the command eilknphk_]ha. To set a locale, issue the command hk_]ha*oaphk_]ha$_]packnu(hk_]ha%. You can either enable the entire set of rules using the category hk_]ha*H?[=HH or set specific ones. In our case, we’ll be using H?[=HH, which also controls the behavior of the functions opnbpeia$% and opnlpeia$%. NNote The locale module relies on OS locale support. Different operating systems might have different locale abbreviations. For example, to run the following script on Linux, I had to use the locale #bn[ BN*EOK)4415)-#, while on Windows I’ve used #bn#. On some Linux distributions, a list of locale aliases can be found in the file +qon+od]na+hk_]ha+hk_]ha*]he]o. Unfortunately, I was unable to run the locale module properly on Cygwin; some poking around suggests that Cygwin does not currently support locale other than the basic one (named ?, which is to imply C language implementation). Generally speaking, you should set your locale upon program entry. :::eilknphk_]ha :::hk_]ha*oaphk_]ha$hk_]ha*H?[=HH(#bn#% #Bnaj_d[Bn]j_a*-.1.# ebpda]^kra`kaoj#pskng(pnunalh]_ejc#bn#sepd#bn[BN*EOK)4415)-#kn#bnaj_d# :::bnkipeiaeilknpcipeia(opnbpeia(opnlpeia :::bip9#!`)!>)!U# :::opnbpeia$bip(hk_]hpeia$%% #.2)k_pk^na).,,4# :::opnlpeia$[(bip% $.,,4(-,(.2(,(,(,(2(/,,()-% The first line imports the locale module. I then set the locale to be #Bnaj_d[Bn]j_a*-.1.#, which basically means the language French, the country France. It’s possible to mix language and country, for example, passing the string #Bnaj_d[?]j]`]# to set the language to French but CHAPTER 5 N PROCESSINGTEXT FILES178 the country to Canada. In Windows, I’ve found that the notation #H]jcq]ca[?kqjpnu# works well, whereas I’ve had some issues with abbreviations (e.g., #bn[?=# didn’t work well for me). In Linux, I’ve found that the notation in the form #aj[QO*EOK)4415)-# works well, so for French Canada, on Linux, I’ve set the locale to #bn[?=*EOK)4415)-#. For additional information regarding the locale module, refer to the Python Standard Library: dppl6++`k_o*lupdkj*knc+he^n]nu+hk_]ha*dpih. We’re not done with locale yet; I’ll present another example after talking a bit about Unicode. Unicode Strings The original ASCII table is based on the English language and does not account for a lot of other languages and symbols. Several designs were introduced to try to resolve this, and lately it appears that Unicode (dppl6++sss*qje_k`a*knc) is the industry standard. The Unicode standard addresses such topics as character encoding, character properties, visual representation, and more. From a very simplistic approach, Unicode tries to support characters and symbols from other languages by assigning every character (and there are tens of thousands of those, if not more) a unique integer number while maintaining compatibility with the ASCII table. This means that some Unicode characters are represented by 4 bytes, not 1 byte. The problem starts when you want to write your Unicode string to file. Writing 4 bytes instead of 1 every time is space consuming. If the characters you use are simple English characters, they are all well within the ASCII table and thus contain less than 8 bits. To write them, you don’t need 4 bytes—1 byte will suffice. In this case, you can choose to encode your Unicode string as an 8-bit value, known as UTF-8. UTF stands for Unicode Transformation Format. If the characters you use are not all from the English alphabet, you might need more bytes to represent your Unicode string. In most cases, 2 bytes is more than enough, which means you can encode your Unicode string using UTF-16 encoding. However, you’re also likely to use characters from the English alphabet (or other ASCII symbols), in which case some characters may be encoded with 8 bits. And so some encoding supports those variable size schemes as well. From our perspective, it is sufficient to know that Python natively supports Unicode strings. And furthermore, we can encode and decode Unicode strings using a host of encoding schemes, including UTF-8 and UTF-16. Unicode strings follow the notation q##. Working with Unicode Unicode strings behave similarly to regular strings. As mentioned previously, if a character in the Unicode string matches the ASCII value, that value is used. So to construct a Unicode string made of ASCII characters, simply call the qje_k`a$% function with the string: :::qje_k`a$#]opnejc#% q#]opnejc# However, if that is the case, and you are using ASCII characters, what’s the point of Uni- code? Fair enough, let’s add nonstandard characters, that is, characters with ordinal value above 128. CHAPTER 5 N PROCESSINGTEXT FILES 179 The value ,t]5 corresponds to © and is pretty hard to type on most keyboards. So instead, I’ve used its ordinal value in Unicode. Python retains the value as an ordinal value and does not print the symbol associated with it. If we dump the Unicode string to file, it will be possible to view the special characters in most editors or web browsers. In this case, the generated file is **+`]p]+ola_e]h*ptp. :::q9#]opnejc#'q#Xq,,]5# :::q q#]opnejcXt]5# :::klaj$#**+`]p]+ola_e]h*ptp#(#s^#%*snepa$q*aj_k`a$#qpb)4#%% The snepa$% method accepts only strings, not Unicode strings. So for us to be able to use the snepa$% function and write the Unicode strings to file, we’ll have to use the aj_k`a$% method, which accepts the encoding to be used. In this case I’ve selected the UTF-8 encoding, which is widely popular. The function `a_k`a$% complements aj_k`a$% and returns the decoded Unicode string: :::q*aj_k`a$#qpb)4#% #]opnejcXt_.Xt]5# :::q*aj_k`a$#qpb)-2#% #XtbbXtba]Xt,,Xt,,oXt,,pXt,,nXt,,eXt,,jXt,,cXt,,Xt]5Xt,,# :::q*aj_k`a$#qpb)-2#%*`a_k`a$#qpb)-2#% q#]opnejcXt]5# For example, to read an entire text file encoded in UTF-16, issue the following: klaj$#**+`]p]+okiabeha*ptp#%*na]`$%*`a_k`a$#qpb)-2#%. Example: The Hebrew Alphabet The purpose of this example is to generate a file with the Hebrew alphabet. If you don’t have Hebrew installed on your system (which I suppose is the case for most folks out there), you can try other, more popular characters, as I’ll show shortly. NNote Alphabet is a Hebrew word. It is composed of the first two letters of the Hebrew language: Aleph and Bet. The Hebrew alphabet, shown in Figure 5-1, starts with the letter Aleph mapping to the value 0x5D0 and ends at the letter Tav, mapped to value 0x5EA in Unicode. Unless you have the Hebrew keyboard installed on your system, manually typing Hebrew letters is not a trivial task. Therefore, you’ll have to construct the alphabet using a Unicode string. Since we don’t want (or can’t) manually type the values, we’ll construct a list of Unicode letters and then gen- erate a string from the list. The function qje_dn$% will be used to construct the characters from their ordinal value, similar to the function _dn$%: :::happano9Wqje_dn$happan%bknhappanejn]jca$,t1`,(,t1a^%Y :::]hald^ap9##*fkej$happano% :::klaj$#**+`]p]+]hald^ap*ptp#(#s#%*snepa$]hald^ap*aj_k`a$#qpb)-2#%% CHAPTER 5 N PROCESSINGTEXT FILES180 Figure 5-1. The Hebrew alphabet Once we have the Unicode string, all that’s required is to write it to file. Since the snepa$% function only accepts strings, and not Unicode strings, we have to encode the string. In our case, the Hebrew Unicode values require 16 bits, so we therefore encode with UTF-16. The Latin alphabet has special characters, as shown in Figure 5-2: the accented letters, starting at value 0xC0 and ending at 0x0FF. Therefore, you could modify the preceding script to generate the Latin special characters as follows: :::happano9Wqje_dn$happan%bknhappanejn]jca$,t_,(,t-,,%Y :::h]pej9##*fkej$happano% :::klaj$#**+`]p]+h]pej*ptp#(#s#%*snepa$h]pej*aj_k`a$#qpb)-2#%% Figure 5-2. Some interesting Latin characters Example: Writing Today’s Date in the Current Locale The purpose of this example is to print the current date and time, in a specific locale, to file. If you’re using the Ajcheod[Qjepa`Op]pao locale, this will be rather boring. So instead, I’ve decided to use the Hebrew locale again, since we’re all familiar with it by now (see Listing 5-19). Since Python doesn’t always print other character sets in the interpreter, it’s best to write the results to file and view them in a text viewer, editor, or web browser. Listing 5-19. Today’s Date in the Current Locale eilknphk_]ha bnkipeiaeilknpopnbpeia(opnlpeia(hk_]hpeia bnkiouoeilknplh]pbkni eblh]pbkni99#hejqt.#6 hk_]ha*oaphk_]ha$hk_]ha*H?[=HH(#da^nas#% aheblh]pbkni99#sej/.#6 hk_]ha*oaphk_]ha$hk_]ha*H?[=HH(#Da^nas[Eon]ah#% aheblh]pbkni99#_ucsej#6 n]eoaAt_alpekj$#?ucsejjkpoqllknpa`#% ahoa6 lnejpQjpaopa`lh]pbkni6(ouo*lh]pbkni pk`]u9opnbpeia$#!>!`(!U#(hk_]hpeia$%% pk`]uQ9qje_k`a$pk`]u(#_l-.11#% klaj$#**+`]p]+pk`]u*ptp#(#s#%*snepa$pk`]uQ*aj_k`a$#qpb)-2#%% CHAPTER 5 N PROCESSINGTEXT FILES 181 At first, I try to guess the encoding for Hebrew based on the current platform using the ouo*lh]pbkni value. As I’ve mentioned earlier, Linux, Windows, and Cygwin all have different locale abbreviations. After I set the locale, I can query the preferred locale encoding with the function hk_]ha*caplnabanna`aj_k`ejc$%, which is quite useful in determining how to encode the Unicode string. Unfortunately, I have found that the preferred encoding in the case of Hebrew should be _l-.11 and not the one returned by the caplnabanna`aj_k`ejc$% function. Lastly, I encode the string and write it to file using UTF-16. Figure 5-3 shows the results. Figure 5-3. A date in Hebrew More on Unicode The topic of Unicode is vast, and numerous books are available that discuss it. As for online information, I have found the Python library reference a valuable resource. If you’re looking for information regarding i18n and l10n in general, including code pages and locale informa- tion, the following might prove useful: sThe Unicode Consortium: dppl6++sss*qje_k`a*knc. sWikipedia: dppl6++aj*segela`e]*knc+sege+Hk_]ha. Be sure to follow the links to topics such as character encoding and Unicode. sInternational Components for Unicode: dppl6++sss*e_q)lnkfa_p*knc+. Final Notes and References String and text processing is a very large field, especially with the popularity of the Internet and search engines; most of the data available online is in some form of text. This chapter has covered a considerable number of the topics associated with text processing in the context of data analysis. However, there’s a lot more to learn. Two topics presented here were but briefly dis- cussed: regular expressions and i18n and l10n. Considerable documentation is available on the Internet on these two topics, so by all means refer to online resources. I hope that I’ve cov- ered the basics properly to allow you to proceed without much trouble. The following book is of great value for the topics discussed in the chapter: sText Processing in Python by David Mertz (Addison-Wesley, 2003; also available online at dppl6++cjkoeo*_t+PLeL) CHAPTER 6 Graphs and Plots Visualizing Data Graphs and plots are efficient methods to present data. Done properly, a graph can convey an idea better than an entire article. What a graph should portray is a function of your target audience, so when you plot a graph, bear that in mind. If your target audience is technical people, they might require addi- tional technical information. If your target audience is investors, another approach is required. In this chapter we won’t be discussing what to present and what not; instead, assuming you know what you want to present, I’ll show you how to do so. Examples include how to plot bar charts and pie charts, how to add markers and control line ticks, how to annotate the graphs with text and arrows, and more. Regardless of your target audience, some ideas and methodologies always hold true. Sources for data should be accurate and verified. Graphs should be easily reproducible by running the code that generated them. (How many times did your boss ask you to modify the report? If the graph was generated with a documented script, doing so should prove easy enough.) And lastly, your graphs should be aesthetically pleasing. Consulting with colleagues could be beneficial as well: What do they understand from the graph? Was the key idea cap- tured? Is the output professional? In this chapter we’ll discuss the basics of creating and annotating graphs. We’ll start by exploring the lhkp$% function, continue with text and grid annotation, explore some other types of graphs, and lastly introduce patches, a method to attach graphical objects to a figure. The Matplotlib Package The matplotlib package, available at dppl6++i]plhkphe^*okqn_abknca*jap+, is the main graphing and plotting tool used throughout this book. The package is versatile and highly con- figurable, supporting several graphing interfaces. Matplotlib, together with NumPy and SciPy (see Chapters 7 and 8), provides MATLAB-like graphing capabilities, with perhaps the limita- tion of 3-D plots, which matplotlib does not support. 183 CHAPTER 6 N GRAPHSAND PLOTS184 The benefits of using matplotlib in the context of data analysis and visualization are as follows: s 0LOTTINGDATAISSIMPLEANDINTUITIVE s 0ERFORMANCEISGREATOUTPUTISPROFESSIONAL s )NTEGRATIONWITH.UM0YAND3CI0YUSEDFORSIGNALPROCESSINGANDNUMERICALANALYSIS is seamless. s 4HEPACKAGEISHIGHLYCUSTOMIZABLEANDCONFIGURABLE CATERINGTOMOSTPEOPLESNEEDS The package is quite extensive and allows, for example, embedding plots in a graphical user interface. Currently, the package supports several graphical interfaces including wxPython (dppl6++sss*stlupdkj*knc+) and PyGTK (dppl6++sss*lucpg*knc+), to name a few. However, GUI topics are beyond the scope of the book. We will focus on plotting graphs rather than dis- cussing the GUI engine itself. For a full account of matplotlib, try the online documentation, available as a PDF document, at dppl6++i]plhkphe^*okqn_abknca*jap+I]plhkphe^*l`b. Going forward, you should ensure that you have matplotlib installed and working prop- erly. Refer to Chapter 2 if you require additional information on installing the package or visit the package’s web site. Interactive Graphs vs. Image Files There are several ways you can use matplotlib: s #REATEDYNAMICCONTENTTOBESERVEDONAWEBSERVERFOREXAMPLE GENERATINGSTOCK price images on the fly or displaying traffic information on top of a map. s %MBEDITINAGRAPHICALUSERINTERFACE ALLOWINGUSERSTOINTERACTWITHANAPPLICATIONTO visualize data. s !UTOMATICALLYPROCESSDATAANDGENERATEOUTPUTINAVARIETYOFFILEFORMATS INCLUDING JPG, PNG (Portable Network Graphics, see dppl6++sss*he^ljc*knc+lq^+ljc+), PDF, and PostScript (PS). This option is best suited for batch processing of a large number of files. s 2UNITINTERACTIVELY WITHTHE0YTHONSHELLIN8OR7INDOWS4HISOPTIONISGOODDURING the development phases of the code. Of the preceding options, and in view of the book topics, we’ll explore two: 1) generating plots of varying file formats and 2) using matplotlib interactively. I typically use both options depending on the stage of my code. In the early stages of development, I work interactively with a small sample of the data: plot, zoom in, change graph parameters, annotate, rinse and repeat. Once my code is ready, I let it loose so to speak on the full set of data files. Since that might mean tens if not hundreds of graph windows, I prefer to write them as files instead, and then use an image viewer to view the results one at a time. So let’s start. First and foremost, import PyLab as follows: :::bnkiluh]^eilknp& CHAPTER 6 N GRAPHSAND PLOTS 185 WHERE DOES THIS FUNCTION COME FROM? A frustration to some, especially those experienced with Python, is that issuing the command bnkiluh]^ eilknp& will import several packages. Some feel they’d like to know whether a specific function is a part of NumPy or matplotlib. The solution is simple—use dahl! For example, here’s the output from dahl$`ebb%: :::dahl$`ebb% Dahlkjbqj_pekj`ebbejik`qhajqilu*he^*bqj_pekj[^]oa6 `ebb$](j9-(]teo9)-% ?]h_qh]papdajpdkn`an`eo_napa`ebbanaj_a]hkjcceraj]teo* As you can see in the first line, `ebb$% is a function from the NumPy package, or more specifically, jqilu*he^*bqj_pekj[^]oa. As previously mentioned, this imports matplotlib, NumPy, and SciPy. Although gener- ally speaking you shouldn’t import everything quite so liberally, in the case of PyLab, make an exception: it’s considerably easier to work with the entire package loaded into memory, and unless memory is a constraint, the usability is great. Going forward, I’ll assume you’ve imported PyLab as just described. Our next step is to plot a graph. We’ll plot the list W,(-(.Y: :::lhkp$n]jca$/%% W8i]plhkphe^*hejao*Heja.@k^fa_p]p,t,-424,B,:Y There’s no visible output yet (other than matplotlib’s response), and the reason for this is we haven’t specified how exactly we want the graph drawn: interactive figures or hard copy files. Interactive Graphs Interactive graphs, like the one shown in Figure 6-1, plot the graph in a separate window in 8OR7INDOWS)FYOUDLIKETHISOPTION ENTERodks$% at the Python shell or call the function odks$% in a script. CHAPTER 6 N GRAPHSAND PLOTS186 Figure 6-1. Interactive graph The function odks$% opens up an interactive window. Several notes about this window: s 4HEWINDOWISNUMBERED ASyou can see by the label “Figure 1.” This is useful if you have several windows and would like subsequent plots to either override or appear on a specific figure. To switch between figures, use the becqna$j% function, where j stands for the figure index. If you’d like a new figure, and don’t particularly care about the figure index, issue the command becqna$%, which will create an empty figure with the next available index. s 4HEX AXISANDY AXISwere created automatically to fit the data. In a lot of the cases, matplotlib does an excellent job of automatically selecting the right axis (as in this example). However, if you want a different range of values to be displayed, that’s doable with the ]teo$% command, more on which appears later in the chapter in the section “Axis.” s 4HELOCATIONOFTHEMOUSEISPRINTEDONTHERIGHTCORNEROFTHEFIGURE4HISISVERYUSEFUL if you’re trying to zoom in on data and find a specific data point. This functionality is not available when you plot graphs to file (i.e., noninteractive mode). s 9OUHAVESEVERALBUTTONSONTHELOWER LEFTSIDEOFTHEFIGURETOALLOWINTERACTIONWITH the graph. The five leftmost buttons are used for zooming and zooming history. The first button from the left (with the house icon) is used to change axes to the origi- nal plot axes. The left and right arrow buttons cycle backward and forward through CHAPTER 6 N GRAPHSAND PLOTS 187 previous axes changes. The fourth button allows changing of the axes origin, and the fifth button from the left enables zooming. The sixth button from the left controls the margins of the plot in respect to the containing window, and lastly the seventh button allows saving the image to disk. NNote If you’re not using matplotlib interactively in Python, be sure to call the function odks$% after all graphs have been generated, as it enters a user interface main loop that will stop execution of the rest of your code. The reason behind this behavior is that matplotlib is designed to be embedded in a GUI as well. In Windows, if you’re working from interactive Python, you need only issue odks$% once; close the figures (or figures) to return to the shell. Subsequent plots will be drawn automatically without issuing odks$%, and you’ll be able to plot graphs interactively. Saving Graphs to Files The function o]rabec$% enables writing images of varying formats to a file. Out of the box, mat- plotlib supports several file formats including PDF, PNG, and PS. The simplest way to generate a file containing a graph is to issue o]rabec$behaj]ia%, where behaj]ia has the extension asso- ciated with your selected image format: :::becqna$% :::lhkp$]n]jca$/%% W8i]plhkphe^*hejao*Heja.@k^fa_p]p,t,-42=.@,:Y :::o]rabec$#heja*ljc#% :::eilknpko :::Wbehabknbehaejko*heop`en$#*#%ebbeha*aj`osepd$#ljc#%Y W#heja*ljc#Y 9OUSHOULDBEABLETOVIEWTHEFIGUREINMOSTIMAGEVIEWERSORYOURWEBBROWSER NNote Matplotlib returns objects as they’re created. In the preceding example, the returned object is noted by the line W8i]plhkphe^*hejao*Heja.@k^fa_p]p,t,-42=.@,:Y. Going forward I’ll omit these responses in the interest of making the interactive code easier to follow. I called the function o]rabec$% with the file name extension as part of the string holding the file name, instructing o]rabec$% to create a PNG file. Similarly, I could’ve created a Post- Script file by issuing o]rabec$#heja*lo#%. The dictionary object Becqna?]jr]o>]oa*behapulao holds a list of supported file types in your system: :::bnkillnejpeilknpllnejp :::llnejp$Becqna?]jr]o>]oa*behapulao% w#aib#6#Ajd]j_a`Iap]beha#( CHAPTER 6 N GRAPHSAND PLOTS188 #alo#6#Aj_]loqh]pa`Lkopo_nelp#( #l`b#6#Lknp]^ha@k_qiajpBkni]p#( #ljc#6#Lknp]^haJapskngCn]lde_o#( #lo#6#Lkopo_nelp#( #n]s#6#N]sNC>=^epi]l#( #nc^]#6#N]sNC>=^epi]l#( #orc#6#O_]h]^haRa_pknCn]lde_o#( #orcv#6#O_]h]^haRa_pknCn]lde_o#y FINDING WHAT YOU’RE LOOKING FOR IN COMPLEX MODULES So how did I figure out Becqna?]jr]o>]oa*behapulao holds the supported file types? Did I read the entire manual? Hardly. Some of the packages we work with are really large, and mastering all the intricacies of variables and objects that control their behavior is not a trivial task. So I use some quick-and-dirty tricks, and although they might not be the “proper” way to do things, they help me get the job done, and that’s what really counts. So let me show you what I’ve done to figure out the available file types. To figure out the base file types, I issued o]rabec$% with a bogus file format: o]rabec$#]*atp#%. The result was, of course, an exception (R]hqaAnnkn) but I also received some useful information: Oqllknpa`bkni]po6aib(alo(l`b(ljc(lo(n]s(nc^](orc(orcv* But that’s not exactly what I wanted; I wanted an enumeration of the file formats so I can index them rather than parse a string returned by an exception. So I traced back the source of the error from the exception output: the error originated in file ?6XLupdkj.1Xhe^Xoepa)l]_g]caoXi]plhkphe^X^]_gaj`[^]oao*lu line 1290. Next, I opened up the file ^]_gaj`[^]oao*lu, jumped to line 1290, and started reading the code. Python is a very readable language, and it didn’t take me long to figure out that the formats are stored in variable oahb*behapulao*gauo. Since oahb points to the container object, I scrolled up some more and found that the calling method is named lnejp[becqna and is part of the class Becqna?]jr]o>]oa, hence Becqna?]jr]o>]oa*behapulao. Reading the exceptions generated by matplotlib, along with exploring modules and their names, also helped me find the objects i]plhkphe^*_khkno*_j]iao and i]plhkphe^*_khkno*?khkn?kjranpan* _khkno, both listing possible colors (see the section “Colors” later in the chapter). That being said, reading the manual is also a very viable option. 9OUCANPASSAbkni]p argument to o]rabec$% to control the output file generated instead OFSPECIFYINGITINTHEFILENAMESTRING9OUCANALSOCONTROLOTHERPARAMETERSSUCHASDOTS per inch (dpi) and face color for the color of the figure. A more general form of o]rabec$% is o]rabec$bj]iaW(l]n]i-9r]hqa-YW(l]n]i.9r]hqa.Y%. Table 6-1 lists some parameters. In the examples, assume bj is a string containing a file name. CHAPTER 6 N GRAPHSAND PLOTS 189 Table 6-1. o]rabec$% Parameters Parameter Description Default Value Example `le Resolution in dots per inch None (On my system the actual dpi is 100.) o]rabec$bj(`le9-1,% b]_a_khkn* The figure’s face color #s# for white background o]rabec$bj( b]_a_khkn9#^#% bkni]p File format #ljc# o]rabec$#ei]ca#( bkni]p9#l`b#% * Refer to Table 6-4 for a list of color values. 9OUCANCOMBINESEVERALPARAMETERSo]rabec$#beha#(`le9-1,(bkni]p9#ljc#%. The function o]rabec$% supports additional options; see dahl$o]rabec% for a full account. Plotting Graphs This section details the building blocks of plotting graphs: the lhkp$% function and how to control it to generate the output we require. We’ve used the lhkp$% command extensively throughout the book. It’s now time to examine it more closely. The lhkp$% function is highly customizable, accommodating various options, includ- ing plotting lines and/or markers, line widths, marker types and sizes, colors, and legend to associate with each plot. The functionality of lhkp$% is similar to that of MATLAB (dppl6++sss* i]pdskngo*_ki) and GNU-Octave (dppl6++sss*cjq*knc+okbps]na+k_p]ra+) with some minor differences, mostly due to the fact that Python has a different syntax from MATLAB and GNU- Octave. Lines and Markers First, we’ll create a vector to plot using NumPy (see Chapter 7 for a full account of the NumPy package): :::becqna$% :::u9]nn]u$W-(.()-(-Y% :::lhkp$u% :::odks$% If you don’t have a GUI installed with matplotlib, replace odks$% with o]rabec$#behaj]ia#% and open the generated image file in an image viewer. NNote Going forward, I’ll omit the odks$% call from listings. Be sure to issue odks$% or o]rabec$% if you’d like to follow along. CHAPTER 6 N GRAPHSAND PLOTS190 I’ve passed the vector u as an input to lhkp$%. As a result, lhkp$% drew a graph of the vector u using auto-incrementing integers for an x-axis. Which is to say that if you don’t sup- ply x-axis values, lhkp$% will automatically generate one for you: lhkp$u% is equivalent to lhkp$n]jca$haj$u%%(u%. So let’s supply x-axis values (denoted by variable p): :::becqna$% :::p9]nn]u$W-,(--(-.(-/Y% :::lhkp$p(u% The call to function becqna$% generates a new figure to plot on so we don’t overwrite the previous figure. On to more options. Next we want to plot u as a function of p but display only markers, not lines. This is easily done: :::becqna$% :::lhkp$p(u(#k#% To select a different marker, replace the character #k# with any of the markers in Table 6-2. For a full account of available markers, issue dahl$lhkp%. Table 6-2. Some Plot Markers Character Marker Symbol #k# Circle #Z# Upward-pointing triangle #o# Square #'# Plus #t# Cross (multiplication) #@# Diamond Much like there are different markers, there are also different line styles, a few of which are listed in Table 6-3. Table 6-3. Some Plot Line Styles Character(s) Line Style #)# Solid line #))# Dashed line #)*# Dash-dot line #6# Dotted line If you’d like both markers and lines, concatenate the symbols for line styles and markers. To plot a dash-dot line and diamond symbols as markers, issue the following: :::lhkp$p(u(#@)*#% Figure 6-2 shows the output of the examples in this section. CHAPTER 6 N GRAPHSAND PLOTS 191 Figure 6-2. Output of previous examples. Order is from left to right, top to bottom. Plotting Several Graphs on One Figure We use graphs to visualize data and compare it. What’s more natural than displaying several graphs in one plot so we can compare results? There are two ways to do that in matplotlib. The first one is by adding more vectors to the lhkp$% function: :::lhkp$p(u(p(.&u% or :::lhkp$p(u(#'#(p(.&u(#o)#% The second method is by calling lhkp$% repeatedly. Sometimes you might have only partial data to plot. Say you have vector u, but then you modify it and want to print both the original vector and the newly modified vector. What do you do? One option would be to store the intermediate value, but what if you have 20 of those? That means calling lhkp$% with some 20+ arguments. When you call lhkp$% with an already existing figure, there are two possible outcomes. One is that the figure is erased, and the new plot is drawn. The other is that the figure is not erased, and the new plot is added to the figure. This behavior is determined by the hold status OFTHEFIGURE9OUCANCONTROLTHEHOLDSTATUSWITHTHEdkh`$% function: calling dkh`$Pnqa% will ensure new plots don’t erase the figure, whereas dkh`$B]hoa% will do the opposite. Issuing the command dkh`$% with no arguments will toggle the hold status. As a general rule, it’s best to specify the hold behavior you require and not rely on the default behavior, that is, dkh`$Pnqa% or dkh`$B]hoa%. CHAPTER 6 N GRAPHSAND PLOTS192 Line Widths and Marker Sizes Next in this discussion of customization is controlling line widths and marker sizes. This is done by passing hejase`pd (or hs for short) and i]nganoeva (or io for short) arguments to lhkp$%, as shown in Listing 6-1. Both arguments accept a floating-point value; the default value is 1. Listing 6-1. Plotting Several Lines in One Graph with Different Line Styles and Markers E9]n]jca$2% lhkp$E(oej$E%(#k#(E(_ko$E%(#)#(hs9/(io94% pepha$lhkp$E(oej$E%(#k#(E(_ko$E%(#)#(hs9/(io94%% Figure 6-3 shows the results of this example. Figure 6-3. Plotting several graphs in one figure When you plot multiple lines in one lhkp$% function call, the parameters hejase`pd and i]nganoeva control all the lines in the same lhkp$% command. If you’d like different lines with different line styles or different marker sizes in the same figure, draw each plot with an indi- vidual call to the lhkp$% function and use the dkh`$% function, as shown in Listing 6-2. Listing 6-2. Different Line Widths in One Graph becqna$%7dkh`$Pnqa% E9]n]jca$2% lhkp$E(oej$E%(hs90% lhkp$E(_ko$E%(hs9.% CHAPTER 6 N GRAPHSAND PLOTS 193 Colors Finally, on our list of plotting basics is controlling color. Just like marker style and line style, you can control color with one character, according to the list in Table 6-4. Table 6-4. Color-Character Lookup Table Character Color #^# Blue #_# Cyan #c# Green #g# Black #i# Magenta #n# Red #s# White #u# 9ELLOW As you might have noticed, matplotlib automatically chooses a different color for subse- QUENTLINEPLOTSIFACOLORISNOTSPECIFIED9OUCANSELECTYOURPREFERREDCOLORBYSUPPLYINGA color character: :::becqna$% :::E9]n]jca$2% :::lhkp$E(oej$E%(#g')#(E(_ko$E%(#i6#% This will plot two lines: the first is a black line with plus markers and a connecting solid line, and the second is a magenta dotted line. If you’d like a color that does not appear in Table 6-4, you can choose one from the dic- tionary object i]plhkphe^*_khkno*_j]iao. The dictionary contains a better color selection and has over a hundred values. And lastly, if that dictionary is not enough, you can provide an explicit Red-Green-Blue (RGB) value. In case you’re using the dictionary values or an explicit RBG value, you have to provide the _khkn argument as a parameter to a lhkp$% call: :::lhkp$n]j`j$1%(#u#(hs91%#u#bnkipda_khknp]^ha :::lhkp$n]j`j$1%(_khkn9#uahhkscnaaj#(hs91%qoejci]plhkphe^*_khkno*_j]ia :::lhkp$n]j`j$1%(_khkn9#bbbb,,#(hs91%atlhe_epuahhksNC> See dahl$i]plhkphe^*_khkno% for additional color information. NNote The function n]j`j$j% in the preceding example generates a random vector of size j. CHAPTER 6 N GRAPHSAND PLOTS194 Controlling the Graph For a graph to convey an idea aesthetically, the data, although highly important, is not every- thing. The grid and grid lines, combined with a proper selection of axis and labels, present additional layers of information that add clarity and contribute to overall graph presentation. Now that we have the basics of plotting lines and markers covered, we turn to controlling the figure: controlling the x-axis and y-axis behavior and setting grid lines. Axis The ]teo$% function controls the behavior of the x-axis and y-axis ranges. If you do not supply a parameter to ]teo$%, the return value is a tuple in the form $tiej(ti]t(uiej(ui]t%9OU can use ]teo$% to set the new axis ranges by specifying new values: ]teo$Wtiej(ti]t(uiej( ui]tY%. In the special case you’d like to set or retrieve only the x-axis values or y-axis values, use the functions thei$tiej(ti]t% or uhei$uiej(ui]t%, respectively. Other than the range limits, the function ]teo$% also accepts the following values: #]qpk#, #amq]h#,#pecdp#,#o_]ha`#, and #kbb#. The value #]qpk#, the default behavior, allows lhkp$% to select what it thinks are the best values. The value #amq]h# forces each x value to be the same length as each y value, which is important if you’re trying to convey physical distances, say, in a GPS plot. The value #pecdp# causes the axis to change so that the maximum and minimum values of x and y both touch the edges of the graph. The value #o_]ha`# changes the x-axis and y-axis ranges so that x and y have both the same length (i.e., aspect ratio of 1). Lastly, calling ]teo$#kbb#% removes the axis and labels. To illustrate these axis behaviors, I’ve plot a circle, as demonstrated in Listing 6-3. Listing 6-3. Plotting a Circle N9-*. E9]n]jca$,(.&le(,*,-% lhkp$oej$E%&N(_ko$E%&N% NNote The reason I chose a circle of radius 1.2 is that in the case of a radius of “nicer” numbers (say 1.0 or 2.0), the automatic axis solution works very well, and it’s hard to show the effects of the different axis options. Figure 6-4 shows the results of applying different axis values to this circle. CHAPTER 6 N GRAPHSAND PLOTS 195 Figure 6-4. Controlling axis behavior Grid and Ticks The function cne`$% draws a grid in the current figure. The grid is composed of a set of hori- ZONTALANDVERTICALDASHEDLINESCOINCIDINGWITHTHEXTICKSANDYTICKS9OUCANTOGGLETHEGRID by calling cne`$% or set it to be either visible or hidden by using cne`$Pnqa% or cne`$B]hoa%, respectively. To control the ticks (and effectively change the grid lines as well), use the functions tpe_go$% and upe_go$%, as shown in Listing 6-4. The functions behave similarly to ]teo$% in that they return the current ticks in case no parameters are passed and are used to set ticks once parameters are provided. The functions take an array holding the tick values as numbers and an optional tuple containing text labels. In case the tuple of labels is not provided, the tick numbers are used as labels. Listing 6-4. Grid and Tick Example N9-*. E9]n]jca$,(0&le(,*,-% lhkp$oej$E%&N(_ko$,*1&E%&N% ]tdheja$_khkn9#cn]u#% ]trheja$_khkn9#cn]u#% cne`$% tpe_go$W)-(,(-Y($#Jac]pera#(#Jaqpn]h#(#Lkoepera#%% upe_go$]n]jca$)-*1(.*,(-%% CHAPTER 6 N GRAPHSAND PLOTS196 Figure 6-5 shows the output generated from Listing 6-4 without issuing the last two calls to tpe_go$% and upe_go$% (left graph) and with tpe_go$% and upe_go$% calls (right graph). Notice the labels on the x-axis. Figure 6-5. Controlling grid and axis: the left graph shows the default tpe_go$%, the right graph displays labels. I’ve also made use of the functions ]tdheja$% and ]trheja$%, which plot a line across the x-axis and y-axis, respectively. The ]tdheja$% and ]trheja$% functions accept many param- eters, including _khkn, hejase`pd, and hejaopuha, to name a few. Subplots In some of the previous figures in this chapter, I’ve displayed several smaller graphs in one figure; these are known as subplots. The oq^lhkp$% function splits the figure into subplots and selects the current subplot. The subplots are numbered from left to right, top to bottom, so the upper-left subplot is 1, and the lower-right subplot is equivalent to the number of subplots. Notice that this is different from the default counting behavior used in Python: numbers start at 1 and not at 0. To split the figure into 2-by-2 subplots and select the upper-left subplot for plotting, issue oq^lhkp$.(.(-%. Alternatively, you can pass the string #..-#, which does the same thing: oq^lhkp$#..-#%. It’s also possible to combine subplots of different sizes in one figure. This is a bit tricky and requires subplotting with different subplot sizes. Listing 6-5 gives an example that generates a subplot on the upper half of the figure and two subplots on the lower part of the figure, the results of which you can see in Figure 6-6. Listing 6-5. Subplots of Varying Sizes becqna$% oq^lhkp$.(-(-% pepha$#Qlland]hb#% oq^lhkp$.(.(/% pepha$#Hksand]hb(habpoe`a#% oq^lhkp$.(.(0% pepha$#Hksand]hb(necdpoe`a#% CHAPTER 6 N GRAPHSAND PLOTS 197 Figure 6-6. Subplots of varying sizes NTip Subplots are especially useful in visualizing several aspects of the same data. For example, the GPS example in Chapter 1 shows x and y coordinates in one subplot and velocity in another subplot. Events (e.g., speeding) are marked in both, providing a visual link between the two subplots. Erasing the Graph The functions _h]$% and _hb$% clear the axes and the figure, respectively. These functions are useful when you’re working with an interactive environment and would like to clear the cur- rent axes (i.e., setting the axes to default values and clearing the plotted lines). It’s also possible to clear the figure altogether, erasing also the axes and subplots, using the _hb$% function. Lastly, you can choose to close the figure window; this is done by calling the function _hkoa$%. If you provide a number to _hkoa$%, the figure associated with the number is closed. So _hkoa$-% will close Figure 1, leaving other figures open. If you’d like to close all the figures, issue _hkoa$#]hh#%. Adding Text There are several options TOANNOTATEYOURGRAPHWITHTEXT9OUVEALREADYSEENSOMEUSING the tpe_go$% and upe_go$% function. The following functions will give you more control over text in a graph. CHAPTER 6 N GRAPHSAND PLOTS198 Title The function pepha$opn% sets opn as a title for the graph and appears above the plot area. The function accepts the arguments listed in Table 6-5. Table 6-5. Text Arguments Argument Description Values bkjpoeva Controls the font size #h]nca#,#ia`eqi#, #oi]hh#, or an actual size (i.e., 1,) ranpe_]h]hecjiajp or r] Controls the vertical alignment #pkl#,#^]oaheja#, #^kppki#,#_ajpan# dknevkjp]h]hecjiajp or d] Controls the horizontal alignment #_ajpan#,#habp#,#necdp# All alignments are based on the default location, which is above the graph, centered. So setting d]9#necdp# will print the title starting at the middle (horizontally) and extending to the right. Similarly, setting d]9#habp# will print the title ending in the middle of the graph (hori- zontally). The same applies for vertical alignment. Here’s an example of using the pepha$% function: :::pepha$#Habp]hecja`(h]ncapepha#(bkjpoeva9.0(r]9#^]oaheja#% Axis Labels and Legend The functions th]^ah$% and uh]^ah$% are similar to pepha$% only they’re used to set the x-axis and y-axis labels, respectively. Both these functions accept the text arguments from Table 6-5: :::th]^ah$#peiaWoa_kj`oY#% Next on our list of text functions is hacaj`$%. The hacaj`$% function adds a legend box, associating a plot with text: :::E9]n]jca$,(.&le(,*-% :::lhkp$E(oej$E%(#')#(E(_ko$E%(#k)#% :::hacaj`$W#oej$E%#(#_ko$E%#Y% The legend order associates the text with the plot. Had I called hacaj`$% with the inverted list, the result would be a wrong legend. An alternative approach is to specify the h]^ah argument with the lhkp$% function call, and then issue a call to hacaj`$% with no parameters: :::E9]n]jca$,(.&le(,*-% :::lhkp$E(oej$E%(#')#(h]^ah9#oej$E%#% :::lhkp$E(_ko$E%(#k)#(h]^ah9#_ko$E%#% :::hacaj`$% Figure 6-7 shows the addition of an x-axis label and legend. CHAPTER 6 N GRAPHSAND PLOTS 199 Figure 6-7. Adding an x-axis label and legend 9OUCANALSOCONTROLthe location of the legend box via the hk_ parameter. This is impor- tant if you don’t want the legend text to hide the graph line. hk_ can take one of the following values: #^aop#,#qllannecdp#,#qllanhabp#,#hksanhabp#,#hksannecdp#,#necdp#,#_ajpan habp#,#_ajpannecdp#,#hksan_ajpan#,#qllan_ajpan#, and #_ajpan#. Instead of using strings, you can use numbers: #^aop# corresponds to ,,#qllanhabp# corresponds to -, and #_ajpan# corresponds to -,. Using the value #^aop# moves the legend to a spot less likely to hide data; however, performance-wise there may be some impact. The function hacaj`$% has additional options that let you add a drop shadow and control the spacing between the text within the legend, among other things. Consult with the interac- tive help for additional information. Text Rendering The patp$t(u(opn% function accepts the coordinates in graph units t, u and the string to print, opn ANDRENDERSTHESTRINGONTHEFIGURE9OUCANMODIFYTHETEXTALIGNMENTUSINGTHE arguments in Table 6-5. The following will print text at location (0, 0): :::becqna$% :::lhkp$W)-(-Y(W)-(-Y% :::patp$,(,(#knecej#(r]9#_ajpan#(d]9#_ajpan#% The function patp$% has many other arguments such as nkp]pekj (which was used in Chapter 1) and bkjpoeva. See dahl$patp% for a complete list of arguments. CHAPTER 6 N GRAPHSAND PLOTS200 Mathematical Symbols and Expressions Last on our list of text-related functions is one that renders mathematical symbols and expres- SIONS4HESYNTAXTOUSEMATHEMATICALSYMBOLSPROVIDEDBYMATPLOTLIBISSIMILARTOTHATOF4E8 To render mathematical expressions, use a raw string and enclose your mathematical expres- sion with signs. For Greek letters, start with a slash followed by the name of the letter. So to print Alpha (_), your string should be n# X]hld] #. Fractions can be created using the Xbn]_wjqiyw`ajy notation; for example, n# Xbn]_wXleyw0y # is the symbol / divided by four. Subscripts are denoted with an underscore, so to render the text ai, write n# ][e #. )TISBEYONDTHESCOPEOFTHEBOOKTOCOVERTHEENTIRE4E8SYNTAXSUPPORTEDBYMATPLOTLIB For additional information, refer to the online matplotlib web site (at the time of the writing of this book, the following link was available: dppl6++i]plhkphe^*okqn_abknca*jap+qoano+ i]pdpatp*dpih). That being said, whenever you encounter a mathematical expression in this book, you’re more than likely be able to figure out how it works with the small subset of com- mands presented in this section. Example: A Summary Graph The script in Listing 6-6 is an example summarizing the functions we’ve discussed up to this point: lhkp$% for plotting; pepha$%, th]^ah$%, uh]^ah$%, and patp$% for text annotations; and tpe_go$%, uhei$%, and cne`$% for grid control. Listing 6-6. Plot Summary Example E9]n]jca$,(.&le',*-(,*-% lhkp$E(oej$E%(h]^ah9#oej$E%#% pepha$#u9oej$t%#% th]^ah$#tWn]`Y#% uh]^ah$#Bqj_pekju9oej$t%#% patp$le+.(-(#I]tr]hqa#(d]9#_ajpan#(r]9#^kppki#% patp$/&le+.()-(#Iejr]hqa#(d]9#_ajpan#(r]9#pkl#% tpe_go$]n]jca$,(.&le(le+.%(X $#,#(n# Xbn]_wXleyw.y #(n# Xle #(n# Xbn]_w/Xleyw.y #%% thei$W,(.&leY% uhei$W)-*.(-*.Y% cne`$% The result of this example appears in Figure 6-8. CHAPTER 6 N GRAPHSAND PLOTS 201 Figure 6-8. Plot summary example More Graph Types While the regular line and marker plots are excellent visualization tools, they’re hardly the only ones. This section provides a quick overview of some other 2-D graph options. Bar Charts A favorite of many, bar charts allow quantitative comparison of several values. To use a bar chart, call the function ^]n$habp(daecdp%, where habp is the x coordinates of the bar and daecdp is the bar height. The function ^]n$% allows for considerable customization; issuing dahl$^]n% will provide most of the details. Example: GDP, N Top Countries For this example, which plots the purchasing power parity (GDP) of various countries, you’ll need the CIA GDP Rank Order file, available from the CIA World Factbook (dpplo6++sss*_e]* ckr+he^n]nu+lq^he_]pekjo+pda)sknh`)b]_p^kkg+n]jgkn`an+.,,-n]jg*ptp); this is a tab- delimited file, perfect for easy data processing. I’ll assume that you’ve downloaded the file and saved it in folder ?d2+`]p]; the source code resides in ?d2+on_, and the output files are located in ?d2+ei]cao. CHAPTER 6 N GRAPHSAND PLOTS202 First, we’ll define a function to read the data, as we will use it in several examples in the chapter. The code in Listing 6-7 should be saved under file on_+na]`[sknh`[`]p]*lu. Listing 6-7. Function na]`[sknh`[`]p]$% eilknp_or(na `abna]`[sknh`[`]p]$J9-,(bj9#**+`]p]+.,,-n]jg*ptp#%6 =bqj_pekjpkna]`?E=Sknh`B]_p^kkgbeha* Jeopdajqi^ankb_kqjpneaopklnk_aoo* Oaadpplo6++sss*_e]*ckr+he^n]nu+lq^he_]pekjo+pda)sknh`)b]_p^kkg+ n]jgkn`an+.,,-n]jg*ptp* ejepe]hevanapqnjheopo c`l(h]^aho9WY(WY na]`pda`]p]]j`lnk_aooep bkne(nksejajqian]pa$_or*na]`an$klaj$bj%(`aheiepan9#Xp#%%6 ogelbenopoaran]hhejao ebe:/6 naikrapda`khh]n(_kii]]j`ol]_a_d]n]_pano c`l[r]hqa9na*oq^$n#WX (Y#(##(nksW.Y% opkna`]p]ej^ehhekjokb`khh]no c`l*]llaj`$bhk]p$c`l[r]hqa%+-a5% h]^aho*]llaj`$nksW-Y*opnel$%% opkl]j]huvejcpda`]p]]bpanJ_kqjpneaod]ra^aajlnk_aooa` ebe:J'.6 ^na]g napqnj$c`l(h]^aho% The function reads data from the first N countries and returns their GDP alongside the country names. I’ve made use of two modules. The first, the csv module, reads the data, which is tab delimited. The second, the re module, gets rid of the dollar sign, comma, and space char- acters in the GDP value field. Armed with na]`[sknh`[`]p]$% function, we turn to plot the bar chart (see Listing 6-8). Listing 6-8. Plotting the GDP Bar Chart ]o_nelppklhkpC@L^]n_d]np bnkiluh]^eilknp& ejepe]hevar]ne]^hao(Jeopdajqi^ankb_kqjpneao J91 ata_beha$#na]`[sknh`[`]p]*lu#% c`l(h]^aho9na]`[sknh`[`]p]$J% CHAPTER 6 N GRAPHSAND PLOTS 203 lhkppda^]n_d]np ^]n$]n]jca$J%(c`l(]hecj9#_ajpan#% ]jjkp]pasepdpatp tpe_go$]n]jca$J%(h]^aho% bkne(r]hejajqian]pa$c`l%6 patp$e(r]h+.(opn$r]h%(r]9#_ajpan#(d]9#_ajpan#(_khkn9#uahhks#% uh]^ah$# $>ehhekjo%#% pepha$#C@Ln]jg(`]p]bnki?E=Sknh`B]_p^kkg#% The script by now should be quite readable. Notice that I’ve decided to put the na]`[ sknh`[`]p]$% function in a separate file, and so to be able to use the function, I’ve called the function ata_beha$#na]`[sknh`[`]p]*lu#%. If you scroll down to the end of CIA GDP rank order file, you’ll find a note similar to this: Pdeobehas]oh]opql`]pa`kj./K_pk^an(.,,4 It’s a good idea to extract the date information and add it to the title (or some other spot of your choice): :::h]op[heja9klaj$#**+`]p]+.,,-n]jg*ptp#%*na]`hejao$%W)-Y :::pepha$#C@Ln]jg(`]p]bnki?E=Sknh`B]_p^kkg(#'h]op[hejaW/-6)-Y% Alternatively, you can modify the function na]`[sknh`[`]p]$% to return this string as well. Figure 6-9 shows our bar chart. Figure 6-9. Bar chart showing World GDP rank CHAPTER 6 N GRAPHSAND PLOTS204 It’s also possible to add error bars. To add an error bar equivalent to ±1000 billion dollars (talk about an error, eh?), add this line to the script shown in Listing 6-8, just after the ^]n$% function call: annkn^]n$]n]jca$J%(c`l(-,,,&kjao$J%(_khkn9#g#% Finally, the function ^]nd$% plots a horizontal bar chart instead of a vertical one should you require one. Histograms Histograms are charts that show the frequency, or occurrence, of values. In matplotlib, the function deop$% is used to calculate and draw the histogram chart. At a minimum, you must SUPPLYANARRAYOFVALUES9OUCANCONTROLTHENUMBEROFCELLSINAHISTOGRAMBYSPECIFYING them as follows: deop$r]hqao(jqi_ahho%. Alternatively, you can specify the histogram bins deop$r]hqao(^ejo%, where ^ejo is a list holding histogram bin values. The return value from deop$% is a tuple of probabilities, bins, and patches. Patches are used to create the bars; I’ll go into more detail in the “Patches” section later in the chapter. The function deop$% has other customization options, including the histogram orientation (vertical or horizontal), the alignment of bars, and more. Again, refer to the interactive help: dahl$deop%. Example: GDP, Histogram We turn again to the GDP ranks from the CIA World Factbook; this time we plot a histogram of the N largest economies. Again, we use the na]`[sknh`[`]p]$% function implemented in the previous example (see Listing 6-9). Listing 6-9. Plotting GDP Histogram ]o_nelppklhkpC@Ldeopkcn]i bnkiluh]^eilknp& ejepe]hevar]ne]^hao7Jeopdajqi^ankb_kqjpneao(>eopda^ejoeva J(>91,(-,,, ata_beha$#na]`[sknh`[`]p]*lu#% c`l(h]^aho9na]`[sknh`[`]p]$J% lhkppdadeopkcn]i lnk^(^ejo(l]p_dao9deop$c`l(]n]jca$,(i]t$c`l%'>(>%(]hecj9#_ajpan#% ]jjkp]pasepdpatp bkne(lejajqian]pa$lnk^%6 lan_ajp9ejp$bhk]p$l%+J&-,,% kjhu]jjkp]pajkj)vankr]hqao eblan_ajp6 patp$^ejoWeY(l(opn$lan_ajp%'#!#( nkp]pekj901(r]9#^kppki#(d]9#_ajpan#% CHAPTER 6 N GRAPHSAND PLOTS 205 uh]^ah$#Jqi^ankb_kqjpneao#% th]^ah$#Ej_kia(^ehhekjokb`khh]no#% pepha$#C@Ldeopkcn]i(!`h]ncaopa_kjkieao#!J% okia]teoi]jelqh]pekjo thei$)>+.(thei$%W-Y)>+.% Figure 6-10 shows the resulting graph. Figure 6-10. GDP histogram, N largest economies Again, the script should prove quite readable. I’d like to turn your attention to what might appear to be an odd modification I’ve made to the x-axis using the call to function thei$%. The purpose of this call is to modify the default behavior of the x-axis ranges. The motivation behind this modification is that since I’ve chosen #_ajpan# for the histogram bins, the auto- matic x-axis range includes negative values, because the leftmost bin is centered at zero but has a width, part of it in the negative x-axis. I didn’t like this behavior and chose to override it by manually setting the axis. Instead of setting a fixed number, I’ve first retrieved the current axis by calling thei$%, and then modified the x-axis by subtracting and adding half the bin width, >+., to the axis. As a general rule, when you modify default behavior like this, try to use parameters as much as possible (in the preceding example, using the parameter >, not the value -,,,, and retrieving current values with thei$%); this will allow for more flexible scripts that cater to a wider range of input values. CHAPTER 6 N GRAPHSAND PLOTS206 Pie Charts Pie charts are as simple to use as bar charts. The function that implements pie charts is lea$t%, where t holds the values to be charted. Example: GDP, Pie Chart Listing 6-10 presents a script to generate a pie chart, shown in Figure 6-11, again making use of the function na]`[sknh`[`]p]$%. Listing 6-10. Plotting a GDP Pie Chart ]o_nelppklhkpC@Llea_d]np bnkiluh]^eilknp& ejepe]hevar]ne]^hao(Jeopdajqi^ankb_kqjpneao J9-, ata_beha$#na]`[sknh`[`]p]*lu#% c`l(p]co9na]`[sknh`[`]p]$J% lhkppda^]n_d]np lea$c`l(h]^aho9p]co(od]`ks9Pnqa% pepha$#C@Ln]jg(`]p]bnki?E=Sknh`B]_p^kkg#% Figure 6-11. GDP pie chart, N largest economies CHAPTER 6 N GRAPHSAND PLOTS 207 NNote I’ve decided to use the variable p]co instead of h]^aho so that the call to lea$% would be a little less confusing. Had I stuck with the original name, h]^aho, the call to pie would’ve been lea$c`l( h]^aho9h]^aho(od]`ks9Pnqa%, which still would’ve worked, but would seem a bit confusing in my opinion. Logarithmic Plots The functions oaiehckt$% and oaiehkcu$% are used to plot the x-axis and y-axis in a logarithmic scale, respectively. Logarithmic plots of type oaiehkcu$% are common when plotting power or intensity values, for example, those of the Richter magnitude scale, which measures seismic energy. Likewise, measurements of quantities used with frequencies, for example, are com- monly plotted on a logarithmic x-scale denoting octaves and decades. There’s also the option of using a hkchkc$% plot, which means both x-axis and y-axis are logarithmic. This is the case in Bode plots, common in engineering fields. All three functions, oaiehkct$%, oaiehkcu$%, and hkchkc$%, can be modified with argu- ments similar to those presented with the lhkp$% function. The function hkcol]_a$op]np(opkl(jqilkejpo91,(aj`lkejp9Pnqa(^]oa9-,*,% can be useful in creating a range of values to be plot with the preceding functions. The op]np and opkl values are the exponent values. hkcol]_a$% generates logarithmically spaced values between -,&&op]np and -,&&opkl9OUCANDECIDEWHETHERTHEENDVALUE -,&&aj`, is returned by speci- fying aj`lkejp9Pnqa. If you’d like a base other than 10, set ^]oa to the value you require. :::becqna$% :::E9.&hkcol]_a$-(1(1% :::E ]nn]u$W.*,,,,,,,,a',-(.*,,,,,,,,a',.(.*,,,,,,,,a',/( .*,,,,,,,,a',0(.*,,,,,,,,a',1Y% :::oaiehkct$E(W.,(-5(4(.(.Y(#')#% :::cne`$% :::pepha$#Hkc]nepdie_lhkp(oaiehkct$%#% :::th]^ah$#Bnamqaj_uWDvY#% :::uh]^ah$#=ilhepq`aW`>Y#% Figure 6-12 shows the results of the preceding example. CHAPTER 6 N GRAPHSAND PLOTS208 Figure 6-12. Logarithmic plot Notice that when plotting with oaiehkct$%, oaiehkcu$%, and hkchkc$%, the labels are the original values, not the logarithms of the values. If you’d like to print the logarithmic values, you should probably use a regular lhkp$% function with hkc$% or hkc-,$% of the values. This is useful, for example, in estimating the energy in decibels (dB): :::`ab`^$t%6 ***Napqnjopdar]hqakbt(ej`a_e^aho* ***napqnj.,&hkc-,$]^o$t%% :::lhkp$`^$]nn]u$W-,,,(54,(53,(0,,(/,(.(-(-Y%%% Polar Plots Polar plots draw polar coordinate values: a radius at a given angle. Polar plots are commonly used to draw antenna radiation patterns, as they depict the energy the antenna transmits at any given angle. Polar plots are implemented using the lkh]n$pdap](n% function. To set the labels along the radius, use the ncne`o$n]`ee(h]^aho% function, which works similarly to tpe_go$% and upe_go$%. If you don’t provide the h]^aho value, the n]`ee values are USEDASLABELS9OUCANALSOSETTHE]jcha at which the labels are plotted (the default is 22.5 degrees). Similarly, the function pdap]cne`o$% plots the angle ticks and labels, as demonstrated in Listing 6-11. CHAPTER 6 N GRAPHSAND PLOTS 209 Listing 6-11. A Polar Plot pdap]9]n]jca$,(.&le(,*,-% lkh]n$pdap](_ko$pdap]%(pdap]()_ko$pdap]%% ncne`o$W,*1(-*,Y(W#D]hb#(#Bqhh#Y% pdap][h]^aho9W#,#(n# Xbn]_wXleyw.y #(n# Xle #(n# Xbn]_w/&Xleyw.y #Y pdap]cne`o$]n]jca$,(/2,(5,%(pdap][h]^aho% pepha$n#=lkh]nlhkpkb Xli_ko$Xpdap]% #% Figure 6-13 shows the resulting polar plot. Figure 6-13. Polar plot In the title, I’ve used the ± symbol denoted by # Xli #. Stem Plots Stem plots draw a vertical line from (x, 0) to (x, y) for every (x, y) value as well as a marker at (x, y). Stem plots are used to denote discrete data and are popular for plotting filtering windows (see Listing 6-12). Listing 6-12. A Stem Plot of Filter Windows bnkiluh]^eilknp& J9W0(4(-2(20Y bkne(jejajqian]pa$J%6 oq^lhkp$.(.(e'-% CHAPTER 6 N GRAPHSAND PLOTS210 opai$]n]jca$j%(d]iiejc$j%% tpe_go$]n]jca$,(j'-(j+0%% upe_go$W,(,*1(-Y% thei$),*1(j',*1% hacaj`$W#J9!`#!jY% Figure 6-14 shows the results of this listing. Figure 6-14. Stem plots of a Hamming window with different N values In the preceding example, I’ve made use of the hacaj`$% function to denote the number of elements used in the plot, as I think it’s clearer than a title. Notice that I had to supply a list to hacaj`$W#J9!`#!jY% (notice the brackets). Had I not supplied a list, the string #J9!`# would have been split because hacaj`$% assumes a sequence of elements and assigns each one a plot line. I’ve also made use of the d]iiejc$% function to create a Hamming window, com- monly used in filtering values. Additional Graphs Matplotlib also supports a great number of graphs used to depict more complex data. Here’s a short list of some of the graphs available: s &UNCTIONS_kjpkqn$% and _kjpkqnb$% are used for contour plots. Contour plots draw a line connecting equal (x, y) value pairs. They’re used in weather maps, detailing lines of equal pressure or temperature; in topographical maps, detailing the terrain; in physics graphs, to describe fields; and more. CHAPTER 6 N GRAPHSAND PLOTS 211 s &UNCTIONola_cn]i$% displays the frequency contents of data over time. ola_cn]i$% can be used, for example, to plot the frequencies of a sound wave as a function of time. s "OTHTHE_kjpkqn$% and ola_cn]i$% functions rely on a color map to depict the data. Color maps are a relation between a value and a color. Matplotlib provides a set of color maps that include such names as ]qpqij$% and dkp$% to ease the selection of a color map. s &UNCTIONmqeran$% implements quiver plots, which are typically used to describe force fields in physics. The quiver plot is a set of arrows depicting the force at each point (direction and magnitude). Example: Plotting Frequency Content of a Signal At times it’s of value to plot the frequencies a signal is composed of as a function of time. For example, in an audio signal, a different frequency means a different note, so plotting frequen- cies as a function of time is a possible “musical visualization.” In this example, shown in Listing 6-13, we create a signal composed of several discrete fre- quencies and display those frequencies as a function of time using a ola_cn]i$%. Listing 6-13. Specgram of a Signal bnkiluh]^eilknp& Bo9.12 peiao9W/(3(1Y bnamqaj_eao9W-,,(.,(4,Y u9]nn]u$WY% bknp(bejvel$peiao(bnamqaj_eao%6 t9_ko$.&le&]n]jca$p&Bo%+Bo&b% u9]llaj`$u(t% ola_cn]i$u(.12(Bo% th]^ah$#PeiaWoa_Y#% uh]^ah$#Bnamqaj_uWDvY#% I’ve set the frequency of sampling at 256 samples per second and created a signal com- posed of 100 Hertz (Hz) for 3 seconds, 20 Hz for 7 seconds, and then 80 Hz for 5 seconds. I then plot the signal using ola_cn]i$%, with the results shown in Figure 6-15. CHAPTER 6 N GRAPHSAND PLOTS212 Figure 6-15. A specgram Figure 6-15 clearly shows that in the first 2 seconds the frequency is 100 Hz, in the next 8 seconds the frequency is 20 Hz, and in the last 5 seconds the signal’s frequency is 80 Hz. NNote If you look closely at the figure, you’ll notice there’s a half-a-second shift in the specgram. This is due to an overlapping window of size 128 samples. See dahl$ola_cn]i% for information on the overlapping window. 9OUCANCHANGETHEcolors used to display the specgram using a color map func- tion. Simply issue dkp$% or ]qpqij$% at the end of the script, and observe the results. See dahl$_khkni]lo% for a full account of available colormaps. Example: A Repelling Force Field The following example illustrates the use of mqeran$% to depict a force field. At each point in the figure, an arrow points at the direction of the acting force as well as its magnitude, denoted by the size of the arrow. bnkiluh]^eilknp& t9]n]jca$)1(2(-% u9]n]jca$)1(2(-% CHAPTER 6 N GRAPHSAND PLOTS 213 q(r9iaodcne`$t(u% mqeran$q(r% tpe_go$n]jca$haj$t%%(t% upe_go$n]jca$haj$u%%(u% ]teo$W)-(--()-(--Y% ]teo$#o_]ha`#% pepha$#=nalahhejcbkn_abeah`#% I’ve made use of the function iaodcne`$t(u%, which generates two matrices: the first is a matrix of repeating values of t, and the second is a matrix of repeating values of u. The output is used to plot the quiver, shown in Figure 6-16. I then update the axis to reflect the proper ranges. Figure 6-16. A quiver plot depicting a force field Getting and Setting Values As you start plotting and generating visual output, you’ll find that you’re using more and more of the “helper” functions, functions that don’t necessarily plot the data, rather control the graph behavior and arrange labels just the way you want them. So far we’ve used two methods to modify a plot behavior. One was using dedicated func- tions such as ]teo$%, thei$%, and uhei$% to control the plot ranges. The other method you’ve seen was passing arguments to functions, for example, the nkp]pekj argument in the patp$% function. CHAPTER 6 N GRAPHSAND PLOTS214 A third method is available, one that makes use of the object-oriented design of matplot- lib. It involves two functions, oapl$% and capl$%, which retrieve and set a matplotlib object’s parameters. The benefit of using oapl$% and capl$% is that automation is easily achieved. Up to this point we’ve suppressed the output from matplotlib so that the interactive scripts are easier to follow. We now turn to looking at those outputs. Whenever you issue a lhkp$% command, matplotlib returns a list of matplotlib objects. This is important; the return value from calling lhkp$% is a list of objects, not the matplotlib object itself, even if you only have one line to plot. :::bnkiluh]^eilknp& :::l9lhkp$]n]jca$1%% :::pula$l% 8pula#heop#: :::pula$lW,Y% 8_h]oo#i]plhkphe^*hejao*Heja.@#: The function oapl$i]pk^f% prints a list of properties you can set for i]pk^f, where i]pk^f is a matplotlib object. The function accepts either a list of matplotlib objects or just one object: :::oapl$lW,Y% 8kqplqpoqllnaooa`: If you’re not sure of what values a parameter can take, issue oapl$l(#l]n]i#%: :::oapl$lW,Y(#reoe^ha#% reoe^ha6WPnqaxB]hoaY So to hide the plot, you could issue :::oapl$lW,Y(reoe^ha9B]hoa% or to set the label associated with a line, issue :::oapl$lW,Y(h]^ah9#Heja-#% :::hacaj`$% The function oapl$% also accepts lists of matplotlib objects, in which case all the matplot- lib objects in the list will be set. NNote To query acceptable parameters, enclose the parameter to be queried in quotes: oapl$l( #hejase`pd#%. To set a parameter value, do not include the quotes, but do use an assignment: oapl$l( hejase`pd90%. Similarly, to retrieve values, use the capl$% function. The function capl$% is a little less for- giving in that it requires one matplotlib object, not a list of objects. :::capl$lW,Y(#hejase`pd#% -*, CHAPTER 6 N GRAPHSAND PLOTS 215 Setting Figure and Axis Parameters In the preceding examples we stored the return value from the call to the function lhkp$%, which is a matplotlib object of a line, specifically the line we drew (actually, a list containing one line). But how do we modify the behavior of the figure or the axis? The function c_b$% returns a handle to the current figure. The function c_]$% returns a handle to the current axis. Armed with these, we can now modify the axis and figure param- eters. To set the y label, instead of calling uh]^ah$#Ur]hqa#%, we could issue the command :::oapl$c_]$%(uh]^ah9#Ur]hqa#% But what are the benefits of using oapl$% in this manner over simply calling uh]^ah$%? The answer is automation. Let’s turn to an example. Example: Modifying Subplot Parameters Suppose you’d like to write a function that receives a figure number and then modifies all the subplot titles in the figure (if they exist) to numbered titles. For example, for a figure of 2-by-2 SUBPLOTSINUSE YOUDLIKETHESUBPLOTTITLESTOBEFROMTOIFTHEYALLEXIST 9OUDONTKNOW in advance how many subplots are in a figure. This is an ideal case for using oapl$% and capl$%, as demonstrated in Listing 6-14. Listing 6-14. Numbering Subplots bnkiluh]^eilknp& `abjqi^an[oq^lhkpo$becjqi%6 Jqi^anopdaoq^lhkpoej]becqna* osep_dpkpdanamqaopa`becqna becqna$becjqi% bec9c_b$% bkne(bec[]taejajqian]pa$capl$bec(#]tao#%%6 bec[]ta*oap[pepha$opn$e'-%% ]teo$% Some notes regarding the function jqi^an[oq^lhkpo$%. First, we set the focus to the figure we’d like to work on by calling becqna$becjqi%. Next, we retrieve a handle to the figure with c_b$%. The following step assumes some knowledge of the matplotlib object structure. But even if you’re not familiar with the structure, it’s pretty simple to figure out what’s going on by exploring the objects. To illustrate this, create a simple figure with two empty subplots: :::becqna$% :::]t-9oq^lhkp$.(-(-% :::]t.9oq^lhkp$.(-(.% CHAPTER 6 N GRAPHSAND PLOTS216 Now retrieve the current figure properties with capl$c_b$%%: :::capl$c_b$%% ]hld]9-*, ]jei]pa`9B]hoa ]tao9W8i]plhkphe^*]tao*=taoOq^lhkpk^fa_p]p,t,/1>0/1,:(8i]plhkphe^*]tao *=taoOq^lhkpk^fa_p]p,t,/1>3=B,:Y _deh`naj9W8i]plhkphe^*l]p_dao*Na_p]jchak^fa_p]p,t,/150B>,:(8i]plhkphe^ *]tao*=taoOq^lhkpk^fa_p]p,t,/1>0/1,:(8i]plhkphe^*]tao*=taoOq^lhkpk^fa_p]p ,t,/1>3=B,:Y (I’ve removed the extra output lines as they’re not important for the discussion.) Look closely at two properties: ]tao and _deh`naj. The parameter ]tao holds a list of two values, and the parameter _deh`naj holds a list of three values. Further examination shows that the ]tao objects are all contained within the _deh`naj values. In reality, these are the two axes for the two subplots. So to get a list of these, we can simply call capl$c_b$%(#]tao#%, as the code indeed does. We then set the titles and call the ]teo$% function to force a redraw. There’s a caveat in the implementation of the function jqi^an[oq^lhkpo$%: numbering is performed in accordance with the creation of the subplots. That is, if the bottom-left subplot was created before the top-left subplot, it will have the smaller title value associated with it and not the regular subplot numbering (left to right, top to bottom). If you’d like to change this, you’ll have to look at the positions of the subplots and assign numbers accordingly. This is somewhat more complex and not all that educational, so I’ve opted to leave it out of the dis- cussion. A lot of the parameters that are accessible via oapl$% and capl$% are also accessible by means of dedicated functions. Instead of setting the y-axis label parameter with oapl$%, you can call the uh]^ah$% function. When possible, I prefer using the function version over oapl$% and capl$% because I think it’s easier to follow. Exploring the matplotlib object by use of the `en statement is also a very good method to probe the capabilities of a matplotlib object. Most of the functions are self-explanatory and let you set and retrieve values associated with a matplotlib object. In case you’re not sure, use the dahl$% function in an interactive Python session. From a partial comparison I’ve made, matplotlib object methods are equivalent to the properties available with capl$% and oapl$%, so you can use either: :::i]pk^f9c_b$% :::Wbqj_bknbqj_ej`en$i]pk^f%ebbqj_*op]nposepd$#cap#%Y W#cap[]hld]#(#cap[]jei]pa`#(#cap[]tao#(#cap[_deh`naj#(#cap[_hel[^kt#(#cap[_ hel[kj#(#cap[_hel[l]pd#(#cap[_kjp]ejo#(#cap[`le#(#cap[a`ca_khkn#(#cap[b]_a_ khkn#(#cap[becdaecdp#(#cap[becqna#(#cap[becse`pd#(#cap[bn]iakj#(#cap[h]^ah# (#cap[le_gan#(#cap[oeva[ej_dao#(#cap[pn]jobkni#(#cap[pn]jobknia`[_hel[l]pd[] j`[]bbeja#(#cap[reoe^ha#(#cap[sej`ks[atpajp#(#cap[vkn`an#Y Final note: working with oapl$% and capl$% or the set and get methods of the matplotlib object is an advanced topic. These functions allow closer control of the behavior of plots and graphs and are not easy to master. They require a good understanding of the matplotlib object CHAPTER 6 N GRAPHSAND PLOTS 217 hierarchy. Regardless of the complexity, I believe this is an important concept. As you draw more graphs and deal with more data, you’ll find that the default functionality, although great, isn’t exactly what you want. And in these cases, turning to oapl$% and capl$% is a good option. I hope that I’ve exposed you enough to the topic to let you experiment on your own. Patches So far we’ve worked with text and lines, which are both implemented as matplotlib objects. But those two objects at times are not enough. A third object, the patch, allows drawing other types of shapes that don’t necessarily fall under the category of a line or text. The way you work with patches is that you assign them to an already existing graph, because in a sense patches are “patched” on top of a figure. Table 6-6 gives a partial listing of available patches. In this table, the notation tu indicates a list or tuple of (x, y) values. Table 6-6. Available Patches Patch Description =nnks$t(u(`t(`u% An arrow, starting at location $t(u% and ending at location $t'`t(u'`u% ?en_ha$tu(n% A circle centered at tu and radius n Ahheloa$tu(s(d(]jcha% An ellipse centered at tu, of width s, height d, and rotated ]jcha degrees Lkhuckj$Wtu-(tu.(tu/(***Y% A polygon made of vertices specified by tu points Sa`ca$tu(n(pdap]-(pdap].% A wedge (part of a circle) centered at tu, of radius n, starting at angle pdap]- and ending at angle pdap]. Na_p]jcha$tu(s(d% A rectangle, starting at tu, of width s and height d To use patches, follow these steps: 1. Draw a graph. 2. Create a patch object. 3. Attach the patch object to the figure, using the ]``[l]p_d$% function. NNote Although this might seem like a considerable effort to add, say an arrow patch, in reality these three steps can be folded into one line. To draw an arrow from (0, 0) to (1, 1), issue c_]$%*]``[l]p_d$=nnks$,( ,(-(-%%. CHAPTER 6 N GRAPHSAND PLOTS218 Example: Adding Arrows to a Graph In this example we’ll draw a graph and connect every two points on the graph with an arrow. First draw a simple graph: :::t9]n]jca$-,% :::u9t&&. :::lhkp$t(u% Now create a list of all the arrows: :::]no9W$t,(u,(`t(`u%bkn$t,(u,(`t(`u%ejvel$t(u(`ebb$t%(`ebb$u%%Y This is a bit tricky. First, the function `ebb$% creates a difference of every two elements in a vector, for example, `ebb$W-(.(/(/,Y% is W-(-(.3Y. This is exactly what we need for our `t and `u values for the =nnks$% function. Second, we combine t, u, `t, and `u using the vel$% function and return a list of tuples by using a list comprehension. Luckily for us, vel$% uses the shortest vector, so even though `ebb$% vectors are shorter by 1, it’s not an issue. Now, all that’s left is to iterate through the list comprehension and attach an arrow to the graph: :::_qn[]tao9c_]$% :::bknt,(u,(`t(`uej]no6 ***_qn[]tao*]``[l]p_d$=nnks$t,(u,(`t(`u%% :::pepha$#=nnkso#% Figure 6-17 shows the added arrows. Figure 6-17. Patching arrows CHAPTER 6 N GRAPHSAND PLOTS 219 Needless to say, =nnks$%, as well as other patches, can be customized considerably; you can adjust color, length, width, and more. Example: Some Other Patches The code in Listing 6-15 generates a list of patch objects and attaches them to a figure. The fig- ure is originally empty. Listing 6-15. Some Patches bnkiluh]^eilknp& EilknpAhheloa]j`Sa`capk_qnnajpj]iaol]_a bnkii]plhkphe^*l]p_daoeilknpAhheloa(Sa`ca ]heopkbokial]p_dao iu[l]p_dao9W =nnks$,(0(,()0(b]_a_khkn9#c#%( ?en_ha$W).(.Y(-*1(hejase`pd90(b_9#kn]jca#%( Ahheloa$W.(/Y(0(-(01*,(a`ca_khkn9#n#%( Lkhuckj$WW0(.Y(W/(/Y(W-()-Y(W/()-YY(ho9#`]oda`#(behh9B]hoa%( Sa`ca$W)-(,Y(/(.,,(/,,(b_9#i#(a_9#i#%( Na_p]jcha$W-().Y(/().(behh9B]hoa(hs91(a_9#n#% Y `n]s]becqna becqna$% ]teo$W)1(1()1(1Y% ]``pdal]p_dao _qn[]t9c_]$% bknlejiu[l]p_dao6 _qn[]t*]``[l]p_d$l% pepha$#L]p_dao#% Figure 6-18 shows the results of the code in Listing 6-15. CHAPTER 6 N GRAPHSAND PLOTS220 Figure 6-18. Some patches The patch objects Ahheloa and Sa`ca are not automatically imported to the current namespace when you issue bnkiluh]^eilknp& (unlike =nnks,?en_ha, Lkhuckj, and Na_p]jcha), so I’ve manually imported them to the namespace with the statement bnki i]plhkphe^*l]p_daoeilknpAhheloa(Sa`ca. I’ve also passed arguments to the patches to show how to use them: b]_a_khkn (or b_), a`ca_khkn (or a_), hejaopuha (or ho), hejase`pd (or hs), and behh. Final Notes and References We’ve explored the matplotlib package, a rich package that supports plotting in Python. The strong suit of matplotlib is easy plotting of simple and complex graphs with a high-number of customization options. If you’re not familiar with the package, exploring it with IPython’s tab completion, complemented by dahl$%, trial and error, and the manual, should yield excellent results in no time. For the purposes of the book and the examples provided, this chapter covers all topics. However, your needs may be different, and I hope that you now have the tools to explore this package on your own. The matplotlib web site is an excellent source of information, and I encourage you to explore it and learn more about the package. s 4HEMATPLOTLIBWEBSITE dppl6++i]plhkphe^*okqn_abknca*jap+ CHAPTER 7 Math Games Preprocessing Data Prior to Visualization Math is a fundamental tool in data visualization. Python provides outstanding math support and as such is an ideal development environment for analysis prior to visualization. There are several reasons I find using Python for this purpose so appealing. First is Python’s interactive nature: it’s easy to manipulate data and observe intermediate results, as well as modify and quickly plot them. The second reason, and probably the factor contributing the most, is the wide range and popularity of freely available, mature numerical packages. Lastly, Python is also structured, allowing the development of production-level code used to generate quality reports. In this chapter we’ll explore Python’s math capabilities, the built-in modules math, cmath, and random, and the excellent package we’ll use extensively (and have used in previ- ous chapters), NumPy. Modules math and cmath Python provides two flavors of math modules: math and cmath. The math module has functions that are common to most programming languages and in essence is using the C math function calls. Functions from module math return floating-point numbers. In case of improper arguments an error will be raised: :::eilknpi]pd :::i]pd*omnp$)-% Pn]_a^]_g$ikopna_ajp_]hhh]op%6 Beha8op`ej:(heja-(ej8ik`qha: R]hqaAnnkn6i]pd`ki]ejannkn 221 CHAPTER 7 N MATH GAMES222 Module cmath returns complex results: :::eilknp_i]pd :::_i]pd*omnp$)-% -f NNote If you see )-*EJ@ in response to omnp$)-%, it means that NumPy is already imported but without complex math. This is probably due to a previously issued bnkijqilueilknp& or bnkiluh]^eilknp & command, or you have a Python distribution that loads NumPy automatically, which automatically issues these commands for you. Complex numbers are supported natively in Python with the _kilhat built-in data type. This is probably a contributing factor to Python’s popularity as a platform for numerical computation. The imaginary part of complex number has a trailing f as shown in the preced- ing example. Most arithmetic operations and function calls can be performed on complex numbers. If you do not require complex number support, opt for using module math over cmath. First, it will provide you with valuable exception information if the parameter to a function is out of the domain, as shown previously. Second, cmath always returns complex results, even if results can be represented as noncomplex numbers, in which case the imaginary value will be zero. Lastly, some functions are only available in the math module, as listed in Table 7-1. Table 7-1. Functions Available Only in the math Module Function Description Example _aeh$t% Returns the smallest integer greater than or equal to t _aeh$.*1% returns /*,. _aeh$).*1% returns ).*,. bhkkn$t% Returns the largest integer less than or equal to t bhkkn$.*1% returns .*,. bhkkn$).*1% returns )/*,. b]^o$t% Returns the absolute value of t b]^o$).*1% returns .*1. bik`$t(u% Returns the remainder of t divided by u bik`$.*1(.% returns ,*1. bik`$).*1(.% returns ),*1. bik`$.*1().% returns ,*1. bik`$).*1().% returns ),*1. ik`b$t% Returns the integer and fractional parts of t ik`b$.*1% returns $,*1(.*,%. ik`b$).*1% returns $),*1().*,%. bnatl$t% Returns the exponent, a, and mantissa, i, such that x = m=2e bnatl$.*1% returns $,*2.1(.%. bnatl$).*1% returns $),*2.1(.%. h`atl$i(a% Returns m=2e h`atl$,*2.1(.% returns .*1. h`atl$),*2.1(.% returns ).*1. CHAPTER 7 N MATH GAMES 223 Power, logarithmic, trigonometric, and hyperbolic functions are available in both math and cmath modules, as listed in Table 7-2, with the exception of the functions lks$t(u%, ]p]j.$t(u%, and dulkp$t(u%. Table 7-2. Power, Logarithmic, Trigonometric, and Hyperbolic Functions in the math and cmath Modules Function Description Example (math) Example (cmath) Power atl$t% ex atl$-% returns .*3-4.4-4.4015,01- (e). atl$le&-f% returns )-. lks$t(u% ty lks$).(.% returns 0. N/A, use operator &&. omnp$t% Square root of t omnp$0% returns .. omnp$.f% returns -'-f. Logarithmic hkc$tW(^]oaY% Logarithms of t—if ^]oa is not specified, defaults to natural logarithms hkc$-2(.% returns 0. hkc$)-% returns /*-0-15.21/14535/-f (j›). hkc-,$t% Logarithms of t, base 10 hkc-,$/% returns ,*033-.-.103-522.00. hkc-,$)-% returns -*/20/32/1/40-40-0f. Trigonometric oej$t%,_ko$t%, p]j$t% Sine, cosine, and tangent of t oej$le+.% returns -*,._ko$-f% returns $-*10/,4,2/04-1.0/3',f%. ]oej$t%,]_ko$t%, ]p]j$t% Arc sine, arc cosine, and arc tangent of t ]oej$-% returns -*13,352/.23504522 (›/2). ]_ko$.% returns -*/-25134525.04-20f. ]p]j.$u(t% Arc tangent of u+t, preserves quadrant information and avoids division by zero ]p]j.$)-(-% returns ),*341/54-2//53004.4 (-›/2). N/A dulkp$t(u% 3(t2+u2) dulkp$/(0% returns 1*,. N/A Hyperbolic oejd$t%,_kod$t%, p]jd$t% Hyperbolic sine, cosine, and tangent of t _kod$,% returns -*,. oejd$-f&le% returns -*..02,2/1/4../33/a),-2f (0). Constants a, le .*3-4.4-4.4015,01-, /*-0-15.21/14535/- CHAPTER 7 N MATH GAMES224 FUNCTION ATAN2 Function ]p]j.$u(t% is very useful in that it maintains angle values of a point in a plane, as shown previ- ously in Chapter 1. That is, if t and u represent coordinates in a plane, ]p]j.$u(t% returns the angle from the origin. Consider the point located at (1,1): its angle is 45 degrees (›/4); point (–1,–1) has an angle of –135 degrees (or 225 degrees). If you were to use ]p]j$u+t%, both points (1,1) and (–1,–1) would yield 45 degrees, losing quadrant information. However, using ]p]j.$u(t% the correct values are returned. There’s also a side benefit that if t is zero, the angle is calculated properly, whereas ]p]j$u+t% would raise an exception. Function ]p]j.$% is not particularly useful in complex math as values already represent Cartesian points. Example: A Newton Fractal In this example we use complex math to create a fractal based on the Newton-Raphson method (or simply Newton’s method). Fractals are used by scientists to investigate chaotic systems: systems whose state over time is highly dependent on initial conditions. The purpose of this example is to show the capabilities of Python’s complex math and explore some ways to visualize data other than a regular plot; fractals are a perfect match. Newton’s method is an iterative procedure to find numerical solutions, or roots, to an equation of the form f(z) = 0 using an initial guess. One of the characteristics of the method is that in case of several solutions, we cannot tell in advance, based on the initial guess, what the converged solution will be (usually). If you were to map out the initial guesses based on the solution, you would find they converge to results in an image known as Newton’s fractal, which is geometrically interesting. If you’d like to read more about Newton’s method, have a look at dppl6++aj*segela`e]* knc+sege+Jaspkj!.3o[iapdk`; there’s a lot of additional information available on the Internet. The function we’ll map is f(z) = z4 + 1. This function has four roots: :::bnki_i]pdeilknple(_ko(oej :::okhqpekjo9W_ko$$.&j'-%&le+0%'-f&oej$$.&j'-%&le+0%bknjejn]jca$0%Y To verify that these are indeed solutions to the equation: :::Wv&&0bknvejokhqpekjoY W$)-'0*00,45.,541,,2.2.a),-2f%($)-'0*00,45.,541,,2.2.a),-2f%($)-*,,,,,,,,,,,,, ,,0'2*22-//4-0331,50-.a),-2f%($)-'4*44-340-53,,-.1./a),-2f%Y The imaginary parts are on the order of scale of 10-16 and are due to inaccuracies of the trigonometric functions, ›, and the floating-point representation; the imaginary parts are actually zero. Newton’s method takes an initial guess and calculates the next guess by applying the equation zn+1 = zn – f(zn) / f'(zn), where f'(z) is the derivative of f(z), or in our case v9 v)$v&&0'-%+$0&v&&/%. To check whether the new value is a “good” solution, we reapply it to the original equation, f(z), and check how close it is to zero. In reality, we check whether the absolute value is less than `ahp], a predefined small value. The number of iterations is an indication of how fast the solution was reached. We’ll use this to select the color depth of each solution: solutions that converged fast will be brighter. Once our guess converges, we check CHAPTER 7 N MATH GAMES 225 what solution it converged to and color it accordingly. Since there are four solutions, there will be four colors (at varying color depths) in the fractal. Listing 7-1 generates said Newton’s frac- tal in the region (0, 0)–(1, 1). Listing 7-1. bn]_p]h*lu bnkiLEHeilknpEi]ca bnki_i]pdeilknp& _na]pao]v&&0'-9,bn]_p]hqoejcpdaJaspkj)N]ldokj nkkpbej`ejciapdk` `ahp]9,*,,,,,-_kjrancaj_a_nepane] nao94,,ei]caoeva epano9/,jqi^ankbepan]pekjo _na]pa]jei]capk`n]skj(l]ejpep^h]_g eic9Ei]ca*jas$NC>($nao(nao%($,(,(,%% pdaoa]napdaokhqpekjopkpdaamq]pekjv&&0'-9,$Aqhan#obkniqh]% okhqpekjo9W_ko$$.&j'-%&le+0%'-f&oej$$.&j'-%&le+0%bknjejn]jca$0%Y _khkno9W$-(,(,%($,(-(,%($,(,(-%($-(-(,%Y bknnaejn]jca$,(nao%6 bkneiejn]jca$,(nao%6 v9$na'-f&ei%+nao bkneejn]jca$epano%6 pnu6 v)9$v&&0'-%+$0&v&&/% at_alpVank@ereoekjAnnkn6 lkooe^hu`ere`a^uvankat_alpekj _kjpejqa eb$]^o$v&&0'-%8`ahp]%6 ^na]g _khkn`alpdeo]bqj_pekjkbpdajqi^ankbepan]pekjo _khkn[`alpd9ejp$$epano)e%&.11*,+epano% bej`pksde_dokhqpekjpdeocqaoo_kjranca`pk ann9W]^o$v)nkkp%bknnkkpejokhqpekjoY `eop]j_ao9vel$ann(n]jca$haj$_khkno%%% oaha_ppda_khkn]ook_e]pa`sepdpdaokhqpekj _khkn9We&_khkn[`alpdbkneej_khknoWiej$`eop]j_ao%W-YYY eic*lqpletah$$na(ei%(pqlha$_khkn%% eic*o]ra$#**+ei]cao+bn]_p]h[v0o[!,/`[!,/`[!,/`*ljc#!X $epano(nao(]^o$hkc-,$`ahp]%%%(`le9$-1,(-1,%% CHAPTER 7 N MATH GAMES226 We use the Python Imaging Library (PIL) to draw the fractal. We start by creating an RGB image of size nao specifying the fractal’s resolution. We then implement Newton’s method with a bkn loop, and an eb statement to check for convergence. While the iteration is straightforward, deciding which of the four solutions a specific guess converges to and then mapping to the right color and color depth requires some clarifications. The _khkno list is composed of the colors red, green, blue, and yellow, each represented by a tuple of Red-Green-Blue (RGB) values: _khkno9W$-(,(,%($,(-(,%($,(,(-%($-(-(,%Y Variable _khkn[`alpd is directly responsible for the color depth (or shade) of the displayed color. For a small number of iterations, _khkn[`alpd is closer to 255, and for a greater number of iterations, this number is closer to 0, resulting in a brighter color for faster converging points (smaller number of iterations). Once _khkn[`alpd is calculated, we find the solution closest to our converging value. Since we’re using complex numbers, the value closest is the one with the minimum distance, or in complex math, the one with the smallest value of ann9]^o$cqaoo)okhqpekj%. To implement this, we generate a list of values corresponding to the distances using a list comprehension. Here’s an example using an arbitrary point: :::v $,*3,3-,234--444/4-4',*3,3-,234--444/4-4f% :::ann9W]^o$v)nkkp%bknnkkpejokhqpekjoY :::ann W/*./505/.2a),-.(-*0-0.-/12(.*,,,,,,,,(-*0-0.-/12Y Next, we combine these values with the numbers 0–3, which represent the indices to the _khkno list, using the vel$% function: :::vel$ann(n]jca$haj$_khkno%%% W$/*./505/.2a),-.(,%($-*0-0.-/12(-%($.*,,,,,,,(.%($-*0-0.-/1(/%Y We then find the minimum error by calling the function iej$%. To find the correct color, we index the color associated with iej$`eop]j_ao%W-Y, which is the second element in the zipped tuple. Maybe it’s easier to show interactively than explain: :::`eop]j_ao9vel$ann(n]jca$haj$_khkno%%% :::iej$`eop]j_ao% $/*./505/.2a),-.(,% :::iej$`eop]j_ao%W-Y , :::_khknoWiej$`eop]j_ao%W-YY $-(,(,% Finally, we use a list comprehension to multiply the RGB values by the color depth. This is because the lqpletah$% method requires a tuple detailing the RGB values: _khkn9We&_khkn[`alpdbkneej_khknoWiej$`eop]j_ao%W-YYY eic*lqpletah$$na(ei%(pqlha$_khkn%% CHAPTER 7 N MATH GAMES 227 NTip As you experiment with parameters, you may wish to save some of the outputs. These runs can take a considerable time to complete, so it’s a good idea to have different file names for the outputs as opposed to a single file name, ensuring files are not accidentally overwritten. Unlike data files, the outputs of these runs are dependent on input parameters and the code (e.g., version of the script) that generated them and are not date or time dependent. It doesn’t matter whether the run was performed last week or last year; the results should be the same. The notation I’ve used is one that details all the parameters used to create the output within the file name: #**+ei]cao+bn]_p]h[v0o[!,/`[!,/`[!,/`*ljc#!$epano(nao( ]^o$hkc-,$`ahp]%%%. Names of the output files detail the inputs that generated them. An even better approach (one that in this case will somewhat disturb the pleasing output) is annotating the images with text describing the parameters used. And lastly, if you use a version control system (see Chapter 2), the version number of the script that generated the output is a very welcomed addition either in the file name or in an annotation. Figure 7-1 is a collage of outputs generated by the script with naokhqpekj9.,, and values of epano ranging from 1 to 9 (top left is e9-; bottom right is e95). We’ll touch on collages in Chapter 9. Figure 7-2 is the result of a longer run with naokhqpekj94,, and epano9/,. Figure 7-1. Collage of Newton’s fractals with iterations from 1 (top left) to 9 (bottom right) CHAPTER 7 N MATH GAMES228 Figure 7-2. Newton’s fractal, max number of iterations equaling 30 NTip The preceding example explores the region (0, 0)–(1, 1). If you wish to explore around the origin, that is, around (0, 0), change the line v9$na'-f&ei%+nao to v9$$na)nao+.%'-f&$ei)nao+.%%+nao. Module random Other than mathematical functions, Python also provides a rich library for random numbers. Random numbers are important in a variety of software applications. In game programming, random numbers are used to change the behavior of elements in the game to make it more interesting or unpredictable. When writing simulations, random numbers are used to generate CHAPTER 7 N MATH GAMES 229 data that simulates the real world. Random numbers can also be used to answer probability questions, as you’ll soon see. The random module provides random values based on a wide variety of distribution func- tions including uniform distribution, Gaussian distribution, and more. Module random also supports Python’s lists naturally, with random functions operating on sequences. Table 7-3 gives a partial list of some useful random functions. Table 7-3. Functions of the random Module Function Description Example/Note Integers n]j`ejp$](^% Returns a random number between ] and ^ (including ] and ^) n]j`ejp$,(-% returns , or - (randomly). n]j`n]jca$Wop]np(Y opklW(opalY% Same as n]j`ejp$% except it allows a step value n]j`n]jca$/(3(.% returns /, 1, or 3 (randomly). Floating-Point Numbers n]j`ki$% Returns a real value between 0.0 and 1.0 (excluding 1.0) n]j`ki$% qjebkni$](^% Returns a real value between ] and ^ (excluding ^) qjebkni$-.,(..,% returns a random number between 120 and 220 (excluding 220). c]qoo$q(oeci]% Returns a Gaussian distributed value with q as mean and oeci] as standard deviation c]qoo$-(.% Module random provides an additional number of other distributions: Log normal and Weibul, to name a couple. Refer to the Python Standard Library documentation for a full account. Using random to Solve Probability Questions The following examples use the random module to solve probability-based questions numerically. Example: Hard Disk Head Return to zero: Consider the following: a hard disk head is normally resting at location 0, rep- resenting the start of the disk. Files (of size zero) are evenly distributed between location 0 and 1, where 1 represents the end of the disk. The head is required to access files randomly. After each read, the head returns to location zero. The question is, what is the average distance the head moves? The answer is not hard: on average, the head moves a distance of 1.0 (don’t forget it has to go back to location 0). You can easily verify this using a simple script: :::bnkin]j`kieilknpn]j`ki :::J9-,,,jqi^ankbbehaopdada]`oaago :::pkp[`eop9, CHAPTER 7 N MATH GAMES230 :::bkneejn]jca$J%6 ***pkp[`eop'9n]j`ki$%&. *** :::pkp[`eop+J -*,.1/,2-3.424-,.- The larger the value of J, (and assuming a good n]j`ki$% implementation), the more accu- rate the result. Not returning to zero: Now consider the scenario where the head does not go back to location 0, but stays where it was before. Finding the average distance the head moves is a bit harder analytically, but numerically, with a simple script, the solution emerges quickly. :::bnkin]j`kieilknpn]j`ki :::J9-,,,jqi^ankbbehaopdada]`oaago :::pkp[`eop(_qn[hk_9,(, :::bkneejn]jca$J%6 ***jas[hk_9n]j`ki$% ***pkp[`eop'9]^o$_qn[hk_)jas[hk_% ***_qn[hk_9jas[hk_ *** :::pkp[`eop+J ,*//132004.22.,./30 This number turns out to be 1/3. Example: Friends Meeting We turn to another example, one that makes use of a visual output as well. Two friends decide to meet between 8 p.m. and 9 p.m. Once one of the friends arrives at the designated meeting spot, he waits for 10 minutes for his friend to show up. So if for exam- ple Friend 1 arrives at 8:40, he’ll wait until 8:50 for Friend 2 to show up. Friend 1 doesn’t know if Friend 2 already showed up earlier (the same is true for Friend 2, he doesn’t know if Friend 1 showed up). But both friends are smart enough to know that if they arrive at 8:55, for example, they only need wait until 9:00 and not 9:05. The question: what’s the probability that these two friends meet? We again turn to the random module to help us solve this problem (see Listing 7-2). Only this time, we also visualize the result, hopefully gaining some insight as to how to solve the question analytically. Listing 7-2. Friends Meeting bnkin]j`kieilknpn]j`ki bnkiluh]^eilknp& J90,,,,jqi^ankbarajpo cajan]paJarajpokbbneaj`opeiao bneaj`-(bneaj`.9WY(WY CHAPTER 7 N MATH GAMES 231 bkneejn]jca$J%6 bneaj`-*]llaj`$n]j`ki$%% bneaj`.*]llaj`$n]j`ki$%% bej`]hhk__qnnaj_aokbbneaj`oiaapejc iap9]nn]u$W$t(u%bkn$t(u%ejvel$bneaj`-(bneaj`.%X eb]^o$u)t%8-*,+2Y% jkp[iap9]nn]u$W$t(u%bkn$t(u%ejvel$bneaj`-(bneaj`.%X eb]^o$u)t%:9-*,+2Y% lhkppdanaoqhp(pdeoiecdpoda`okiahecdpkjpdalnk^hai lhkp$iapW6(,Y(iapW6(-Y(#'i#% lhkp$jkp[iapW6(,Y(jkp[iapW6(-Y(#kc#% pepha$Lnk^]^ehepukbiaapejc6!-*/b!$bhk]p$haj$iap%%+J%% th]^ah$#Peiakb]nner]hkbBneaj`-#% uh]^ah$#Peiakb]nner]hkbBneaj`.#% ]teo$#o_]ha`#% The first step is to generate a considerable number of events, in this case 40,000. An event is composed of two numbers: one associated with Friend 1’s time of arrival and one associated with Friend 2’s time of arrival. We store both their times in lists. The process of generating the events is performed in the first bkn loop. The function n]j`ki$% returns a value between , and -, which maps out to the time of arrival: , is 8 p.m., - is 9 p.m. Now that we have a considerable number of events, we ask at what times the friends meet. The friends meet if the difference between their times of arrival is less than 10 minutes, or 10 minutes / 60 minutes * 1.0 = 1/6 (1.0 is the range of random values). But it’s also possible that Friend 1 arrives after Friend 2 and not the other way around. So we should be asking whether bneaj`-Ìbneaj`. is less than 1/6, as well as whether bneaj`.Ìbneaj`- is less than 1/6. This can be elegantly coded as ]^o$bneaj`-)bneaj`.%8-*,+2. The actual implementation makes use of a list comprehension, returning a tuple of (x, y) values that match the condition ]^o$t)u%8-*,+2*, which means the friends have met. We then construct an array of these values (a NumPy array, more on this shortly) so we can easily access the x and y vectors, without any bkn loops. We also build a list of times the friends did not meet because we want to plot both, in different colors and markers. Next we plot the results and calculate the probability of the friends meeting, numerically, as shown in Figure 7-3. This visualization really helps. The corridor in the middle describes the events cor- responding to the two friends meeting. The probability is the area of this corridor and can be calculated by the area of the entire square minus the area of the top-left triangle and the bottom-right triangle. Each triangle has an area of 0.5 = (5/6)2, and the total probability of meeting is 1 – (5/6)2 = 11/36 = 0.3055 . . . which is pretty close to the estimated numerical value (displayed in the figure title). CHAPTER 7 N MATH GAMES232 Figure 7-3. Visualizing friends meeting Random Sequences Another set of functions available under the random module operates on sequences. These include the functions listed in Table 7-4. Table 7-4. Functions from the random Module for Operating on Sequences Function Description _dke_a$o% Returns a random element from the sequence o odqbbha$o% Shuffles the sequence o o]ilha$o(j% Returns a subsequence of size j from o For the examples in this section, we create a deck-of-cards sequence using the vel$% built- in function. Each card is represented as a tuple holding a number 1–13 and a character, #O#, #D#,#@#,#?#, corresponding to spades, hearts, diamonds, and clubs. :::bnkin]j`kieilknp& :::_]n`o9vel$n]jca$-(-0%&0(#O#&-/'#D#&-/'#@#&-/'#?#&-/% :::_]n`oW,61Y W$-(#O#%($.(#O#%($/(#O#%($0(#O#%($1(#O#%Y :::_dke_a$_]n`o% $3(#?#% :::odqbbha$_]n`o% :::_]n`oW61Y CHAPTER 7 N MATH GAMES 233 W$0(#?#%($-,(#@#%($-(#?#%($3(#?#%($-.(#@#%Y :::o]ilha$_]n`o(1% W$-.(#O#%($-.(#D#%($4(#?#%($4(#@#%($1(#O#%Y A DECK OF CARDS There are lots of ways to implement a deck of cards, and the method described here is a bit tricky. The reason I chose it is that it shows another way of creating a deck of cards other than a double bkn loop (see Beginning Python: From Novice to Professional for an implementation using a double bkn loop in a list com- prehension). There are benefits to using bkn loops: they’re straightforward to implement and read, and in this specific case, we can use full names for the sign of the card (e.g., #Ol]`a# instead of #O#). If I were to use NumPy’s j`]nn]u object (discussed in the next section), I’d opt to use the line vel$]n]jca$1.%+0'-(#OD@?#&-/%, but this is tricky too because the division by 4 might yield noninte- ger values in future versions. Maybe a more prudent approach would be to add a bhkkn$% function call. In any case, what should concern you more is the readability of your code. Don’t forget that there’s a good chance you’ll be the person maintaining it as well. If you’re more comfortable with a double bkn loop, use the bkn loop approach. If you’re more comfortable zipping flat arrays, the options shown here are viable approaches. It’s a matter of personal preference, as performance is hardly an issue. This brings up another point: performance. Opt for readability over performance if possible. After all, Python is a high-level program- ming language: if you really need code performance, other programming languages might prove a better choice. Even better, you can extend Python with other programming languages. Module NumPy NumPy’s j`]nn]u object has been the basic building block for a lot of the data processing and visualization scripts presented throughout the book. We now turn to exploring this package and discussing its usage. NNote Although used in previous chapters, we have not explicitly seen calls to import NumPy. Neverthe- less, we did use NumPy’s j`]nn]u object extensively. The reason we have not seen NumPy imports is that we have been using the bnkiluh]^eilknp& command instead, which imports, among other packages, the NumPy package as well. The j`]nn]u object provides substantial added functionality to Python’s ]nn]u object and has a lot in common with Matlab’s matrix data structure. Such functionality includes matrix operations, linear algebra, and more. It also provides the basic building blocks for more com- plex numerical methods as will be explored in future chapters. The name “ndarray” stands for N-dimensional array, and as it implies, this object supports N-dimensional arrays. CHAPTER 7 N MATH GAMES234 NumPy is a full and rich package. I will only cover topics that are important for the ideas discussed in the book, and as such, this chapter should be considered a quick introduction. If you’d like to learn more about NumPy, consult with the references at the end of the chapter. NNote I’ll use the terms “array” and “ndarray” interchangeably. In both cases I am referring to NumPy’s j`]nn]u object—there’s little use for Python’s ]nn]u object once NumPy is imported. Array Creation Chapter 3 covered Python’s built-in data structures including tuples, lists, and dictionaries. If you recall, there were several methods to create most of these data structures: we’ve used brackets for lists as well as the heop$% function, we’ve used curly braces for dictionaries as well as the `e_p$% function, and so on. Unfortunately, there’s no specific symbol set aside for NumPy arrays, so the options are to use either the ]nn]u$% function or functions that return an array, the array creation functions. The most straightforward method to create and initialize an array is from a list: :::bnkijqilueilknp& :::r9]nn]u$W-(.Y% :::r ]nn]u$W-(.Y% :::i9]nn]u$WW-(,Y(W,(0YY% :::i ]nn]u$WW-(,Y( W,(0YY% Other methods to create arrays are available, ones that are more useful when dealing with larger amounts of data points, as described in Table 7-5. Table 7-5. Array Creation Functions Function Description Example N-Dimensional Arrays ]nn]u$o% Creates an array based on the sequence o. ]nn]u$$$-(.%($/(0%%% returns ]nn]u$WW-(.Y( W/(0YY%. kjao$p% Creates an N-dimensional array initialized with -s based on the tuple p. kjao$.% returns ]nn]u$W-*( -*Y%. vanko$p% Similar to kjao$p%, only initialized with zeros. vanko$$.(.%% returns ]nn]u$WW,*(,*Y(W,*( ,*YY%. CHAPTER 7 N MATH GAMES 235 Function Description Example Two-Dimensional Arrays (Matrices) aua$jW(iY% Creates a 2-D array of size j = i, the major diagonal filled with ones and the remaining matrix zeros. If i is not provided, it is assumed equal to j. aua$.(/% returns ]nn]u$WW -*(,*(,*Y(W,*(-*( ,*YY%. One-Dimensional Arrays (Vectors) ]n]jca$Wop]np(YopklW( opalY% Creates an array of values starting at op]np, ending at (but excluding) opkl with an increment opal. This is similar to ]nn]u$n]jca$op]np( opkl(opal%%, only that ]n]jca$% can return noninteger values as well. ]n]jca$-(.(*1% returns ]nn]u$W-*(-*1Y%. hejol]_a$op]np(opkl( jqi91,% Creates a linearly spaced vector of size jqi from op]np to opkl; refer to the interactive help for additional options. hejol]_a$-(-,(/% returns ]nn]u$W-*(1*1(-,* Y%. hkcol]_a$op]np(opkl( jqi91,% Similar to hejol]_a, only values are spaced evenly from 10start to 10stop on a logarithmic scale; refer to the online help for additional options. hkcol]_a$,(-(/% returns ]nn]u$W-*(/*-2..3322( -,*Y%. Some additional array creation functions (bnkibeha$%, ailpu$%) exist, but in most cases, you’ll find the ones in Table 7-5 sufficient. There’s some redundancy in those as well: kjao$-,% results in the same array as vanko$-,%'-. Slicing, Indexing, and Reshaping Arrays can be resized using the naod]la$% and naoeva$% functions and indexed and sliced using Python’s slicing and indexing operators, WY and W6Y. The difference between the two functions is that naoeva$% resizes an existing array, whereas naod]la$% returns a new array based upon the data in the original array. :::]9]n]jca$-.%*naod]la$0(/% :::] ]nn]u$WW,(-(.Y( W/(0(1Y( W2(3(4Y( W5(-,(--YY% :::]W-Y ]nn]u$W/(0(1Y% :::]W)-Y ]nn]u$W5(-,(--Y% :::]W-(-Y CHAPTER 7 N MATH GAMES236 0 :::]W6(-Y ]nn]u$W-(0(3(-,Y% :::]W-(6.Y ]nn]u$W/(0Y% N-Dimensional Arrays NumPy arrays are N-dimensional arrays and can be created in the same manner as 1-D and 2-D arrays: :::kjao$$.(/(0%% ]nn]u$WWW-*(-*(-*(-*Y( W-*(-*(-*(-*Y( W-*(-*(-*(-*YY( WW-*(-*(-*(-*Y( W-*(-*(-*(-*Y( W-*(-*(-*(-*YYY% A useful operator for N-dimensional arrays is ***, which means, “all the remaining dimensions.” :::]9kjao$$.(/(0%% :::] ]nn]u$WWW-*(-*(-*(-*Y( W-*(-*(-*(-*Y( W-*(-*(-*(-*YY( WW-*(-*(-*(-*Y( W-*(-*(-*(-*Y( W-*(-*(-*(-*YYY% :::]W,(***Y ]nn]u$WW-*(-*(-*(-*Y( W-*(-*(-*(-*Y( W-*(-*(-*(-*YY% :::]W,(-(***Y ]nn]u$W-*(-*(-*(-*Y% One of the common questions is how useful N-dimensional arrays are. Some people feel that they do pretty well with one or two dimensions and have little use for N-dimensions. My experience with N-dimensional arrays is that they provide an excellent data storage when dealing with a combination of several parameters. Consider a simulation with four param- eters, each parameter having a range of values. Suppose you want to map out the simulation, that is, calculate the results for every given combination of parameters and also store the results, because the running time is long. How would you store the results? One method is to write them to a list, flattening the data. An alternative method is using an N-dimensional array. CHAPTER 7 N MATH GAMES 237 Example: Comparing Mortgages The following example discusses how to store data as a function of several parameters (typi- cally more than two) using both N-dimensional arrays and flat data structures. Since at the time of writing this book the subprime mortgage crisis has hit the world mar- kets, I thought it appropriate to give an example comparing mortgages. By all means I’m not financially savvy, so please don’t use this as advice in selecting a mortgage! Now to define the problem. Fixed mortgage payments are a function of three parameters: the loan amount (which is also called the present value), the interest rate, and the number of payments. Banks typically have different interest rates as a function of the number of payments, a person’s record, and possibly also the loan value. So based on these three parameters (present value, interest rate, and number of pay- ments), we’d like to map out monthly payments—that is, what the expected monthly payment is for every value in the range of parameters. For this example, we’ll assume that we’re considering loans in the amounts of $100,000 to $140,000 in increments of $20,000, mortgage interest rates range from 3 percent to 5 percent in increments of 0.5 percent, and number of payments is 60 to 300 in increments of 60 (repre- senting 5 to 25 years in increments of 5 years). We’ll use the function lip$%, which is part of the NumPy package. The function returns a fixed monthly payment for a fixed-rate mortgage (see dppl6++aj*segela`e]*knc+sege+Beta`[n]pa[iknpc]ca). We construct lists representing the range of values we’d like to map out. We implement these lists using the ]n]jca$% function described previously in this chapter. THE CONVENIENCE OF USING ARANGE() AND LINSPACE() Here’s another example of why NumPy provides convenience over non-math-oriented data structures. To implement a list of values with noninteger increments, we can use a list comprehension. For example: :::ejpanaopo9Wt+.*,'/bkntejn]jca$1%Y While this is perfectly OK, it’s less readable than something like this: :::ejpanaopo9]n]jca$/*,(1*1(,*1% In the former method (using a list comprehension), you’d have to do some math to realize exactly what values are being used. In the second method, they’re clearly spelled out: from 3 to 5.5 (excluding 5.5) in increments of 0.5. I’m assuming the decision not to include the edge value (i.e., 5.5) in the function ]n]jca$% is to have it behave in a similar manner to n]jca$% and tn]jca$%. My personal preference would’ve been to include the edge value. Alternatively, you could use the hejol]_a$% function: :::ejpanaopo9hejol]_a$/*,(1*,(1% which in this specific example is awkward: the number 5 (the last argument) has to be precalculated to reach an increment of 0.5. CHAPTER 7 N MATH GAMES238 A final note: Those values are annual values, and to use them properly you’d have to divide them by 12 (months) and by 100 (percentage values). Regardless, this is required in both a list comprehension implemen- tation and an ]n]jca$% implementation. I’ve left it out so that the example would be clearer to follow. The ability to multiply (or divide) an array by a value will be shown in the next section. Next we iterate over the range of loans, the number of payments, and the interest rates and construct a data structure to hold the results: monthly payments. We examine two data structures: s !FLATTENEDLISTOFrows, with each row being a list containing loan size, number of pay- ments, interest rate, and monthly payment. This is a native Python list. s !N. DIMENSIONALARRAYWHEREEACHDIMENSIONCORRESPONDSTOADIFFERENTPARAMETER interest rate, number of payments, and loan size. This is a 3-D NumPy array. Listing 7-3 compares these two structures. Listing 7-3. Flattening Data vs. N-Dimensional Data bnkijqilueilknp& hk]jo9]n]jca$-,,,,,(-2,,,,(.,,,,% jqi[l]uiajpo9]n]jca$1(/,(1%&-. ejpanaopo9]n]jca$/(1*1(,*1%+-,,*,+-.*, iapdk`-(opknejcnaoqhpoej]heop nao-9WY iapdk`.(opknejcnaoqhpoej]j]nn]u nao.9vanko$Whaj$hk]jo%(haj$jqi[l]uiajpo%(haj$ejpanaopo%Y% bkne(hk]jejajqian]pa$hk]jo%6 bknf(jqi[l]uejajqian]pa$jqi[l]uiajpo%6 bkng(ejpanaopejajqian]pa$ejpanaopo%6 nao-*]llaj`$Whk]j(jqi[l]u(ejpanaop(X )lip$ejpanaop(jqi[l]u(hk]j%Y% nao.WeYWfYWgY9)lip$ejpanaop(jqi[l]u(hk]j% The benefit of using an N-dimensional array is that indexing is a lot easier and faster. For example, assuming l]n]i- is fixed and set at 0, the results can be accessed with nao.W,(***Y. Achieving the same in a list will probably require iterating over the entire list nao- and compar- ing the value of the first parameter. There’s overhead both in code in actual performance: :::bknnksejnao-6 ***eb$nksW,Y99-.,,,,]j`nksW-Y99-.,%6 ***lnejpnks *** W-.,,,,(-.,(,*,,.1,,,,,,,,,,,,,,-(--14*3.45/2/4,2510Y W-.,,,,(-.,(,*,,.5-22222222222224(--42*2/,0,510.4/2/Y CHAPTER 7 N MATH GAMES 239 W-.,,,,(-.,(,*,,////////////////1(-.-0*50-213534111Y W-.,,,,(-.,(,*,,/3055555555555555(-.0/*22,5,1,40.,00Y W-.,,,,(-.,(,*,,0-222222222222222(-.3.*342-4.4245,21Y :::nao.W-(-(***Y ]nn]u$W--14*3.45/2/4(--42*2/,0,510(-.-0*50-21354(-.0/*22,5,1,4( -.3.*342-4.43Y% However, the results of the list are much more readable: they list all combinations of parameters in human readable form. You could do the same with nao. but that requires a bkn loop: :::r]hqao9nao.W-(-(6Y :::bkne(rejajqian]pa$r]hqao%6 ***nks9Whk]joW-Y(jqi[l]uiajpoW-Y(ejpanaopoWeY(rY ***lnejpnks *** W-.,,,,(-.,(,*,,.1,,,,,,,,,,,,,,-(--14*3.45/2/4,2510Y W-.,,,,(-.,(,*,,.5-22222222222224(--42*2/,0,510.4/2/Y W-.,,,,(-.,(,*,,////////////////1(-.-0*50-213534111Y W-.,,,,(-.,(,*,,/3055555555555555(-.0/*22,5,1,40.,00Y W-.,,,,(-.,(,*,,0-222222222222222(-.3.*342-4.4245,21Y In the bkn loops, I’ve used ajqian]pa$% on the list of values we’re iterating over. The rea- son for this is that NumPy arrays require indices, and those are integers, whereas Python lists do not. So in a sense, lists here could be more elegant code-wise (no need to use ajqian]pa$%). Lastly, the list implementation can lend itself very nicely to storage in a CSV file, which in itself is also a flattened data structure. That being said, you could also flatten the array and do the same. Although N-dimensional arrays are interesting data structures, most examples in this book are based on 1-D arrays (vectors) and 2-D arrays (matrices), as they cover most anything we do. Even 3-D plots are really represented by 2-D matrices: the indices represent x and y, and the cell value represents z. Choosing either N-dimensional arrays or flattened data structures is dependent on the exact problem you’re trying to solve. Math Functions Simple arithmetic operations are possible on arrays: addition, subtraction, division, multi- plication, and exponentiation as well as most math functions available in math and cmath modules (albeit now they’re implemented as part of NumPy). Example: Visualizing Fourier Expansion of a Rectangular Wave The following is an example showing a Fourier expansion of a rectangular wave using a sum of sine waves. Fourier expansion is used in numerous applications ranging from solving dif- ferential equations to signal processing. This example will show how we could treat a NumPy array as a vector of values and operate on that vector as if it were a function. We use a Fourier expansion of sine waves (NumPy arrays) to generate a rectangular wave (another NumPy array). We implement the equation f(t) = 4/(›*n)*sin(2*›*n*t*num_cycles), which is a Fourier CHAPTER 7 N MATH GAMES240 series expansion of a rectangular wave (see Listing 7-4). The parameter jqi[_u_hao determines the number of cycles we’re expanding. In this example we’ll set the number to . to view two cycles. Listing 7-4. Visualizing a Fourier Expansion lhkpo]Bkqneanatl]joekjkb]na_p]jcqh]ns]ra bnkiluh]^eilknp& lnal]napdalhkp becqna$% dkh`$Pnqa% jqi^ankblkejpopk`eolh]updas]ra J9.&&4 p9hejol]_a$,(-(J% u9vanko$J% bknjejn]jca$-(4(.%6 pdaoejas]rao(]``a` u'90+$le&j%&oej$.&le&j&p&.% lhkppdacn]ld lhkp$p(u% ]jjkp]papdacn]ld ]teo$W,(-()-*0(-*0Y% cne`$% th]^ah$#PeiaWoa_kj`oY#% uh]^ah$#R]hqaWY#% pepha$#Bkqneanatl]joekjkb]na_p]jcqh]ns]ra#% hacaj`$% We import the entire PyLab module, which also includes NumPy and the plotting commands: both are required in this example. We then prepare an empty plot: each new cal- culation of the expansion will be plotted on top of the previous one, so we issue the command dkh`$Pnqa% to ensure subsequent plots do not erase existing ones. The first array object is created with the command p9hejol]_a$,(-(J% (we could’ve also used an ]n]jca$% function call instead). Array object p is a 1-D array, a vector. All our sub- sequent operations and math functions will operate on this vector. We then initialize the series expansion variable, u, using the vanko$% function. The heart of the computation lies in the bkn loop. Each sine wave is added to the previous one, and the result is stored in u. The simple line u'90+$le&j%&oej$.&le&j&p&.% is in reality operating on entire arrays, showing the strength of the array object. We then plot u as it is being calculated and annotate the graph once the expansion is com- plete, as shown in Figure 7-4. CHAPTER 7 N MATH GAMES 241 Figure 7-4. Fourier expansion of a rectangular wave Array Methods and Properties Arrays are objects and as such have functions called methods and variables called properties. Using IPython (see Chapter 2), you can list an object’s methods and properties by using char- acter completion, accessible via the Tab key. Alternatively, you can issue the following: :::eilknpjqilu :::Wibkniej`en$jqilu*j`]nn]u%ebjkp$i*op]nposepd$#[[#%%Y W#P#(#]hh#(#]ju#(#]nci]t#(#]nciej#(#]ncoknp#(#]opula#(#^]oa#(#^upaos]l#( #_dkkoa#(#_hel#(#_kilnaoo#(#_kjf#(#_kjfqc]pa#(#_klu#(#_pulao#(#_qilnk`#( #_qioqi#(#`]p]#(#`e]ckj]h#(#`pula#(#`qil#(#`qilo#(#behh#(#bh]co#(#bh]p# (#bh]ppaj#(#capbeah`#(#ei]c#(#epai#(#epaioap#(#epaioeva#(#i]t#(#ia]j#(# iej#(#j^upao#(#j`ei#(#jas^upakn`an#(#jkjvank#(#lnk`#(#lpl#(#lqp#(#n]rah# (#na]h#(#nala]p#(#naod]la#(#naoeva#(#nkqj`#(#oa]n_doknpa`#(#oapbeah`#(#o apbh]co#(#od]la#(#oeva#(#oknp#(#omqaava#(#op`#(#opne`ao#(#oqi#(#os]l]tao #(#p]ga#(#pkbeha#(#pkheop#(#pkopnejc#(#pn]_a#(#pn]jolkoa#(#r]n#(#reas#Y I’ve used the preceding list to create Table 7-6; it’s only a subset of the methods and attributes, and I chose to describe those I feel are the most useful for data processing and visu- alization. I’ve also split the methods into categories for easier viewing. Methods are denoted with $%, while properties do not have a trailing parenthesis. In this table, ] refers to an array variable. CHAPTER 7 N MATH GAMES242 Table 7-6. Array Methods and Attributes (Partial) Function Description Examples Logical ]hh$% True if all elements of ] are true (nonzero). ]n]jca$-,%*]hh$% returns B]hoa (the first element is zero). ]n]jca$)1().%*]hh$% returns Pnqa. ]ju$% True if at least one element of ] is true (nonzero). ]n]jca$-,%*]ju$% returns Pnqa. jkjvank$% A tuple of indices to nonzero ele- ments of ]. ]n]jca$/%*jkjvank$% returns $]nn]u$W-(.Y%(%. Indexing oknp$% Sorts elements in ].]9]n]jca$/(,()-% sets ] to ]nn]u$W/(.(-Y%. ]*oknp$% changes ] to ]nn]u$W-( .(/Y%. oa]n_doknpa`$t% Returns indices to insert t such that the array’s order is preserved. Assumes ] is already sorted. ]n]jca$0%*oa]n_doknpa`$-*1% returns .. Modifying _hel$iej(i]t% If an element of ] is less than iej, returns iej; if an element of ] is greater than i]t, returns i]t; otherwise, returns the element. ]n]jca$1%*_hel$-(/% returns ]nn]u$W-(-(.(/(/Y%. _kilnaoo$_kj`% Returns an array whose elements match the condition specified in _kj`; equivalent to ]W_kj`Y. ]9]n]jca$-,% ]*_kilnaoo$]:1% returns ]nn]u$W2(3(4(5Y%. ]W]:1Y returns ]nn]u$W2(3(4( 5Y%. behh$t% Sets all values of an array to t; equivalent to ]W6Y9t. ]9vanko$W.(.Y% ]W6Y9)- sets ] to ]nn]u$WW)-*( )-*Y(W)-*()-*YY%. ]*behh$).% sets ] to ]nn]u$WW).*( ).*Y(W).*().*YY%. Math For math examples, assume ]9]nn]u$W-(-f()-Y%, which can also be expressed as ]9 atl$-f&le&]n]jca$/%+.%. _qilnk`$% Cumulative product. Each element is the product of the previous ele- ments in the array. ]*_qilnk`$% returns ]nn]u$W -*',*f(,*'-*f(,*)-*fY%. _qioqi$% Cumulative sum. Each element is the sum of the previous elements in the array. ]*_qioqi$% returns ]nn]u$W -*',*f(-*'-*f(,*'-*fY%. CHAPTER 7 N MATH GAMES 243 Function Description Examples na]h and ei]c Real and imaginary values of elements in ]. ]*ei]c returns ]nn]u$W,*(-*( ,*Y%. ]*na]h returns ]nn]u$W-*(,*( )-*Y%. _kjf$% Complex conjugate of ](negation of the imaginary part; rows and columns transposed). ]W-Y*_kjf$% returns )-f. i]t$%, iej$% Maximum and minimum values of ](performed on real part only). ]*i]t$% returns $-',f%. ]*iej$% returns $)-',f%. ia]j$% Mean value of ].]*ia]j$% returns ,*////////////////-f (1f/3). lnk`$% Product of all the values in ].]*lnk`$% returns )-f. Note that ]*lnk`$% is equal to ]*_qilnk`$%W)-Y. lpl$% Peak-to-peak value of ]; equi- valent to ]*i]t$%)]*iej$%. ]*lpl$% returns $.',f%. nkqj`$% Rounded values of ]. atl$]%*nkqj`$% returns ]nn]u$W /*',*f(-*'-*f(,*',*fY%. op`$% Standard deviation of elements in ]. ]n]jca$.%*op`$% returns ,*1. oqi$% Sum of all the values in ].]*oqi$% returns -f. Note that ]*oqi$% is equal to ]*_qioqi$%W)-Y. pn]_a$WjY% Sum of the diagonal of a 2-D array. If j is provided, sums the offset diagonal. aua$J%*pn]_a$% returns J. r]n$% Variance of elements in ].]n]jca$.%*r]n$% returns ,*.1. Shape Related bh]ppaj$% The values in ] as a 1-D array. aua$.%*bh]ppaj$% returns ]nn]u$W -*(,*(,*(-*Y%. j`ei Number of dimensions of ]. aua$0(1%*j`ei returns .. nala]p$j% Copies ] over j times, flattened. aua$.%*nala]p$.% returns ]nn]u$W -*(-*(,*(,*(,*(,*( -*(-*Y%. naod]la$`-(`.(***% Generates a new array of size (d1, d2, . . . ). ]n]jca$0%*naod]la$.(.% returns ]nn]u$WW,(-Y(W.(/YY%. naoeva$`-(`.(***% Resizes the current array to size (d1, d2, . . . ). ]9]n]jca$0% ]*naoeva$.(.% sets ] to ]nn]u$WW,(-Y(W.(/YY%. od]la A tuple representing shape of ]. aua$/(0%*od]la returns $/(0%. pn]jolkoa$% Transposes a matrix. This is equiva- lent to conjugate but without negat- ing the imaginary parts. aua$.(/%*pn]jolkoa$% returns ]nn]u$WW-*(,*Y(W,*(-*Y( W,*(,*YY%. Continued CHAPTER 7 N MATH GAMES244 Table 7-6. Continued Function Description Examples Conversion pkbeha$bj]ia% Writes an array to file (binary). aua$.%*pkbeha$#aua.*^ej#% bnkibeha$be`% Reads an array from file (binary). ]9bnkibeha$klaj$#aua.*^ej#%% pkheop$% Converts an array to a list aua$.%*pkheop$% returns WW-*,( ,*,Y(W,*,(-*,YY. Example: A Magic Square A magic square is a square with the sum of each row and column equal and the same. Typi- cally, magic squares do not allow numbers to repeat twice. In this example, we’ll generate magic squares, populating the values from 1 to N2 in a square of size N by N. A modern variation on the magic square idea is the Sudoku puzzle game. The ideas pre- sented in this example can be modified to provide solutions to Sudoku puzzles (see dppl6++ aj*segela`e]*knc+sege+Oq`kgq for possible strategies for implementing a computer solution). Back to our example. We’ll create a magic square implementing the De la Loubère method (also known as the Siamese method), which works for squares of odd values of N only. Con- structing a magic square is performed by placing the first value, 1, in the middle column at the top. Incremented values are placed diagonally up and to the right. If the spot up and to the right is outside the square, it is wrapped around to the bottom row (if exceeded at the top) or to the first column (if exceeded to the right) or both. If a cell is already occupied, the value moves a row below (again, wrapping if needed). Figure 7-5 illustrates the algorithm with exam- ple magic squares of sizes 3 and 5. Figure 7-5. De la Loubère method An implementation of the algorithm using an array is presented in Listing 7-5. CHAPTER 7 N MATH GAMES 245 Listing 7-5. Creating a Magic Square bnkijqilueilknp& `abi]ce_om$j9/%6 Napqnjo]i]ce_omq]nakboevaj7jiqop^ak`` ebj!.9-6 n]eoaR]hqaAnnkn(I]ce_$j%namqenaojpk^ak`` i(nks(_kh9vanko$Wj(jY%(,(j+. bknjqiejtn]jca$-(j&&.'-%6 iWnks(_khY9jqibehhpda_ahh _kh9$_kh'-%!j nks9$nks)-%!j ebiWnks(_khY6 _kh9$_kh)-%!j nks9$nks'.%!j napqnji `abpaopi]ce_om$i%6 NapqnjoPnqaebieo]i]ce_omq]na* ioqi9oqi$iW,(6Y% napqnj]hh$i*oqi$,%99ioqi%]j`]hh$i*oqi$-%99ioqi% The main bkn loop is quite straightforward and follows the algorithm strictly. However, calculation of the column and row values using the modulo (!) operator is tricky and requires some explanation. Consider the way the algorithm is specified: increment the column value and check whether the new value is within the size of the matrix. If it is not, wrap it around to the beginning. A similar approach is taken with the row: decrement and wrap if required. Instead of implementing these two steps, an increment/decrement followed by an eb state- ment, we could use the modulo operation, which captures the idea quite elegantly: _kh9 $_kh'-%!j. I’ve chosen to initialize the variables i, nks, and _kh with a multiple assignment. Multiple assignments can also be used inside the bkn loop: _kh(nks9$_kh'-%!j($nks)-%!j; however, in my mind it’s less clear, and there’s no impact performance-wise. My personal preference is to use multiple assignments in initializations and not calculations. I’ve defined another function here, paopi]ce_om$%, which checks whether a square is indeed a magic square. The function also works on even values (which is a plus) and makes use of the oqi$% member function of the array object. NTip Python supports testing via several built-in packages, including `k_paop and qjeppaop. However, for the purpose of this example, I’ve chosen to write a dedicated test function, which will further show the properties of NumPy arrays. CHAPTER 7 N MATH GAMES246 The function oqi$,% returns an array of the sums of columns (i.e., along axis 0); oqi$-% returns an array summing rows (along axis 1). Here’s a listing demonstrating summing along the 0 axis and the 1 axis: :::]9aua$.(/% :::] ]nn]u$WW-*(,*(,*Y( W,*(-*(,*YY% :::]*oqi$,% ]nn]u$W-*(-*(,*Y% :::]*oqi$-% ]nn]u$W-*(-*Y% :::i9i]ce_om$1% :::i*oqi$,% ]nn]u$W21*(21*(21*(21*(21*Y% :::i*oqi$-% ]nn]u$W21*(21*(21*(21*(21*Y% As can be seen, the matrix aua$.(/% has two rows and three columns. Summing along axis 0 via oqi$,% returns a 1-D array (a vector) holding the sums of all three columns. Conse- quently, oqi$-% returns a vector holding the sum of the rows. The next lines show how this can be used to check for “magic-ness” of a square—both vectors, oqi$,% and oqi$-%, should be equal element-wise to the sum along an arbitrary axis. In the paopi]ce_om$% function I’ve chosen to compare both sums of columns and of rows with the sum of the first column: oqi$iW,(6Y%. If you compare a vector (1-D array) with a scalar (a single value), the result is a vector with each element compared with the scalar. To ensure all are indeed equal to the required sum, you could use the ]hh$% member function. I’ve opted to use the notation ]hh$i*oqi$,%99ioqi% over $i*oqi$,%99ioqi%*]hh$% because I think it’s more readable, but that again is personal preference; both do the job. NNote In the function paopi]ce_om$%, it’s not enough to check that i*oqi$,% is equal to i*oqi$-% because this only checks that the sums of rows is equal to the sums of columns. However, that’s not a suffi- cient condition. Consider the array i9]nn]u$WW-(,Y(W,(.YY%: it satisfies the condition i*oqi$,%99 i*oqi$-%, but it’s not a magic square. You might raise the question whether the array aua$J% is a magic square—the function paopi]ce_om$% will return Pnqa, but maybe this is a trivial case of a magic square. One other interesting aspect of the Siamese method is that the sum along the diagonal is also identical to the sum of each row and each column; that’s true for both diagonals. The function pn]_a$% calculates the sum along the diagonal (top left to bottom right). To calculate the sum of the second diagonal (bottom left to top right), you could use the bhelhn$% function. :::i9i]ce_om$1% :::i*pn]_a$% 21*, :::bhelhn$i%*pn]_a$% 21*, CHAPTER 7 N MATH GAMES 247 Other Useful Array Functions Other than j`]nn]u properties and methods, the NumPy package also provides functions that operate on arrays but are not part of the j`]nn]u object class. For a full account, issue the following: :::eilknpjqilu :::`en$jqilu% As you can see, many functions are available from various fields of interest: sVector operations:_kjrkhra$%,_nkoo$%,_knnah]pa$%, and r`kp$% sMatrix operations: `e]c$% and pn]_a$% sStatistical functions:_kr$%, r]n$%, op`$%, ia]j$%, and deopkcn]i$% sFinancial functions: br$%, lr$%, and lip$% sPolynomial operations: lkhu]``$%, lkhuiqh$%, lkhu`er$%, lkhubep$%, lkhu`an$%, lkhuejp$%, and nkkpo$% sOperations that change vector and matrix sizes and orientations: bhelq`$%, bhelhn$%, and nkp5,$% sFunctions that generate windows for filtering: d]iiejc$%, d]jjejc$%, ^]nphapp$%, ^h]_gi]j$%, and g]eoan$% We’ll explore some of these functions in Chapter 8. If you’d like to know more about these functions, issue dahl$jqilu*bqj_pekj%. For example, here’s a function I particularly like using: :::dahl$jqilu*`ebb% Dahlkjbqj_pekj`ebbejik`qhajqilu*he^*bqj_pekj[^]oa6 `ebb$](j9-(]teo9)-% ?]h_qh]papdajpdkn`an`eo_napa`ebbanaj_a]hkjcceraj]teo* I use `ebb$% to calculate the difference between two consecutive elements in an array. I’ve used it several times already in the book, including in the section “Example: Adding Arrows to a Graph” in Chapter 6. You could also modify the friends meeting example in this chapter to use `ebb$% instead of a list comprehension. Final Notes and References The range of applications for which NumPy is of value is large. And as evidence, you’ll find that a considerable number of packages rely on NumPy, and for a good reason: NumPy provides a solid base for mathematical arrays. An interesting module that comes with the Python Standard Library is the decimal mod- ule. This module provides support for decimal floating-point values and allows, for example, arbitrary percision. The decimal module is a bit less intuitive than regular numbers in Python, but should you require higher percision, and provided you’re willing to accept some perfor- mance loss, this module is a good option. Another module, introduced with Python version 2.6, is the fractions module, which supports rational number arithmetic. CHAPTER 7 N MATH GAMES248 Should you require additional information on NumPy or the other topics discussed in this chapter, I hope you find the following references of value: s .UM0YHOMEPAGE dppl6++jqilu*o_elu*knc+ s 'UIDETO.UM0YBY4RAVIS%/LIPHANT THELEADDEVELOPEROF.UM0Y dppl6++sss* pn]iu*qo+jqilu^kkg*l`b s h$ELA,OUBÞRE-ETHOD v7IKIPEDIA dppl6++aj*segela`e]*knc+sege+Oe]iaoa[iapdk` s h&OURIER3ERIES v7IKIPEDIA dppl6++aj*segela`e]*knc+sege+Bkqnean[oaneao s h.EWTON&RACTALS v7IKIPEDIA dppl6++aj*segela`e]*knc+sege+Jaspkj!.3o[iapdk` s 4HE0YTHON3TANDARD,IBRARY dppl6++`k_o*lupdkj*knc+he^n]nu+ej`at*dpih s $ECIMALMODULE dppl6++`k_o*lupdkj*knc+he^n]nu+`a_ei]h*dpih s &RACTIONSMODULE dppl6++`k_o*lupdkj*knc+he^n]nu+bn]_pekjo*dpih CHAPTER 8 Science and Visualization Numerical Analysis and Signal Processing I’ve covered a great deal of the topics associated with data analysis and visualization: reading and writing files, text processing and converting text to numerical data, plotting and graph- ing, writing scripts, and implementing algorithms. It’s time to take a deeper dive and analyze numerical data. This chapter deals with two important topics: numerical analysis and signal processing. These two topics appear in many sciences: mathematics, computing, engineering, and more. From a simplistic point of view, numerical analysis is concerned with algorithms that yield numerical values: a solution to a nonlinear equation, the decimal representation of /, and more. Signal processing deals with processing signals, that is, values that change over time. Signal processing includes such topics as detection and filtering. Most universities and colleges offer undergraduate courses that teach these topics. But you don’t have to be an engineer or a computer scientist to use the methods and ideas dis- cussed in the chapter. Most of the topics are easy to follow, as I’ve tried to keep the math to a minimum. If you have a strong numerical analysis and signal processing background, this chapter should prove a good starting point for these topics in Python. If you’re new to the ideas of numerical analysis and signal processing, I hope to shed some light so that you can pick it up from here with relevant scientific literature. In particular, I’d like to point out one of the books that made a great deal of impact on me (and many others), Numerical Recipes: The Art of Sci- entific Computing, Third Edition by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery (Cambridge University Press, 2007; for more information, see dppl6++ sss*jn*_ki). Although the book implements algorithms using C/C++ (my original copy was in the Pascal programming language), it provides a wealth of information on numerical algo- rithms and should prove easy enough to port to Python. In my view, the field of numerical analysis is a cookbook of algorithms to numerically solve mathematical problems. And so in a sense, that’s how the chapter is organized as well: as a list of problems and solutions. Each topic will be explored with examples in hopes that you’ll modify the examples to fit your needs. And that’s also how I suggest you refer to the chapter: 249 CHAPTER 8 N SCIENCEAND VISUALIZATION250 as a cookbook of algorithms. While it’s quite possible to read through and learn the algorithms one at a time, it’s probably easier to read specific sections as you engage problems associated with them in real life. So my suggestion is this: skim through the table of contents to acquaint yourself with what’s available, and then try to solve a specific problem by reading the relevant section. In this chapter, I’ve used SciPy, matplotlib, and NumPy extensively. These three packages are rich and complex, and as a result, I was only able to cover some of the functionality, not all of it. I therefore chose to cover topics and show examples of problems I personally encoun- tered. I hope you’ll find the examples of value. Finding Your Way: Variables and Functions The NumPy package provides us with two useful helper functions. I call them helper functions because they don’t fall into any specific numerical analysis or signal processing category. When one works in an interactive environment, one constantly defines variables. It’s hard to remember what variables are defined and what they mean. The function sdk$% prints a list of all j`]nn]u variables (NumPy arrays): :::sdk$% Qllan^kqj`kjpkp]h^upao9, :::ql(`ksj9]n]jca$-,%(]n]jca$-,(,()-% :::sdk$% J]iaOd]la>upaoPula 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 `ksj-,0,ejp/. ql-,0,ejp/. Qllan^kqj`kjpkp]h^upao94, The function hkkgbkn$% is great for searching inside docstrings. So to look for functions that perform numerical integration, issue :::hkkgbkn$#ejpacn]pa#% Oa]n_dnaoqhpobkn#ejpacn]pa# )))))))))))))))))))))))))))))) jqilu*pn]lv Ejpacn]pau$t%qoejco]ilhao]hkjcpdaceraj]teo]j`pda_kilkoepa SciPy SciPy (dppl6++sss*o_elu*knc+) is an open source scientific library for Python. The idea of SciPy is similar to that of Octave-Forge (dppl6++k_p]ra*okqn_abknca*jap+), which provides extra packages for GNU-Octave (dppl6++sss*k_p]ra*knc) and toolboxes that enhance MATLAB (dppl6++sss*i]pdskngo*_ki). SciPy is built on top of NumPy and so requires NumPy to work properly. SciPy is organized into several modules, some of which are detailed in Table 8-1. CHAPTER 8 N SCIENCEAND VISUALIZATION 251 Table 8-1. SciPy Packages Package Description Fftpack Fast Fourier Transform Integrate Integration functions, including ordinary differential equations Interpolate Interpolation of functions Linalg Linear algebra Optimize Optimization functions, including root-solving algorithms Signal Signal processing Special Special functions (Airy, Bessel, etc.) We’ll be exploring most of SciPy modules that deal with numerical analysis and signal processing. Additional SciPy modules include sparse matrices (module sparse), statistics (module stats), and more; they will not be covered in this book. To import a SciPy module, issue eilknpo_elu*ik`qhaj]ia: :::eilknpo_elu*hej]hc or :::bnkio_elueilknphej]hc Personally, I prefer the latter option: hej]hc*aec$% is shorter to code than o_elu*hej]hc*aec$% (plus I think it’s easier to read). Linear Algebra Linear algebra is a branch in mathematics that deals with matrices, vectors, and solving systems of linear equations. SciPy and NumPy provide us with many functions to deal with these topics: solving systems of linear equations, matrix and vector operations, and matrix decompositions. Solving a System of Linear Equations To solve a system of linear equations, we first write the problem in matrix notation. .&t'/&u9-, /&tÌu9)-*1 We start by defining a matrix, I, and a vector, R. The matrix is composed of the coefficients of t and u, which are 2 and 3 on the first row, hence W.(/Y, and 3 and –1 on the second row, hence W/()-Y: :::bnkiluh]^eilknp& :::I9]nn]u$WW.(/Y(W/()-YY% CHAPTER 8 N SCIENCEAND VISUALIZATION252 Next we define the vector of the results, W-,()-*1Y: :::R9]nn]u$W-,()-*1Y% Now all that’s required is to use the function okhra$%: :::okhra$I(R% ]nn]u$W,*1(/*Y% meaning that t is equal to 0.5 and u is equal to 3. It’s also possible to reach the solution by calculating the inverse of the matrix I and multi- plying it by vector R: :::`kp$ejr$I%(R% ]nn]u$W,*1(/*Y% I’ve introduced two functions here: ejr$% and `kp$%. The function ejr$% calculates the inverse of a matrix, and the function `kp$% performs a dot product. Had I multiplied ejr$I% with R, I would’ve received an element-by-element multiplication, and not the result we’re interested in: :::ejr$I%&R ]nn]u$WW,*5,5,5,5-(),*0,5,5,5-Y( W.*,*. Generally speaking, you should use okhra$% instead of ejr$%. The function okhra$% can handle what mathematicians call “less-behaved” matrices. Vector and Matrix Operations Much like `kp$%, the function r`kp$% returns the dot product of two vectors. So if you’re only interested in the value of t in the previous example, you can write :::`kp$ejr$I%W,Y(R% ,*1,,,,,,,,,,,,,,.. The function ejjan$r-(r.% will perform an inner product, that is, multiply every element in r- with the corresponding element in r. and then sum them together: :::R-9]nn]u$W-,()-*1Y% :::R.9]nn]u$W-(.Y% :::oqi9, :::bkneejn]jca$haj$R-%%6 ***oqi'9R-WeY&R.WeY *** :::oqi 3*, :::ejjan$R-(R.% 3*, I’ve implemented an inner product operation with a bkn loop and compared the results with the results of the function ejjan$%. As can be expected, the results are the same. Note CHAPTER 8 N SCIENCEAND VISUALIZATION 253 that the function ejjan$% does not multiply an element with its conjugate (negative imaginary part). The function ejjan$% works on matrices as well: :::I9]nn]u$WW.(/Y(W/()-YY% :::I ]nn]u$WW.(/Y( W/()-YY% :::ejjan$I(ejr$I%% ]nn]u$WW-*,,,,,,,,a',,(-*--,../,.a)-2Y( W1*11---1-.a)-3(-*,,,,,,,,a',,YY% Similarly, kqpan$% performs an outer product of two vectors or matrices: :::R-9]nn]u$W-,()-*1Y% :::R.9]nn]u$W-(.Y% :::kqpan$R-(R.% ]nn]u$WW-,*(.,*Y( W)-*1()/*YY% The function pn]jolkoa$% will permute axes, and _kjfqc]pa$% will permute axes and negate the imaginary part of a matrix or vector: :::R-9]nn]u$W-,()-*1Y% :::R.9]nn]u$W-(.Y% :::kqpan$R-(R.% ]nn]u$WW-,*(.,*Y( W)-*1()/*YY% :::kqpan$R.(R-% ]nn]u$WW-,*()-*1Y( W.,*()/*YY% :::]hh$kqpan$R-(R.%99pn]jolkoa$kqpan$R.(R-%%% Pnqa :::_kjfqc]pa$R-'-f&R.% ]nn]u$W-,*,)-*f()-*1).*fY% The function `ap$i% will return the determinant of matrix i: :::`ap$]nn]u$WW.(/Y(W/()-YY%% )--*, Matrix Decomposition Matrix decomposition is the rewriting of a matrix to a specific form. There are many decompo- sitions including LU decomposition, singular value decomposition, and QR decomposition. NumPy’s linear algebra module supports some matrix decompositions via the functions in Table 8-2. CHAPTER 8 N SCIENCEAND VISUALIZATION254 Table 8-2. Some Matrix Decomposition Functions Function Description _dkhaogu$i% Cholesky decomposition aec$i% Eigenvalue decomposition mn$i% QR decomposition or`$i% Singular value decomposition The following code performs eigenvalue decomposition with verification of the results: :::=9]nn]u$WW-(.Y(W,(-YY% :::H(r9aec$=%_]h_qh]paaecajr]hqao]j`aecajra_pkno :::`ap$=)aua$.%&H%ranebuaecajr]hqao$odkqh`^avank% ,*, :::`kp$=(rW6(,Y%)HW,Y&rW6(,Yranebuaecajra_pkn$odkqh`^a,% ]nn]u$W,*(,*Y% :::`kp$=(rW6(-Y%)HW-Y&rW6(-Yranebuaecajra_pkn$odkqh`^a,% ]nn]u$W.*..,002,1a)-2(,*,,,,,,,,a',,Y% I’ve created a matrix = and calculated its eigenvalues h1,2 (stored in vector H) and eigen- vectors v1,2 (stored in matrix r). Once the eigenvalues are evaluated, they can be verified by calculating det(A – h * I), which should be zero; this is done in the second line. Also, for every eigenvector hv = A* v, this is verified in the last two lines. We will not be covering other matrix decompositions here; if you require additional infor- mation, dahl$% is quite informative. Additional Linear Algebra Functionality Additional linear algebra functionality is available with the scipy.linalg module. To access SciPy’s linear algebra functions, issue eilknpo_elu*hej]hc or bnkio_elueilknphej]hc. SciPy’s added functionality includes s -ATRIXDECOMPOSITIONFUNCTIONShq$% for LU decomposition and mn$% for QR matrix decomposition, as well as functions for other decompositions. s -ATRIXANDVECTOROPERATORSSUCHASjkni$% to calculate a matrix or vector norm. s -ATRIXFUNCTIONS FOREXAMPLE atli$% and p]ji$%. Matrix function names are similar to regular function names but with an added character i. Numerical Integration Numerical integration is the process of numerically computing a definite integral. There are many occasions where numerical integration is important. Examples include calculating the area of a shape or the area under a graph, and solving differential equations. CHAPTER 8 N SCIENCEAND VISUALIZATION 255 For the purpose of this discussion we’ll calculate the area of half a circle of radius 1. We already know this area to be //2. So in a sense, calculating the area of half a circle is equivalent to calculating the numerical value of /. First, we create two vectors: t and u. These two vectors satisfy the circle equation x2 + y2 = 1: :::J93 :::t9hejol]_a$)-(-(J% :::u9omnp$-)t&&.% :::t&&.'u&&. ]nn]u$W-*(-*(-*(-*(-*(-*(-*Y% I chose the variable J arbitrarily; J is the number of points in the vectors t and u. To visualize the numerical integration, I plot rectangles that approximate the area of the circle: :::becqna$% :::`t9tW-YÌtW,Y :::bkneejn]jca$haj$t%)-%6 ***na_p9Na_p]jcha$$tWeY(,%(`t(,*1&$uWeY'uWe'-Y%% ***c_]$%*]``[l]p_d$na_p% *** :::pepha$#=llnktei]pejcpda]na]kbd]hb]_en_ha#% :::]teo$#amq]h#% The area under the curve, that is, the integral, is approximately the sum of these squares. Each square’s area is 0.5*(y[i]+y[i+1])*dx, so the total sum can be written as follows: :::`t&$oqi$uW,6)-Y'uW-6Y%% .*5-311//3433155,0 I’ve multiplied the result by 2 so we can compare with / instead of / /2. Obviously, the bigger J is, the closer this number will be to /: :::bknJejW1(-,(.,(-,,Y6 ***t9hejol]_a$)-(-(J% ***`t9tW-Y)tW,Y ***u9omnp$-)t&&.% ***aop[le9`t&oqi$uW,6)-Y'uW-6Y% ***lnejpJ9!`(aopei]pa`leeo!b!$J(aop[le% *** J91(aopei]pa`leeo.*3/.,1- J9-,(aopei]pa`leeo/*,-5./. J9.,(aopei]pa`leeo/*-,-12, J9-,,(aopei]pa`leeo/*-/4.-4 As you can see, for J = 100, the accuracy is about 1 percent. Figure 8-1 captures this visually. CHAPTER 8 N SCIENCEAND VISUALIZATION256 Figure 8-1. Calculating the area of a circle In calculating the area of the circle, I chose values that are evenly spaced. In case you’d like to use non-evenly spaced values, the implementation is more complex. Also, the method uses rectangles to approximate the area under the curve, but in this particular example (and many others), trapezoidals are probably better suited, which brings us to the function pn]lv$u(t%. The function accepts vectors u and t and returns the numerical integral. The following performs numerical integration of non-evenly spaced t values using the function pn]lv$%: :::t9]nn]u$W)-(),*5(),*0(,*,(,*0(,*5(-Y% :::u9omnp$-)t&&.% :::pn]lv$u(t%&. .*53.351-./0,454/- Figure 8-2 shows a visual representation of the trapezoidal integration. CHAPTER 8 N SCIENCEAND VISUALIZATION 257 Figure 8-2. Calculating the area of a circle using the trapezoidal method and non-evenly spaced values More Integration Methods Additional integration algorithms are available with the module scipy.integrate. To use this module, issue bnkio_elueilknpejpacn]pa. We’ll limit our discussion to the algorithm mq]`$%, which uses a Gaussian quadrature to numerically integrate a mathematical function. Unlike previous methods such as pn]lv$%, using mq]`$% requires supplying a mathematical function and not the t and u vectors. NNote I’ve used the term “mathematical function” to differentiate this type of function from a general- purpose Python function. A mathmatical function is one that returns a numerical value given an input numerical value, for example, y = f(x). In reality, we implement a mathematical function as a Python function. :::bnkio_elu*ejpacn]paeilknpmq]` :::`abd]hb[_en_ha$t%6 ***napqnjomnp$-)t&&.% *** :::le[d]hb(ann9mq]`$d]hb[_en_ha()-(-% :::$le[d]hb&.(ann% $/*-0-15.21/1453523(-*,,,./101,,.-15-1a),,5% CHAPTER 8 N SCIENCEAND VISUALIZATION258 I defined a mathematical function d]hb[_en_ha$% that returns the y coordinate value of the upper half circle of radius 1, given an x coordinate value. I then called mq]`$% with the argu- ments d]hb[_en_ha, the function to integrate, and )- and -, the range of values to integrate. The function mq]`$% returns a value and an error. The module scipy.integrate also supports solving of ordinary differential equations using functions k`a$% and k`aejp$%. We will not be discussing these functions. If you’re interested in solving differential equations, refer to the SciPy home page: dppl6++sss*o_elu*knc+O_eLu. Interpolation and Curve Fitting Interpolation and curve fitting deal with fitting functions to discrete known values. There are several reasons you would want to fit functions to points of data, among which are s &ITTINGAKNOWNFUNCTIONTOGATHEREDEXPERIMENTALDATA4HISCANBEHELPFULINDETER- mining other parameters of the experiment. s %VALUATINGTHENUMERICALVALUESOFFUNCTIONSATADDITIONALPOINTSOTHERTHANTHEGIVEN ones). Interpolation allows efficient implementations that are tailor-made to a specific prob- lem. Instead of writing a lookup table for all the possible values, you could come up with an interpolation polynomial that is more efficient, albeit with possible loss of performance and accuracy. At other times, you might choose to implement a known function such as omnp$% instead of using a library-supplied algorithm to increase performance (again, at the possible cost of accuracy). INVERSE SQUARE ROOT AND QUAKE III If you’re interested in efficient algorithms to calculate numerical functions, you may find the article “Fast Inverse Square Root” by Chris Lomont, dppl6++sss*hkikjp*knc+I]pd+L]lano+.,,/+EjrOmnp*l`b, interesting. The article describes a very efficient algorithm to implement the inverse square root of a number that appeared in the source code of the computer game Quake III. The implementation makes use of the Newton-Raphson method (and not interpolation). The article assumes knowledge of C. Piecewise Linear Interpolation Let’s turn back to our half-a-circle example. This time, we’ll limit ourselves to a quarter of a circle, that is, positive values of x and y. We start by calculating the y values for x equal to 0, 0.2, . . . , 1. We’ll store the results in vectors tl and ul: :::tl9hejol]_a$,(-(2% :::tl ]nn]u$W,*(,*.(,*0(,*2(,*4(-*Y% :::ul9omnp$-)tl&&.% CHAPTER 8 N SCIENCEAND VISUALIZATION 259 We’d like to calculate the values of u for t values equal to 0.1, 0.3, . . . , 0.9 given tl and ul. We’ll use the function ejpanl$t(tl(ul% for this. The function returns the value of the piecewise linear function defined by tl, ul at a requested point t. What this means is ejpanl$% returns the value of a point on a line connecting two adjacent (tl, ul) points: :::te9]n]jca$,*-(-*,(,*.% :::ue9ejpanl$te(tl(ul% The vector ue holds the interpolated values at points 0.1, 0.3, . . ., 0.9. The following visualizes a piecewise linear interpolation for the quarter of a circle: :::becqna$% :::dkh`$Pnqa% :::t9hejol]_a$,(-(1,,% :::u9omnp$-)t&&.% :::lhkp$t(u(#^#(h]^ah9#e`a]h#% :::lhkp$tl(ul(#kn#(h]^ah9#ejpanlkh]pekjlkejpo#% :::lhkp$tl(ul(#))n#(h]^ah9#lea_aseoaheja]nbqj_pekj#% :::lhkp$te(ue(#oc#(h]^ah9#ejpanlkh]pa`r]hqao#% :::hacaj`$hk_9#^aop#% :::cne`$% :::]teo$#o_]ha`#% :::]teo$W,(-*-(,(-*-Y% :::pepha$#Lea_aseoaheja]nejpanlkh]pekj#% Figure 8-3 shows the results of this visualization. Figure 8-3. Piecewise linear interpolation CHAPTER 8 N SCIENCEAND VISUALIZATION260 For the purpose of the example, the values tl and ul are computed, but in reality, these values can originate from sampled data. As you can see from the graph, the interpolated value at 0.9 is considerably less accurate than other interpolated values. Typically, the more points you add, the more accurate the result. Polynomials Polynomials are mathematical expressions that involve a sum of integer powers of a variable multiplied by a coefficient. Examples include 2x2 + x – 1 as well as x. However, sin(x) is not a polynomial. The reason polynomials are so important is that they involve only basic opera- tions: addition, subtraction, and multiplication (integer powers can be implemented with several multiplications), and this property makes them very easy to implement in computing. Taylor series expansion (dppl6++aj*segela`e]*knc+sege+P]uhkn[oaneao) is a prime example of transforming a function to a polynomial, easily computed. To be able to operate on polynomials with NumPy and SciPy, we represent a polynomial as a vector. The first element in the vector is the coefficient to the highest power, and the last element in the array is the coefficient to the lowest power, 0. So to express the polynomial x2 + 3x + 2, issue the following: :::l9]nn]u$W-(/(.Y% To solve the equation x2 + 3x + 2 = 0, use the function nkkpo$l%: :::nkkpo$l% ]nn]u$W).*',*f()-*',*fY% Notice that the imaginary parts are zero, and so the roots are –2 and –1. If you’d like to construct a polynomial from its roots instead of its coefficients, use the function lkhu$%: :::l9lkhu$W).()-Y% :::l ]nn]u$W-(/(.Y% Adding and subtracting polynomials is done using lkhu]``$% and lkhuoq^$%: :::l-9lkhu$W).()-Y% :::l.9]nn]u$W-(,(,(,Y% :::lkhu]``$l-(l.% ]nn]u$W-(-(/(.Y% I’ve added x2 + 3x + 2 to x3 and got x3 + x2 + 3x + 2 as a result. Multiplying and dividing polynomials is done using lkhuiqh$% and lkhu`er$%. The return value from lkhu`er$% is a quotient and a remainder: :::l9lkhuiqh$]nn]u$W-(.Y%(]nn]u$W-(/Y%% :::l ]nn]u$W-(1(2Y% :::lkhu`er$l(]nn]u$W-(/Y%% $]nn]u$W-*(.*Y%(]nn]u$W,Y%% CHAPTER 8 N SCIENCEAND VISUALIZATION 261 Performing integration and differentiation on polynomials is done using the functions lkhuejp$% and lkhu`an$%, respectively: :::l9lkhu$W)-f(-fY% :::l ]nn]u$W-*(,*(-*Y% :::lkhu`an$l% ]nn]u$W.*(,*Y% :::lkhuejp$l% ]nn]u$W,*////////(,*(-*(,*Y% In the first line I created a polynomial from complex numbers; the polynomial created is stored in l and is x2 + 1. Using lkhu`an$% I calculated the derivative of l and got 2x. Using lkhuejp$% I calculate the integral of l and got 1/3 x2 + x. Uses of Polynomials So why is all this polystuff important? The main reason is that you can use polynomials to approximate functions both from gathered data and from analytical functions. And since polynomials only require multiplications and additions, implementing polynomials in an embedded system, for example, is straightforward. Fitting polynomials to data is done using the function lkhubep$t(u(j%. Given a vector of x points and a vector of y points, lkhubep$t(u(j% will return a polynomial of degree n (high- est power of x) that best fits the set of data points. Another function that is of use is lkhur]h$l( t%; this function returns the value of the polynomial at t (t can be a vector). Example: Linear Regression A known curve-fitting algorithm is linear regression. The idea is to draw a straight line in such a way that the total distance of all the points from the line is minimal. For the purpose of this example, we’ll create a straight line and then add “measurement noise” to the values. Confronted with the new “noisy” data, we’ll try to evaluate the first order polynomial that fits the data. We’ll compare the results with the known true values (see List- ing 8-1). Listing 8-1. Linear Regression with lkhubep$% bnkiluh]^eilknp& jqi^ankb`]p]lkejpo J9-,, op]np9, aj`9- =9n]j`$% >9n]j`$% kqnheja]nhejasehh^au9=&t'> CHAPTER 8 N SCIENCEAND VISUALIZATION262 t9hejol]_a$op]np(aj`(J% u9=&t'> u'9n]j`j$J%+-, heja]nnacnaooekj l9lkhubep$t(u(-% becqna$% pepha$#Heja]nnacnaooekjsepdlkhubep$%#% lhkp$t(u(#k#( h]^ah9#Ia]oqna``]p]7=9!*.b(>9!*.b#!$=(>%% lhkp$t(lkhur]h$l(t%(#)#( h]^ah9#Heja]nnacnaooekj7=9!*.b(>9!*.b#!pqlha$l%% hacaj`$hk_9#^aop#% I’ve randomly selected two values for = and >, and constructed a linear line with noise using n]j`j$%. Then, I used lkhubep$% to fit the data to a first degree polynomial, a straight line. Lastly, I plot the data along with the newly constructed linear line. Figure 8-4 shows the results of this linear regression. Figure 8-4. Linear regression with lkhubep$% CHAPTER 8 N SCIENCEAND VISUALIZATION 263 Example: Linear Regression of Nonlinear Functions In cases where the function you’re trying to fit isn’t linear, at times it’s still possible to perform linear regression. The following is an example of fitting exponential data: bnkiluh]^eilknp& jqi^ankb`]p]lkejpo J9-,, op]np9, aj`9. =9n]j`$% >9n]j`$% t9hejol]_a$op]np(aj`(J% u9atl$=&t'>% u'9n]j`j$J%+1 heja]nnacnaooekj l9lkhubep$t(hkc$u%(-% becqna$% pepha$n#Heja]nnacnaooekjsepdlkhubep$%( u9>aZw=ty #% lhkp$t(u(#k#( h]^ah9#Ia]oqna``]p]7=9!*.b(>9!*.b#!$=(atl$>%%% lhkp$t(atl$lkhur]h$l(t%%(#)#( h]^ah9#Heja]nnacnaooekj7=9!*.b(>9!*.b#!$lW,Y(atl$lW-Y%%% hacaj`$hk_9#^aop#% The regression is performed in the call to the function lkhubep$%. This time, I’ve passed t and hkc$u% as values allowing a linear regression on hkc$u% or an exponential regression on u. You can see the results of this regression in Figure 8-5. CHAPTER 8 N SCIENCEAND VISUALIZATION264 Figure 8-5. Fitting exponential data Example: Approximating Functions with Polynomials Another set of problems solvable with lkhubep$% is approximation of functions using inter- polation. The motivation behind this is a simple implementation of known functions. For the purpose of this example, we’ll approximate the function sin(x). The idea is to create a polynomial that passes through known interpolation points—that is, calculate the value of sin(x) for known n values of x, and then create a polynomial of degree n – 1 that passes through all these points. We start by selecting a set of points from 0 to //2; these will be our interpolation points. Values outside this range can be computed using trigonometry identities and the interpolation function. We select five points for interpolation, thus deciding the degree of the interpolation polynomial to be 4. Once the points are selected, we calculate the sine of these points. For the purpose of this example, I’ve chosen sine values that can be easily computed using the omnp$% function. You might argue that I’m cheating here, because I’m using a nonlinear function (square root) to calculate sin(x) and not purely polynomials. But you’ve already seen how to calculate the square root of a number using Newton’s method in Chapter 7. NTip The selection of interpolation points is an interesting topic, and work by the mathematician Pafnuty Chebyshev has contributed much to the topic. See dppl6++aj*segela`e]*knc+sege+L]bjqpu[?da^uodar and dppl6++aj*segela`e]*knc+sege+?da^uodar[jk`ao. CHAPTER 8 N SCIENCEAND VISUALIZATION 265 The values I’ll select for interpolation are 0, 30, 45, 60, and 90 degrees. The reason I chose these values is that I know their exact sine values: 0, ½, 32/2, 33/2, and 1, respectively. Or, in vector form: :::r]hqao9W,(le+2(le+0(le+/(le+.Y :::oejao9omnp$]n]jca$1%%+. :::oejao ]nn]u$W,*(,*1(,*3,3-,234(,*422,.10(-*Y% Given these, interpolation is straightforward: :::l9lkhubep$r]hqao(oejao(haj$r]hqao%)-% :::l ]nn]u$W.*4353--.1a),.().*,0/0,252a),-(.*-/3/,,31a),.( 5*512.2-40a),-(.*..,002,1a)-2Y% So if you were to implement sin(x), all you need is to store the values of l given previously and then write a simple routine to calculate the value of sin(x) using the polynomial. If you’re using NumPy, simply call lkhubep$%. Let’s plot the difference between our implementation of sin(x) and Python’s built-in sin(x) function: :::becqna$% :::t9hejol]_a$,(le+.(-,,% :::lhkp$t(lkhur]h$l(t%)oej$t%(h]^ah9#annkn#(hs9/% :::cne`$% :::uh]^ah$#lkhur]h$l(t%)oej$t%#% :::th]^ah$#t#% :::pepha$#Annkn]llnktei]pejcoej$t%qoejclkhubep$%#% :::thei$,(le+.% Figure 8-6 illustrates this difference. CHAPTER 8 N SCIENCEAND VISUALIZATION266 Figure 8-6. Interpolation accuracy The results are quite accurate, less than 0.003 at worst. Spline Interpolation The scipy.interpolate module adds additional interpolation functions. One of these is the olheja$tl(ul(t% interpolation function. Notice that the arguments to the function olheja$% are ordered differently from those of the function ejpanl$%. Spline interpolation is a piecewise polynomial interpolation that adheres to specific rules to yield smooth results. Let’s turn to the previous circle example: bnkio_elu*ejpanlkh]paeilknpolheja bnkiluh]^eilknp& tl9hejol]_a$,(-(2% ul9omnp$-)tl&&.% te9hejol]_a$,(-(-,,% ue9ejpanl$te(tl(ul% uo9olheja$tl(ul(te% becqna$% dkh`$Pnqa% lhkp$te(ue(#))#(h]^ah9#lea_aseoaheja]n#(hs9.% lhkp$te(uo(#)#(h]^ah9#olheja#(hs9.% hacaj`$hk_9#^aop#% cne`$% pepha$n#Olhejaejpanlkh]pekjkb u9Xomnpw-)tZ.y #% th]^ah$#t#% CHAPTER 8 N SCIENCEAND VISUALIZATION 267 uh]^ah$#u#% ]teo$#o_]ha`#% ]teo$W,(-*.(,(-*.Y% In Figure 8-7, I’ve compared a piecewise linear interpolation with a spline interpolation. The spline interpolation appears “smoother.” Figure 8-7. Spline interpolation Solving Nonlinear Equations In Chapter 7 we’ve talked about Newton’s method and used it to draw fractals. Newton’s method was used to solve a nonlinear equation. The module scipy.optimize provides us with additional tools to solve nonlinear equations, as well as other optimization routines that will not be discussed here. Of those routines, I’d like to highlight three: bokhra$b(t,%, ^eoa_p$b(](^%, and jaspkj$b(t,%. All these functions try to solve the equation f = 0, where f is a mathematical function implemented in Python. Suppose we’d like to calculate 33 for the previous example. The idea is to construct a func- tion such that the solution will be 33. This is easily done by setting f = x2 – 3 = 0: :::`abb$t%6 ***Napqnjot&&.)/ ***napqnjt&&.)/ *** :::b$-,% 53 CHAPTER 8 N SCIENCEAND VISUALIZATION268 Let’s use the functions bokhra$%, ^eoa_p$%, and jaspkj$% to calculate the roots. For bokhra$% and jaspkj$%, we’ll use x0 = 1, which is called the initial guess. The initial guess is a value that is close to the desired result. For ^eoa_p$%, we need to provide a region for the search. We’ll set the region to (1, 2) because we know the square root of 3 is less than 2 but greater than 1: :::bnkio_elueilknpklpeieva :::klpeieva*bokhra$b(-% -*3/.,1,4,3124433. :::klpeieva*jaspkj$b(-% -*3/.,1,4,3124433. :::klpeieva*^eoa_p$b(-(.% -*3/.,1,4,3125,224 :::[&&. /*,,,,,,,,,,,,2120 Although in the simple case of square root of 3, all these functions provide accurate results, the algorithms are computationally intensive. In most these functions you can control how accurate you’d like your result to be by passing proper arguments to the functions. Of course, for a simple question as the one presented here, it’s best to use omnp$/%. Special Functions The scipy.special module provides a host of special functions that surface usually in higher mathematics and physics. These include s "ESSELFUNCTIONS INTEGRALS DERIVATIVES ANDZEROSOF"ESSELFUNCTIONS s !IRYFUNCTIONS s 'AMMAFUNCTIONSANDERRORFUNCTIONS s 3PECIALPOLYNOMIALS,EGENDRE #HEBYSHEV and many more. To use the functions, issue the following: :::bnkio_elueilknpola_e]h :::ola_e]h*_da^up$.% lkhu-`$W.*,,,,,,,,a',,()0*00,45.-,a)-2()-*,,,,,,,,a',,Y% For a full account, issue dahl$ola_e]h%. Signal Processing Up to this point in the chapter, we’ve dealt with numerical analysis. Going forward, the topics are related to signal processing. Signal processing is a vast field that deals with signals: values that change over time. Popular signal processing algorithms include the processing of sound, such as an equalizer; others include algorithms for radars, CAT scanning, and many more. This part of the chapter will cover some of the functionality available with the module scipy.signal and complement the discussion with examples. You’ll learn about some basic algorithms to detect signals in the presence of noise and functions to design filters. However, CHAPTER 8 N SCIENCEAND VISUALIZATION 269 this section is but only a taste of the topic, and I encourage you to consult with the references at the end of the chapter and professional literature for efficient signal processing algorithms. Functions where, select, and find The first set of functions we’ll cover is sdana$%, oaha_p$%, and bej`$%. The function bej`$_kj`% finds the indices to an array for which a condition is met: :::bnkiluh]^eilknp& :::omq]nao9]n]jca$-,%&&. :::omq]nao ]nn]u$W,(-(0(5(-2(.1(/2(05(20(4-Y% :::E9bej`$omq]nao81,% :::E ]nn]u$W,(-(.(/(0(1(2(3Y% :::omq]naoWEY ]nn]u$W,(-(0(5(-2(.1(/2(05Y% We created a vector holding the squares of the numbers 0–9 and found all the indices to the vector that satisfy the condition that the squares are less than 50. Notice that the return value is a vector of indices, and if you require the values and not the indices, you have to access the original array, which is omq]naoWEY in the preceding example. The function sdana$_kj`(t(u% accepts three arrays of the same size: _kj`, t, and u, and then evaluates every element in _kj`. If the element evaluates to Pnqa, the return value is the corresponding element from t; if the return element evaluates to B]hoa, the return value is the corresponding element from u: :::ql9]n]jca$-,% :::ql ]nn]u$W,(-(.(/(0(1(2(3(4(5Y% :::`ksj9]n]jca$-,(,()-% :::`ksj ]nn]u$W-,(5(4(3(2(1(0(/(.(-Y% :::decdaop9sdana$ql:`ksj(ql(`ksj% :::decdaop ]nn]u$W-,(5(4(3(2(1(2(3(4(5Y% The function oaha_p$_kj`(r]ho(`ab]qhp9,% adds functionality to the function sdana$% by allowing for several conditions. The function accepts a list of conditions specified in _kj` and returns the corresponding element associated with r]ho if a condition is met; if none of the conditions are met, the `ab]qhp value is selected. :::ql9]n]jca$-,% :::n]il9oaha_p$Wql80(ql:3Y(W0(3Y(ql% :::n]il ]nn]u$W0(0(0(0(0(1(2(3(3(3Y% The first three elements of ql are less than 4, so the condition ql80 is met, causing the selection of value 0. The last three elements are greater than 7, causing the selection of the value 3, and values greater than or equal to 4 yet less than 7 are retained as is because the CHAPTER 8 N SCIENCEAND VISUALIZATION270 default is set to be equal to ql. As a matter of fact, this functionality is called clipping and is available as both a method of the NumPy j`]nn]u object and as a stand-alone function, _hel$%. So now that you know of the functions bej`$%, sdana$%, and oaha_p$%, what can you do with them? The answer is simple: they’re great for picking up values, what we call detection in signal processing. Example: Simple Detection of Signal in Noise, Part 1 The detection of signals in the presence of noise plays an integral role in a great number of applications. For example, it is used in communication systems in the detection of signals such as radio or television broadcasts and differentiating them from noise, in medicine with the detection of an ECG signal, and more. For the purpose of this example, we’ll first construct a clean signal. By a signal, I mean a one-dimensional array (vector), where values are stored as a function of time. Our purpose will be to detect “events,” which will be represented by narrow triangles placed randomly in time. There can be several events in a signal. To generated a triangular pulse, I’ll use the oecj]h*pne]jc$% function (which is really a window function, more on that later in the chapter in the section “Window Functions”). The function generates a triangular window of a specified size. We randomly place triangular pulses in the signal vector, as shown in Listing 8-2. Listing 8-2. Randomly Placing Triangular Spikes bnkiluh]^eilknp& bnkio_elueilknpoecj]h l]n]iapano_kjpnkhhejcpdaoecj]h j9-,, p9]n]jca$j% u9vanko$j% jqi[lqhoao9/ ls9-- ]il9., bkneejn]jca$jqi[lqhoao%6 hk_9bhkkn$n]j`$%&$j)ls'-%% uWhk_6hk_'lsY9oecj]h*pne]jc$ls%&]il ]``okiajkeoa u'9n]j`j$j% becqna$% pepha$#Oecj]h]j`jkeoa#% th]^ah$#p#% uh]^ah$#u#% lhkp$p(u% CHAPTER 8 N SCIENCEAND VISUALIZATION 271 First I defined some parameters I’ll be using. The number of points in the signal is j and is equal to 100. The number of triangular pulses I’ll place is 3, denoted by jqi[lqhoao. Each trian- gular pulse will be generated using ls9-- points. The maximum value for the triangular spike will be ]il, denoting amplitude. Once I have all the parameters defined, I create two vectors: the time vector, p, and the values vector, u. The vector p is some arbitrary timestamp, in this example incrementing values starting at zero and ending at n – 1. The vector u is initially set at zero. Next I randomly place triangular spikes. The location, hk_, where the triangular spike will be placed is randomly generated with the call to the function n]j`$% that generates a value between 0 and 1, and so I randomly pick a value between 0 and n – pw + 1 to ensure spikes aren’t placed outside the vector u. Once I have all the spikes placed, I add noise by use of the function n]j`j$% that generates a normally distributed noise, also known as Gaussian distribu- tion, or “white noise.” I’ve chosen to use a normal distribution with variance 1 and mean 0. Notice that n]j`j$% is different from n]j`$%. Figure 8-8 shows a randomly generated signal. Figure 8-8. Three triangular spikes with noise I did not check to see that spikes do not overlap, so as you run the script at times you’ll view one or two spikes instead of three. This is fine, since we want to add some randomness to the example. So far we’ve just created the signal. Now let’s detect it. For detection, we’ll use a simple algorithm: whenever a value is above a set threshold, we’ll declare this as an event, or detec- tion. We’ll set the threshold at ]il+. and make use of the function bej`$%, as shown in Listing 8-3. CHAPTER 8 N SCIENCEAND VISUALIZATION272 Listing 8-3. Detecting Signals `apa_poecj]ho pdn9]il+. E9bej`$u:pdn% lhkpoecj]hsepdjkeoalhqo`apa_pekj becqna$% dkh`$Pnqa% lhkp$p(u(#^#(h]^ah9#oecj]hsepdjkeoa#% lhkp$pWEY(uWEY(#nk#(h]^ah9#`apa_pekjo#% lhkp$W,(jY(Wpdn(pdnY(#c))#% ]jjkp]papdapdnaodkh` patp$.(pdn'*.(#Pdnaodkh`#(r]9#^kppki#% pepha$#Oeilhaoecj]h`apa_pekjejjkeoa#% hacaj`$hk_9#^aop#% Figure 8-9 shows the result. Figure 8-9. Simple signal detection in the presence of noise CHAPTER 8 N SCIENCEAND VISUALIZATION 273 Functions diff and split Another set of functions that’s of use in signal detection is `ebb$% and olhep$%. The function `ebb$r%, which was introduced in previous chapters, returns a vector composed of differences of elements in r. The function olhep$r(ej`e_ao% splits a vector on ej`e_ao. :::r9]n]jca$-,% :::olhep$r(W0(4Y% W]nn]u$W,(-(.(/Y%(]nn]u$W0(1(2(3Y%(]nn]u$W4(5Y%Y Example: Simple Detection of Signal in Noise, Part 2 In the previous example, you saw how to perform simple detection using bej`$%. We’ve dis- played all points that were above a specific threshold. In many occasions, we’re less interested with points above a threshold because the threshold is arbitrarily chosen; we’re more inter- ested with the highest points above a threshold. Here we pick up from the previous example. This time, we’d like to spot the peak in each detection. Listing 8-4 presents the code to do that. Listing 8-4. Peak Detections la]g`apa_pekjo F9bej`$`ebb$E%:-% bknGejolhep$E(F'-%6 up]c9uWGY la]g9bej`$up]c99i]t$up]c%% lhkp$la]g'GW,Y(up]cWla]gY(#oc#(io93% The implementation is a bit tricky, so let’s walk through it. The idea is this: we split the detections into separate groups, and in each group, we find the peak and plot it. The first problem of splitting detections makes use of the indices of detected values. A group is considered one detection if the indices are consecutive. Whenever there’s a jump in indices, it means a new group: :::E9bej`$u:pdn% :::E ]nn]u$W5(-,(--(-.(-/(-0(0.(0/(00(01(02(03(22(23(24(25(3,(3-Y% So the group W5(-,(--(-.(-/(-0Y is one group, the group W0.(0/(00(01(02(03Y is the second group, and the group W22(23(24(25(3,(3-Y is the last group. The function `ebb$E% will return values other than - whenever there’s a new group. When- ever the difference is greater than 1, it means the start of a new group: :::`ebb$E% ]nn]u$W-(-(-(-(-(.4(-(-(-(-(-(-5(-(-(-(-(-Y% :::F9bej`$`ebb$E%:-% :::F ]nn]u$W1(--Y% CHAPTER 8 N SCIENCEAND VISUALIZATION274 So we’d like to split on the sixth element (denoted by 1) and the twelfth element (denoted by --). This is done with the olhep$% function: :::olhep$E(F'-% W]nn]u$W5(-,(--(-.(-/(-0Y%(]nn]u$W0.(0/(00(01(02(03Y%(]nn]u$W22(23(24( 25(3,(3-Y%Y All that’s needed now is finding the peak, which is coded as bej`$up]c99i]t$up]c%%. In Figure 8-10, peak detections are marked by squares. Figure 8-10. Peak detections Waveforms Additional SciPy functionality includes several waveforms that can be used when you’re designing a signal processing algorithm or testing it. These include o]spkkpd$%, omq]na$%, c]qoolqhoa$%, and _denl$%: bnkiluh]^eilknp& bnkio_elueilknpoecj]h _u_hao9-, p9]n]jca$,(.&le&_u_hao(le+-,% s]rabknio9W#o]spkkpd#(#omq]na#Y CHAPTER 8 N SCIENCEAND VISUALIZATION 275 bkne(s]rabkniejajqian]pa$s]rabknio%6 oq^lhkp$.(.(e'-% ata_#u9oecj]h*#'s]rabkni'#$p%# lhkp$p(u% pepha$s]rabkni% ]teo$W,(.&le&_u_hao()-*-(-*-Y% Figure 8-11 shows the resulting waveforms. Figure 8-11. Some waveforms The difference between waveforms and the triangular window used earlier is that they’re repetitive, whereas pne]jc$% generates a single window. The functions c]qoolqhoa$% and _denl$% are a bit more specialized; refer to the interactive help for information on using them. Fourier Transform Fourier transform is a linear operation that transforms a function from the time domain to the frequency domain. Much like the sound you hear can be viewed as an amplitude as a function of time, it can also be viewed by its frequency components: basses are the low frequencies of audio, for example. The topic of Fourier transforms is quite large and requires some mathematical rigor. I will not be trying to address the topic in depth here; instead, I will show how you can use PyLab to perform Fourier transforms on sampled data. To convert a signal from time domain to frequency domain, use bbp$t%. FFT, which stands for Fast Fourier Transform, is an efficient implementation of the transformation. Generally speaking, if the number of elements in x is a power of 2, the results are quite fast: :::bnkipeiaeilknppeia]op :::p-9p$%7oqi$bbp$]n]jca$.&&.,%%%7lnejpp$%)p- $),*,-/-,3-4244-4.1/.4),*,,1110-55.-431f% ,*.//555523131 :::p-9p$%7oqi$bbp$]n]jca$.&&.,)-%%%7lnejpp$%)p- $),*,,10,4-3-.3//,..425',*,0454,3-.45,2.1f% ,*103,,,-25310 CHAPTER 8 N SCIENCEAND VISUALIZATION276 The first bbp$% was performed on a vector the size of 220, which is a power of 2; the second one was performed on a shorter vector but the took longer to compute, more than twice as long, because its size is not a power of 2. To transform data from the frequency domain to the time domain, use ebbp$t%. Example: FFT of a Sampled Cosine Wave A cosine wave is made of one frequency (actually, two frequencies if you include the negative frequency). Let’s generate a cosine wave and calculate its frequency using bbp$%, as shown in Listing 8-5. Listing 8-5. Fourier Transform of a Cosine Wave J9.&&5salnabanlksanokb. B9.1]s]ra]p.1Dv p9]n]jca$J%+bhk]p$J%o]ilha`kran-oa_kj` t9_ko$.&le&p&B%pdaoecj]h oq^lhkp$.(-(-% lhkp$p(t% uh]^ah$#tWY#% th]^ah$#pWoa_kj`oY#% pepha$#=_koejas]ra#% cne`$% oq^lhkp$.(-(.% b9p&J tb9bbp$t% lhkp$b(]^o$tb%% pepha$#Bkqneanpn]jobknikb]_koejas]ra#% th]^ah$#bWDvY#% uh]^ah$#tbWY#% thei$W,(JY% cne`$% I first defined a few parameters: J is the number of points in the signal, and b is the fre- quency of the cosine wave. I then created a time vector, p, which is made of evenly spaced samples between 0 and 1, representing 1 second. I then calculated the sampled cosine wave and plotted it along with its Fourier transform. I’ve chosen to plot the absolute of the trans- formed signal, seeing as Fourier transforms return complex values (albeit in this case those complex values are zero). Figure 8-12 shows the results. CHAPTER 8 N SCIENCEAND VISUALIZATION 277 Figure 8-12. FFT of a signal NNote There’s a frequency content at 25 Hz (the left spike), but there’s also another one at 487 Hz. That’s really the value corresponding to –25 Hz, that is, 512 – 25. If you’d like to view the frequency domain cen- tered around 0 Hz, use the function bbpodebp$%. Window Functions In the FFT example I carefully chose a cosine wave that will have a full number of cycles in 1 second, which is basically any integer number for the frequency value. The reason for this was had I chosen a noninteger value, I would’ve ended up with a signal that does not have full wave cycles. The problem with this signal is when you perform the FFT of the signal, you’ll start seeing other frequencies and not just the frequency of your original signal. The reason for this is, in essence, FFT assumes the signal is repetitive; that is, it’s not just from 0 to 1 second, it’s from minus infinity to infinity. And so it treats the signal as if it’s copied left and right an infinite number of times. If the signal has an integer number of cycles, it will nicely fit when copied left and right. But in reality, you can’t guarantee an integer number of waves in your sampled signal, so you’ll start seeing these sampling effects. To minimize the effect, we can use a window function. CHAPTER 8 N SCIENCEAND VISUALIZATION278 Several window functions such as d]iiejc$%, d]jjejc$%, ^]nphapp$%, and g]eoan$% help minimize this effect but with a cost: the signal itself is also distorted. To use a window, multi- ply it by the time-domain vector, as shown in Listing 8-6. Listing 8-6. Hamming Window J9.&&5salnabanlksanokb. B9.1*1s]rabnamqaj_u p9]n]jca$J%+bhk]p$J%o]ilha`kran-oa_kj` b9p&Jbnamqaj_u`ki]ej t9_ko$.&le&p&B%pdaoecj]h td9t&d]iiejc$1-.%iqhpelhusepd]d]iiejcsej`ks lhkp$b(]^o$bbp$t%%(#o)#(h]^ah9#knecej]h#% lhkp$b(]^o$bbp$td%%(#k)#(h]^ah9#sepdD]iiejc#% thei$W,(1,Y% tpe_go$]n]jca$,(11(1%% hacaj`$% cne`$% pepha$#Oecj]hsepdD]iiejcsej`ks#% th]^ah$#Bnamqaj_uWDvY#% uh]^ah$#=ilhepq`aWY#% I’ve plot the FFT of two vectors: the original and the one with a Hamming window. In Fig- ure 8-13, you can see I’ve zoomed on the 25.5 Hz frequency to show the effects of the window function. Figure 8-13. Signal with Hamming window CHAPTER 8 N SCIENCEAND VISUALIZATION 279 The scipy.signal module provides additional window functions. To access these, issue :::bnkio_elueilknpoecj]h :::dahl$oecj]h% and scroll down to the window functions section. Filtering One of the reasons to transform a time-domain signal to a frequency-domain signal is for the purpose of filtering. A filter is an operation that changes a signal. Much like filters in your kitchen sink, filters let some frequencies pass (water), while stopping other frequencies (large food remains). Filters are used in a variety of applications, ranging from audio to radar systems. Filters are categorized by their behavior. A filter that lets through low frequencies and stops high frequencies is called a low-pass filter (LPF). Similarly, a high-pass filter (HPF) will allow only high frequencies to pass. There are also other categorizations such as band-pass filters (allows only a specific band of frequencies), band-stop filters (allows anything but a spe- cific band of frequencies), and notch filters (suppresses very few frequencies). Filters are further categorized by their behavior to an impulse input—that is, the output of the filter as a function of time assuming you were to input a short spike to the filter. Filters that eventually will forget the impulse are known as finite-impulse-response (FIR) filters, and filters that never forget are known as infinite-impulse-response (IIR) filters. From a very simplistic approach, if a filter does not rely on previous outputs (no feedback), it is considered an FIR; otherwise, it’s an IIR. Filter Design Assuming you know what filter you wish to design, this section will help you do so. Filter design is an advanced topic, and as such this section is meant for those who require a few pointers on designing filters in Python with SciPy. The scipy.signal module includes several functions to help design a filter. The function een`aoecj$% is used for designing an IIR filter. It is quite complete, and it’s best to read the online help and follow it through. Other useful IIR design filters include ^qppan$%,_da^u-$%, _da^u.$%, and ahhel$%. FIR filter design functionality is provided with functions naiav$% and bensej$%. I won’t be covering those, but should you need to use them, the online help is quite informative. Finally, if you’d like to view the frequency response of a filter, use the functions bnamv$% and bnamo$%. The code in Listing 8-7 will design a low-pass Butterworth filter (an IIR filter) and plot its frequency response. CHAPTER 8 N SCIENCEAND VISUALIZATION280 Listing 8-7. Frequency Response of a Filter J9.12jqi^ankblkejpobknbnamv S_9,*./`>lkejp Kn`an9/behpankn`an `aoecj]>qppansknpdbehpan W^(]Y9oecj]h*^qppan$Kn`an(S_% _]h_qh]papdabnamqaj_unaolkjoa Ws(dY9oecj]h*bnamv$^(](J% lhkppdanaoqhpo becqna$% oq^lhkp$.(-(-% lhkp$]n]jca$J%+bhk]p$J%(.,&hkc-,$]^o$d%%(hs9.% pepha$#Bnamqaj_unaolkjoa#% th]^ah$#Bnamqaj_u$jkni]heva`%#% uh]^ah$#`>#% uhei$uhei$%W,Y(uhei$%W-Y'1% cne`$% oq^lhkp$.(-(.% lhkp$]n]jca$J%+bhk]p$J%(.,&hkc-,$]^o$d%%(hs9.% pepha$#Bnamqaj_unaolkjoa$/`>lkejp%#% th]^ah$#Bnamqaj_u$jkni]heva`%#% uh]^ah$#`>#% patp$S_'*,.()/(#/`>lkejp#(r]9#^kppki#% uhei$W)/(,*-Y% cne`$% I’ve made use of two functions: ^qppan$% and bnamv$%. The function ^qppan$% designs an IIR filter with specified parameters (order and cutoff frequency), and the function bnamv$% returns a frequency response. Note that the frequency response is a complex number, and so I’ve plot the amplitude in dB of the absolute value: .,&hkc-,$]^o$d%%, as shown in Figure 8-14. CHAPTER 8 N SCIENCEAND VISUALIZATION 281 Figure 8-14. Frequency response of a low-pass filter To filter data given a specific filter, use the function o_elu*hbehpan$^(](t%. Let’s turn to an example. Example: Heart-Rate Monitor For the purpose of this example, I’ll generate a signal that simulates the data generated from a heart-rate monitor connected to a patient. Please do not use this in any sort of production sys- tem; it’s merely for educational purposes (and not meant to truly represent heart signals!). The patient walks around, and as a result, two signals are picked up: 1) the heart signal and 2) a signal associated with the patient’s movement, or what is typically referred to as a movement artifact. Listing 8-8 shows these signals in my simulation. Listing 8-8. Heart Rate Simulation da]npoecj]hoeiqh]pekj J9.12jqi^ankbo]ilhaolanoa_kj` P9.jqi^ankboa_kj`o dn9-*23-,,^a]polaniejqpao B-9,*1ikraiajpbnamqaj_u p9]n]jca$P&J%+bhk]p$J% u-91&oej$.&le&p&B-%ikraiajp]npeb]_p CHAPTER 8 N SCIENCEAND VISUALIZATION282 ]``da]npoecj]ho u.9vanko$oeva$u-%% bkneejn]jca$ejp$P&dn%%6 u.We&J+dn6e&J+dn'-,Y9oecj]h*pne]jc$-,% _ki^ejaikraiajpsepd^a]po u9u-'u. _na]pa]decd)l]oobehpan W^(]Y9oecj]h*^qppan$/(,*,1(#decd#% behpanpdaoecj]h uj9oecj]h*hbehpan$^(](u% lhkppdacn]ldo becqna$% oq^lhkp$.(-(-% pepha$#Da]npoecj]hsepdikraiajp]npeb]_p$oeiqh]pekj%#% lhkp$p(u(hs9.% th]^ah$#pWoa_kj`oY#% uh]^ah$#=ilhepq`aWY#% oq^lhkp$.(-(.% pepha$#Behpana`oecj]h#% lhkp$p(uj(hs9.% th]^ah$#pWoa_kj`oY#% uh]^ah$#=ilhepq`aWY#% I’ve defined several parameters that control the script. The value J is equal to the number of samples per second (some are used to name this value Bo, which stands for frequency of sampling). The value P is the total number of seconds, in this case 2 whole seconds. The value dn is the patient’s heart rate, 100 beats per minute: 100 / 60 = 1.67 Hz. Lastly, I defined the movement artifact frequency at 0.5 Hz. I then construct a time vector, p, and a movement arti- fact vector, r-, and add “beats” with triangular waveforms using the oecj]h*pne]jc$% function. Now that I have a heart signal with a movement artifact, I turn to filter out the movement artifact. I design a second-order Butterworth HPF to do so via the call to oecj]h*^qppan$% and use the filter parameters to filter the signal using oecj]h*hbehpan$%. Figure 8-15 shows the resulting plot. CHAPTER 8 N SCIENCEAND VISUALIZATION 283 Figure 8-15. Filtering a signal Example: Moving Average On many occasions, filtering is used to “smooth” a signal. A simple algorithm is that of a mov- ing average. For every two consecutive points, we calculate the average and use that value instead. The points are overlapping, so a result of using the algorithm on the vector W-(.(,( .Y would be W-*1(-(-Y. But why stop at two samples? Moving average can be performed on several points, returning the average of those points. In Python, you could write :::bnkiluh]^eilknp& :::J91-. :::p9hejol]_a$,(-,(J% :::t9-)atl$)p%'n]j`j$J%+-, :::S9/.jqilkejpoejikrejc]ran]ca :::tb9vanko$haj$t%)S'-% :::bkneejn]jca$haj$t%)S'-%6 ***tbWeY9ia]j$tWe6e'SY% *** :::lhkp$p(t% :::dkh`$Pnqa% :::lhkp$pWS)-6Y(tb(hs9/% :::pepha$#Ikrejc]ran]ca#% :::hacaj`$W#oecj]hsepdjkeoa#(#behpana`oecj]h#Y% :::th]^ah$#pWoa_kj`oY#% :::uh]^ah$#tWY#% CHAPTER 8 N SCIENCEAND VISUALIZATION284 This is a straightforward implementation using a bkn loop. The input to the filter is arbitrarily chosen as 1 – exp(–t) plus noise. There is an easier approach. A moving average is an FIR filter with all its elements equal to 1/W, where W is the length of the moving average window. In this case, a quick-and-simple way to implement a moving average filter instead of the bkn loop is by calling the oecj]h* hbehpan$% function and passing kjao$S%+S as the filter values: :::bnkio_elueilknpoecj]h :::tb9oecj]h*hbehpan$kjao$S%+S(-(t% Figure 8-16 shows the results of plotting a moving average. Figure 8-16. Moving average Final Notes and References The purpose of this chapter is to serve as a cookbook of algorithms in numerical analysis and signal processing. I took great care to limit the amount of math used in the examples and yet still be informative. The topics covered in the chapter are far too great to be explored in one book, let alone one chapter. If you find these topics of interest, the following may provide additional information: sNumerical Recipes: The Art of Scientific Computing, Third Edition by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery (Cambridge University Press, 2007; for more information, see dppl6++sss*jn*_ki) s 3CI0Y dppl6++sss*o_elu*knc+ CHAPTER 9 Image Processing Two-Dimensional Data Up to this point we’ve mostly dealt with one-dimensional data, that is, graphs and data made of essentially a series of values. We’ve plotted the data, analyzed it, and created an image that was later saved to file or displayed to screen. However, image data files, or the image on your screen, is two dimensional. It is made of pixels (which is short for picture elements), each pixel having a location in two dimensions, x and y, and a value corresponding to the color. In this chapter we turn to manipulating images on the pixel level, that is, operating on the two-dimensional matrix of pixels. Operations on images are similar to operations performed on one-dimensional data. Slicing a 1-D array of values and adding those values to another array is equivalent to copy- ing and pasting images. Saving an array to file is equivalent to writing images in TIFF or JPEG format. So in a sense, image processing is equivalent to signal processing, only that the signals now have two dimensions. Copying and pasting, resizing, and cropping are all simple operations supported by most GUI-based graphics applications. But with a GUI application, it’s harder to perform these operations in a systematic and automated manner, for example, resizing several images and then combining them together to form a collage of images. It’s doable in GUI applications, but it’s typically not as easy and requires some user skills. Other simple image operations include converting file formats, rotating images, and cropping them. I’ll cover these image operations, as well as how to automate them, so results are consistent. We’ll also deal with data on a numerical level, that is, reading the image, transforming it into a NumPy matrix, and then operating on the matrix itself. I’ll show how you can imple- ment some interesting algorithms that involve image processing. Lastly, I’ll touch on some more complex topics such as filtering, which is the act of modifying a picture to enhance the visual output. The basic package we’ll use for all these operations is the Python Imaging Library, or PIL for short. Be sure to install PIL per the guidelines in Chapter 2. We’ll also be relying on NumPy and matplotlib. It is assumed you have read Chapter 6 and Chapter 7; the text here will rely on the mate- rial covered in those chapters to discuss topics related to image processing. 285 CHAPTER 9 N IMAGEPROCESSING286 Reading, Writing, and Displaying Images The Python Imaging Library provides us with several classes that enable image processing. The basic class, Ei]ca, supports image operations such as reading an image from file, writing an image to file, copying and pasting, resizing, and rotating. Reading Images from File Let’s get started. To use the Ei]ca class, import it as follows: :::bnkiLEHeilknpEi]ca NNote Going forward, I’ll assume you’ve issued bnkiLEHeilknpEi]ca and will refrain from mention- ing the eilknp statement. It’s also possible to simply to write eilknpEi]ca instead of bnkiLEHeilknp Ei]ca. Our first operation is reading an image file. Since we currently don’t have any images, let’s generate an image and read it from file. We’ll use matplotlib patches for this, as shown in Listing 9-1. (If you’re not familiar with matplotlib patches, refer to Chapter 6.) Listing 9-1. Creating an Image :::bnkiluh]^eilknp& :::becqna$% :::c_]$%*]``[l]p_d$?en_ha$$,(,%(-%% :::]teo$#kbb#% :::]teo$#o_]ha`#% :::o]rabec$#**+ei]cao+_en_ha*ljc#% The code in Listing 9-1 draws a filled circle and saves it to file **+ei]cao+_en_ha*ljc. NTip I’ve used matplotlib to create an image file to work with, but you just as well could use any image file, for example, a JPEG picture you took with your digital camera. Our first operation is to read the file and attach it to an Ei]ca object: :::ei9Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#% NNote In accordance with the directory structure presented in Chapter 2, I assume that you’re currently running from directory ?d5+on_ and that directory ?d5+ei]cao holds image files. If that is not the case, be sure to change the path to the images in the examples provided in this chapter. CHAPTER 9 N IMAGEPROCESSING 287 Image Attributes Now that we have an Ei]ca object associated with an image file, we can query the object’s attributes: :::ei*oeva $4-1(2-1% :::ei*ejbk w#`le#6$-,,(-,,%y :::ei*ik`a #NC>=# :::ei*bkni]p #LJC# :::ei*behaj]ia #**+ei]cao+_en_ha*ljc# This is quite a bit of information. We know that the image size is 815615 pixels wide, the resolution is 100 dpi, the mode is RGBA (we’ll get to modes a little later in the chapter in the section “Creating New Images”), the image format is PNG, and the file associated with the object we’ve created is **+ei]cao+_en_ha*ljc. Example: Image Catalog My experience with analyzing image data is that images are not always taken in a consistent manner. This means that you, the programmer, have to manually crop, resize, enhance, or even delete images. This also translates into maintaining a catalog file of some sorts. An approach I found helpful is creating an automated catalog file and then annotating informa- tion as I work with data (see Chapter 4 for discussion of catalog files). The purpose of this example is to create a basic image catalog file. The script makes use of the Ei]ca attributes presented in the previous section and creates a CSV catalog file in the parent directory of the searched directory. The catalog file has an extension *_]p*_or. That is, if you’re searching +dkia+qoan, a catalog file will be created named +dkia+qoan*_]p*_or. The catalog file includes the name, size, format, and resolution of each image (see Listing 9-2). Listing 9-2. Creating an Image Catalog bnkiLEHeilknpEi]ca eilknpko(_or `abei]ca[_]p]hkc$on_dl]pd%6 ?na]pao]_]p]hkcbehaj]ia`on_dl]pd*_]p*_or* pda?ORda]`an _]p]hkc9WW#Behaj]ia#(#L]pdj]ia#(#Bkni]p#(#Oeva#(#Naokhqpekj#YY s]hg`ena_pknupnaa bknnkkp(`eno(behaoejko*s]hg$on_dl]pd%6 bknbehaejbehao6 l]pdj]ia9ko*l]pd*fkej$nkkp(beha% CHAPTER 9 N IMAGEPROCESSING288 pnu6 eic9Ei]ca*klaj$l]pdj]ia% behaoeva9ko*l]pd*capoeva$l]pdj]ia% _]p]hkc*]llaj`$Wbeha(l]pdj]ia(eic*bkni]p( eic*oeva(eic*ejbkY% at_alpEKAnnkn6jkp]jei]ca l]oo _na]papda_ha]j_]p]hkc b9klaj$on_dl]pd'#*_]p*_or#(#s^#% _or*snepan$b%*snepankso$_]p]hkc% b*_hkoa$% The script defines a function named ei]ca[_]p]hkc$%, which accepts the directory to search and produces an image catalog file in CSV format. The variable _]p]hkc is a list of rows containing image information. We iterate through the directory and look for images with the Easier to Ask Forgiveness than Permission (EAFP) approach: try to open a file as if it were an image file. In case of success, the catalog is updated. If the file is not an image, exception EKAnnkn is raised, and we l]oo this file. NNote If your directory is supposed to contain strictly images, you might want to add a lnejp statement before the l]oo notifying the user that a nonimage file was encountered. Here are the results I got from running the script in directory ei]cao (contents of file ei]cao*_]p*_or): Behaj]ia(L]pdj]ia(Bkni]p(Oeva(Naokhqpekj jecdpogu-*ljc(**+ei]cao+jecdpogu-*ljc(LJC($4-1(2-1%(w#`le#6$-,,(-,,%y jecdpogu.*ljc(**+ei]cao+jecdpogu.*ljc(LJC($4-1(2-1%(w#`le#6$-,,(-,,%y _en_ha*ljc(**+ei]cao+_en_ha*ljc(LJC($4-1(2-1%(w#`le#6$-,,(-,,%y jecdpogu*ljc(**+ei]cao+jecdpogu*ljc(LJC($4-1(2-1%(w#`le#6$-,,(-,,%y _khh]ca*ljc(**+ei]cao+_khh]ca*ljc(LJC($2,,(2,,%(wy Displaying Images You can view an image by calling the Ei]ca method odks$%. The method in turn calls the oper- ating system’s default image viewer, which is usually provided by the OS. To use a different viewer from the one supplied by the OS, associate the image with an image viewer you desire. The following will display the image **+ei]cao+_en_ha*ljc created previously: :::Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#%*odks$% CHAPTER 9 N IMAGEPROCESSING 289 Converting File Formats One of the common operations we perform on images is converting the image file format. It’s important for a couple of reasons: because we want to store images in a more efficient format using compression, or because the application that accepts the image requires a different format. The Ei]ca method o]ra$% enables saving an image to file in a specified image format. There are two methods to specify a format: with a file name extension and by explicitly speci- fying the bkni]p argument. Assuming you’ve created an image file **+ei]cao+_en_ha*ljc per the previous listing, you can read the image and convert the file format to a JPEG file format, as shown in Listing 9-3. Listing 9-3. Converting PNG to JPEG :::ei9Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#% :::ei*o]ra$#**+ei]cao+_en_ha*flc#% :::eilknpko :::Wbjbknbjejko*heop`en$#**+ei]cao#%ebbj*op]nposepd$#_en_ha#%Y W#_en_ha*flc#(#_en_ha*ljc#Y In this particular example, you’re not really converting the file, rather creating another file with a different image format (converting would mean that you also delete the original file). Or, you could create a function to convert an image to JPEG format, as shown in Listing 9-4. Listing 9-4. A Function to Convert an Image to JPEG Format bnkiLEHeilknpEi]ca bnkiko*l]pdeilknpolhepatp `ab?kjranpPkFlac$behaj]ia%6 ?kjranp]jei]cabehapk]Flacbeha* flacj]ia9olhepatp$behaj]ia%W,Y'#*flc# Ei]ca*klaj$behaj]ia%*o]ra$flacj]ia% I’ve made use of the olhepatp$% function, which is part of the os module to replace the original extension with a *flc extension. The *flc extension instructs the o]ra$% function to create a JPEG file. As mentioned, you can also explicitly specify a format: :::ei9Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#% :::ei*o]ra$#**+ei]cao+_en_ha#(bkni]p9#Flac#% :::Wbjbknbjejko*heop`en$#**+ei]cao#%ebbj*op]nposepd$#_en_ha#%Y W#_en_ha*ljc#(#_en_ha#Y In this case, o]ra$% does not add an extension to the file name (that is, the file created is _en_ha, not _en_ha*flc). CHAPTER 9 N IMAGEPROCESSING290 PIL supports a large number of file formats. Most popular image formats can be read by the Ei]ca class. Furthermore, most images can be saved using known file formats including JPEG, TIFF, and PNG. However, some image formats can only be read. Other formats such as MPEG (video files) are supported in identify mode only. For a full account, refer to the PIL handbook: dppl6++sss*lupdkjs]na*_ki+he^n]nu+leh+d]j`^kkg+ej`at*dpi. Example: A Function to Convert All Images in a Directory to JPEG Format A direct continuation of the idea presented previously is to write a function that iterates through a directory and converts all images to JPEG format, as shown in Listing 9-5. We’ll keep the original image as well because JPEG uses a lossy compression algorithm, which might lower the original image quality. However, you can easily modify the example to remove the original images. Listing 9-5. Converting All Images in a Directory to JPEG eilknpko(_or bnkiLEHeilknpEi]ca `ab?kjranp@enPkFlac$on_d`en%6 ?kjranpo]hhei]caoej]`ena_pknupk]flacbeha* s]hg`ena_pknupnaa bknnkkp(`eno(behaoejko*s]hg$on_d`en%6 bknbehaejbehao6 l]pdj]iadkh`opdaei]cabehaj]ia l]pdj]ia9ko*l]pd*fkej$nkkp(beha% pnu6 _kjranppdabehapk]FLACbeha eic9Ei]ca*klaj$l]pdj]ia% flacj]ia9ko*l]pd*olhepatp$l]pdj]ia%W,Y'#*flc# ebko*l]pd*ateopo$flacj]ia%6 lnejp@e`jkp_na]pa!o7beha]hna]`uateopo*!flacj]ia ahoa6 eic*o]ra$flacj]ia% lnejp?na]pa`beha'flacj]ia at_alpEKAnnkn6kklo(jkp]jei]ca l]oo The script again makes use of the EAFP approach: try to open a file as an image and if all goes well, convert it to a JPEG image. To run the function, enter ?kjranp@enPkFlac$`enj]ia%. NNote In case the function ?kjranp@enPkFlac$% is called with a nonexisting directory, no output is generated, not even a warning message. If you require such functionality, be sure to modify the function and include it. CHAPTER 9 N IMAGEPROCESSING 291 Image Manipulation So now we can read images, display them, and convert file formats. But in converting file for- mats, we haven’t really changed the image, we merely saved it in a different format. In this section we turn to perform basic image manipulations, that is, modifying the con- tents of an image: cutting and pasting, cropping, and rotating. Creating New Images PIL provides us with the ability to create images, not just read them from file. This is especially useful when you want to copy and paste images from other sources to a new image. The syntax for creating a new image is Ei]ca*jas$ik`a(oeva(_khkn9,%. The ik`a argument can take one of the values listed in Table 9-1. Table 9-1. Image Modes Mode Description #-# 1 bit per pixel; useful for black-and-white images #H# 1 byte per pixel (values from 0 to 255), black and white; useful for working with one color band (see the discussion about color later in the chapter). #NC># Red, green, and blue, 1 byte per color, also known as true color. RGB is common when the image background is black such as on a screen monitor. #NC>=# Red, green, blue, and a transparency mask, 1 byte per color; common in several file formats including PNG. #?IUG# Cyan, magenta, yellow, and black, 1 byte per color. CMYK is common in print. There are additional image modes, but I won’t be covering them in this chapter. To view the list of available modes, issue :::bnkiLEHeilknpEi]ca :::Ei]ca*IK@AO W#-#(#?IUG#(#B#(#E#(#H#(#L#(#NC>#(#NC>=#(#NC>T#(#U?^?n#Y Refer to the PIL web site for additional information: dppl6++sss*lupdkjs]na*_ki+ lnk`q_po+leh+ej`at*dpi. The oeva argument in the Ei]ca*jas$% function is a two-element tuple detailing the width and height of the image. The _khkn argument is a function of the mode. For example, in the case of an RGB image, the color is a tuple in the form (red, green, blue); in case of CMYK, the color takes the form (cyan, magenta, yellow, black). :::ei-9Ei]ca*jas$#H#($4,,(2,,%%^h]_g(kja)^]j`ei]ca :::ei.9Ei]ca*jas$#?IUG#($4,,(2,,%($,(.11(,(,%%i]cajp]ei]ca :::ei/9Ei]ca*jas$#NC>#($4,,(2,,%($.11(,(,%%na`ei]ca CHAPTER 9 N IMAGEPROCESSING292 Copy and Paste The methods _klu$% and l]opa$% allow copying images and pasting images into other images, respectively. The method _klu$% requires no parameters and creates a copy of the current image. The method l]opa$ei(tu% pastes the image ei into the current image; the tu argument is a tuple indicating the (x, y) location to paste (top left). Let’s turn to an example that uses the l]opa$% method. Example: Fractal Collage In this example we make use of the functions jas$%, klaj$%, l]opa$%, and o]ra$% to create a collage of images. To follow along, you’ll need to modify the fractal script presented in Chap- ter 7 in Listing 7-1 so that it’s a function instead of a script. The function should be named bn]_p]h$`ahp](nao(epan% and return an Ei]ca object representing the fractal. Refer to the Appendix for a listing of the function. Once you create the function, save it under ?d5+on_+ bn]_p]h[bqj_*lu. Armed with the function bn]_p]h$`ahp](nao(epan%, we create a fractal collage, as shown in Listing 9-6. Listing 9-6. A Collage of Fractals bnkiLEHeilknpEi]ca `abejapdabqj_pekjbn]_p]h7]ooqiejcep#oejbeha#bn]_p]h[bqj_*lu# ata_beha$#bn]_p]h[bqj_*lu#% boeva9.,,oi]hhbn]_p]hei]case`pd]j`daecdp jt9/jqi^ankbei]cao(se`pd ju9/jqi^ankbei]cao(daecdp _khh]ca9Ei]ca*jas$NC>($boeva&jt(boeva&ju%% bkneejn]jca$ju%6 bknfejn]jca$jt%6 ei9bn]_p]h$,*,,,,,-(boeva(e&jt'f'-% lnejpLnk_aooejcei]ca!`kb!`!$e&jt'f'-(jt&ju% _khh]ca*l]opa$ei($boeva&f(boeva&e%% _khh]ca*o]ra$#**+ei]cao+_khh]ca*ljc#% The script generates fractals with increasing numbers of iterations and pastes them into an image that serves as the image collage. The arguments to the l]opa$% method are chosen so that the images are pasted from top left to bottom right. I’ve saved the image to file **+ei]cao+ _khh]ca*ljc. The result from running this script is shown in Figure 7-1 in Chapter 7. Crop and Resize Cropping and resizing modify an existing image. The function _nkl$% selects part of the origi- nal image, and the function naoeva$% resizes an existing image, that is, scales it so it fits the new size. CHAPTER 9 N IMAGEPROCESSING 293 Assuming you have run the previous collage example, you should now have a file named _khh]ca*ljc. The function _nkl$% accepts a tuple of four values, detailing the box to crop: (x0, y0, x1, y1). Let’s read the _khh]ca*ljc file and crop it to show only 2 by 2 images from the frac- tal collage: :::eic9Ei]ca*klaj$#**+ei]cao+_khh]ca*ljc#% :::eic*oeva $2,,(2,,% :::_nklla`[eic9eic*_nkl$$,(,(0,,(0,,%% :::_nklla`[eic*odks$% Suppose you want to show the entire image, but scaled to size (400, 400); in this case you’d use the naoeva$tu% function, where tu is a two-element tuple detailing the width and height of the resized image: :::eic9Ei]ca*klaj$#**+ei]cao+_khh]ca*ljc#% :::eic*oeva $2,,(2,,% :::naoeva`[eic9eic*naoeva$$0,,(0,,%% :::naoeva`[eic*odks$% :::naoeva`[eic*oeva $0,,(0,,% You can also use the method pdqi^j]eh$%, which is similar to naoeva$%. The difference is that naoeva$% returns a modified image copy, whereas pdqi^j]eh$% modifies the image itself. :::eic*pdqi^j]eh$$0,,(0,,%% :::eic*oeva $0,,(0,,% In both naoeva$% and pdqi^j]eh$% methods you can provide a behpan argument that determines the method of resampling. The acceptable values are Ei]ca*JA=NAOP (default), Ei]ca*>EHEJA=N, Ei]ca*>E?Q>E?, and Ei]ca*=JPE=HE=O (best quality). Antialiasing has the best results but might take longer to compute: :::eic*pdqi^j]eh$$0,,(0,,%(Ei]ca*=JPE=HE=O% Rotate Lastly on our list of basic operations is the nkp]pa$% function. The function nkp]pa$pdap]% rotates an image pdap] degrees. From a user’s perspective, rotating is a basic operation, for example, rotating a scanned document by a few degrees so it’s properly displayed. But in reality, rotation isn’t such a basic operation; it requires changing the width and height of the image. In the case of rotating by 90 degrees (or multiples), the nkp]pa$% function knows to swap the x-axis and the y-axis, but in case of other rotation values, both axes change, so the total area of the image changes. You can control whether you want nkp]pa$% to expand the image so it includes the entire rotated image or not by passing atl]j`9Pnqa or atl]j`9B]hoa, respectively. CHAPTER 9 N IMAGEPROCESSING294 :::eic9Ei]ca*jas$#NC>#($.,,(/,,%($,(,(.11%% :::eic/,9eic*nkp]pa$/,% :::eic/,*odks$% :::eic/,*oeva $.,,(/,,% :::eic/,a9eic*nkp]pa$/,(atl]j`9Pnqa% :::eic/,a*odks$% :::eic/,a*oeva $/.0(/2,% In the first line I’ve created a simple blue image that is 200 pixels wide and 300 pixels high. I’ve then rotated the image 30 degrees with and without expanding. The results are shown in Figure 9-1. Figure 9-1. Rotated images: the left is not expanded, the right is expanded. Image Annotation Annotating images is just as important as annotating graphs. However, in some cases, anno- tating an image with text disrupts the pleasing visual result. That’s probably why it’s less common in pictures. There’s also the issue of what color to choose. In cases where the picture is mostly white, you probably want to choose nonwhite annotation. In this section we’ll cover text annotation as well as geometrical shapes to highlight spe- cific image features. Annotating with Geometrical Shapes PIL provides us with the Ei]ca@n]s object, which allows annotations of an existing image. To import the Ei]ca@n]s object, issue bnkiLEHeilknpEi]ca@n]s. To use the Ei]ca@n]s object, attach it to an existing image: CHAPTER 9 N IMAGEPROCESSING 295 :::bnkiLEHeilknpEi]ca(Ei]ca@n]s :::eic9Ei]ca*jas$#NC>#($.,,(/,,%($,(,(.11%% :::`n]s9Ei]ca@n]s*@n]s$eic% I’ve created an Ei]ca@n]s object named `n]s and attached it to the image eic. Going for- ward, operations performed with the Ei]ca@n]s object will be performed on the Ei]ca object: :::`n]s*heja$$-,,(-,,(.,,(.,,%% :::eic*odks$% This will draw a line from (100, 100) to (200, 200). You can use the functions in Table 9-2 to annotate an image. In the table, assume `n]s9 Ei]ca@n]s*@n]s$Ei]ca%. Table 9-2. Some Ei]ca@n]s Functions Function Description Example ]n_$tu^kt(op]np(aj`% Draws an arc, a part of the circle bound by the rectangle tu^kt (tuple of four elements), starting at angle op]np and ending at angle aj`. `n]s*]n_$$-,,(-,,(.,,( .,,%(5,(-4,% will draw a quarter of a circle. _dkn`$tu^kt(op]np(aj`% Similar to ]n_$%, also draws a line connecting the arc edges. `n]s*_dkn`$$,(,(-,,( -,,%(5,(-4,% ahheloa$tu^kt% Draws an ellipse bound by the four-element tuple tu^kt. If you’d like a circle, use a square for the tu^kt values. `n]s*ahheloa$$1,(1,(-1,( -,,%% heja$tuoam% Draws lines connecting elements in the sequence tuoam. `n]s*heja$W,(,(-,(-,( .,(-,(.,(.,Y% lkejp$tu% Draws a point at location tu. `n]s*lkejp$$0,(0,%% lkhuckj$tuoam% Draws a polygon connecting ele- ments in the sequence tuoam. The difference from the heja$% function is that the polygon is always a closed shape, allowing the use of the behh argument. `n]s*lkhuckj$W-,(.,(0,( 0,(1,(/,(3,(4,Y% na_p]jcha$tu^kt% Draws a rectangle specified by the four-element tuple tu^kt. `n]s*na_p]jcha$$.,(2,( 4,(-0,%(behh9-.4% The Ei]ca@n]s annotation functions accept the following optional arguments: behh, which determines the color of the annotation or the fill object (similar to the b]_a_khkn argument in matplotlib); kqpheja, which determines the line to draw the object (similar to the matplotlib a`ca_khkn argument); and bkjp, in case of text annotations. Text Annotations Other than geometrical shapes, Ei]ca@n]s also provides text annotation with the function patp$tu(opnejc%: CHAPTER 9 N IMAGEPROCESSING296 :::bnkiLEHeilknpEi]ca(Ei]ca@n]s :::eic9Ei]ca*jas$#H#($-2,(-2,%(.11% :::`n]s9Ei]ca@n]s*@n]s$eic% :::`n]s*ahheloa$$,(,(-2,(-2,%(behh9-.4% :::`n]s*patp$$4,(4,%(#=hkjcopnejc#% :::eic*odks$% Originally, I had intended on having the text centered horizontally. However, the text string has width, so I require a method to calculate the width and height of the text in pixels. Once I have the width and height, I can draw the text at location (80 – width/2, 80 – height/2). This is done using the function patpoeva$%: :::bnkiLEHeilknpEi]ca(Ei]ca@n]s :::eic9Ei]ca*jas$#H#($-2,(-2,%(.11% :::`n]s9Ei]ca@n]s*@n]s$eic% :::`n]s*ahheloa$$,(,(-2,(-2,%(behh9-.4% :::o9#=hkjcopnejc# :::se`pd(daecdp9`n]s*patpoeva$o% :::se`pd(daecdp $34(--% :::`n]s*patp$$4,Ìse`pd+.(4,Ìdaecdp+.%(o% :::eic*odks$% Figure 9-2 shows the results with and without taking into consideration the string width and height. Figure 9-2. Text annotation using patp$% and patpoeva$% Fonts It’s also possible to use other fonts with the patp$% function. To do so, first create an Ei]caBkjp object. The Ei]caBkjp object is part of PIL, and to import it, issue bnkiLEHeilknpEi]caBkjp. Once Ei]caBkjp is imported, you can use a font with a call to Ei]caBkjp*pnqapula$bkjpj]ia( CHAPTER 9 N IMAGEPROCESSING 297 oeva%. The returned Ei]caBkjp object can be passed as an argument to the patp$% function by means of the bkjp argument. Of course, to be able to use fonts, they must first be installed in your system. Windows typically comes with built-in fonts; and on Linux fonts are usually installed with X (as well as other applications). You can also use fonts from the GNU FreeFont project (dppl6++sss*cjq* knc+okbps]na+bnaabkjp+). However, font and font names are different on varying systems, and not just different operating systems: my Windows system might have different fonts than your Windows sys- tem. This means calling Ei]caBkjp*pnqapula$bkjpj]ia(oeva% might work on one system and not on another. To overcome this problem, I use the function bej`bkjp$%, which is part of the matplotlib.font_manager module. The function bej`bkjp$% returns a string with the location of a font that best matches the requested font. The following script annotates text with the Vera font using the bej`bkjp$% function: :::bnkii]plhkphe^eilknpbkjp[i]j]can :::bnkiLEHeilknpEi]ca(Ei]ca@n]s(Ei]caBkjp :::eic9Ei]ca*jas$#H#($.1,(-,,%(-5.% :::`n]s9Ei]ca@n]s*@n]s$eic% :::bkjp[opn9bkjp[i]j]can*bej`bkjp$#Ran]#% :::bkjp[opn #+qon+he^+lupdkj.*1+oepa)l]_g]cao+i]plhkphe^+ilh)`]p]+bkjpo+ppb+Ran]*ppb# :::ppb9Ei]caBkjp*pnqapula$bkjp[opn(10% :::o9#=>?]^_# :::$s(d%9`n]s*patpoeva$o(bkjp9ppb% :::`n]s*patp$$$.1,)s%+.($-,,)d%+.%(#=>?]^_#(bkjp9ppb% :::eic*odks$% The first two statements import the proper objects from PIL as well as matplotlib’s font manager. I then create a one-band image of size (250, 100) followed by the instantiation of an Ei]ca@n]s object attached to the image. Next I use matplotlib’s bej`bkjp$% function to find a font that’s closest to the font Vera. The path to the font is stored in the string bkjp[opn. Following that I create an Ei]caBkjp object named ppb and use that font object to render the text. I then calculate the size of the text and render it in the middle of a gray background, as shown in Figure 9-3. Figure 9-3. Font rendering NNote To use a font, you must supply the bkjp argument in calls to both functions patp$% and patpoeva$%. CHAPTER 9 N IMAGEPROCESSING298 Example: Thumbnail Index Image In a previous example we’ve created a catalog of images. While that catalog is quite useful, it doesn’t show the contents of those images. A more useful catalog perhaps would be a collage of the images annotated with text showing each image’s file name (see Listing 9-7). Listing 9-7. Thumbnail Index Image pdqi^j]ehej`at eilknpko bnkiLEHeilknpEi]ca(Ei]ca@n]s `abpdqi^j]eh[ej`at$`enl]pd%6 ?na]pa]pdqi^j]ehej`atbnkiei]caoej`enl]pd* jqi[ei]cao91 pdqi^[oeva9$-.4(52% _]p[oeva9$jqi[ei]cao&pdqi^[oevaW,Y(jqi[ei]cao&pdqi^[oevaW-Y% bj[ej`at9,behaj]iaej`at eic[ej`at9,ei]caej`at ckpdnkqcd]hhpdale_pqnaoej]`ena_pknu bknbehaejko*heop`en$`enl]pd%6 cappdal]pdj]iabknpdabeha l]pdj]ia9ko*l]pd*fkej$`enl]pd(beha% pnu6eopdeo]jei]cabeha; klajpdaei]cabeha eic9Ei]ca*klaj$l]pdj]ia% at_alpEKAnnkn6 lnejpbeha(eojkp]jei]cabeha _kjpejqa _na]pa]pdqi^j]eh eic*pdqi^j]eh$$pdqi^[oeva%(Ei]ca*=JPE=HE=O% `n]s9Ei]ca@n]s*@n]s$eic% `n]s*patp$$.(.%(beha% `ksajaa`pk_na]pa]jas_]p]hkcei]ca; ebeic[ej`at99,6 pdqi^o[eic9Ei]ca*jas$#NC>#(_]p[oeva% _]h_qh]papdahk_]pekjbknpdeoei]ca t9eic[ej`at!jqi[ei]cao u9eic[ej`at++jqi[ei]cao CHAPTER 9 N IMAGEPROCESSING 299 l]opapdapdqi^j]eh pdqi^o[eic*l]opa$eic($t&pdqi^[oevaW,Y(u&pdqi^[oevaW-Y%% ej_naiajppdaei]caej`at eic[ej`at'9- d]rasana]_da`pdaaj`kbpda_]p]hkcei]ca; ebeic[ej`at99jqi[ei]cao&&.6 eic[ej`at9, pdqi^o[eic*o]ra$#!o)!,/`*_]p*flc#!$`enl]pd(bj[ej`at%% bj[ej`at'9- o]rapdah]op_]p]hkcbeha ebeic[ej`at6 pdqi^o[eic*o]ra$#!o)!,/`*_]p*flc#!$`enl]pd(bj[ej`at%% The function pdqij]eh[ej`at$% accepts a directory and produces a thumbnail index image. Figure 9-4 shows the result from running the function on a collection of images my daughter particularly likes. Figure 9-4. Thumbnail index image For the purpose of this example, I decided not to use ko*s]hg$% and iterate through the directory listing, instead using ko*heop`en$%. I’ve defined two parameters: jqi[ei]cao, which holds the number of images on either x- or y-axis, and pdqi^[oeva, which holds the thumbnail width and height. Next I composed a list of all the files in the requested directory. For every file, the script tries to open the file as if it were an image file. If indeed a file is an image file, a thumbnail of the image is created and pasted to the index image. Additionally, the thumbnail CHAPTER 9 N IMAGEPROCESSING300 is annotated with the file name in the top-left corner. There’s some indexing used to deter- mine the exact location of an image in the thumbnail index image as well as creating a new thumbnail index image once the current one has filled up. NTip An alternative approach to displaying the text directly on the thumbnail image is to display it below the image. This can be done by adding a black (or white) stripe between rows of images. Image Processing So far we’ve performed tasks that can also be performed by most GUI-based image editing applications such as GIMP, the GNU Image Manipulation Program (dppl6++sss*ceil*knc+). However, GUI-based applications have a GUI user in mind and are not easily automated. We now turn to explore possibilities of writing scripts to automatically perform operations on images. Furthermore, as you start thinking about higher-level image processing algorithms, you may realize you require access to the actual data, the numbers that represent the image. In this section, we’ll also show how this can be achieved. A word of caution: image processing is a vast field. I won’t be covering even the basics here, rather showing that if you do have an image processing algorithm, it’s quite likely you can implement it in Python. Matrix Representation and Colors An image can be represented by a matrix, each (x, y) point corresponding to a column and row in the matrix, and the value corresponding to the color. The color value is a function of the mode (see Table 9-1 earlier for details). For example, in the case of an RGB image, each value of the matrix is a tuple of 3 bytes, each byte representing a different color. So in a sense, you can think of the entire image as three matrices, each matrix corresponding to the colors red, green, and blue, also known as color bands or channels. Furthermore, each image can be split into these colors (depending on the mode, of course—there’s no splitting of a 1-bit image into individual colors). This is done with the Ei]ca method olhep$%: :::ei9Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#% :::ei*ik`a #NC>=# :::N(C(>(=9ei*olhep$% :::N*ik`a(C*ik`a(>*ik`a(=*ik`a $#H#(#H#(#H#(#H#% NNote I’ve assumed you have followed along with the chapter and created a file named **+ei]cao+ _en_ha*ljc; if not, follow Listing 9-1 to create a _en_ha*ljc image. CHAPTER 9 N IMAGEPROCESSING 301 Each split image is an image by itself, but it now contains only one-color information, hence its mode is #H#, not #NC>=#. To retrieve the data associated with the color, that is, the actual values, call the function cap`]p]$%. We can then transform the values to a NumPy array for some interesting numerical processing. Continuing our previous listing: :::bnkiluh]^eilknp& :::`]p]9]nn]u$N*cap`]p]$%% :::pula$`]p]% 8pula#jqilu*j`]nn]u#: :::`]p]*oeva 1,-..1 :::`]p]*od]la $1,-..1(% The image data is stored as a list of all the values in the image, not a matrix representation of the image. To change it to a matrix, we use the NumPy method naod]la$%: :::ei*oeva $4-1(2-1% :::`]p]9`]p]*naod]la$4-1(2-1% :::`]p]*oevaoevaodkqh`^apdao]ia 1,-..1 :::`]p]*od]lanaod]la`]o]i]pnet $4-1(2-1% Now that we have the data as a matrix, we can operate on the matrix instead of the image. This gives us great flexibility. Say we want to arbitrarily draw a magenta stripe in the middle of the circle; all that we need to do is modify the matrix associated with the red channel and then merge it to form a new, modified image. NNote Why does changing the red channel generate a magenta output? The way I’m going to modify the matrix is by setting the red channel to .11. This means that my previously blue circle will now be a combina- tion of blue and red, which is magenta, while the rest of the background, which is white, will remain white. Let’s do this a step at a time, from the top: :::bnkiluh]^eilknp& :::bnkiLEHeilknpEi]ca :::ei9Ei]ca*klaj$#**+ei]cao+_en_ha*ljc#% :::ei*ik`a #NC>=# :::ei*oeva $4-1(2-1% :::N(C(>(=9ei*olhep$% :::`]p]9]nn]u$N*cap`]p]$%%*naod]la$ei*oeva% CHAPTER 9 N IMAGEPROCESSING302 :::$s(d%9`]p]*od]la :::`]p]Ws+.)-,,6s+.'-,,(6Y9.11&kjao$$.,,(d%% :::N*lqp`]p]$`]p]*naod]la$d&s%% :::jas[eic9Ei]ca*ianca$#NC>#($N(C(>%% :::jas[eic*odks$% The first line reads the image from file and displays some image information. I then split the image into four channels: red, green, blue, and the transparency mask. From here on, I’ll restrict myself to dealing with the red channel only. First, I retrieve the actual numerical values associated with the red channel. This is done with a call to cap`]p]$%. In the same line, I also transform the data into a NumPy array and reshape the list to a matrix form. Next I change the data values associated with 200 rows in the middle and set their value to .11. This in effect creates the magenta stripe. I then update the red channel with the modified data by calling the function lqp`]p]$%. The function lqp`]p]$% complements cap`]p]$% and expects a list, not a matrix, so I reshape the data back to a 1-D array. Lastly, I create a new image, this time in RGB mode (I don’t require transparency) by com- bining the original green and blue channels with the modified red channel. This is done by calling the ianca$% function, which is the opposite of the olhep$% function. Figure 9-5 shows the results. NNote It’s also possible to perform this task using the Ei]ca@n]s object. Figure 9-5. Circle with a stripe CHAPTER 9 N IMAGEPROCESSING 303 So far we’ve covered an interesting number of functions that enable working on images with numerical values: olhep$%, ianca$%, cap`]p]$%, and lqp`]p]$%. We’ll use some of them in a more complex example. Example: Counting Objects (Five Parts) The following example is rather long and deals with an interesting aspect of image processing: counting objects in an image. The idea is to write a script that counts the number of elements in a picture. Counting elements is a complex task, even for the human mind: What objects should I count? What constitutes an object? And so on. The task of counting objects is very useful in a wide variety of applications, as indicated by just a few examples: sBiology: Estimating the number of bacteria in an image from a microscope sMedicine: Counting the number of axons in a tissue cross-section sElectronics board manufacturing: Counting the number of imperfections in a printed circuit board or counting the number of resistors sAstronomy: Counting the number of stars For the purpose of this example, we’ll create an image of the sky at night, with stars placed randomly. We’ll then write a script to count the number of stars. We’ll have a very sterile image, one that has a very clean background (black, night sky) and most information in one channel. However, we’ll add a bit of complexity by varying shapes and sizes of stars. Once we have an image of the sky at night, I’ll talk a bit about recursion, a topic I have been avoiding thus far. Recursion will be used to implement an algorithm to fill an image. Lastly, I’ll discuss some ideas and methods you could use to expand upon this example and add more capabilities to the algorithm. Part 1: Twinkle, Twinkle, Little Star First, we create the stars for our image of the sky at night. The night sky will be composed of white stars and a black background. Since we want the stars to be of varying sizes and shapes, we’ll define a function named op]n$% that creates a matplotlib patch object (see Listing 9-8). Listing 9-8. A Star Patch, Source of op]n[l]p_d*lu _na]pa]op]nl]p_dk^fa_p bnkiluh]^eilknp& `abop]n$N(t,(u,(_khkn9#s#(J91(pdej9,*1%6 Napqnjo]jJ)lkejpa`op]nkboevaN]p$t,(u,%$i]plhkphe^l]p_d%* lkhuop]n9vanko$$.&J(.%% bkneejn]jca$.&J%6 ]jcha9e&le+J n9N&$-)pdej&$e!.%% lkhuop]nWeY9Wn&_ko$]jcha%'t,(n&oej$]jcha%'u,Y napqnjLkhuckj$lkhuop]n(b_9_khkn(a_9_khkn% CHAPTER 9 N IMAGEPROCESSING304 The values that control the star patch are N, which determines the star’s radius; t, and u,, which control the star’s location; _khkn, which determines both the fill and edge color; J, which controls the number of pointy edges a star has; and pdej, which controls how thin or thick a star is (on the range of 0 to 1, 1 being very thin). NTip The default star patch is white because we’ll be using it for the night sky. Be sure to change it to a different color if you’re using a white background. I’ve used the Lkhuckj object to create the star patch, with some mathematical trickery. The idea is this: I place N pointy edges on a circle of radius R with the center at (x0, y0) at fixed angle increments. I then place another set of points at a smaller radius to serve as the inner edges of the star, again at fixed angle increments but shifted so that each inner point resides exactly in the middle of the outer edge’s points. The pdej parameter determines the radius of the inner circle: the larger the value, the smaller the radius, and the “thinner” the star is. Lastly, I draw a line connecting all these points using the Lkhuckj patch object. NNote Be sure to save the star patch listing as file op]n[l]p_d*lu; we’ll use it in future scripts. USING A LIST COMPREHENSION It’s also possible to implement the star patch with list comprehensions. The idea is to zip together the ele- ments in the polygon list: `ab]jkpdan[op]n$N(t,(u,(_khkn9#s#(J91(pdej9,*1%6 Napqnjo]jJ)lkejpa`op]nkboevaN]p$t,(u,%$i]plhkphe^l]p_d%* ]9]n]jca$,(.&le(.&le+J% n9pdej&N lkhuop]n9]nn]u$vel$N&_ko$]%'t,(N&oej$]%'u,(X n&_ko$]'le+J%'t,(n&oej$]'le+J%'u,%% napqnjLkhuckj$lkhuop]n*naod]la$J&.(.%(b_9_khkn(a_9_khkn% Some would prefer this implementation over the previous implementation. Personally, I think both are fine; choose whichever is easier for you to follow. It’s also possible to code the entire function as a single napqnj statement, but I strongly recommend against it, as the code would be hard to understand. The script in Listing 9-9 generates some interesting stars. CHAPTER 9 N IMAGEPROCESSING 305 Listing 9-9. Generating Some Interesting Stars odksokiaop]nat]ilhao bnkiluh]^eilknp& ajoqnapdaop]nl]p_deo`abeja`lnklanhu ata_beha$#op]n[l]p_d*lu#% at]ilhao9W op]n$-,(,(,(#g#%( op]n$-,(,(,(#g#(-,%( op]n$-,(,(,(#g#(1(,*.%( op]n$-,(,(,(#g#(/(,*5%Y bkne(at]ilhaejajqian]pa$at]ilhao%6 oq^lhkp$.(.(e'-% ata_jas[op]n9'at]ilha c_]$%*]``[l]p_d$jas[op]n% pepha$at]ilha% ]teo$#o_]ha`#% ]teo$W)-,(-,()-,(-,Y% In this script I’ve decided to iterate over a list of strings and use the ata_ statement. The same string used for the ata_ statement is also used to create the title for the subplots (see Figure 9-6). Figure 9-6. Some star patches CHAPTER 9 N IMAGEPROCESSING306 NTip To ensure a patch is displayed, issue a call to ]teo$#jkni]h#% (or similar) to force a refresh of the figure. There’s room for additional work on the op]n$% patch object; for example, you could add a nkp]pekj parameter, rotating the entire star by nkp]pekj degrees. This can be done by changing the argument to the functions oej$% and _ko$% in the op]n$% function. Another modi- fication could include a hollow star, implemented by splitting the a`ca_khkn and b]_a_khkn functionality. Armed with the star patch, we turn to part 2 of this example: creating an image of the sky at night. Part 2: The Sky at Night To create a simulated image of the night sky, we use the script in Listing 9-10. Listing 9-10. Creating an Image of the Sky at Night, jecdpogu*lu _na]pa]be_pepekqojecdpogu bnkin]j`kieilknpn]j`n]jca]onn ata_beha$#op]n[l]p_d*lu#% l]n]iapanobknpdaoeiqh]pa`jecdpoguei]ca eic[oeva94,, jqi[op]no9.1 op]nl]n]iapano6jqi^ankblkejpua`cao]j`n]`eqo iej[jqi[lkejpo91 i]t[jqi[lkejpo9-- iej[op]n[n]`eqo9. i]t[op]n[n]`eqo9-, op]nl]n]iapan#pdejjaoo#eokj]o_]hakb-pk-, iej[pdej91 i]t[pdej95 `n]spdajecdpogu becqna$b]_a_khkn9#g#% _qn[]tao9c_]$% l]p_dop]no bkneejn]jca$jqi[op]no%6 jas[op]n9op]n$nn$iej[op]n[n]`eqo(i]t[op]n[n]`eqo%( nn$,(eic[oeva%(nn$,(eic[oeva%(#s#(X CHAPTER 9 N IMAGEPROCESSING 307 nn$iej[jqi[lkejpo(i]t[jqi[lkejpo%(X nn$iej[pdej(i]t[pdej%+-,*,% _qn[]tao*]``[l]p_d$jas[op]n% ik`ebu]teo^ad]rekn ]teo$W,(eic[oeva(,(eic[oevaY% ]teo$#o_]ha`#% ]teo$#kbb#% o]rabec$#**+ei]cao+jecdpogu#(b]_a_khkn9#g#(a`ca_khkn9#g#% I’ve imported the function n]j`n]jca$% from the module random and decided to rename it to nn$%, which I think is clearer to read. I then define a set of values you can tweak and observe the results. The values are self-explanatory and include such values as the image size and number of stars in the image. The patching of stars is done in the bkn loop, which creates a new star with random values and adds it to the current figure. I then follow up by updating the image size and removing the axes. Finally, I save the image to file. Figure 9-7 shows random output from the simulated night sky. Figure 9-7. Simulated (random) night sky CHAPTER 9 N IMAGEPROCESSING308 Part 3: Flood Fill and Recursion We now turn to something completely different: recursion. Recursion describes a scenario where a function calls itself. Some known recursion algorithms implement the factorial opera- tion and Fibonacci sequences. We’ll use recursion for image processing, specifically to fill an image using a flood-fill algorithm. Flood fill, sometimes also referred to as bucket fill, is an algorithm to fill a closed area of a specific color with a different color. This is a quite common operation in most image process- ing applications. Kids love to use it to paint digital coloring images. To implement flood fill, we’ll use recursion. In the implementation, we’ll assume that the image to fill is given to us as a NumPy array, more specifically as a 2-D array (i.e., a matrix). It’s also possible to manipulate a PIL Ei]ca object, but I prefer using a matrix for two reasons: s )TSMOREGENERIC)CANPORTTHEFLOOD FILLALGORITHMTOOTHEROBJECTSASLONGAS)CAN convert the objects to a NumPy array (matrix). s )TSEASIERTOVIEWTHECODEBYINDEXINGOVERMATRIXELEMENTSTHANTOUSETHEcapletah$% and oapletah$% methods provided by the Ei]ca object. So how does flood fill work? Flood fill starts by receiving a point to start filling from. If the point is the color to be converted, flood fill will change the color to the desired color. It then moves to a point adjacent to it, say to the right, and calls itself. As the process continues, points to the right will start filling up with the new color. If the point to the right is not in the desired color (that is, shouldn’t be painted), the point to the top is checked, and the process resumes. This process is repeated for left and bottom points surrounding each point. The end process is a filled, closed object. FLOOD FILL AND MINESWEEPER The flood-fill algorithm can also be used in the coding of the game Minesweeper (shipped with Windows). You can use the algorithm to expand an area and reveal points adjacent to mines. The algorithm will follow a similar path, and one option would be to create a matrix of values corresponding to whether a square is empty (value 0) or adjacent to a mine (value equal to the number of mines it is adjacent to), with a different value indicating a mine (say, value –1). When the user clicks on a square, the flood-fill algorithm kicks in and decides how many squares to reveal. If you’re not familiar with Minesweeper, I suggest you refrain from try- ing it; the game is addictive. Listing 9-11 presents a simple flood-fill implementation. CHAPTER 9 N IMAGEPROCESSING 309 Listing 9-11. Flood-Fill Implementation Using Recursion, bhkk`[behh*lu bnkijqilueilknp& bnkiouoeilknpcapna_qnoekjheiep `abbhkk`[behh$t(u(i(pkp]h%6 =bqj_pekjpkbhkk`behh]jei]ca$i]pnet%* ebpkp]h:capna_qnoekjheiep$%6 napqnjpkp]h jkpdejcpkbehh ebiWt(uY9.116 napqnjpkp]h iWt(uY9-.4 eb$t)-:9,%6 pkp]h9bhkk`[behh$t)-(u(i(pkp]h'-% eb$t'-89i*od]laW,Y)-%6 pkp]h9bhkk`[behh$t'-(u(i(pkp]h'-% eb$u)-:9,%6 pkp]h9bhkk`[behh$t(u)-(i(pkp]h'-% eb$u'-89i*od]laW-Y)-%6 pkp]h9bhkk`[behh$t(u'-(i(pkp]h'-% napqnjpkp]h'- The function bhkk`[behh$% is an implementation of the flood-fill algorithm described pre- viously. I’ve bolded the code where recursion actually happens (the function calling itself). The function accepts the values t and u, denoting the point to fill; i, which is the NumPy matrix; and pkp]h, which is a variable used to keep track of the recursion depth (i.e., how many times a function calls itself repeatedly). I’ve chosen to fill all values corresponding to 255 with 128. In essence this means that if the object is fully red (or green or blue, depending on the band selected), it will be changed to “half” red. You can modify the function bhkk`[behh$% to accept an original color and a new color as parameters; I chose not to do so, as I think the code looks clearer that way. Every time a function is called in a recursion, additional memory is consumed. Python limits the recursion depth with the value ouo*capna_qnoekjheiep$%. If the running code exceeds this limit, a recursion exception is raised. It’s possible to increase this number by call- ing ouo*oapna_qnoekjheiep$%, but that’s only a small fix; inevitably, you’ll reach a memory limit, which might cause a system crash. Therefore, it’s best if your code can detect these events beforehand and alert the user if such an event transpired. I have chosen to do so by returning the value pkp]h. In case pkp]h is greater than the maximum recursion depth, I can notify the user of the event. It’s also important to note that if your night sky image gets larger or the size of stars get larger (e.g., a larger radius), or if you save the image at a higher resolution (more points per star), inevitably you will hit a recursion limit because the areas to fill get larger and larger. So while this is a viable option to fill objects, maybe a different algorithm should be employed for production-level code, such as using Ei]ca@n]s’s bhkk`behh$% method. CHAPTER 9 N IMAGEPROCESSING310 USING IMAGEDRAW FOR FLOOD FILL The Ei]ca@n]s object also provides a bhkk`behh$% function, which may be used for the algorithm presented here. There are several reasons I chose to implement bhkk`[behh$% instead of using the Ei]ca@n]s function: s )WANTEDTOTALKABOUTRECURSION s Ei]ca@n]s’s bhkk`behh$% doesn’t return information such as the size of the filled region, which can be used to enhance the algorithm. That being said, it’s quite possible to use other methods to comple- ment this such as comparing the image before and after flood filling it. s )WANTEDTOSHOWYOUHOWTOTWEAKFLOODFILL FOREXAMPLE TOINCLUDEDIAGONALSCELLSASADJACENTCELLS (and not just up, down, left, and right). So now that we have the bhkk`[behh$% function, how does that help us count the number of stars at night? Part 4: Counting Objects Counting objects is really easy, once you have an implementation of flood fill (see Listing 9-12). The idea is simple: go through every point in your image and fill it. The return value from flood fill is the actual number of points filled. In case there was nothing to fill, the value will be zero, but in case flood fill fills an object, the return value will be nonzero, which indi- cates that flood fill found and filled an object. Future calls to bhkk`[behh$% for that pixel will not fill the object, as it is already filled. Now all that’s required is to count the number of times flood fill returns a nonzero value, and you have the number of objects! Listing 9-12. Counting Objects in a Picture bnkiluh]^eilknp& bnkiLEHeilknpEi]ca bnkiouoeilknpcapna_qnoekjheiep ata_beha$#bhkk`[behh*lu#% na]`pdaei]ca ei9Ei]ca*klaj$#**+ei]cao+jecdpogu*ljc#% olheppdaei]caejpkej`ere`q]h^]j`o _kho(nkso9ei*oeva N(C(>(=9ei*olhep$% napnearapda`]p]bnkipdana`^]j`]o]i]pnet]nn]u `]p]9]nn]u$N*cap`]p]$%% CHAPTER 9 N IMAGEPROCESSING 311 oap]hhr]hqaopd]p]najkjvankpk.11 $_kqh`^a`qapk]jpe]he]oejc% `]p]Wbej`$`]p]9,%Y9.11 `]p]9`]p]*naod]la$nkso(_kho% _kqjppdaop]no _kqjp(na_qnoekj[heiep[na]_da`9,(, bkneejn]jca$nkso%6 bknfejn]jca$_kho%6 pkp9bhkk`[behh$e(f(`]p](,% ebpkp:capna_qnoekjheiep$%6 na_qnoekj[heiep[na]_da`'9- ahebpkp:,6 _kqjp'9- ebna_qnoekj[heiep[na]_da`6 lnejpNa_qnoekjheiepna]_da`!`peiao!na_qnoekj[heiep[na]_da` lnejpE_kqjpa`!`op]no!_kqjp N*lqp`]p]$`]p]*naod]la$_kho&nkso%% Ei]ca*ianca$#NC>#($N(C(>%%*odks$% The script is an implementation of the preceding algorithm. We start by importing the proper modules and calling the script bhkk`[behh*lu, which contains the bhkk`[behh$% func- tion implementation. Next we open the image of the sky at night and split it into individual color bands. I decided to work strictly on the red band, but in reality, because we are dealing with black-and-white pictures, I could just as well have chosen any other channel (other than the transparency). I then access the data and convert it to a NumPy matrix. This is done in the call `]p]9]nn]u$N*cap`]p]$%%. Next I implement a simple threshold. What I do is change all nonblack values to white by setting all nonzero values to .11. Other algorithms use a different approach such as setting all values above and including 128 to .11 and all values below 128 to ,. In this particular case, the results would be very similar. Notice that I run the threshold before I reshape the matrix, the reason being that I can use the bej`$% function quite easily that way. I then reshape the image into a matrix. Up to this point I’ve been dealing with reading the image, splitting it, and applying a threshold to the image. Now, I turn to using the bhkk`[behh$% function. I go through every pixel in the matrix and call the function bhkk`[behh$%. If the return value is greater than the recursion limit, I increment the number of times a recursion limit has been reached. If the return has not reached the recursion limit and is nonzero, I increment the count of objects. Lastly, I report my results: the number of recursions that exceeded the maximum allowed value (for debugging purposes more than anything) and the number of stars counted. Here’s a result from running the script on the night sky image presented earlier: E_kqjpa`./op]no (We’ll get to why that number is not 25 in the next section.) CHAPTER 9 N IMAGEPROCESSING312 To be sure I’ve counted all the stars in the night sky picture, and furthermore, that I did not accidentally count objects that are not stars, I decided on some sort of visual feedback of the result. I do this in the last two lines of the script: I convert the matrix data back to the red channel and construct a new image with the newly modified red channel. If you look closely at the newly constructed image (zoom in if you’d like), you’ll see that at times the edges around stars are not filled properly. I believe that the reason for this is the quantization effect we’ve used (threshold) that modifies all half-values to black. Part 5: Optimizing the Algorithm So why did the algorithm return 23 stars and not 25? (See jqi[op]no value in Listing 9-10.) A plausible reason is that several stars overlapped. This would cause the algorithm to combine several objects into one. In real pictures (nonsterile, unlike those presented in the example), there could be other reasons, and this is where you can start tweaking your image processing algorithm. But as you start working with “real” data, you’ll find that sometimes the opposite hap- pens, that is, the algorithm counts more objects than there really are. The reason for this could be because the images are not ideal, and even small specks, or noise, could throw off the number count. In that case, a possible solution would be to count only elements whose size is greater than a fixed value, that is, reading the value returned by bhkk`[behh$% and discarding objects whose size is too small. Another option would be to preprocess the image using a filter (see the section “Image Filtering” later in this chapter). Another improvement to the algorithm could be giving it the capability to find the largest object. Again, this is quite possible by reading the value returned from the function bhkk`[behh$% and then sorting the results or finding the maximum. And you can also try to evaluate the luminosity of the night sky, by counting the areas of all the stars. This might be used to estimate how clear the skies are or, in the case of a micro- scopic image, help determine whether the size of a bacteria colony has changed. Some real images might have objects so small that you’ll need to think about flood filling diagonals as well. That is, consider the character “x” drawn on a 33 pixel grid: there’s no pixel that’s adjacent to another unless you count diagonals. Modifying flood fill to include diagonals will combine the pixels that make up this “x” into one object. The point of the matter, now that data is accessible as a NumPy matrix, is that you can implement whatever algorithm or image processing idea you might have. But in many occa- sions, you don’t have to resort to the matrix level; PIL provides a good number of support functions. Image Arithmetic PIL provides a set of arithmetic operations via the module ImageChops (Chops is short for channel operations). In the night sky example, some people would prefer working on a white background (which could save quite a bit of ink if you’re printing the images). Per the previ- ous section, you could transfer the image to a NumPy array and then convert it, but in such a simple case, it makes more sense to use the ImageChops ejranp$% function: `eolh]u]jei]ca]j`epoejranoa bnkiLEHeilknpEi]ca(Ei]ca?dklo ei9Ei]ca*klaj$#**+ei]cao+jecdpogu*ljc#% CHAPTER 9 N IMAGEPROCESSING 313 jas[eic9Ei]ca*jas$#NC>#($ei*oevaW,Y&.(ei*oevaW-Y%% jas[eic*l]opa$ei($,(,%% jas[eic*l]opa$Ei]ca?dklo*ejranp$ei%($ei*oevaW,Y(,%% jas[eic*odks$% In Figure 9-8, I’ve used the image generated by the script jecdpogu*lu (Listing 9-6) with jqi[op]no9-,, iej[op]n[n]`eqo9-,, and i]t[op]n[n]`eqo9/, to show a more pronounced effect of the image inversion. Figure 9-8. Inverting an image: the original is on the left, and the inverted image is on the right. Table 9-3 lists some additional ImageChops operations. Notice that ImageChops opera- tions operate only one channel (L) or RGB images. Table 9-3. Some ImageChops Operations Function Description ]``$eic-(eic.(o_]ha9-*,(kbboap9,% Adds two images as follows: $eic-'eic.%+ o_]ha'kbboap. The default values of o_]ha and kbboap mean a simple addition. _kjop]jp$eic-(r]hqa% Returns an image of size eic- filled with color r]hqa. `]ngan$eic-(eic.% Returns an image with the darker pixel from both images. This a minimum of the two im- ages, on a pixel-by-pixel level. `ebbanaj_a$eic-(eic.% Returns the absolute difference of two images. This is ]^o$eic-)eic.%, on a pixel-by-pixel level. hecdpan$eic-(eic.% Returns an image with the lighter pixel from both images. This a maximum of the two im- ages, on a pixel-by-pixel level. oq^pn]_p$eic-(eic.(o_]ha9-*,(kbboap9,% Subtracts two images as follows: $eic-)eic.%+ o_]ha'kbboap. The default values of o_]ha and kbboap mean a simple subtraction. CHAPTER 9 N IMAGEPROCESSING314 There are additional functions available in ImageChops; check out either dahl$Ei]ca?dklo% or the PIL web site (dppl6++sss*lupdkjs]na*_ki+he^n]nu+leh+d]j`^kkg+ ei]ca_dklo*dpi). You can create some interesting effects using these simple operations. And these effects can in turn be used for some fast image processing algorithms. Listing 9-13 presents a script that makes use of the hecdpan$% method on two night sky images. To follow along, run the jecdpogu*lu script and rename the generated file ei]cao+jecdpogu*ljc to ei]cao+ jecdpogu-*ljc; do it again, this time renaming the generated image to ei]cao+jecdpogu.*ljc. Listing 9-13. Using hecdpan$% on Two Images bnkiLEHeilknpEi]ca(Ei]ca@n]s(Ei]caBkjp(Ei]ca?dklo bnkii]plhkphe^eilknpbkjp[i]j]can na]`pdaei]cao eic-9Ei]ca*klaj$#**+ei]cao+jecdpogu-*ljc#% eic.9Ei]ca*klaj$#**+ei]cao+jecdpogu.*ljc#% _na]pa]jasei]ca(i]`akbpdahecdpankbpdapsk eic/9Ei]ca?dklo*hecdpan$eic-(eic.% _na]pa]_khh]cakbpdnaaei]cao se`pd(daecdp9eic-*oeva `ahp]9-, eic9Ei]ca*jas$#NC>#($se`pd&.'`ahp](daecdp&.'`ahp]%($.11(.11(.11%% eic*l]opa$eic-($,(,%% eic*l]opa$eic.($se`pd'`ahp](,%% eic*l]opa$eic/($$se`pd'`ahp]%+.(daecdp'`ahp]%% ]jjkp]papdaei]caosepdpatp bkjp[opn9bkjp[i]j]can*bej`bkjp$#Ran]#% ppb9Ei]caBkjp*pnqapula$bkjp[opn(10% `n]s9Ei]ca@n]s*@n]s$eic% `n]s*patp$$`ahp](`ahp]%(#JecdpOgu$-%#(behh9#sdepa#(bkjp9ppb% `n]s*patp$$`ahp]&.'se`pd(`ahp]%(#JecdpOgu$.%#(behh9#sdepa#(bkjp9ppb% `n]s*patp$$$se`pd'`ahp]%+.'`ahp](daecdp'`ahp]&.%(X #?ki^eja`#(behh9#sdepa#(bkjp9ppb% `eolh]updabej]hei]ca eic*odks$% I’ve made a collage and separated the images with a white band. Figure 9-9 shows the result. CHAPTER 9 N IMAGEPROCESSING 315 Figure 9-9. Using hecdpan$% It’s interesting to note that in this specific case, using the function ]``$% would have resulted in a similar image. Image Filtering Most GUI-based image processing applications come with a bundle of image filters. There’s a wide variety of filters available, and different applications group them into different categories. Some of the common filtering categories are blur, enhancement, edge detection, and more. From an image processing standpoint, image filters are known operations that help us achieve a specific effect. For example, I once used the counting objects algorithm presented as an example in this book to try to count the number of bubbles in a printed circuit board soaked in water. As you probably realize, pictures obtained from a real-life image are not as sterile as those presented in the sky at night example. And so prior to using the algorithm, I had to clean up the images. By “clean up” I mean filtering the image using known filters. I ended up using a threshold combined with a median filter, and then converting the image to a 1-bit (black-and-white) version prior to running the algorithm. The following text assumes you have some background in image filtering. If not, my sug- gestion is that you experiment with a GUI application such as GIMP to get a feel for what filters to use and how they can help you with basic image processing. Once you have the preprocess- ing figured out, that is, you know what filters you want to run on your image prior to the final algorithm, you can implement the filters with a Python script that makes use of PIL filters. (You might not even require a final algorithm if you select the proper filters.) CHAPTER 9 N IMAGEPROCESSING316 PIL provides us with the class Ei]caBehpan, which supports a good number of filters. To use Ei]caBehpan, import it as follows: bnkiLEHeilknpEi]caBehpan (or simply eilknp Ei]caBehpan). Once you’ve imported Ei]caBehpan, call the behpan$% method that’s part of the Ei]ca object (not Ei]caBehpan object) to filter an image: :::bnkiLEHeilknpEi]ca(Ei]ca?dklo(Ei]caBehpan :::eic9Ei]ca*klaj$#**+ei]cao+jecdpogu*ljc#% :::ejr[eic9Ei]ca?dklo*ejranp$eic% :::beh[eic9ejr[eic*behpan$Ei]caBehpan*IejBehpan$-1%% In the preceding example, I’ve used the night sky images you’ve seen before and inverted the output so as to work on black stars over white background. I then filtered the image using a IejBehpan filter (see Figure 9-10). The IejBehpan works on a pixel-by-pixel level. For every pixel, it returns the minimum pixel from the square of size j (in the example, 15) centered on the given pixel. As you can see, even from this small example, there’s quite a bit to be gained by working with image filters. Figure 9-10. Filtering an image: left is the original, and right shows the image filtered with a IejBehpan set to -1. Ei]caBehpan provides fixed image enhancement filters easily distinguishable due to their capitalized names: :::Wbehpbknbehpej`en$Ei]caBehpan%ebbehp*eoqllan$%Y W#>HQN#(#?KJPKQN#(#@AP=EH#(#A@CA[AJD=J?A#(#A@CA[AJD=J?A[IKNA#(#AI>KOO#( #BEJ@[A@CAO#(#OD=NLAJ#(#OIKKPD#(#OIKKPD[IKNA#Y By the term “fixed” I mean that they accept no parameters. To use these filters, call the behpan$% method with the fixed filter, as follows: :::jas[eic9eic*behpan$Ei]caBehpan*?KJPKQN% The names of these filters should provide direction as to what they perform. Ei]caBehpan also provides nonfixed filters (i.e., filters that accept parameters). Table 9-4 lists some additional filters supported by the Ei]caBehpan object. CHAPTER 9 N IMAGEPROCESSING 317 Table 9-4. Some Image Filters Function Description I]tBehpan$oeva9/% For every pixel in the original image, returns the pixel with the maxi- mum value from a square of width oeva placed around the original pixel. oeva must be odd (3, 5, 7, . . . ). Ia`e]jBehpan$oeva9/% For every pixel in the original image, returns the median pixel from a square of width oeva placed around the original pixel. oeva must be odd (3, 5, 7, . . . ). IejBehpan$oeva9/% For every pixel in the original image, returns the pixel with the mini- mum value from a square of width oeva placed around the original pixel. oeva must be odd (3, 5, 7, . . . ). Ik`aBehpan$oeva9/% For every pixel in the original image, returns the most common pixel from a square of width oeva placed around the original pixel. oeva must be odd (3, 5, 7, . . . ). Final Notes and References Image processing is a large field and is gaining more and more popularity as computers increase in performance. And image processing is only two dimensional; nowadays we see more and more 3-D data processing as well, including video. Armed with Python, the Python Imaging Library, and NumPy, even complex image pro- cessing tasks can be prototyped. However, image processing requires a great deal of memory and processing power; as you work with images you’ll realize you may require faster tools, and you may even need to port parts of your code to a lower-level programming language such as C to gain performance. Nevertheless, Python is an excellent prototyping environment; it provides fast responses in an interactive environment and can help you define your image processing algorithm. Additional information can be found at the following sites: s 4HE0YTHONImaging Library, dppl6++sss*lupdkjs]na*_ki+he^n]nu+leh+d]j`^kkg+ s ')-0 dppl6++sss*ceil*knc+`k_o+ CHAPTER 10 Advanced File Processing More on Files A common task of programmers is working with files—not merely reading and writing files, but also organizing them, moving them around, deleting, compressing, archiving, and more. I often find myself borrowing code from my previous projects, especially code that deals with reading and parsing files, typically via copy and paste. But that seems such a waste—why not come up with a library of functions that addresses most people’s needs? Other programmers must have felt the same way, and so they turned to writing modules, libraries of functions to perform these tasks. And many of these are now included with the Python Standard Library; more are added on a regular basis. This chapter expands on ideas discussed in Chapters 4 and 5 and examines additional file-related topics. I also build on some examples from previous chapters as a way to introduce new topics and create more reusable code. Binary Files and Random Access The term binary file describes a nontext file: executable files, image files, or simply data files. In this section we’ll show some methods of dealing with binary data files. Working with text files, we’ve used na]`heja$% or na]`$j% to read chunks of data from a file. The function na]`heja$%, in a sense, splits the file into smaller chunks of data (i.e., lines of text). With binary files, it’s more common to see random access, that is, arbitrary reading of chunks of data from anywhere in the file. With text files, this is a bit harder because you don’t know in advance how many characters and words are in a line, so randomly picking the nth line is not a trivial task. With binary files composed of fixed-length records, random access allows access to an arbitrary field. The methods oaag$kbboapW(sdaj_aY% and pahh$% are random-access file functions. To better understand what these functions do, you need to understand the concept of file point- ers. A file pointer points to a location in the file: subsequent read or write operations will happen at that specific location (assuming the file was opened in a mode that allows random access read and/or write). Whenever we read or write data from the file, the file pointer is incremented accordingly. 319 CHAPTER 10 N ADVANCED FILE PROCESSING320 The function oaag$% sets the file pointer to a value of our choosing; subsequent calls to na]`$% will pick up from the newly “seeked” location. The function pahh$% returns the current file pointer value in bytes. Here’s a short interactive Python session describing the works of oaag$% and pahh$%: :::b9klaj$#**+`]p]+at]ilha*^ej#(#s^#% :::b*pahh$% ,H :::b*snepa$#,-./012345#% :::b*pahh$% -,H NNote As in previous chapters, I assume you’re running an interactive Python session in directory ?d-,+ on_ and that directory ?d-,+`]p] exists. I’ve created a binary file. Once created, the file pointer associated with the file is set to 0, as shown by the result from b*pahh$%. After writing ten values, the file pointer is at 10. :::b*oaag$1% :::b*snepa$#,-./012345#% :::b*pahh$% -1H :::b*_hkoa$% I’ve changed the file pointer to point to location 5 and wrote again the same ten values. As a result, the file pointer has changed to 15. Let’s print the contents of the file: :::klaj$#**+`]p]+at]ilha*^ej#(#n^#%*na]`$% #,-./0,-./012345# As expected, the result is the string #,-./012345# overlapped by another copy of the same string at location 5. The argument sdaj_a in the function oaag$kbboapW(sdaj_aY% instructs how the offset should be calculated. If sdaj_a is , (the default), oaag$% moves kbboap bytes relative to the start of the file. If sdaj_a is -, oaag$% moves the file pointer kbboap bytes relative to the end of the file. Notice that in order to change the file pointer to n bytes before the end of the file, pass a negative value as an offset. On many systems, it’s possible to seek past the end of the file (which is a feature, not a bug, as you’ll soon see). If sdaj_a is equal to ., oaag$% moves relative to the current location. Again, both negative and positive values are allowed. Negative values for oaag$% are not allowed if sdaj_a is ,. Continuing our previous example: :::b9klaj$#**+`]p]+at]ilha*^ej#(#n^#% :::b*oaag$).(.% :::lnejpb*na]`$% 45 :::b*_hkoa$% CHAPTER 10 N ADVANCED FILE PROCESSING 321 I’ve set the file pointer to 2 bytes before the end of the file and printed the contents of the file from that point forward. Example: Reading the Nth Field The functions oaag$% and pahh$% are especially useful for accessing large binary files that con- tain fixed-length records. Unlike text files, with binary files of fixed-length records, you can calculate in advance the location of a field in the file. Combined with oaag$%, it’s possible to read a single field. This is especially important in large files where reading the entire file or even reading the file a value at a time (without seeking directly to the required field) can take a considerable amount of time. In this example, shown in Listing 10-1, we combine oaag$% and pahh$% with the struct module (see Chapter 4). Listing 10-1. Reading the Nth Field eilknpopnq_p bnkii]pdeilknpomnp bnkin]j`kieilknpn]j`n]jca ^ej]nubehaj]ia ^ej[bj9#**+`]p]+h]nca[beha*^ej# Jbeah`o9-,,,jqi^ankbbeah`o J9322beah`pknapneara bip9#_`H#bkni]p6_d]n(bhk]p(hkjc bip[oeva9opnq_p*_]h_oeva$bip% _na]pa]n]j`ki^ej]nubeha bkqp9klaj$^ej[bj(#s^#% bkneejtn]jca$Jbeah`o%6 `]p]9opnq_p*l]_g$bip(_dn$n]j`n]jca$/.(-.4%%(omnp$bhk]p$e%%(e% bkqp*snepa$`]p]% bkqp*_hkoa$% na]`pdajpdr]hqa bej9klaj$^ej[bj(#n^#% bej*oaag$$J)-%&bip[oeva% `]p]9bej*na]`$bip[oeva% $_(`(h%9opnq_p*qjl]_g$bip(`]p]% lnejp=phk_]pekj!`(Ena]`6!$bej*pahh$%+bip[oeva%($_(`(h% The first part of the script creates a binary file with some made-up data. The second part reads a single field at location 766 without reading the entire file. This is done by changing the file pointer to point to location $J)-%&bip[oeva and reading only one field. CHAPTER 10 N ADVANCED FILE PROCESSING322 Here’s the result from running the script: =phk_]pekj322(Ena]`6$##(.3*2142///3-43422-(321% Example: Efficient Tail Implementation In Chapter 5 you saw a possible implementation of head and tail functionality. The tail func- tionality was harder to implement for a very large file. The reason for this was explained in Chapter 5. In this example, we turn to implement tail functionality for large files with use of oaag$% and pahh$%, as demonstrated in Listing 10-2. Listing 10-2. p]eh$% Function for Large Files bnkiko*l]pdeilknpcapoeva `abp]eh[h]nca$behaj]ia(j9-,%6 Napqnjopdah]opjhejaokb]ranuh]ncabeha* J(`]p]9-,.0(## klajpdabeha]j`napnearaepooeva b9klaj$behaj]ia(#n^#% boeva9capoeva$behaj]ia% oaagpkpdaaj`kbbeha b*oaag$,(.% bkneejtn]jca$boeva)J()J()J%6 na]`pdajatp_dqjgkb`]p] h]op[hk_9b*pahh$% b*oaag$i]t$e(,%% opknana]``]p](naranoa`kn`an `]p]'9b*na]`$h]op[hk_)b*pahh$%% `ksad]raajkqcdhejao; eb`]p]*_kqjp$#Xj#%:j6 ^na]g lnejppdah]opjhejao hejao9`]p]*olhephejao$% bknhejaejhejaoW)j6Y6 lnejpheja The idea is this: read J bytes from the end and store the result in `]p]. The parameter J is an arbitrary number and describes the number of bytes to read in one chunk. I’ve set it to -,.0. If `]p] contains more than j lines (by counting the number of times #Xj# is encountered), CHAPTER 10 N ADVANCED FILE PROCESSING 323 break out of the bkn loop and print the last j lines of data. If `]p] does not contain j lines, read the next chunk of J bytes (that is, backward) and add the read bytes to `]p]. So in a sense, we’re going backward from the end of the file, reading chunks of J9-,.0 bytes and counting whether we encountered enough line breaks. If we have, we print those lines; if we haven’t, we keep reading more data until we either have read the required number of lines or have reached the beginning of the file. This implementation is not as straightforward as the one presented in Chapter 5. How- ever, there is a substantial performance gain for large files. Example: Creating a Fixed-Size File Dealing with binary files, at times there’s a requirement to create a large file (of noninitialized values). A trick I use to create a file is to seek past the end of the file to a location equivalent to the required length minus one, and then write , and close the file. This (in many systems) creates a file of the required size. The following creates an uninitialized file of size 1GB (230 bytes): :::b9klaj$#**+`]p]+-c^[beha*^ej#(#s^#% :::b*oaag$.&&/,)-% :::b*snepa$_dn$,%% :::b*_hkoa$% Now to ensure that the file was indeed created: :::bnkiko*l]pdeilknpcapoeva :::capoeva$#**+`]p]+-c^[beha*^ej#% -,3/30-4.0H NNote The ability to seek past the end of a file is system dependent and not supported by all systems. Example: Recording Time-Based Binary Data When recording time-based binary data, a method I particularly like is using the epoch nota- tion (see Chapter 5). For this example, I’ll be using functions from the time module and from Python’s array module (not to be confused with NumPy’s array object). In case you’re simply recording a variable as a function of time, it’s easier if the recorded variable is in floating-point notation, because now both the time and the value use the same data type. This allows for a simple use of the array module, as shown in Listing 10-3. Listing 10-3. Writing Epoch-Based Data in Binary Form eilknpn]j`ki(peia(]nn]u J9-, bj]ia9#**+`]p]+^ej]nu[`]p]*b20# `]p]9]nn]u*]nn]u$#`#% CHAPTER 10 N ADVANCED FILE PROCESSING324 _na]pa`]p] bknr]hqaejn]jca$J%6 peia*ohaal$n]j`ki*n]j`ki$%% `]p]*]llaj`$peia*peia$%% `]p]*]llaj`$r]hqa% opkna`]p]pkbeha b9klaj$bj]ia(#s^#% `]p]*pkbeha$b% b*_hkoa$% The script runs on average 5 seconds and generates timestamps and values. I’ve made use of the array method pkbeha$% to store binary values to file. Retrieving data from the binary file is simple as well, as you can see in Listing 10-4. Listing 10-4. Reading Binary Data Stored with Epoch Notation eilknpn]j`ki(peia(]nn]u J9-, bj]ia9#**+`]p]+^ej]nu[`]p]*b20# `]p]9]nn]u*]nn]u$#`#% na]``]p] b9klaj$bj]ia(#n^#% `]p]*bnkibeha$b(J&.% b*_hkoa$% `eolh]u`]p] H9`]p]*pkheop$% bknp(r]hejvel$HW66.Y(HW-66.Y%6 lnejppeia*_peia$p%(r]h Most of the work is performed in the line `]p]*bnkibeha$b(J&.%, which reads values and stores them in a Python array. I then rearrange the data and display the results: Pdq@a_,0--6-161-.,,4,*, Pdq@a_,0--6-161..,,4-*, Pdq@a_,0--6-161/.,,4.*, Pdq@a_,0--6-161/.,,4/*, Pdq@a_,0--6-1610.,,40*, Pdq@a_,0--6-1610.,,41*, Pdq@a_,0--6-1610.,,42*, Pdq@a_,0--6-1611.,,43*, Pdq@a_,0--6-1612.,,44*, Pdq@a_,0--6-1612.,,45*, CHAPTER 10 N ADVANCED FILE PROCESSING 325 I’ve used a trick to rearrange the data. When I convert the data from an array to a list, H, the values are interlaced: time, value, time, value, and so on. To print values, I can just iterate through H, converting to a time format every odd value. Instead I’ve opted to zip slices of even and odd values. The following code illustrates this: :::H9W-(#R]hqa#(.(#=jkpdanr]hqa#(/(#H]opr]hqa#Y :::HW66.Ypdaoa]napdak``r]hqao W-(.(/Y :::HW-66.Ypdaoa]napdaarajr]hqao W#R]hqa#(#=jkpdanr]hqa#(#H]opr]hqa#Y :::vel$HW66.Y(HW-66.Y% W$-(#R]hqa#%($.(#=jkpdanr]hqa#%($/(#H]opr]hqa#%Y :::bkne(oejvel$HW66.Y(HW-66.Y%6 ***lnejpe(o *** -R]hqa .=jkpdanr]hqa /H]opr]hqa Object Serialization At times, working with an interactive session in Python, it’s useful to be able to save variables to file. Prior to writing them to file, variables should be serialized, that is, converted into a stream of bytes. The stream of bytes can then be written to file and later retrieved. Instead of creating dedicated file formats to deal with all sorts of variable types (lists, strings, NumPy arrays, and the like), Python provides us with a built-in object serialization module that is ideal for this purpose: Pickle. The Pickle Module Pickle comes in two flavors: the Pickle module and the faster C implementation named cPickle. With the better performance of cPickle comes a price: you can’t subclass the module. Personally, I have not found this limitation an issue, so for me, cPickle is a better choice. To use Pickle, issue eilknple_gha; to use cPickle, issue eilknp_Le_gha. The function _Le_gha*`qil$k^f(behaW(lnkpk_khY% serializes an object and writes it to file. The lnkpk_kh argument can take the values , for ASCII (the default), - for binary, and . to indicate support for new Python objects. Both protocols 1 and 2 create binary files. If you provide a negative value for `qil$%, the highest version protocol will be used. This is to accom- modate for future protocol versions of Pickle and cPickle. The function _Le_gha*hk]`$beha% will read an object from file. The function _Le_gha*`qilo$k^fW(lnkpk_khY% serializes the object and returns its string representation without writing it to file. Similarly, _Le_gha*hk]`o$opn% creates an object from a string. CHAPTER 10 N ADVANCED FILE PROCESSING326 Example: Saving and Retrieving Python Session Variables The example in Listing 10-5 makes use of the cPickle module to write variables of varying data types to file. Listing 10-5. Pickling Several Objects to File eilknp_Le_gha bnkijqilueilknp& bj]ia9#**+`]p]+iuoaooekj*le_gha# ]9/ ^9=opnejc _9w#`e_p#6-,y `9aua$/% bkqp9klaj$bj]ia(#s^#% bknr]nejW](^(_(`Y6 _Le_gha*`qil$r]n(bkqp% bkqp*_hkoa$% To pickle objects (i.e., serialize them) and write them to file, I’ve used the function _Le_gha*`qil$r]n(beha%. You can issue subsequent calls to _Le_gha*`qil$% to store addi- tional values to file as shown in Listing 10-5. Now to read the objects from file (see Listing 10-6). Listing 10-6. Reading Objects from File eilknp_Le_gha bj]ia9#**+`]p]+iuoaooekj*le_gha# bej9klaj$bj]ia(#n^#% r]n[ej`at9, sdehaPnqa6 pnu6 r]n[ej`at'9- ata_r[!`9_Le_gha*hk]`$bej%!r]n[ej`at ata_r]n[pula9pula$r[!`%!r]n[ej`at lnejpNa]`r[!`(pulaeo6!o!$r]n[ej`at(r]n[pula% at_alpAKBAnnkn6 ^na]g Whenever you issue a call to _Le_gha*hk]`$%, the return value is a Python object (unless the end of file is reached). However, the name of the object is not stored. Therefore, I’ve made use of the ata_ statement to create variables named r[-, r[., and so forth to store the objects. Here are the results from running the script: CHAPTER 10 N ADVANCED FILE PROCESSING 327 Na]`r[-(pulaeo68pula#ejp#: Na]`r[.(pulaeo68pula#opn#: Na]`r[/(pulaeo68pula#`e_p#: Na]`r[0(pulaeo68pula#jqilu*j`]nn]u#: :::r[0 ]nn]u$WW-*(,*(,*Y( W,*(-*(,*Y( W,*(,*(-*YY% If you’re using NumPy arrays, you can make use of the functions o]ra$% and hk]`$% pro- vided by matplotlib. These functions accept a file name and read and write a NumPy array object to and from file: :::bnkiluh]^eilknp& :::bj]ia9#**+`]p]+oaooekj*jlu# :::o]ra$bj]ia(aua$/%% :::hk]`$bj]ia% ]nn]u$WW-*(,*(,*Y( W,*(-*(,*Y( W,*(,*(-*YY% If the file name used in o]ra$% and hk]`$% ends with *cv, gzip compression is automati- cally used (see the section “File Compression” later in this chapter). Command-Line Parameters This section, which covers command-line parameters, is a bit of an off-topic discussion. (The reason I’ve decided to give an overview of command-line parameters before getting into the details of the FileInput module is that the FileInput module makes sense in the context of command-line parameters, as you’ll soon see.) A possible progression from an interactive Python session is creating a stand-alone util- ity or application—that is, a Python script callable from the shell, be it the command prompt in Windows or a bash shell in Linux. One of the options of interacting with such a script is by passing command-line arguments. For example, in the p]eh command-line utility, a command-line parameter could be the number of lines p]eh will display. So to list the last 20 lines of a file, you would write lupdkjp]eh*lu)j.,behaj]ia argv The sys module enables command-line processing with the ouo*]ncr variable, demonstrated in Listing 10-7. ouo*]ncr is a list of strings containing the split shell command entered. The value in ouo*]ncrW,Y is the name of the Python script. CHAPTER 10 N ADVANCED FILE PROCESSING328 Listing 10-7. Command-Line Arguments eilknpouo bkne(_i`ejajqian]pa$ouo*]ncr%6 lnejp]ncrW!`Y9#!o#!$e(_i`% Save the file as l]noa[]nco*lu and run lupdkjl]noa[]nco*lu.,iubeha in a shell. The results should look like this: ]ncrW,Y9#l]noa[]nco*lu# ]ncrW-Y9#.,# ]ncrW.Y9#iubeha# Example: Creating a Fixed-Size File (Stand-Alone Script) We turn to modify the code in this chapter from the section “Example: Creating a Fixed-Size File” to a stand-alone script callable from a CLI (shell or command window). The script, shown in Listing 10-8 (ailpu[beha*lu%, accepts the number of bytes and a file name and creates a file of specified name and size. Listing 10-8. Creating a Fixed-Size File (Stand-Alone Script), ailpu[beha*lu bnkiouoeilknp]ncr(atep qo]ca9Qo]ca6lupdkjailpu[beha*luj^upaobehaj]ia saatla_ppdnaa]ncqiajpo6o_nelpj]ia(oeva(]j`behaj]ia ebhaj$]ncr%9/6 lnejpEilnklanjqi^ankb]ncqiajpo* lnejpqo]ca atep$% eooeva]jejpacan; pnu6 j^upao9hkjc$]ncrW-Y% at_alpR]hqaAnnkn6 lnejpBenop]ncqiajpeojkp]jejpacanjqi^an* lnejpqo]ca atep$% napnearapdanamqaopa`behaj]ia behaj]ia9]ncrW.Y _]jsa_na]papdabeha; dana]b]ehqna_kqh`^a`qapk]jkjateopejcl]pd pnu6 b9klaj$behaj]ia(#s^#% CHAPTER 10 N ADVANCED FILE PROCESSING 329 at_alpEKAnnkn6 lnejpQj]^hapk_na]pabeha(behaj]ia lnejpqo]ca atep$% bej]hhu_na]papdabeha b*oaag$j^upao)-% b*snepa$_dn$,%% b*_hkoa$% lnejpOq__aoobqhhu_na]pa`beha!okboeva!`*!$behaj]ia(j^upao% I’ve carefully checked the parameters passed by the user to determine whether there are an adequate number of parameters and if those values are valid. I took special care to ensure the file can indeed be created. Finally, the code that generates the empty file is simple. I’ve also introduced the function atep$%, provided with the sys module. The function is especially useful when you’re writing a stand-alone script, as it exits the script immediately. OptParse Module Enforcing a strict syntax for command-line parameters renders a script less user friendly. For instance, in the previous example, you might want the script to automatically create a file of default size, say 1KB, in case no length is provided. Or you might want to add additional parameters with default values, further controlling the behavior of the script so that it creates a path to the file name if it does not exist, for example. Accommodating additional options as well as default options will cause the code in the previous listing to grow larger and less maintainable. When the number of options increases, consider using the OptParse module; the OptParse module is designed to address command- line parameters in an easy set of library functions. NTip The module getopt (dppl6++`k_o*lupdkj*knc+he^n]nu+capklp*dpih) is an older module that also provides functions to parse command-line options. To use the OptParse module, we follow these steps: 1. Create an KlpekjL]noan object. 2. Add options to the parser using the ]``[klpekj$% method. 3. Parse the command-line arguments using the l]noa[]nco$% method. The first step is simple: instantiate an KlpekjL]noan object by setting l]noan9 KlpekjL]noan$%. Adding options is a bit more complex as there are many possibilities to choose from (as you’ll soon see). The last step is calling the function l]noa[]nco$%, which returns a list of command-line options. The return value of l]noa[]nco$% is a tuple of options and arguments. The difference between an option and an argument is that options are, of course, optional, and arguments (positional arguments per OptParse documentation) are required. CHAPTER 10 N ADVANCED FILE PROCESSING330 Example: Processing Command-Line Parameters We’ll modify our previous example so that now the number of bytes per file is an option fol- lowed by the requested number of bytes (i.e., )j-,,,), as in Listing 10-9 (ailpu[klp*lu%. Furthermore, we’ll add an option switch, also known as an option flag, indicating whether a *^ej extension should be added to the file name. The existence of the option flag )t instructs the script whether to create the extension: there’s no additional value following it. Listing 10-9. Processing Command-Line Parameters Using OptParse, ailpu[klp*lu bnkiklpl]noaeilknpKlpekjL]noan bnkiouoeilknpatep qo]ca9Qo]ca6lupdkjailpu[klp*luWklpekjoYbehaj]ia _na]pa]jKlpekjL]noanejop]j_a l]noan9KlpekjL]noan$qo]ca% pdaoa]napdaklpekjo l]noan*]``[klpekj$)j())jqi^upao(`aop9j^upao( pula9ejp(`ab]qhp9-,,,(dahl9jqi^ankb^upaoejbeha% l]noan*]``[klpekj$)t())atp(`aop9atp( ]_pekj9opkna[pnqa(`ab]qhp9B]hoa(dahl9]``o#^ej#atpajoekjpkbehaj]ia% $klp(]nco%9l]noan*l]noa[]nco$% iqopd]ra]behaj]ia ebhaj$]nco%9-6 lnejpEilnklanjqi^ankb]ncqiajpo* atep$% ]llaj`atpajoekjebosep_deokj behaj]ia9]ncoW,Y'#*^ej#ebklp*atpahoa]ncoW,Y _na]papdabeha pnu6 b9klaj$behaj]ia(#s^#% at_alpEKAnnkn6 lnejpQj]^hapk_na]pabeha(behaj]ia atep$% b*oaag$klp*j^upao)-% b*snepa$_dn$,%% b*_hkoa$% lnejpOq__aoobqhhu_na]pa`beha!okboeva!`*!$behaj]ia(klp*j^upao% First, I’ve imported the OptParse module. I then instantiate an KlpekjL]noan object and provide it with the default qo]ca string. The qo]ca string will be displayed as the first line when- ever the user issues the command-line switch )d or Ìdahl, so: lupdkjailpu[klp*lu)d. CHAPTER 10 N ADVANCED FILE PROCESSING 331 I then add options using the ]``[klpekj$% method. The ]``[klpekj$% method has many parameters to control how options should be parsed. In my first ]``[klpekj$% call, I’ve set how the user invokes this option: by entering either )j J>UPAO or ))jqi^upaoJ>UPAO. I set the des- tination for this option to be named j^upao. This means that after the option is parsed, I can access the option value through variable klp*j^upao. The type of variable is ejp, as detailed by the pula argument, and the default value is -,,, in case j^upao isn’t provided by the user. Lastly, the help string associated with this option is detailed: dahl9jqi^ankb^upaoejbeha. Similarly, I set another option named atp; this option is a switch, meaning the user will invoke the switch simply by entering )t or ))atp; there are no additional values following the switch (in contrast, the )j option was accompanied by an J>UPAO value). The ]_pekj argu- ment instructs OptParse to treat this as a positively acting switch: if )t is provided, set the flag to Pnqa. Lastly, I’ve set the default value to B]hoa and added a help string: dahl9]``o#^ej# atpajoekjpkbehaj]ia. Parsing the command-line options is performed with the call to l]noa[]nco$%. Both options and arguments are then retrieved. The options are accessed via a class parameter, and the arguments are provided in a list. Following that is the actual creation of the file. The following are the results from running the script with various options in a bash shell: lupdkjailpu[klp*lu Eilnklanjqi^ankb]ncqiajpo* lupdkjailpu[klp*lu)d Qo]ca6lupdkjailpu[klp*luWklpekjoYbehaj]ia Klpekjo6 )d())dahlodkspdeodahliaoo]ca]j`atep )jJ>UPAO())jqi^upao9J>UPAO jqi^ankb^upaoejbeha )t())atp]``o#^ej#atpajoekjpkbehaj]ia lupdkjailpu[klp*lubeha- Oq__aoobqhhu_na]pa`behabeha-kboeva-,,,* lupdkjailpu[klp*lu)tbeha- Oq__aoobqhhu_na]pa`behabeha-*^ejkboeva-,,,* lupdkjailpu[klp*lu)j.,,,))atpbeha- Oq__aoobqhhu_na]pa`behabeha-*^ejkboeva.,,,* lupdkjailpu[klp*lu)j.,,,))atpbeha-beha. Eilnklanjqi^ankb]ncqiajpo* lupdkjailpu[klp*lu)j.]))atpbeha- Qo]ca6lupdkjailpu[klp*luWklpekjoYbehaj]ia ailpu[klp*lu6annkn6klpekj)j6ejr]he`ejpacanr]hqa6#.]# The script expects an input as follows: WklpekjoYbehaj]ia. Calling the script with command-line parameter )d or ))dahl prints out the usage help message. This is implemented automatically when you use the OptParse module. Next, I’ve issued some valid command-line parameters and some invalid ones. OptParse handles the parsing of the values, while my code handles the number of arguments (only one: behaj]ia). I’ve also called the script with full option names ())atp) and abbreviated option names ()t). CHAPTER 10 N ADVANCED FILE PROCESSING332 Module OptParse is a rich module with many features. Refer to the online help at dppl6++ `k_o*lupdkj*knc+he^n]nu+klpl]noa*dpih for a detailed description of the module. NTip As the number of options increases, consider using the ConfigParser module instead. See Chapter 4 for an introduction to ConfigParser and the online help (dppl6++`k_o*lupdkj*knc+he^n]nu+ _kjbecl]noan*dpih). The FileInput Module Closing our command-line parameters discussion is the FileInput module. The module pro- vides an easy method for accessing several files (or streams) passed by the command line (i.e., lupdkjokiao_nelp*lubeha-beha.beha/). To use the module, issue eilknpbehaejlqp. Using the module is straightforward: iterate over behaejlqp*ejlqp$%. The result from the iteration is the next line in the current file. Once the end of the file is reached, the next file is opened automatically, and the process resumes until all lines from all files have been iterated over. Table 10-1 lists some useful FileInput methods that can be used to further enhance scripts that make use of the module. Table 10-1. Useful FileInput Methods Method Description behaejlqp*_hkoa$% Ends the processing, closing all opened files behaejlqp*behahejajk$% Returns the line number in the current file behaejlqp*behaj]ia$% Returns the name of the file currently being read behaejlqp*behajk$% Returns the index of the current file behaejlqp*eobenopheja$% Returns Pnqa if this is the first line in a file behaejlqp*hejajk$% Returns the cumulative line number of all lines read from all the files behaejlqp*jatpbeha$% Stops processing the current file and jumps to the next file Let’s turn to an example. Example: Combining Data from Several Sources Based on the Epoch Here we pick up on an example previously presented in Chapter 5 in a section with the same title as this one. This time we allow for more than two files to be combined, based on the epoch (see Listing 10-10). CHAPTER 10 N ADVANCED FILE PROCESSING 333 Listing 10-10. Combining Several Files Based on the Epoch, _ki^eja[alk_d*lu eilknpbehaejlqp bnkipeiaeilknpigpeia(opnlpeia `]p]9WY bip9#!^!`!D6!I6!O!U# bknhejaejbehaejlqp*ejlqp$%6 `]p]*]llaj`$Wigpeia$opnlpeia$hejaW06.0Y(bip%%(hejaY% bknhejaejoknpa`$`]p]%6 lnejphejaW-Y( The contents of the files are detailed in Chapter 5. Use the script as follows: lupdkj _ki^eja[alk_d*lubeha-beha.***. The source code should prove easy to follow. Example: Searching for Text in Multiple Files Again, building from an example previously shown in Chapter 5 in the section “Example: Searching Inside a Text File,” we now search for text in multiple files. To use the script, on_dbeha*lu, shown in Listing 10-11, issue the command lupdkjon_dbeha*luoa]n_d[opnejc beha-beha.***. Listing 10-11. Searching for Text in Multiple Files, on_dbeha*lu eilknpbehaejlqp(ouo opnejcpkoa]n_deopdabenop]ncqiajp bknhejaejbehaejlqp*ejlqp$ouo*]ncrW.6Y%6 ebheja*bej`$ouo*]ncrW-Y%9)-6 lnejpBeha!o(!`6!o!$behaejlqp*behaj]ia$%( behaejlqp*behahejajk$%(heja*nopnel$%% The main difference from the previous example is that now the first parameter is the string to search instead of a file. So I access the command-line parameters and pass the values from the third parameter onward (ouo*]ncrW.6Y) to behaejlqp*ejlqp$%. Doing so will skip the script name (]ncrW,Y) and the search string (]ncrW-Y). The fileinput module also provides support for modifying files as you process the lines via the ejlh]_a argument. Refer to the online help for more on this: dppl6++`k_o*lupdkj*knc+ he^n]nu+behaejlqp*dpih. File and Directory Manipulation Other than reading and writing files and processing command-line parameters, manipulat- ing files is also a task commonly required of a developer. You’ve seen the os.walk module and some directory operations in previous chapters; here I expand on those, as well as file opera- tions: deleting files, moving files, and more. CHAPTER 10 N ADVANCED FILE PROCESSING334 Module glob The glob module enables searching for files given a file name pattern. The function chk^$l]ppanj% will return a list of all the files matching l]ppanj; the function echk^$l]ppanj% returns an iterator (as opposed to a list in chk^$%) of all the files matching l]ppanj. I usually just use the list version, chk^$l]ppanj%: :::bnkichk^eilknpchk^ :::chk^$#&*lu#% W#atpn]_p/*lu#(#p]eh[h]nca*lu#(#_il[`eno*lu#(#_il[behao*lu#Y chk^$% accepts shell-like wildcards such as & (matches a string of characters), ; (matches one character), W_d]noY (matches any character from a list of characters), and W_d]noY (matches anything but those characters listed). The following will match a file name that ends with lu and contains a number: :::chk^$#&W,)5Y&lu#% W#atpn]_p/*lu#Y This will match a file name that ends with lu and does not start with _: :::chk^$#W_Y&lu#% W#atpn]_p/*lu#(#p]eh[h]nca*lu#Y Please note that glob expressions contain shell wildcards, and so are not regular expressions. NTip See also module fnmatch (dppl6++`k_o*lupdkj*knc+he^n]nu+bji]p_d*dpih). Additional os Module Functionality You’ve already seen a considerable number of functions from the os and os.path modules (see the section “Moving Around” in Chapter 3). Table 10-2 lists a few more, not mentioned earlier, that are especially useful for manipulating files and directories. In the table, assume the cur- rent working directory is +dkia+qoan and the file in the directory is beha*atp* Table 10-2. os Module Functions for Manipulating Files and Directories Function Description Example ko*_dik`$l]pd(ik`a% Changes file permissions (in Windows, only read and write permissions are changed, all else is ignored). ko*_dik`$#beha*atp#(,333% changes the file permissions to read, write, and execute for all. ko*_dksj$l]pd(qe`(ce`% Changes the group and user own- ership of a file (not available in Windows). If you wish to change only qe`, set ce` to )-; if you wish to change only ce`, set qe` to )-. ko*_dksj$#beha*atp#(,(,% will set the group and user own- ership of file beha*atp to root (assuming root has a qe` of ,). CHAPTER 10 N ADVANCED FILE PROCESSING 335 Function Description Example ko*naikra$l]pdj]ia% ko*qjhejg$l]pdj]ia% Deletes the file specified in l]pd) j]ia. ko*qjhejg$#beha*atp#% will delete the file beha*atp. ko*ni`en$% Removes a directory if it’s empty. ko*ni`en$#+dkia+qoan#% will remove directory +dkia+qoan if it’s empty. ko*ig`en$l]pd% Creates a directory. ko*ig`en$#]jkpdan#% will create directory +dkia+qoan+]jkpdan. ko*i]ga`eno$l]pd% Creates a directory as well as any intermediate subdirectories. ko*i]ga`eno$#`en-+`en.#% will create directories +dkia+ qoan+`en- and +dkia+qoan+ `en-+`en.. ko*naj]ia$kh`(jas% Renames a path or file. ko*naj]ia$#beha*atp#(#beha.* atp#% renames file beha*atp to beha.*atp. ko*naj]iao$kh`(jas% Renames a path or file including the creation of intermediate direc- tories and removal of empty ones. ko*naj]iao$#+dkia+qoan#( #+dkia+qoan.+`en-+`en.#% will rename the directory +dkia+ qoan to +dkia+qoan.+`en-+`en. as well as create subdirectories that do not exist and remove di- rectory +dkia+qoan if it’s empty. Additional os.path Module Functionality The module os.path provides functions that help manage file names and file paths. Table 10-3 lists some useful os.path functions. In the table, assume the current working directory is +dkia+qoan and the file in the directory is beha*atp. Table 10-3. Useful os.path Functions Function Description Example ko*l]pd*]^ol]pd$o% Returns the absolute path of a file ko*l]pd*]^ol]pd$#beha*atp#% returns #+dkia+qoan+beha*atp#. ko*l]pd*^]oaj]ia$o% Returns the file name, excluding path ko*l]pd*^]oaj]ia$#+dkia+qoan+ beha*atp#% returns #beha*atp#. ko*l]pd*`enj]ia$o% Returns the directory name of a path ko*l]pd*`enj]ia$#+dkia+qoan+ beha*atp#% returns #+dkia+qoan#. ko*l]pd*ateopo$o% Returns Pnqa if the path or file specified by o exists ko*l]pd*ateopo$#+dkia+qoan#% returns Pnqa. ko*l]pd*cap]peia$o% Returns the last access time of a file peia*_peia$ko*l]pd*cap]peia$#+ dkia+qoan+beha*atp#%% will print the access time (_peia$% is part of the time module). ko*l]pd*cap_peia$o% Returns creation time of a file Similar to ko*l]pd*cap]peia$% example. Continued CHAPTER 10 N ADVANCED FILE PROCESSING336 Function Description Example ko*l]pd*capipeia$o% Returns the last modification time of a file Similar to ko*l]pd*cap]peia$% example. ko*l]pd*capoeva$o% Returns the file size in bytes ko*l]pd*capoeva$#beha*atp#% returns the size of file beha*ptp in bytes. ko*l]pd*eo]^o$o% Returns Pnqa if the path specified by o is an absolute path ko*l]pd*eo]^o$#beha*atp#% returns B]hoa. ko*l]pd*eo]^o$#+dkia+qoan+beha* atp#% returns Pnqa. ko*l]pd*eo`en$o% Returns Pnqa if o is a directory ko*l]pd*eo`en$#+dkia#% returns Pnqa. ko*l]pd*eobeha$o% Returns Pnqa if o is a file ko*l]pd*eobeha$#beha*atp#% returns Pnqa. ko*l]pd*fkej$^]oa(oam% Joins two or more paths, adding slashes as needed ko*l]pd*fkej$#+dkia+qoan#( #beha*atp#% returns #+dkia+qoan+ beha*atp#. ko*l]pd*fkej$#+dkia#(#qoan#( #beha*atp#% returns #+dkia+qoan+ beha*atp#. ko*l]pd*olhep$o% Splits a pathname, returning the path and the file name ko*l]pd*olhep$#+dkia+qoan+beha* atp#% returns $#+dkia+qoan#( #beha*atp#%. ko*l]pd*olhepatp$o% Splits a pathname returning the extension, including the dot ko*l]pd*olhepatp$#+dkia+qoan+ beha*atp#% returns $#+dkia+qoan+ beha#(#*atp#%. Module shutil The shutil module provides higher-level functions for copying, moving, and renaming files. Of those, we’ll explore the following: _klu$on_(`aop%,_klupnaa$on_(`aop%, nipnaa$l]pd%, and ikra$on_(`op%. For a full account of the module, refer to dppl6++`k_o*lupdkj*knc+he^n]nu+ odqpeh*dpih. I assume a file named beha-*ptp exists in the current directory. If yours doesn’t have this file, create one if you wish to follow along. First, let’s create a directory with subdirectories and copy beha-*ptp to the newly created directory: :::eilknpodqpeh :::bnkikoeilknpi]ga`eno :::bnkichk^eilknpchk^ :::i]ga`eno$#`en-+`en.+`en/+`en0#% :::odqpeh*_klu$#beha-*ptp#(#`en-+`en.+`en/+`en0#% :::odqpeh*_klu$#beha-*ptp#(#`en-+`en.+`en/+`en0+beha.*ptp#% :::chk^$#`en-+`en.+`en/+`en0+&#% W#`en-+`en.+`en/+`en0+beha-*ptp#(#`en-+`en.+`en/+`en0+beha.*ptp#Y Table 10-3. Continued CHAPTER 10 N ADVANCED FILE PROCESSING 337 First, I imported several modules and functions: shutil, os, and glob. I then created a directory (as well as its parent directories): `en-+`en.+`en/+`en0. I made use of the function _klu$% in two ways: first, to copy the file beha-*ptp to the newly created directory, and second, to copy the file beha-*ptp to the same directory under a different name, beha.*ptp. :::odqpeh*ikra$#`en-+`en.+`en/+`en0+beha.*ptp#(#`en-+`en.#% :::chk^$#`en-+`en.+&#% W#`en-+`en.+`en/#(#`en-+`en.+beha.*ptp#Y :::chk^$#`en-+`en.+`en/+`en0+&#% W#`en-+`en.+`en/+`en0+beha-*ptp#Y I’ve moved the file beha.*ptp to directory `en-+`en.. The results from chk^$% confirm the move. Now I copy the entire directory leaf under `en- to a new directory named @en[-: :::odqpeh*_klupnaa$#`en-#(#@en[-#% :::chk^$#@en[-+`en.+`en/+`en0+&#% W#@en[-+`en.+`en/+`en0+beha-*ptp#Y And lastly, it’s time for cleanup—I remove both directories as well as their subdirectories: :::odqpeh*nipnaa$#`en-#% :::odqpeh*nipnaa$#@en[-#% :::chk^$#`en-#% WY File Compression File compression is the process of representing a file in fewer bytes. Compression is typically divided into two categories: lossy compression and nonlossy compression. In lossy compres- sion, the compressed data is not identical to the original data; data is lost in the process of reducing the file size (hopefully nonimportant information is lost). Nonlossy compression uses clever schemes to represent data in a way that is more efficient. For example, instead of writing a hundred identical values to file, a nonlossy compression scheme might be to write the value 100, representing the count, and then the repeating value. Python provides us with several compression and archiving modules. Archiving modules are used to create compressed files; compression modules deal with the compression itself and can be used on strings, not only on files. The distinction is somewhat blurred as some modules perform both compression and archiving. The modules bz2 (dppl6++`k_o*lupdkj* knc+he^n]nu+^v.*dpih), gzip (dppl6++`k_o*lupdkj*knc+he^n]nu+cvel*dpih), and zlib (dppl6++ `k_o*lupdkj*knc+he^n]nu+vhe^*dpih as well as dppl6++sss*vhe^*jap+) provide nonlossy com- pression functionality; the modules tarfile (dppl6++`k_o*lupdkj*knc+he^n]nu+p]nbeha*dpih) and zipfile (dppl6++`k_o*lupdkj*knc+he^n]nu+velbeha*dpih) provide archiving capabilities. The names of the packages are also the import names, so to use gzip, issue eilknpcvel. There are some differences between the different modules in terms of compression ratio, performance, and popularity. They’re all easy to use and provide excellent results. In this sec- tion we’ll explore the tarfile module. CHAPTER 10 N ADVANCED FILE PROCESSING338 Example: A Compressed tar File In the open source community, it’s common to see files distributed with extensions *p]n*cv or *p]n*^v.. These are compressed tar files; tar stands for tape archive, but in reality there’s no need for tapes. The example in Listing 10-12 creates several files, archives them, and then retrieves them from the archive. Listing 10-12. Creating an Archive eilknpp]nbeha(chk^ _na]paokiabehao bkneejn]jca$1%6 b9klaj$#beha!`*ptp#!e(#s#% snepaokia`]p] bknfejn]jca$-,,%6 b*snepa$#Okia`]p]6!`Xj#!f% b*_hkoa$% ]n_derapdabehaoqoejc^v._kilnaooekj pb9p]nbeha*klaj$#behao*p]n*^v.#(#s6^v.#% bknbehaj]iaejchk^*chk^$#beha&#%6 pb*]``$behaj]ia% pb*_hkoa$% The first section of the script generates five files with some made-up data. Once files are created, I create a tar file for archiving. The file mode is specified as #s6^v.#, which stands for writing (creating) a tar file compressed with compression algorithm bz2. Other modes include #s6cv# for gzip compression and #s# for no compression. Similarly, opening an archive can be done by specifying #n#,#n6cv#, and #n6^v.#. Once the p]nbeha object is created, we add files to the archive using the ]``$l]pd% method. If you provide a directory to ]``$%, the entire directory is added to the archive. I’ve decided to add the files one at a time in case other files exist in the directory that I don’t wish to include. Finally, I close the tar file, effectively creating the file behao*p]n*^v.. Retrieving files from an archive is simple as well, as demonstrated in Listing 10-13. The method atpn]_p]hh$% will extract all files from an archive. The method atpn]_p$iai^an(l]pd% will extract a file that is a member of the archive to a location specified by l]pd. The method capiai^ano$% lists the members (files) in an archive. Listing 10-13. Extracting All Files from an Archive eilknpp]nbeha(ko ebjkpko*l]pd*ateopo$#jas#%6 ko*ig`en$#jas#% pb9p]nbeha*klaj$#behao*p]n*^v.#(#n6^v.#% pb*atpn]_p]hh$#jas#% pb*_hkoa$% CHAPTER 10 N ADVANCED FILE PROCESSING 339 Listing 10-14 shows how to extract just the first three files in the archive. Listing 10-14. Extracting Three Files from an Archive eilknpp]nbeha(ko ebjkpko*l]pd*ateopo$#jas#%6 ko*ig`en$#jas#% pb9p]nbeha*klaj$#behao*p]n*^v.#(#n6^v.#% bkniai^anejpb*capiai^ano$%W6/Y6 pb*atpn]_p$iai^an(#jas#% pb*_hkoa$% I’ve made use of the method capiai^ano$% to retrieve the list of files in the archive and then indexed only the first three files. Comparing Files Ensuring two files are identical is a common task. In case of input data files, it means we can remove the copy, and our script will both run faster and provide better statistics because now the data isn’t used twice. The reasons for duplicate files can be numerous as discussed in Chapter 4. A simple mechanism for comparing two files can be to open both files, read the entire files to memory, and then compare the values: :::`]p]-9klaj$#**+`]p]+beha-*ptp#(#n^#%*na]`$% :::`]p].9klaj$#**+`]p]+beha.*ptp#(#n^#%*na]`$% :::`]p]-99`]p]. Pnqa The main benefit of this method is that it’s simple. However, there are several shortcomings: sInefficiency: Suppose one file is of size 10GB and other file is 1 byte long. By looking at the file sizes, it’s possible to tell the files are not identical. On the other hand, reading a 10GB file to memory can bring the system to a crawl. sLack of information: If two files are not identical, what exactly are the differences? Modules filecmp and difflib from the Python Standard Library provide us with functional- ity to compare files and find the differences. Module filecmp The module filecmp provides functions for file and directory comparisons. The method _il$beha-(beha.W(od]hhksY% will compare beha- with beha.. If od]hhks is not provided (or is Pnqa), files that have the same stat signature are considered equal. By this I mean files that have the same system information such as size, creation date, and more (see dppl6++`k_o* lupdkj*knc+he^n]nu+ko*dpih for an explanation of stat). If od]hhks is B]hoa, files are also com- pared for content. CHAPTER 10 N ADVANCED FILE PROCESSING340 :::behaj]iao9W#**+`]p]+beha-*^ej#(#**+`]p]+beha.*^ej#Y :::bknbjejbehaj]iao6 ***b9klaj$bj(#s^#% ***b*snepa$#okia`]p]#% ***b*_hkoa$% *** :::eilknpbeha_il :::beha_il*_il$behaj]iaoW,Y(behaj]iaoW-Y% Pnqa The class `en_il$`en-(`en.% enables the comparison of directories `en- and `en.. The comparison includes all subdirectories as well. The method nalknp$% will print the result from comparing both directories. For the following example, I assume you’ve created the file behao*p]n*^v. in the previous compression example. Here, we’ll create two directories, jas- and jas.. Directory jas- will contain the extracted files from the archive; directory jas. will contain the extracted files from the archive as well as another subdirectory, jas/, which will also contain the contents of the archive. We’ll compare the directory contents (see Listing 10-15). Listing 10-15. Comparing Directories eilknpp]nbeha(ko(beha_il ebjkpko*l]pd*ateopo$#jas-#%6 ko*ig`en$#jas-#% ebjkpko*l]pd*ateopo$#jas.+jas/#%6 ko*i]ga`eno$#jas.+jas/#% pb9p]nbeha*klaj$#behao*p]n*^v.#(#n6^v.#% pb*atpn]_p]hh$#jas-#% pb*atpn]_p]hh$#jas.#% pb*atpn]_p]hh$#jas.+jas/#% pb*_hkoa$% _il9beha_il*`en_il$#jas-#(#jas.#% _il*nalknp$% The results are as follows: `ebbjas-jas. Kjhuejjas.6W#jas/#Y E`ajpe_]hbehao6W#beha,*ptp#(#beha-*ptp#(#beha.*ptp#(#beha/*ptp#(#beha0*ptp#Y As you can see, comparing directory contents using the filecmp module is easy and simple. CHAPTER 10 N ADVANCED FILE PROCESSING 341 Module difflib The module difflib provides several objects and functions to help compare lists of strings (e.g., text files). Several functions provide a `ebb result in different formats. These include _kjpatp[`ebb$%, j`ebb$%, and qjebea`[`ebb$%. In this section we’ll examine the _kjpatp[ `ebb$b-(b.W(bnkibehaYW(pkbehaY% function; other functions have similar behavior. First we create two files, **+`]p]+beha-*ptp and **+`]p]+beha.*ptp, with similar but not identical content, as shown in Listing 10-16. Listing 10-16. Creating Files for Comparison _kjpajp9=opnejc -./(012 345 okiapatpXj bj]ia-9#**+`]p]+beha-*ptp# bj]ia.9#**+`]p]+beha.*ptp# b-9klaj$bj]ia-(#s^#% b-*snepa$#^abknaXj#% b-*snepa$_kjpajp% b-*_hkoa$% b.9klaj$bj]ia.(#s^#% b.*snepa$_kjpajp% b.*snepa$#]bpanXj#% b.*_hkoa$% The two files differ in that the first file contains an extra line in the beginning, and the sec- ond file contains an extra line in the end. We call _kjpatp[`ebb$% to display those differences (see Listing 10-17). Listing 10-17. Comparing File Contents eilknp`ebbhe^ bj]ia-9#**+`]p]+beha-*ptp# bj]ia.9#**+`]p]+beha.*ptp# hejao-9klaj$bj]ia-%*na]`hejao$% hejao.9klaj$bj]ia.%*na]`hejao$% bknhejaej`ebbhe^*_kjpatp[`ebb$hejao-(hejao.(bj]ia-(bj]ia.%6 lnejpheja( I’ve included the name of the files as parameters to _kjpatp[`ebb$%; this will generate a report that displays the file names in the header information. Here are the results: CHAPTER 10 N ADVANCED FILE PROCESSING342 &&&**+`]p]+beha-*ptp )))**+`]p]+beha.*ptp &&&&&&&&&&&&&&& &&&-(1&&&& )^abkna =opnejc -./(012 345 okiapatp )))-(1)))) =opnejc -./(012 345 okiapatp ']bpan A section starting with &&& means the report addresses the file **+`]p]+beha-*ptp; a section starting with ))) means the report addresses the file **+`]p]+beha.*ptp. A line start- ing with a ) sign implies that the line is missing from the first file; a ' sign means the line is included in the first file but not in the second file. The output is similar to output generated by UNIX `ebb command-line utilities. Additional difflib functionality can be found online at dppl6++`k_o*lupdkj*knc+he^n]nu+ `ebbhe^*dpih. Final Notes and References Python provides a wealth of libraries that deal with common programming tasks: file process- ing, command-line parameters, file and directory manipulation, compressing and archiving files, and many more. There are a great number of additional modules available with the Python Standard Library: s 4HE0YTHONStandard Library, dppl6++`k_o*lupdkj*knc+he^n]nu+ej`at*dpih APPENDIX Additional Source Listing This appendix is a collection of source listings that didn’t quite belong in the chapters them- selves, but nevertheless might be of interest to you. Nudge Subplots In generating subplots of size 2 by 2 for this book, I’ve noticed that the text for the x-axis of the top subplots clashes with the titles of the lower subplots. To overcome this, I’ve defined nudge_subplot(), a function designed to modify the location of subplots within a figure (see Listing A-1). Listing A-1. Source Listing of nudge_subplot() def nudge_subplot(subp, dy): """A helper function to move subplots.""" sp_ax = subp.get_position() sp.set_position([sp_ax.x0, sp_ax.y0+dy, sp_ax.x1-sp_ax.x0, sp_ax.y1-sp_ax.y0]) To use the function, store the return value from subplot() and then “nudge” it by calling nudge_subplot(sp, dy), as shown in Listing A-2, where sp is the subplot and dy is the amount to nudge (a value of 0.02 for dy usually works well). Listing A-2. Using nudge_subplot() from pylab import * # values to plot t = arange(5) y = array([1, 2, -1, 1, -2]) plot_cmds = [ "plot(y)", "plot(-y)", "plot(y**2)", 343 APPENDIX ■ ADDITIONAL SOURCE LISTING344 "plot(sin(y))" ] figure() for i, plot_cmd in enumerate(plot_cmds): sp = subplot(2, 2, i+1) if i == 1: nudge_subplot(sp, 0.02) if i == 3: nudge_subplot(sp, -0.02) exec plot_cmd title(plot_cmd, fontsize='large') xlabel('x values') In this code, I’ve nudged the right subplots and left the left ones as is, as you can see in Figure A-1. Figure A-1. The left subplots are unmoved (the default), and the right subplots are nudged. The function nudge_subplot() is not backward compatible with older versions of matplot- lib. For example, with matplotlib version 0.91.4, the function set_position() accepts different arguments, and so the code needs revising. Nevertheless, the ideas are similar. Listing A-3 is an implementation that runs on matplotlib version 0.91.4. APPENDIX ■ ADDITIONAL SOURCE LISTING 345 Listing A-3. Source Listing of nudge_subplot_old(), for Older Versions of Matplotlib def nudge_subplot_old(subp, dy): """A helper function to move subplots. Works on matplotlib version 0.91.4.""" sp_ax = subp.get_position() sp_ax[1] += dy sp.set_position(sp_ax) Magic Square Arrows In Chapter 7 I presented a figure describing the magic square algorithm. I used matplotlib patch arrows embedded in the algorithm to plot that figure. Listing A-4 is the source code used to generate the diagram. Listing A-4. Magic Square Diagram Creation from pylab import * def magic_arrow(x, y, top_right, n, c=0): """Draws an arrow from point x, y.""" d, my_colors = 0.15, 'rbymg' if top_right: # top-right arrow mc = my_colors[c % len(my_colors)] ar = Arrow(x+0.5+d, n-y-0.5+d, 1-2*d, 1-2*d, width=0.2, fc=mc, ec=mc) else: # down arrow ar = Arrow(x+0.5, n-y-0.5-d, 0, 2*d-1, width=0.2, fc='k', ec='k') # patch the arrow gca().add_patch(ar) def show_alg(n=3): """Draws a magic square, n must be odd.""" if n % 2 != 1: raise ValueError, "Magic(n) requires n to be odd." # prepare the figure, draw grid lines, hide ticks axis('scaled') axis([0, n, 0, n]) for i in range(n): plot([0, n], [i, i], 'b') APPENDIX ■ ADDITIONAL SOURCE LISTING346 plot([i, i], [0, n], 'b') xticks([]) yticks([]) # alternating color index altc = 0 # initialize variables m, row, col = zeros([n, n]), 0, n/2 # go through all the numbers from 1 to n**2 for num in xrange(1, n**2+1): # assign the current number and display it on the figure m[row, col] = num text(col+0.5, n-row-0.5, str(num), va='center', ha='center') # store current row and col pcol, prow = col, row # increment row and col col = (col+1) % n row = (row-1) % n # if location (col, row) is nonzero, it means the cell # is occupied, move down if m[row, col]: col = pcol % n row = (prow+1) % n # if current location minus previous location is (1, 1) # draw a top-right arrow if col-pcol == 1 and prow-row == 1: magic_arrow(pcol, prow, True, n, altc) # if previous col location is identical to current # col location, draw a down arrow (unless it's the last cell) elif pcol == col and num != n**2: magic_arrow(pcol, prow, False, n) altc += 1 # the following two elif sentences take care of drawing two # arrows in case of wrapping: one originating from the current # location, the other to the next location elif col-pcol == 1 and prow-row != 1: magic_arrow(pcol, prow, True, n, altc) magic_arrow(pcol, n, True, n, altc) APPENDIX ■ ADDITIONAL SOURCE LISTING 347 elif col-pcol != 1 and prow-row == 1: magic_arrow(pcol, prow, True, n, altc) magic_arrow(-1, prow, True, n, altc) # last cell elif num == n**2: pass # if we've reached this point, there's a bug else: raise ValueError, "We should never be here." def show_some(): figure() for i in range(4): subplot(2, 2, i+1) show_alg(2*i+3) title('N='+str(2*i+3)) show_some() I’ve defined the function magic_arrow() that draws an arrow at a given position using a matplotlib arrow patch. The arrow’s direction is determined by comparing the current loca- tion with the previous location. Other than that, the code is similar to the one discussed in Chapter 7. Fractal Function Source Code In Chapter 9 I made use of a variation of the fractal script in Chapter 7 to create a collage by wrapping it within a function. Listing A-5 shows the function used in creating the fractal col- lage in Chapter 9. Listing A-5. Fractal Collage Function from PIL import Image from cmath import * def fractal(delta=0.000001, res=800, iters=30): """Creates a z**4+1=0 fractal using the Newton-Raphson method.""" # create an image to draw on, paint it black img = Image.new("RGB", (res, res), (0, 0, 0)) # these are the solutions to the equation z**4+1=0 (Euler's formula) solutions = [cos((2*n+1)*pi/4)+1j*sin((2*n+1)*pi/4) for n in range(4)] colors = [(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0)] APPENDIX ■ ADDITIONAL SOURCE LISTING348 for re in range(0, res): for im in range(0, res): z = (re+1j*im)/res for i in range(iters): try: z -= (z**4+1)/(4*z**3) except ZeroDivisionError: # possibly divide by zero exception continue if(abs(z**4+1) < delta): break # color depth is a function of the number of iterations color_depth = int((iters-i)*255.0/iters) # find to which solution this guess converged to err = [abs(z-root) for root in solutions] distances = zip(err, range(len(colors))) # select the color associated with the solution color = [i*color_depth for i in colors[min(distances)[1]]] img.putpixel((re, im), tuple(color)) return img 349 Symbols >>> prompt, 3, 55 += operation, 13–14, 63 - (range) character in regular expressions, 175 * asterisk character in regular expressions, 174 \ (backslash character), 59 % (bitwise AND), 63 ^ (bitwise exclusive OR), 63 ^ (start of a string) character in regular ex- pressions, 174 ~ (bitwise not), 63 | (bitwise OR), 63 | (alternative) character in regular expres- sions, 175 [] (brackets), 15, 67, 69, 73, 74, 76 # (comment symbol), 5, 85, 157 {} curly braces, 74 $ (dollar sign) character in regular expres- sions, 174 . (dot) symbol, 12 . (dot) symbol in regular expressions, 173 == (double equal sign), 63 … (ellipsis symbol), 3, 12, 236 > (greater than), 63 >= (greater-than-or-equal), 63 != (inequality), 63 < (less than), 63 <= (less-than-or-equal), 63 % (modulo) operator, 96 % (string formatting), 82–84 ) (parenthesis), 71–72, 93, 97 + (plus) character in regular expressions, 174 ? (question mark) in regular expressions, 174 << (shift left) operator, 63 >> (shift right) operator, 63 A AbiWord, 48 abspath() function, 335 acos() function, 223 add() function in sets, 79 ImageChops operation, 313, 315 add_option() function, 329, 331 algebra. See linear algebra all() function, 92, 242 Alphabet, Hebrew (example), 179 any() function, 92, 242 append() function, 70–71 arc() function, 295 archive files creating, 338 extracting, 338–339 archiving modules, 337 arctan2(dy, dx) function, 21 argv variable, 327 arithmetic operations, on arrays, 239–240 arange() function, 235–237 array() function, 234 array of values, 118–119 arrays creating, 234–235 data types, 118–119 functions, 234–235, 247 indexing, 235 math functions, 239–240 methods and properties, 241–246 N-dimensional arrays, 234–239 numerical, 14 one-dimensional, 235 reshaping, 235 slicing, 235 storing directory contents in, 127–128 of structs, 119–122 tuples of, 17 two-dimensional, 235 Arrow() function, 219 arrows, adding to graph, 218–219 ASCII (American Standard Code for Informa- tion Interchange), 135 asctime() function, 166–167, 169 asin() function, 223 Index NINDEX350 C Calc, 48 capitalize() function, 145 Cartesian coordinates, 17–18 cascading, functions, 12 catalogs, 131–133 ceil() function, 222 center() function, 145 character completion, with GNU Readline, 40–41 character count, 151–152 chdir() function, 58 children parameter, 216 chirp() function, 274 chmod() function, 334 choice() function, 232 cholesky() function, 253 chord() function, 295 chown() function, 334 chr() function, 92, 179 circles calculating area of, 255–256 plotting, 194 cla() function, 197 classes, 96–97 clear() method, 76–78 clf() function, 197 clip() function, 242 clock() function, 131 close() function, 148, 197 cmath module, 221–227 functions, 223 Newton fractal (example), 224–227 cmp() function, 92 coLinux, 34 color depth (fractal example), 226 color maps, 211–212 colors image, 300–303 for plots, 193 COM ports, 2–4 combining files (example), 153–155 combining data based on the epoch, 172–173 command-line interface (CLI), 35, 54–55 command-line parameters, 327–333 commands, entering, 55–56 Comma Separated Values (CSV) files. See CSV (Comma Separated Values) files comments, 5, 85, 157 comment symbol (#), 157 comparison operators, 63 assert statement, 140–143 atan() function, 223 atan2() function, 223–224 attributes, 96 array, 241–246 image, 287–288 augmented assignments, 63 autocompletion feature, 46 axes parameter, 216 axhline() function, 196 axis parameters, setting, 215–217 axis behavior, controlling, 194 axis() function, 20, 186, 194, 216 axis labels, 198–199 axvline() function, 196 B backslash character (\), 59 bar charts, 201–204 Bash, 35 bartlett() function, 278 base conversions, 138–143 base conversions (example), 138–143 basename() function, 335 bases, 61–62 baud rate, 3, 5 binary conversion, in Python 2.5, 139–140 binary editor, 48 binary files, 117–123, 135 array of structs, 119–122 array of values, 118–119 file formats, 104–105 header files with, 122–123 pros and cons, 109 random access and, 319–325 binding, variables, 80 bin() function, 143 bisect() function, 267–268 bitwise AND (%), 63 bitwise exclusive OR (^), 63 bitwise not (~) operator, 63 bitwise operations, 63 bitwise OR (|), 63 Booleans, 67–68 bool() function, 68 break statement, 91–92 bucket fill, 308–312 built-in functions, 92–93 butter() function, 280 bz2 module, 337 NINDEX 351 comparing mortgages (example), 237–238 compiled programming languages, 2–3 compile() function, 173 complex data type, 64–65 complex numbers, 222 compress() function, 242 compression (file compression), 337–339 concatenation, of lists, 69 ConfigParser module, 124, 332 configuration files, 123–125 conj() function, 243 conjugate() function, 253 constant() function, 313 constructors, 97 context_diff() function, 341 continue statement, 91–92 contour() function, 210 contour plots, 210 Cooperative Linux, 34 copy() method, 76, 78, 80, 292 cos() function, 223 cosh() function, 223 cosine wave, Fourier transform of, 276–277 count() function, 144 counting objects (image processing exam- ple), 303–312 cPickle module, 325–327 crop() function, 292–293 cropping images, 292–293 C++ style comments, 157 CSV (Comma Separated Values) files, 6–7, 109–117, 159–163 creating, 115–116 limitations of, 116 processing, 9 reading, 9–12 spreadsheets and, 48 when to use, 117 .csv extension, 7 csv module, 9–12, 116, 159–163 csv.reader object, 160 csv.writer object, 160–161 ctime() function, 171 cumprod() function, 242 cumsum() function, 242 curly braces {}, 74 curve fitting, 258–267 Cygwin, 33–34 Cygwin Net Release Setup Program, 33 D darker() function, 313 data combining, based on epoch, 172–173, 332–333 exponential, fitting, 263–264 gathering, 2–6 GPS, 2–6, 12–25 two-dimensional, 285 data analysis, 8–17 GPS data, 12–17 reading CSV files, 9–12 databases, vs. files, 133–134 data files catalogs, 131–133 compiling list of, 8–9 indexing, 128–131 locating, 126–134 searching for, 127–128 storage location, 7 data organization, 6–7 catalogs, 131–133 directories, 126 file formats, 108–126 file name conventions, 102–108 files vs. databases, 133–134 indexing, 128–131 introduction to, 101–102 searches, 127–128 data storage decisions on what to store, 116–117 using binary files, 117–123 data structures, 68–80 dictionaries, 68, 74–78 flattened, 238–239 lists, 68–72 ndarrays (NumPy arrays), 233–236 sets, 78–80 tuples, 68, 72–73 data types array, 118–119 Booleans, 67–68 complex, 64–65 file, 147 float, 63–64 int, 60–61 long, 60–61 strings, 65–67 data visualization, 17–25 annotating the graph, 20–22 plotting GPS data, 18–20 NINDEX352 dot() function, 252 dot products, 252 dot (.) symbol, 12, 173 double equal sign (==), 63 double quotes, 65 dual-boot systems, 37 duplicate files, searching for, 128–131 E EAFP (It’s Easier to Ask Forgiveness than Permission) motto, 4, 138, 155, 158, 288, 290 editors, 45–48 eig() function, 253 element-by-element multiplication, 252 elif statement, 17, 85–86 ellipse() (ImageDraw function), 295 Ellipse (matplotlib patch object), 217 ellipsis symbol (...), 3, 12, 236 else statement, 85–86 encode() function, 179 end-of-day report, 170–171 endswith() function, 9, 146 Enthought Python Distribution (EPD), 38 enumerate() function, 25, 90, 96, 155 epoch, 165, 168–173, 332–333 exceptions, 56, 86–89 execfile() function, 3, 59 exec statement, 140–143 exists() function, 107–108, 335 exit() function, 329 exp() function, 223 expm() function, 254 exponential data, fitting, 263 extend() function, 70–71 eye() function, 235 F fabs() function, 222 Fast Fourier Transform, 275–277 fft() function, 275–277 Fedora project (Linux), 32 field names (in CSV files), 162 figure() function, 22–23, 186, 190 filecmp module, 339–340 file compression, 337–339 file formats, 6–7, 104–105, 108–126 binary, 104–105, 109, 117–123 converting image, 289–290 CSV, 6–7, 109–117 header files, 122–123 preprocessing prior to, 221 subplots, 23 using text, 23–25 velocity plot, 22–23 date extracting from file contents, 168 in file name, 102–103 parsing and formatting, 165–168 writing in current locale (example), 180–181 Debian Linux, 32 decimal module, 247 deck of cards, 233 decode() function, 179 deep copy, 81 def keyword, 93 De la Loubere method, 244–246 delimiter, 161 del statement, 71 determinant of matrix, 253 detection, signal in noise (example), 270–274 det() function, 253 development environment image viewers, 49 operating systems, 32–37 Python environment, 37–44 software components for, 31–52 spreadsheets, 48 text editors, 45–48 version control systems (VCSs), 49–51 word processors, 48 dict() function, 74 dictionaries, 13, 68, 74–78 dictionary methods, 75 DictReader object, 162–163 DictWriter object, 162–163 diff() function, 24, 218, 247, 273–274 difference() function, 78, 313 difference_update() method, 78 difflib module, 341–342 directories, 126 changing, 8, 57 comparing, 340 compiling list of files in, 8–9 listing contents of, 8 storing in arrays, 127–128 directory manipulation, 333–337 dirname() function, 335 dir statement, 99, 216 discard() method, 78 docstrings, 10, 13–14, 94, 250 doctest module, 140, 245 NINDEX 353 image, 104 INI files, 123–125 Readme files, 123 selecting, 108 text, 109 XML, 125 FileInput module, 332–333 file manipulation, 333–337 file names, 7, 102–108 automating creation of, 106 date and time in file name, 102–103 extensions, 104–105 pattern matching, 334 running index implementation, 107–108 titles, 104 file pointers, 319 files archive, 338–339 binary, 117–123, 135, 319–325 catalog, 131–133 closing, 148 comparing, 339–342 configuration, 123–125 CSV files, 6–12, 109–117, 159–163 data. See data files vs. databases, 133–134 decisions on what to store, 116–117 directories for, 126 documenting contents of, 101 duplicate, 128–131 fixed-size, 323, 328–329 header, 122–123 indexing, 128–131 log files, 163–168, 172–173 multiple, 45, 333 opening, 147–148 reading, 149–150 reading images from, 286 Readme, 7, 123 saving graphs to, 187–189 searching for, 127–128 tar, 338–339 text. See text files writing to, 148–149 fill() function, 242 filter design, 279–281 filtering, 279–284 filter() method, 316 filters finite-impulse-response (FIR) filters, 279 high-pass filters (HPFs), 279 image, 315–317 infinite-impulse-response (IIR) filters, 279 low-pass filters (LPFs), 279 finally statement, 87 find() function, 20, 143–144, 269, 271 findall() function, 173 findfont() function, 297 finite-impulse-response (FIR) filters, 279 firwin() function, 279 fixed-length records, 321–322 fixed-size files, 323, 328–329 flattened data structures, 238–239 flatten() function, 243 float data type, 63–64 float() function, 16, 64, 137–138, 159 floating-point numbers, 16, 63–64 flood fill, 308–312 floor() function, 222 flow control statements, 85–92 fmod() function, 222 fonts, 296–297 formatting date and time, 165–168 with print statement, 82–84 strings, 145–146 for statement, 89–90 Fourier expansion, 239–240 Fourier transform, 275–279 of cosine wave, 276–277 window functions, 277–279 fractals, 224–227, 347 fractions module, 248 freqs() function, 279 frequency domain, 275 freqz() function, 279–280 frexp() function, 222 fromfile() function, 119, 244 fromkeys() function, 76 fsolve() function, 267–268 functions, 68 approximating, with polynomials, 264–266 built-in, 92–93 cascading, 12 defining, 93–96 fitting to discrete known values, 258–267 Fourier transform, 275–279 generators, 94–95 searching for, 250 special functions, 268 See also specific functions NINDEX354 G gauss() function, 229 gausspulse() function, 274 gca() function, 215 gcf() function, 215 generator expressions (genexps), 95–96 generators, 94–95 Gentoo Linux, 32, 38 get() function, 76–77 getatime() function, 335 getctime() function, 335 getcwd() function, 58 getdata() function, 301–302 getmtime() function, 335 getopt module, 329 getp() function, 214–217 getsize() function, 336 glob module, 334 gmtime() function, 112, 169 GNU Emacs, 47 GNU/Linux, 32–33 Gnumeric, 48 GNU Nano, 47 GNU Octave, 41, 189 gnuplot, 42–43 GNU Public License (GPL), 29 GNU Readline, 40–41 GPS data analyzing, 12–14 case study, 2–3, 8 extracting, 14–17 plotting, 18–20 recording, 2–6 visualization, 17–25 GPS graphs, annotating, 20–22 GPS values, 2 graphical user interface (GUI), 35 graphs, 183 adding arrows to, 218–219 additional, 210–213 annotating with text, 197–200 axis, 194 axis labels, 198–199 bar charts, 201–204 colors, 193 controlling, 194–197 erasing, 197 getting and setting values, 213–217 grids and ticks, 195–196 histograms, 204–205 vs. image files, 184–187 interactive, 185–187 legends, 198–199 line widths, 192 logarithmic plots, 207–208 marker sizes, 192 matplotlib package. See matplotlib package patches, 217–220 pie charts, 206–207 plotting, 189–193 polar plots, 208–209 saving to files, 187–189 stem plots, 209–210 subplots, 196–197 summary example, 200–201 target audience and, 183 titles, 198 types, 201–213 See also plots greater than (>), 63 greater-than-or-equal (>=), 63 grep, 155 grid() function, 19, 195–196 grids, 195–196 GUI (graphical user interface), 35 gzip module, 337 H hamming() function, 210, 278 hanning() function, 278 hashing algorithm, 75 has_key() method, 76–77 header files, 122–123 header stamps, 13 head() function, 152–153 head utility, 152–153 heart-rate monitor (example), 281–282 Hebrew alphabet (example), 179–180 help() function, 10–11, 185, 99 help system, 56–57 hex() function, 62, 138–140 hexadecimal base, 62 hexedit, 48 high-pass filters (HPFs), 279 hist() function, 204–205 histograms, 204–205 history command, 58 hyperbolic function, 223 hypot() function, 223 NINDEX 355 I i18n (internationalization), 177 IDEs (integrated development environ- ments), 39–41 IDLE, 39 ifft() function, 276 if statement, 17, 85–86 iirdesign() function, 279 imag (imaginary) attribute, 243 image annotation, 294–300 fonts, 296–297 with geometrical shapes, 294–295 text annotations, 295–300 image arithmetic, 312–315 image attributes, 287–288 image catalog, 287–288, 298 ImageChops module, 312–315 Image class, 286 ImageDraw object, 294–300, 310 ImageFilter class, 316–317 image filtering, 315–317 image formats, 104 image modes, 291 image processing, 300–315 counting objects, 303–312 matrix representation and colors, 300–303 packages for, 43 two-dimensional data, 285 images colors, 300–303 converting file formats, 289–290 copying and pasting, 292 creating, 286, 291 cropping and resizing, 292–293 displaying, 288 manipulation of, 291–294 reading from file, 286 rotating, 293–294 split, 300–301 thumbnail, 298–300 image viewers, 49 import statement, 3, 98–99 indentation (tabs), 5 index() function, 71 indexing, 128–131 arrays, 235 lists, 70 tuples, 73 inequality (!=), 63 infinite-impulse-response (IIR) filters, 279 INI files, 123–125 __init__ function, 97 inner() function, 252–253 inner products, 252 in operator, 67, 70, 74 insert() function, 71 int() function, 60, 62, 103, 137–140, 159 int data type, 60–61 integer division, 64 integrated development environments (IDEs), 39–41 integration algorithms, 254–258 interactive graphs, 185–187 interactive help system, 56–57 interactive Python, 54–58 interactive sessions, vs. Python scripts, 3 internationalization, 176–181 interp() function, 259, 266 interpolation, 258–267 approximation of functions using, 264–266 piecewise linear interpolation, 258–260 spline interpolation, 266–267 interpreted programming languages, 2–3 intersection() function, 78 intersection_update() method, 78 inverse square root, 258 inv() function, 252 IPython, 39–40 IronPython, 38 isabs() function, 336 isalnum() function, 146 isdigit() function, 158 isdir() function, 336 isfile() function, 336 islower() function, 146 ISO date and time format, 15, 167 isspace() function, 146 issubset() method, 78 issuperset() method, 78 istitle() function, 146 isupper() function, 146 -i switch, 59 items() method, 76 iterators, 89, 90, 94–95 iteritems() method, 76, 90 iterkeys() method, 76 itervalues() method, 76 NINDEX356 ljust() function, 145 locale.getpreferredencoding() function, 181 locale module, 177–178 localization, 176–181 localtime() function, 106, 165, 166 loc parameter, 199 log10() function, 223 logarithmic function, 223 logarithmic plots, 207–208 log files, 163–168, 172–173 log() function, 223 logical operations, 68 loglog() function, 207–208 logspace() function, 207, 235 long data type, 60–61 longitude, 15–17 lookfor() function, 250 lower() function, 145 low-pass filters (LPFs), 279 lstrip() function, 144 M Mac OS, 32, 36 macros recording, 46–47 support for, 46 magic square arrows, 345–347 magic squares, 244–246 makedir() function, 335 manually installing packages (example), 44 markers, 189–190 marker sizes, 192 match() function, 173 math math module, 221–227 cmath module, 221–227 data visualization and, 221 Newton fractal (example), 224–227 NumPy module, 233–247 random module, 228–233 mathematical expressions, 200 mathematical symbols, 200 math functions, 239–240 math module, 221–224 MATLAB, 1, 41, 189 matplotlib.finance module, 113 matplotlib objects, 214–216 matplotlib package, 17–19, 41–42, 183–184, 286 file formats supported by, 187–188 getting and setting values, 213–217 J join() function, 137, 144, 336 JPEG (Joint Photographic Expert Group), 184 justification, text, 145 Jython, 38 K kaiser() function, 278 keys, 74 keys() method, 76 L l10n (localization), 177 Latin alphabet, 180 latitude, 15, 17 lazy copy, 81 ldexp() function, 222 legend() function, 19, 198–199, 210 legends, 198–199 len() function, 67, 70, 137, 151 less than (<), 63 less-than-or-equal (<=), 63 licensing, 51–52 lighter() function, 313 linear algebra additional functionality, 254 matrix decomposition, 253–254 solving systems of linear equations, 251–252 vector and matrix operations, 252–253 linear algebra, 251–254 linear equations, solving systems of, 251–252 linear interpolation, piecewise, 258–260 linearization process, 15, 112 linear regression of nonlinear functions, 263 with polyfit(), 261–262 line breaks, suppressing, 82 line count, 151–152 line() function, 295 line numbering, 46 lines, 137, 189–190 line widths, 192 linspace() function, 235, 237 Linux, 32–36 list comprehensions, 91, 237–238, 304 listdir() function, 58 list() function, 69 list methods, 71 lists, 68–72 NINDEX 357 interactive graphs, 185–187 plotting graphs, 189–193 ways to use, 184 matrix calculating inverse of, 252 decomposition, 253–254 operations, 252–253 representation, 300–303 MaxFilter, 317 max() function, 243 mean() function, 243 MedianFilter, 317 Mercurial, 50 merge() function, 302 meshgrid() function, 213 methods, 96 array, 241–246 See also functions Minesweeper, 308 MinFilter, 317 min() function, 226, 243 mkdir() function, 335 mktime() function, 112, 169 ModeFilter, 317 modf() function, 222 modules, 97–99 modulo (%) operator, 96 mortgage comparison (example), 237–239 movement artifact (example), 281 moving average (example), 283–284 multiple files editing, 45 searching for text in, 333 N naming conventions. See file names National Marine Electronics Association (NMEA), 13 ndarray (NumPy) object, 233–234 ndim attribute, 243 N-dimensional (NumPy) arrays, 234–239 functions for creating, 234 mortgage comparison (example), 237–239 usefulness of, 236 newton() function, 267–268 Newton’s method (also Newton-Raphson method), 224–227, 258, 267 NMEA 0183 format, 13–14 noise, detection of signal in presence of, 270–274 nonlinear equations, solving, 267–268 nonlinear functions, linear regression of, 263 nonzero() function, 242 Notepad++, 47 nudgeing subplots, 343–344 numbers base conversions, 138–143 bases, 61–62 bitwise operations, 63 comparisons, 63 complex, 64–65, 222 converting strings to, 15, 137–143 extracting from text file, 157–159 floating-point, 63–64 int data type, 60–61 long data type, 60–61 random, 228–233 numerical analysis, 249–268 curve fitting, 258–267 integration, 254–258 interpolation, 258–267 linear algebra, 251–254 numerical integration, 254–258 polynomials, 260–266 solving nonlinear equations, 267–268 splines, 266–267 special functions, 268 root finding (polynomials), 260 numerical arrays, 14 numerical integration, 254–258 NumPy module, 14, 41–42, 222 array creation, 234–235 array methods and properties, 241–247 lookfor() function, 250 math functions, 239–240 ndarray object, 233–234 N-dimensional arrays, 236–239 slicing, indexing, and reshaping arrays, 235 who() function, 250 O object-oriented programming, 96–97 objects counting, in image processing, 303–312 lists, 69–72 tuples, 72–73 object serialization, 325–327 octal base, 62 Octave-Forge, 250 oct() function, 62, 138–140 one-dimensional arrays (vectors), 235 NINDEX358 paste() function, 292 patches, 217–220 path names, 127 PATH variable, 59 patterns, regular expression, 173–174 PDF, 184 Pickle module, 325–327 piecewise linear interpolation, 258–260 pie charts, 206–207 plain text files, 135 plot() function, 19–20, 189–193, 214 plot lines, 189–190 plot markers, 189–190 plots, 183 changing color of, 20 contour, 210 displaying several graphs in one, 191 GPS location, 18–20 logarithmic, 207–208 matplotlib package, 183–184 plot summary example, 200–201 polar, 208–209 stem, 209–210 subplots, 196–197, 23 velocity, 22–23 See also graphs plotting, 189–193 colors, 193 lines and markers, 189–190 line widths, 192 marker sizes, 192 multiple graphs on one figure, 191 packages for, 42–43 PNG (Portable Network Graphics), 184 point() function, 295 polar plots, 208–209 poly() function, 260 polyadd() function, 260 polyder() function, 261 polydiv() function, 260 polyfit() function, 261 approximation of functions, 264–266 linear regression with, 261–262 polygon() function, 295 polyint() function, 261 polymul() function, 260 polynomials, 260–266 approximating functions with, 264–266 linear regression, 261–263 representing as vectors, 260 uses of, 261–266 polysub() function, 260 ones() function, 234 open() function, 147–148 operating systems, 32–37 choosing, 35–36 GNU/Linux, 32–33 Mac OS, 32 using several, 36–37 Windows, 33–35 OptParse module, 329–332 ord() function, 92 os.chdir(path) function, 58 os.chmod() function, 334 os.chown() function, 334 os.getcwd() function, 58 os.listdir(path) function, 58 OS locale support, 177 os.makedirs() function, 335 os.mkdir() function, 335 os module, 57–58, 334–335 os.path.abspath() function, 335 os.path.basename() function, 335 os.path.dirname() function, 335 os.path.exists() function, 107–108, 335 os.path.getatime() function, 335 os.path.getctime() function, 335 os.path.getmtime() function, 335 os.path.getsize() function, 336 os.path.isabs() function, 336 os.path.isdir() function, 336 os.path.isfile() function, 336 os.path.join() function, 137, 144, 336 os.path module, 335–336 os.path.splitext() function, 336 os.path.split() function, 336 os.remove() function, 71, 78, 335 os.rename() function, 335 os.renames() function, 335 os.rmdir() function, 335 os.walk() function, 8–9 outer() function, 253 outer products, 253 output files, naming, 227 P packages, 41–44, 97–99 packages, manually installing (example), 44 Parallels, 34 parameters, command-line, 327–333 parse_args() method, 329, 331 parsing, date and time, 165–168 pass statement, 4, 86 NINDEX 359 polyval() function, 261 pop() function, 71, 76, 78 popitem() method, 76 port numbers, 3–4 PostScript, 184 pow() function, 223 power functions, 223 pprint() function, 81 printf() function, 82, 82 print statement, 81–84 probability questions, solving using random module, 229–231 prod() function, 243 programming languages compiled, 2–3 interpreted, 2–3 projections, plotting, 18 properties, array, 241–246 ptp() function, 243 putdata() function, 302 putpixel() function, 226 .py extension, 3 PyGTK, 184 PyLab module, 14, 41, 184–185 PyReadline, 40 pySerial module, 3–4, 43 Python about, 53–54 as interpreted programming language, 2–3 comments in, 5 data structures, 68–80 data types, 60–68 downloading, 38 entering commands, 55–56 functions, 92–96 help system, 56–57 image processing packages, 43 installation, 37–44 integrated development environments (IDEs), 39–41 interactive mode, 54–58 invoking, 54–55 language features, 54 math capabilities, 221–248 modules and packages, 97–99 operating systems and, 32–37 packages (additional), 43 plotting packages, 42–43 running interactively, 2–3 running scripts in, 3, 58–59 scientific computing packages, 38, 41–42 stand-alone (natively) environment, 33 statements, 81–92 variables, 80–81 versions, 37–38 Python 2.5, 38, 139–140 Python 2.6, 38 Python 3.0, 38 Python Imaging Library (PIL), 43, 226, 285, 290 Python scripts vs. interactive sessions, 3 running, 3, 58–59 Python Software Foundation (PSF), 29 Python Standard Library, 8 Python Win32 Extensions, 44 Python(x,y), 38 Q qr() function, 253 quad() function, 257–258 Quake III, 258 quiver() function, 211–213 quotechar parameter, 161 quotes double, 65 single, 65 triple-double-quotes, 65–66 R randint() function, 229 randn() function, 193, 271 random access, 319–321 random() function, 229, 231 random module, 228–233 functions, 229, 232 random sequences, 232 solving probability questions using, 229–231 random numbers, 228–233 random sequences, 232 randrange() function, 229, 307 range() function, 90, 92 ranges, 175 raw_input() function, 84–85 raw strings, 65–66 read() function, 121, 149–150 readline() function, 319 readlines() function, 149–150, 152 Readme files, 7, 123 read(n) function, 319 real attribute, 243 recording gps data, 5–6 NINDEX360 rectangle() function, 295 recursion, 308–310 regular expressions, 173–176 patterns, 173–174 ranges, 175 removing extra spaces with, 174 special sequences, 175 when to use, 175–176 remez() function, 279 remove() function, 71, 78, 335 rename() function, 335 renames() function, 335 replace() function, 143–145, 158 report() function, 340 research and development (R&D), 1, 29 reshape() function, 235, 243 reshaping, arrays, 235 resize() function, 235, 243, 292–293 resizing images, 292–293 re.split() function, 173 result variable, 56 return statement, 93 reverse() function, 71 reversed() function, 25, 90 rgrids() function, 208 rjust() function, 145 rmdir() function, 335 Rossum, Guido van, 54 rotate() function, 293–294 round() function, 243 rstrip() function, 144 running index, 107–108 run (IPython) command, 3 S sample() function, 232 savefig() function, 187–189 save() function, 289 sawtooth() function, 274 scanning serial ports, 3–4 scientific computing packages, 41–42 SciPy module, 41–42, 250–251 importing modules, 251 scipy.interpolate module, 266–267 scipy.integrate module, 257 scipy.optimize module, 267 scipy.signal module, 279 scipy.special module, 268 SciTE (Scintilla Text Editor), 47 scope, 97 scripts, 4 Python, 3, 58–59 running, 3, 58–59 stand-alone, 328–329 storage location, 7 use of, 8 search() function, 173 searching, text files, 155–156 searchsorted() function, 242 seek() function, 319–323 select() function, 269 self argument, 97 semilogx() function, 207 semilogy() function, 207–208 sequences, random, 232 sequence unpacking, 17 Serial() function, 4 serial port parameters, 3 serial ports, 2 accessing, 3 closing, 4, 6 scanning, 3–4 set() function, 78 set operations, 78 setdefault() method, 76 setp() function, 183–217 sets, 78–80 setuptools package, 44 shallow copy, 81 shape attribute, 243 shift left (<<) operator, 63 shift right (>>) operator, 63 show() function, 20, 185–187, 189, 288 shuffle() function, 232 shutil module, 336–337 Siamese method, 244–246 signal processing, 249–250, 268–284 detection of signal in noise, 270–274 diff() function, 273–274 filtering, 279–284 filter design, 279–284 find() function, 269 Fourier transforms, 275–277 select() function, 269 split() function, 273–274 waveforms, 274–275 where() function, 251 window functions, 277–279 signal.triang() function, 270 simulations, random numbers and, 228–229 sin() function, 223, 264–266 NINDEX 361 single quotes, 65 sinh() function, 223 sleep() function, 167 slicing arrays, 235 lists, 70 tuples, 73 software components, 31–52 image viewers, 49 licensing, 51–52 operating systems, 32–37 Python, 37–45 spreadsheets, 48 text editors, 45–48 version control systems, 49–51 word processors, 48–49 software licensing, 51–52 solve() function, 252 sort() function, 71, 242 sorted() function, 92 source listing (additional), 343–347 spaces, removing extra, 144–145, 174 specgram() function, 211–212 special characters, 173–174 special functions, 268 special sequences, 175 spherical coordinates, converting to Carte- sian coordinates, 17–18 spline() function, 266–267 spline interpolation, 266–267 split() function, 336 cvs module vs., 116 image processing and, 300, 336 regular expressions and, 173 removing extra spaces, 144 signal detection and, 273–274 splitting text, 136–137 splitfile() function, 153–155 splitext() function, 336 split images, 300–301 split() function, 103 splitlines() function, 136, 144, 151 spreadsheets, 48 sqrt() function, 223, 258, 264 square() function, 274 stand-alone (natively) environment, 33 stand-alone scripts, creating, 328–329 star patch (example), 303–306 startswith() function, 146 state machines, 164 statements, 81–92 break, 91–92 comments, 85 continue, 91–92 dir, 99 elif, 85–86 else, 85–86 exceptions, 86–89 flow control, 85–92 for, 89–90 if, 85–86 import, 98–99 pass, 86 print, 81–84 return, 93 try, 86–89 user input, 84–85 while, 91 yield, 94 statistics (GPS example) calculating, 24 printing, 24–25 std() function, 243 stem plots, 209–210 storage location, of data, 7 str() function, 158 strftime() function, 165–168 string conditionals, 146 string operations, 66–67 strings, 56, 65–68, 136–149 comparing, 341–342 converting to numbers, 15, 137–143 counting number of words and lines in (example), 137 expressing, 65–66 find and replace, 143–144 formatting, 145–146 joining, 137 raw, 65, 66 splitting, 136–137 stripping, 144–145 Unicode, 65, 178–181 writing to files, 148–149 string slicing, 15 strip() function, 144 strptime() function, 103, 165–166, 168 struct.calcsize() function, 120 structs, array of, 119–122 struct_time tuple, 165–166 struct.unpack() function, 121 NINDEX362 plain, 135 reading, 149–150 regular expressions, 173–176 searching inside, 155–156 splitting and combining, 153–155 working with, 150–159 writing to, 148–149 See also CSV files text() function, 21, 199, 295–300 text rendering, 199 textsize() function, 296 thetagrids() function, 208 thumbnail() function, 293 thumbnail index image, 298–300 ticks, 195–196 time epoch representation, 168–173 extracting from file contents, 168 in file name, 102–103 linearizing the time base, 168–170 parsing and formatting, 165–168 time-based binary data, 323–325 time domain, 275 time module, 5, 164–165 timestamps, 107, 163 timestamp string, 15 title() function, 145, 198 titles adding to graph, 198 file name, 104 tofile() function, 244, 324 tolist() function, 244 trace() function, 243 transpose() function, 243, 253 trapz() function, 256 triang() function, 275 trigonometric function, 223 triple-double-quotes, 65–66 try statement, 86–89 tuple() function, 72 tuples, 17, 68, 72–73 two-dimensional arrays, 235 two-dimensional data, 285 type() function, 92 U Ubuntu Linux, 32 unichr() function, 179 Unicode strings, 65, 178–181 uniform() function, 229 union() function, 78 subdirectories, 126 sub() function, 173–174 subplot() function, 23, 196–197 subplot parameters, modifying, 215–217 subplots, 23, 196–197, 343–344 subtract() function, 313 Subversion, 50 Sudoku puzzles, 244 sum() function, 92, 243, 245–246 svd() function, 253 swapcase() function, 145 symmetric_difference() method, 78 symmetric_difference_update() method, 78 syntax highlighting, 46 sys.argv variable, 327 T tabs, 5 tail() function, 152–153, 322–323 tail functionality, 322–323 tail utility, 152–153 tan() function, 223 tanh() function, 223 tanm() function, 254 tarfile module, 337 tar files, 338–339 target audience, 183 Taylor series expansion, 260 tell() function, 319–323 TeX syntax, 200 text, 23–25 adding to graphs, 197–200 find and replace, 143–144 removing extra spaces from, 144–145, 174 searching for, in multiple files, 333 splitting, 136–137 strings, 136–147 text annotations, 295–300 text editors, 45–48 text file formats, 104, 109–117 text files, 135–136 character, word, and line count, 151–152 closing, 148 comments, working with (example), 157 date and time, 163–173 extracting numbers from, 157–159 head and tail utilities, 152–153 internationalization and localization, 176–181 log files, 163–168 opening, 147–148 NINDEX 363 unittest module, 140, 245 UNIX-like operating systems, 32–33 unpacking, tuples, 73 update() method, 76, 78 upper() function, 145 USB GPS receivers, 2 UTF (Unicode Transformation Format), 178 user input, 84–85 V ValueError exceptions, 138 values() method, 74–76 var() function, 243 variables, 80–81 binding, 80 printing list of, 250 saving and retrieving, 326–327 scope, 97 serialization of, 325–327 vdot() function, 252 vector operations, 252–253 vectors, 235, 260 velocity plot, 22–23 version control systems (VCSs), 49–51 Vim, 47 virtual machines (VMs), 34–37 W walk() function, 8–9 walking directories, 8–9 waveforms, 274–275 where() function, 251 while statement, 91 who() function, 250 window functions, 277–279 Windows, 33–36 Cygwin, 33–34 stand-alone (natively), 33 virtual machines (VMs), 34–35 word count (example), 151–152 word processors, 48 words, counting in strings, 137 words, used only once (example), 176 World factbook, CIA, 201 Write, 48 writelines() method, 148 write() method, 148–149, 179 wxPython, 184 X x-axis, 194 xlabel() function, 19, 198 xlim() function, 205 XML (Extensible Markup Language), 125 xrange() function, 90, 95–96 xticks() function, 195–196 X windows, 47 Y Yahoo! financial data, reading and plotting, 113–114 y-axis, 194 yield statement, 94 ylabel() function, 19, 198, 216 yticks() function, 195–196 Z zeros() function, 234 zipfile module, 337 zip() function, 92, 226, 232 zlib module, 337




需要 10 金币 [ 分享pdf获得金币 ] 3 人已下载