one line at a time: for line in page: sys.stdout.write(line) Understanding Web page errors If there's a problem opening the page, Python raises an exception specifying the kind of error. Sometimes the message is long, so we won't reproduce the whole thing here. The critical part of the message is the last line. Here's part of the message you get if the site exists but the page doesn't exist: >>> page=urllib2.urlopen('http://www.python.org/idontexist.html') [...] urllib2.HTTPError: HTTP Error 404: Not Found Detailed error reporting is useful when you want to write code that handles different Web errors in different ways. For example, you might write code that checks a Web page's links and logs whether the link is still valid, whether it redirects to another page, or whether the linked page no longer exists. (The spider.py program in Chapter 4 uses this functionality.) Finding information about a URL Python has two useful tools for getting the scoop on a URL. info() To get meta-information (server type, last-modified date, content type, and so on) about a URL you've opened, use the info() method, like this: >>> print page.info() Date: Sun, 21 May 2006 07:52:15 GMT Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e 310 Last-Modified: Mon, 17 Apr 2006 12:04:05 GMT ETag: "240110-8ac5-3811b740" Accept-Ranges: bytes Content-Length: 35525 Connection: close Content-Type: text/html geturl() To get the actual URL that was opened when you made your request, use the geturl() method, like this: >>> redirect = urllib2.urlopen('http://www.livejournal.com/users/firecat') >>> redirect.geturl() 'http://firecat.livejournal.com/' Tip Use geturl() to find out whether the URL you opened redirected you somewhere else. Processing special characters in a URL To process special characters (such as a space or tilde) in a URL, use the urllib module's quote_plus() function. Warning Although the functions quote() and unquote() are similar to quote_plus(), we recommend using quote_plus() to avoid problems with URLs that contain + and space characters. The quote_plus() function has a required argument and an optional argument. Required argument The quote_plus() function requires a URL string (or part of a URL) as an argument and does the following: • Replaces special characters (anything but letters, digits, underscores, dots, and hyphens) by using the "%xx" escape format • Replaces spaces with + (required for quoting HTML form contents) • Escapes + characters in the original string This example processes a URL with slash characters and a tilde (~): >>> import urllib >>> urllib.quote_plus('http://cat-and-dragon.com/~stef') 'http%3A%2F%2Fcat-and-dragon.com%2F%7Estef' Tip To reverse this process, use unquote_plus(). (This function doesn't use any optional parameters.) 311 Optional argument The quote_plus() function takes an optional safe argument, which specifies characters to leave alone. The following example processes the same URL as the preceding example, but the slash character has been designated as safe: >>> urllib.quote_plus('http://cat-and-dragon.com/~stef', safe='/') 'http%3A//cat-and-dragon.com/%7Estef' Submitting form data When communicating via HTTP, the urlopen() function usually makes a GET request to the HTTP server. A GET request limits your URL (address and data) to 1,024 bytes. To send more data—for example, to submit the results of a form—you need to make a POST request. Sending data via a POST request requires the following steps. Note that these steps require you to import both urllib (for the urlencode() function) and urllib2 (for the urlopen() function): 1. Type the following: 2. import urllib, urllib2 3. 4. Use the urllib.urlencode() function to encode the data. The function takes one of the following arguments: o A mapping object (such as a dictionary of form fields) o A sequence of two-element tuples The function converts its argument to the format application/x-www-form- urlencoded. Here's an example: >>> mylist = [('1', 'one'), ('2', 'two'), ('3', 'three')] >>> mydata = urllib.urlencode(mylist) >>> mydata '1=one&2=two&3=three' 5. Call urllib2.urlopen() and pass the data from urlencode() as the second argument, like this: 6. x = urllib2.urlopen('http://www.company.com', mydata) 312 Taming the Wild URL The urlparse module splits URLs into components, combines components into a URL, and converts a relative URL to an absolute URL. To split a URL string into components, pass it to the urlparse() function. There are two optional parameters for urlparse(): • default_scheme: Used if the URL doesn't include an addressing scheme, for instance, 'http'. The default is an empty string (which means it's off). • allow_fragments: Allows URLs with elements following a # character (designating a location somewhere inside a page). Defaults to 1, which means to allow them. The urlparse() function returns a six-item tuple of strings containing 1. Addressing scheme (http or ftp) 2. Network location (http://www.python.org) 3. Path (/doc/2.4.2/lib/module-urlparse.html) 4. Parameters (;type="a") 5. Query (?filter=16400) 6. Fragment identifier (#top) The tuple items don't include delimiters (colons and slashes), except for a leading slash in the path component. If any of the items isn't part of the URL being parsed, the tuple contains an empty string for that item. The following example contains an addressing scheme, network location, and path, but doesn't contain parameters, a query, or a fragment identifier: >>> urlparse.urlparse("http://python.org/doc/2.4.2/lib/module- urlparse.html") ('http', 'python.org', '/doc/2.4.2/lib/module-urlparse.html', '', '', '') The urlparse module has a couple of other capabilities: • To turn a tuple created by urlparse() back into an URL string, pass the tuple to urlunparse(). • To stick together a base URL and a relative URL to create a complete URL, use urljoin(). It takes the base URL and relative URL as strings. It also takes the optional allow_fragments argument. This example joins a single Web page to its location: • >>> urlparse.urljoin('http://python.org/doc/2.4.2/mac/', 'mac.html') 313 • 'http://python.org/doc/2.4.2/mac/mac.html' Getting Hip with Hypertext To read and output HTML-formatted text files and to find links in HTML documents, use the htmllib module. It supports all of XHTML 1.0, all of HTML up to 2.0, and much of HTML 3.0 and 3.2. Of parsers, formatters, and writers Reading and outputting an HTML file is a three-step process. Here's what's going on inside Python while it's happening: 1. A parser receives information from an HTML-formatted file. This file might be an HTML document you have on disk, or it might be a Web page you've opened with urllib2.urlopen() (see "Downloading Web Data," earlier in this chapter). When the parser encounters an opening tag, it stores whatever is inside that tag in a buffer until it encounters the matching closing tag. 2. The parser passes the information to a formatter, which reformats it into a human- readable form by reading the HTML tags and applying instructions to them. 3. The formatter passes the reformatted information to a writer, which outputs the information. Setting up a read-and-output process The parser, formatter, and writer are all instance objects, so you need to create the instances before they can act. The parser works on the contents of the file or Web page, so you have to open and read a file or Web page before the parser, formatter, and writer can act. To set up a read-and-output process, follow these steps: 1. Import the htmllib, formatter, and StringIO modules, like so: 2. import htmllib, formatter 3. from cStringIO import StringIO 4. Open an HTML-formatted file: 5. myfile = open("memo.html", 'rb') 6. Read the file: 7. html = myfile.read() 8. Create a writer instance. The simplest writer is DumbWriter(), which outputs plain text. 314 dumdum = formatter.DumbWriter(StringIO()) 9. Create a formatter instance and pass it the writer instance. AbstractFormatter() is the most commonly used formatter class. fermat = formatter.AbstractFormatter(dumdum) 10. Create an HTMLParser instance and pass it the formatter instance. 11.parsley = htmllib.HTMLParser(fermat) 12. Feed the file contents to the parser. 13.parsley.feed(html) 14. Close the parser and the file. 15.parsley.close() 16.myfile.close() Outputting the links of a Web page After the parser has read the document, the document's links are available in the data attribute anchorlist. >>> parsley.anchorlist ['http://www.portmeirion-village.com/'] Getting help for messy HTML If you have to deal with badly formatted HTML (and there's a lot of it out there), don't despair; a third-party tool can help. It's called Beautiful Soup (after the poem in Alice's Adventures in Wonderland), and full details are available at http://www.crummy.com/software/BeautifulSoup. The Great XML Python comes with modules that support Extensible Markup Language (XML), a mechanism for creating structured documents in plain text. TECHNICAL STUFF Although we wrote plain text above, XML is actually often encoded in UTF-8. See Chapter 6 for details about character encoding. TECHNICAL STUFF Both HTML and XML data include tags—text surrounded by brackets (<>). But they use tags differently: • In HTML, tags are specific formatting instructions. For example,
starts a new paragraph and marks boldface type. 315 • In XML, tags label kinds of data. An XML tag can also have subtags for kinds of data that are part of a larger group— data can include and ;
- data can include , and so on. These tags clarify the data structure, so it's easy for a computer to separate the data into parts. Tip If XML tags and subtags remind you of Python's classes and their data attributes, you might be on to something…. The ElementTree XML implementation The best way to use XML in Python is with ElementTree. • ElementTree is included as xml.etree in Python 2.5. • ElementTree is available for download from http://www.effbot.org.zone/element- index.htm if you're using an earlier version of Python. ElementTree is based on an Element data type, which stores hierarchical data structures (such as XML) in memory. TECHNICAL STUFF Ordering up some XML Here's an application for which XML is useful. Suppose you need to send an order. If you were writing it on a piece of paper, it would look something like this: Order: Eric Half-a-Bee 123 Main St Anytown, CA, 95432 Items: Foo 1 $12 $12 Bar 2 $15 $30 Subtotal: $42 Shipping: $5 Total: $47 In XML, it would look something like this: Eric Half-a-Bee 123 Main St Anytown CA 95432
- 316 Foo 1 12 12
- Bar 2 15 30 42 5 47 Importing ElementTree How you import ElementTree depends on your version of Python. • In Python 2.5, type this: • from xml.etree import ElementTree as ET • • In earlier versions, after you download ElementTree, type the following: • import elementtree.ElementTree as ET Tip (You don't have to use the as ET part, but it will help if you are following along with our examples.) Creating an element An ElementTree structure consists of Element instances. • An Element instance must have a tag—a string identifying the element type. • An Element instance can have these options: attributes, a text string, and subelements (child elements). To create an element, type ET.Element and pass the tag string as an argument, like this: tree = ET.Element("tree") Element attributes Element attributes are dictionary key:value pairs. (The keys must be unique.) Assigning attributes To assign attributes to an existing element, use the attrib() method, like this: branch = ET.Element("branch") 317 branch.attrib["one"] = "1" branch.attrib["two"] = "2" Manipulating attributes To manipulate attributes, use the following methods, which work a lot like dictionary methods (see Chapter 9): • To assign attributes when creating an element, use keyword arguments, like this: • branch = ET.Element("branch", one="1", two="2") • To return attribute values, use the get() method with the key: • branch.get("one") • To get a list of attribute keys, use the keys() method, like so: • branch.keys() • To get a list of attributes (key:value tuples), use the items() method: • branch.items() • • To set attribute values, use the set() method, like this: • branch.set("third", "3") • To set an element's text attribute, assign it a string, like so: • branch.text = "this element also contains text" Subelements Element instances can have subelements, also called child elements. Creating subelements There are a couple of different ways to build a tree: • The SubElement() function: You can create subelements and attach them to the parent element by using the SubElement() function. The resulting code looks like this: • ET.SubElement(tree, "branch") • The append() method: You can attach subelements to a parent element with the append() method, like this: • tree.append(ET.Element("branch")) Tip Subelements (child elements) are stored as a list, so you can access, add, and delete them by using list methods, like this: >>> tree.insert(0, ET.Element("fruit flies")) Searching for subelements To search for or within subelements, use these methods: • find(pattern) returns the first matching subelement. 318 • findtext(pattern) returns the value of the text attribute for the first matching subelement. • findall(pattern) returns a list of all subelements. Tip The pattern argument in find() can be either • A tag: If you use a tag, only subelements of that tag are checked. • A path: You can use a path to search the entire subtree. Tip To search the whole tree, use the getiterator() method. getiterator(tag) returns a list of all subelements with the tag; getiterator() (without an argument) returns a list of all subelements in the subtree. TECHNICAL STUFF getiterator() searches the tree in depth-first order; that is, it searches down one branch to the end, and then down the next branch, and so on. Using XML files ElementTree is designed to automatically read XML files, represent their structures, and write them to disk or to a Python file object. Loading documents To load an XML document and turn it into an ElementTree instance, use the file keyword argument to create a tree from a file in a single operation, like this: order = ET.ElementTree(file='order.xml') Writing an Element tree to disk as an XML file To save an ElementTree instance back to disk, use the write() method. It takes either a filename or a file object. The output might not be human-readable. order.write(output_file) To save an Element structure to disk, use the following code: ET.ElementTree(tree).write(output_file) Other useful XML modules Python includes a few other XML modules: • xml.dom: New in Python 2.0. Supports Document Object Model (DOM), a method for reading and modifying XML documents that works in a variety of programming languages. 319 Tip DOM creates XML documents in a tree structure and is useful for random- access reading. • xml.dom.minidom: Supports Document Object Model (DOM) but is simpler and smaller than the full xml.dom module. • xml.sax: This package implements Simple API for XML (SAX). It's good for huge XML documents because it doesn't read the whole thing in one chunk. However, SAX is more difficult to use than the other modules, so we recommend you use it only when you really need it. MIME-ing Success: Managing E-Mail Messages The email package is a library that includes tools for reading the text of e-mail messages, transforming messages from text to Python objects and back again, and generating e-mail objects, including MIME documents. REMEMBER The email package doesn't send messages; the smtplib module sends them. Representing an e-mail message in Python In Python, e-mail objects are represented as instances of the Message class. A Message instance has two parts: • Message headers: Addressing information, subject, date, and so on are stored in a dictionary-like format (a mapping), but there are a few differences from dictionaries: o Duplicate message headers are possible. o Message headers are stored in the order they appeared in the original message. • Payload: The body of the message can be either a string (a text-only message) or a list of Message objects (a multipart message; for instance, a message that has text, HTML, a GIF file, and so on). Creating e-mail and MIME objects To build a message structure, you create a Message instance and add attachments and headers. Creating a Message object The simplest way to create a Message object is to use the message_from_ string() or message_from_file() function. To create a Message object from a string, use message_from_string(). Pass the function a string or StringIO instance, like this: 320 >>> import email >>> msg = email.message_from_string(mystring) To create a Message object from a file, use message_from_file(). Pass the function an open file object. Creating MIME objects via subclasses The email module's MIME submodules are for attaching particular kinds of data, such as formatted text documents, to an e-mail message. The subclasses automatically set up some of the MIME headers for you. The name of the class is always the same as the name of its corresponding submodule. To import a class from a MIME submodule, type the following line of code (substitute the name of the class/submodule you want to import): from email.MIMEBase import MIMEBase To create an instance of the class after using the preceding import statement, type code like this: mime_msg = MIMEBase(parameters) The following MIME subclasses create messages that include MIME objects. MIMEMultipart() The MIMEMultipart() subclass is an outer layer for MIME messages that have multiple parts. To use Message objects (or other MIME subclass instances), you can either • Specify the objects as payload subparts when you create an instance of the MIMEMultipart subclass. • Add the objects later by using the Message.attach() method. TECHNICAL STUFF The MIMEMultiPart subclass is new in Python 2.2.2. MIMEBase() The MIMEBase() subclass is for creating MIME messages with message parts that aren't of specific other MIME types (text, audio, or image). MIMEBase() requires two Content-Type arguments: 321 • Major type (for example, application) • Minor type (for example, msword) TECHNICAL STUFF If you aren't sure what types to use, import the mimetypes module and use the guess_type() function, like this: >>> mimetypes.guess_type('foo.doc') ('application/msword', None) >>> x = email.MIMEBase.MIMEBase('application', 'msword') Tip To attach a Word document to an e-mail message, use the MimeBase() subclass: 1. Import the modules you need, like so: 2. >>> from email.MIMEBase import MIMEBase 3. Open the Word file for reading: 4. >>> word_file = file('foo.doc', 'rb') 5. Read the file, and then close it: 6. >>> file_contents = word_file.read() 7. >>> word_file.close() 8. Make a MIMEBase() object by typing this: 9. >>> mime_msg = mime_msg = MIMEBase('application', 'msword') 10. Set the payload of the MIMEBase() object: 11.>>> mime_msg.set_payload(file_contents) Specifying a character set To specify the character set of a payload, use the set_charset() method. TECHNICAL STUFF The set_charset() method argument can be either • A Charset() instance • A string with the name of a character set • None (which removes the charset parameter from the Content-Type: header) The following line specifies the character set using a string: >>> msg.set_charset('ISO-8859-1') Adding and changing message headers Message object headers are stored in a dictionary-like object: • The header field is the key. • The content of the header is the value. 322 To create or change a header, or to add a value, use the key, like this: >>> msg['Subject'] = 'Pictures of My Cat' The add_header() method creates a new header, a value, and optional parameters for the value: >>> msg.add_header('MyCat', 'Angus', coat='Tuxedo') The set_param() method sets or changes a parameter in the Content-Type: header, creating the header if it doesn't exist. Adding content to an existing message To add a payload to a multipart e-mail message that already has an existing payload, use attach() and pass it a list of Message objects. If the message doesn't have any payload yet, use set_payload() to add a payload stored as a string. Warning If the message already has a payload, the set_payload() method replaces it. Generating MIME documents from message structures After you have set up an e-mail message as a structure of Python objects, you need to convert it to plain text if you want to send it or print it. This is called flattening the message. There are several ways to flatten a message object. We discuss two of them. Printing a message object as a string The easiest way to print the text of a message is to use the as_string() method of Message() objects, like so: >>> print littlemsg.as_string() Date: Wed, 5 Jul 2006 01:04:42 -0700 From: Professor Grue To: Geography 101 students Subject: Islands of the world Farallons: Seal Rock Maintop Island Encoding binary data into string data The Encoders module includes functions for turning binary data into data that can be sent as e-mail. Two of its functions are 323 • encode_quopri(): Use for a message that has mostly text but some unprintable characters. • encode_base64(): Use for a message that has mostly binary (unprintable) data. The format is more compact than quoted-printable, but humans can't read it. The encoding functions add a Content-Transfer-Encoding: header. They change a message in place and return None. To encode the Message() object littlemsg as base 64, type this code: >>> import email.Encoders >>> email.Encoders.encode_base64(littlemsg) >>> print littlemsg From nobody Wed Jul 5 01:04:42 2006 From: Professor Grue To: Geography 101 students Subject: Islands of the world Content-Transfer-Encoding: base64 RmFyYWxsb25zOgpTZWFsIFJvY2sKTWFpbnRvcCBJc2xhbmQKCgoK Reading e-mail messages This section shows how Python can read and manipulate existing e-mail messages. Getting information and payloads The simplest way to return an e-mail message as a text string is to use the as_string() method of a Message() instance. To include an envelope header in the string, specify unixfrom=True as the argument. mymessage.as_string(unixfrom=True) To find whether the e-mail message has multiple parts, use the is_multipart() method. It returns True or False. False means the message is a string. To see a message's payload, use the get_payload() method. If the message is multipart, it returns a list of Message() objects. If not, it returns a string. >>> msg.get_payload() [, >email.Message.Message instance at 0x82e260c>] >>> littlemsg.get_payload() 'Farallons:\nSeal Rock\nMaintop Island\n Parsing e-mail messages The classes of the Parser module process more complex messages. When you already have the whole message as a string or file, use its "classic" Parser class. Follow these steps: 324 1. Import the Parser module, like so: 2. >>> import email.Parser 3. 4. Create an instance of the Parser() class: 5. >>> p = email.Parser.Parser() Use the parse() method for a file (pass it a file object) or the parsestr() method for a string (pass it a string). The text must be a block of headers followed by the body of the message (which can have MIME-encoded subparts). To read only the headers, pass the optional parameter headersonly=True. >>> x = p.parse(fp) >>> print x From nobody Wed Jul 5 01:04:42 2006 From: Professor Grue To: Geography 101 students Subject: Islands of the world Farallons: Seal Rock Maintop Island Using e-mail utilities The email.Utils module includes the following message tools: • Functions for managing quotation marks and other delimiters o quote() takes a string and escapes backslashes and double quotes by preceding them with \. o unquote() removes double quotes and angle brackets from the beginning and end of a string. • Functions for reading address headers o parseaddr() splits the header into two parts—real name and e-mail address. It returns a tuple. o formataddr() takes the parseaddr() tuple and turns it back into a string suitable for an address header. o getaddresses() takes a list of headers and returns a parsed list of addresses found in the headers. This code gets all the addresses in a message's To: header: from email.Utils import getaddresses to_addrs = msg.get_all('to', ) to_recips = getaddresses(to_addrs) • Tools for working with dates 325 o parsedate() reads a date in a message header and returns a nine-item tuple that works with time.mktime() (see Chapter 17). If the date doesn't follow the standard format, parsedate() tries to guess how the date is formatted. o formatdate() takes a floating point time value (or uses the current time) and returns a date string of the following format: o Fri, 09 Nov 2001 01:08:47 -0000 • Functions for decoding and encoding character sets o decode_rfc2231() decodes a string according to RFC 2231. o encode_rfc2231() encodes a string and accepts optional character set and language arguments. o collapse_rfc2231_value() turns a get_param() three-item tuple (character set, language, and value) into a Unicode string. o decode_params() decodes a list of two-item tuples in the format (content- type, string-value). Simply SMTP The email package only generates and processes e-mail messages. To send e-mail messages, you need the smtplib module. SMTP stands for Simple Mail Transfer Protocol. The primary way to use the smtplib module is to create an instance of the SMTP class (its parameter is a hostname); then use the sendmail() and quit() methods in that order. The sendmail() method requires the following arguments: • From-address string • List of to-address strings, or a string containing a single to-address • Message (a text string—usually a flattened email.Message object) TECHNICAL STUFF The optional arguments for sendmail() are advanced, and we don't cover them in this book. The response from sendmail() depends on the following factors: • If the server accepts the mail for at least one of the to-addresses, sendmail() reports success. • If any of the to-addresses didn't receive the message, sendmail() returns a dictionary containing one entry for each recipient who didn't get the mail, along with error codes. To end the SMTP session and close the connection with the server, use the quit() method. Tip The simplest code for sending a message is the following: 326 import smtplib server = smtplib.SMTP('localhost') server.sendmail(fromaddr, toaddrs, msg) server.quit() CGI: Gateway to the Web Common Gateway Interface, or CGI, is a standard scripting language for Web sites. CGI scripts work like this: 1. A browser makes a request to a Web server. 2. The Web server executes a CGI script. 3. The CGI script output is returned to the browser as a Web page. You can write CGI scripts in Python. The cgi module includes tools that let your scripts work with Web pages and receive data from form fields. Debugging tools are also provided. Setting up CGI output in Python A CGI script usually outputs two sections of text separated by a blank line. Tip To generate this text from your Python program, use the print command: 1. The first section contains headers describing the type of data in the second section. Code for this section might look like this: 2. print "Content-Type: text/html" 3. Use an empty print command to generate the blank line separator: 4. print 5. The second section usually includes HTML-formatted text that generates the Web page, like this: 6. print "Message in a browser" 7. print "
Hello, world!" 8. print "
Help! I'm trapped in this Web browser" 9. print "and I can't get out! " Reading data from Web input CGI scripts usually process input from a Web form created with the HTML tag