STORE = (bo(\P.all x.(girl(x) -> P(x)),z1), bo(\P.exists x.(dog(x) & P(x)),z2)) It should be clearer now why the address variables are an important part of the binding operator. Recall that during S-retrieval, we will be taking binding operators off the STORE list and applying them successively to the CORE. Suppose we start with bo(\P.all x.(girl(x) -> P(x)),z1), which we want to combine with chase(z1,z2). The quantifier part of the binding operator is \P.all x.(girl(x) -> P(x)), and to combine this with chase(z1,z2), the latter needs to first be turned into a λ-abstract. How do we know which variable to abstract over? This is what the address z1 tells us, i.e., that every girl has the role of chaser rather than chasee. The module nltk.sem.cooper_storage deals with the task of turning storage-style se- mantic representations into standard logical forms. First, we construct a CooperStore instance, and inspect its STORE and CORE. >>> from nltk.sem import cooper_storage as cs >>> sentence = 'every girl chases a dog' >>> trees = cs.parse_with_bindops(sentence, grammar='grammars/book_grammars/storage.fcfg') >>> semrep = trees[0].node['sem'] >>> cs_semrep = cs.CooperStore(semrep) >>> print cs_semrep.core chase(z1,z2) >>> for bo in cs_semrep.store: ... print bo bo(\P.all x.(girl(x) -> P(x)),z1) bo(\P.exists x.(dog(x) & P(x)),z2) Finally, we call s_retrieve() and check the readings. >>> cs_semrep.s_retrieve(trace=True) Permutation 1 (\P.all x.(girl(x) -> P(x)))(\z1.chase(z1,z2)) (\P.exists x.(dog(x) & P(x)))(\z2.all x.(girl(x) -> chase(x,z2))) Permutation 2 (\P.exists x.(dog(x) & P(x)))(\z2.chase(z1,z2)) (\P.all x.(girl(x) -> P(x)))(\z1.exists x.(dog(x) & chase(z1,x))) >>> for reading in cs_semrep.readings: ... print reading exists x.(dog(x) & all z3.(girl(z3) -> chase(z3,x))) all x.(girl(x) -> exists z4.(dog(z4) & chase(x,z4))) 396 | Chapter 10: Analyzing the Meaning of Sentences 10.5 Discourse Semantics A discourse is a sequence of sentences. Very often, the interpretation of a sentence in a discourse depends on what preceded it. A clear example of this comes from anaphoric pronouns, such as he, she, and it. Given a discourse such as Angus used to have a dog. But he recently disappeared., you will probably interpret he as referring to Angus’s dog. However, in Angus used to have a dog. He took him for walks in New Town., you are more likely to interpret he as referring to Angus himself. Discourse Representation Theory The standard approach to quantification in first-order logic is limited to single senten- ces. Yet there seem to be examples where the scope of a quantifier can extend over two or more sentences. We saw one earlier, and here’s a second example, together with a translation. (54) a. Angus owns a dog. It bit Irene. b. ∃x.(dog(x) & own(Angus, x) & bite(x, Irene)) That is, the NP a dog acts like a quantifier which binds the it in the second sentence. Discourse Representation Theory (DRT) was developed with the specific goal of pro- viding a means for handling this and other semantic phenomena which seem to be characteristic of discourse. A discourse representation structure (DRS) presents the meaning of discourse in terms of a list of discourse referents and a list of conditions. The discourse referents are the things under discussion in the discourse, and they correspond to the individual variables of first-order logic. The DRS conditions apply to those discourse referents, and correspond to atomic open formulas of first-order logic. Figure 10-4 illustrates how a DRS for the first sentence in (54a) is augmented to become a DRS for both sentences. When the second sentence of (54a) is processed, it is interpreted in the context of what is already present in the lefthand side of Figure 10-4. The pronoun it triggers the addi- tion of a new discourse referent, say, u, and we need to find an anaphoric antecedent for it—that is, we want to work out what it refers to. In DRT, the task of finding the antecedent for an anaphoric pronoun involves linking it to a discourse ref- erent already within the current DRS, and y is the obvious choice. (We will say more about anaphora resolution shortly.) This processing step gives rise to a new condition u = y. The remaining content contributed by the second sentence is also merged with the content of the first, and this is shown on the righthand side of Figure 10-4. Figure 10-4 illustrates how a DRS can represent more than just a single sentence. In this case, it is a two-sentence discourse, but in principle a single DRS could correspond to the interpretation of a whole text. We can inquire into the truth conditions of the righthand DRS in Figure 10-4. Informally, it is true in some situation s if there are entities a, c, and i in s corresponding to the discourse referents in the DRS such that 10.5 Discourse Semantics | 397 all the conditions are true in s; that is, a is named Angus, c is a dog, a owns c, i is named Irene, and c bit i. In order to process DRSs computationally, we need to convert them into a linear format. Here’s an example, where the DRS is a pair consisting of a list of discourse referents and a list of DRS conditions: ([x, y], [angus(x), dog(y), own(x,y)]) The easiest way to build a DRS object in NLTK is by parsing a string representation . >>> dp = nltk.DrtParser() >>> drs1 = dp.parse('([x, y], [angus(x), dog(y), own(x, y)])') >>> print drs1 ([x,y],[angus(x), dog(y), own(x,y)]) We can use the draw() method to visualize the result, as shown in Figure 10-5. >>> drs1.draw() Figure 10-5. DRS screenshot. When we discussed the truth conditions of the DRSs in Figure 10-4, we assumed that the topmost discourse referents were interpreted as existential quantifiers, while the Figure 10-4. Building a DRS: The DRS on the lefthand side represents the result of processing the first sentence in the discourse, while the DRS on the righthand side shows the effect of processing the second sentence and integrating its content. 398 | Chapter 10: Analyzing the Meaning of Sentences conditions were interpreted as though they are conjoined. In fact, every DRS can be translated into a formula of first-order logic, and the fol() method implements this translation. >>> print drs1.fol() exists x y.((angus(x) & dog(y)) & own(x,y)) In addition to the functionality available for first-order logic expressions, DRT Expressions have a DRS-concatenation operator, represented as the + symbol. The concatenation of two DRSs is a single DRS containing the merged discourse referents and the conditions from both arguments. DRS-concatenation automatically α-converts bound variables to avoid name-clashes. >>> drs2 = dp.parse('([x], [walk(x)]) + ([y], [run(y)])') >>> print drs2 (([x],[walk(x)]) + ([y],[run(y)])) >>> print drs2.simplify() ([x,y],[walk(x), run(y)]) While all the conditions seen so far have been atomic, it is possible to embed one DRS within another, and this is how universal quantification is handled. In drs3, there are no top-level discourse referents, and the sole condition is made up of two sub-DRSs, connected by an implication. Again, we can use fol() to get a handle on the truth conditions. >>> drs3 = dp.parse('([], [(([x], [dog(x)]) -> ([y],[ankle(y), bite(x, y)]))])') >>> print drs3.fol() all x.(dog(x) -> exists y.(ankle(y) & bite(x,y))) We pointed out earlier that DRT is designed to allow anaphoric pronouns to be inter- preted by linking to existing discourse referents. DRT sets constraints on which dis- course referents are “accessible” as possible antecedents, but is not intended to explain how a particular antecedent is chosen from the set of candidates. The module nltk.sem.drt_resolve_anaphora adopts a similarly conservative strategy: if the DRS contains a condition of the form PRO(x), the method resolve_anaphora() replaces this with a condition of the form x = [...], where [...] is a list of possible antecedents. >>> drs4 = dp.parse('([x, y], [angus(x), dog(y), own(x, y)])') >>> drs5 = dp.parse('([u, z], [PRO(u), irene(z), bite(u, z)])') >>> drs6 = drs4 + drs5 >>> print drs6.simplify() ([x,y,u,z],[angus(x), dog(y), own(x,y), PRO(u), irene(z), bite(u,z)]) >>> print drs6.simplify().resolve_anaphora() ([x,y,u,z],[angus(x), dog(y), own(x,y), (u = [x,y,z]), irene(z), bite(u,z)]) Since the algorithm for anaphora resolution has been separated into its own module, this facilitates swapping in alternative procedures that try to make more intelligent guesses about the correct antecedent. Our treatment of DRSs is fully compatible with the existing machinery for handling λ- abstraction, and consequently it is straightforward to build compositional semantic representations that are based on DRT rather than first-order logic. This technique is 10.5 Discourse Semantics | 399 illustrated in the following rule for indefinites (which is part of the grammar drt.fcfg). For ease of comparison, we have added the parallel rule for indefinites from simple- sem.fcfg. Det[NUM=sg,SEM=<\P Q.([x],[]) + P(x) + Q(x)>] -> 'a' Det[NUM=sg,SEM=<\P Q. exists x.(P(x) & Q(x))>] -> 'a' To get a better idea of how the DRT rule works, look at this subtree for the NP a dog: (NP[NUM='sg', SEM=<\Q.(([x],[dog(x)]) + Q(x))>] (Det[NUM'sg', SEM=<\P Q.((([x],[]) + P(x)) + Q(x))>] a) (Nom[NUM='sg', SEM=<\x.([],[dog(x)])>] (N[NUM='sg', SEM=<\x.([],[dog(x)])>] dog))))) The λ-abstract for the indefinite is applied as a function expression to \x.([], [dog(x)]) which leads to \Q.(([x],[]) + ([],[dog(x)]) + Q(x)); after simplification, we get \Q.(([x],[dog(x)]) + Q(x)) as the representation for the NP as a whole. In order to parse with grammar drt.fcfg, we specify in the call to load_earley() that SEM values in feature structures are to be parsed using DrtParser in place of the default LogicParser. >>> from nltk import load_parser >>> parser = load_parser('grammars/book_grammars/drt.fcfg', logic_parser=nltk.DrtParser()) >>> trees = parser.nbest_parse('Angus owns a dog'.split()) >>> print trees[0].node['sem'].simplify() ([x,z2],[Angus(x), dog(z2), own(x,z2)]) Discourse Processing When we interpret a sentence, we use a rich context for interpretation, determined in part by the preceding context and in part by our background assumptions. DRT pro- vides a theory of how the meaning of a sentence is integrated into a representation of the prior discourse, but two things have been glaringly absent from the processing approach just discussed. First, there has been no attempt to incorporate any kind of inference; and second, we have only processed individual sentences. These omissions are redressed by the module nltk.inference.discourse. Whereas a discourse is a sequence s1, ... sn of sentences, a discourse thread is a sequence s1-ri, ... sn-rj of readings, one for each sentence in the discourse. The module processes sentences incrementally, keeping track of all possible threads when there is ambiguity. For simplicity, the following example ignores scope ambiguity: >>> dt = nltk.DiscourseTester(['A student dances', 'Every student is a person']) >>> dt.readings() s0 readings: s0-r0: exists x.(student(x) & dance(x)) s1 readings: s1-r0: all x.(student(x) -> person(x)) When a new sentence is added to the current discourse, setting the parameter consistchk=True causes consistency to be checked by invoking the model checker for each thread, i.e., each sequence of admissible readings. In this case, the user has the option of retracting the sentence in question. 400 | Chapter 10: Analyzing the Meaning of Sentences >>> dt.add_sentence('No person dances', consistchk=True) Inconsistent discourse d0 ['s0-r0', 's1-r0', 's2-r0']: s0-r0: exists x.(student(x) & dance(x)) s1-r0: all x.(student(x) -> person(x)) s2-r0: -exists x.(person(x) & dance(x)) >>> dt.retract_sentence('No person dances', verbose=True) Current sentences are s0: A student dances s1: Every student is a person In a similar manner, we use informchk=True to check whether a new sentence φ is informative relative to the current discourse. The theorem prover treats existing sen- tences in the thread as assumptions and attempts to prove φ; it is informative if no such proof can be found. >>> dt.add_sentence('A person dances', informchk=True) Sentence 'A person dances' under reading 'exists x.(person(x) & dance(x))': Not informative relative to thread 'd0' It is also possible to pass in an additional set of assumptions as background knowledge and use these to filter out inconsistent readings; see the Discourse HOWTO at http:// www.nltk.org/howto for more details. The discourse module can accommodate semantic ambiguity and filter out readings that are not admissible. The following example invokes both Glue Semantics as well as DRT. Since the Glue Semantics module is configured to use the wide-coverage Malt dependency parser, the input (Every dog chases a boy. He runs.) needs to be tagged as well as tokenized. >>> from nltk.tag import RegexpTagger >>> tagger = RegexpTagger( ... [('^(chases|runs)$', 'VB'), ... ('^(a)$', 'ex_quant'), ... ('^(every)$', 'univ_quant'), ... ('^(dog|boy)$', 'NN'), ... ('^(He)$', 'PRP') ... ]) >>> rc = nltk.DrtGlueReadingCommand(depparser=nltk.MaltParser(tagger=tagger)) >>> dt = nltk.DiscourseTester(['Every dog chases a boy', 'He runs'], rc) >>> dt.readings() s0 readings: s0-r0: ([],[(([x],[dog(x)]) -> ([z3],[boy(z3), chases(x,z3)]))]) s0-r1: ([z4],[boy(z4), (([x],[dog(x)]) -> ([],[chases(x,z4)]))]) s1 readings: s1-r0: ([x],[PRO(x), runs(x)]) The first sentence of the discourse has two possible readings, depending on the quan- tifier scoping. The unique reading of the second sentence represents the pronoun He via the condition PRO(x). Now let’s look at the discourse threads that result: >>> dt.readings(show_thread_readings=True) d0: ['s0-r0', 's1-r0'] : INVALID: AnaphoraResolutionException 10.5 Discourse Semantics | 401 d1: ['s0-r1', 's1-r0'] : ([z6,z10],[boy(z6), (([x],[dog(x)]) -> ([],[chases(x,z6)])), (z10 = z6), runs(z10)]) When we examine threads d0 and d1, we see that reading s0-r0, where every dog out- scopes a boy, is deemed inadmissible because the pronoun in the second sentence cannot be resolved. By contrast, in thread d1 the pronoun (relettered to z10) has been bound via the equation (z10 = z6). Inadmissible readings can be filtered out by passing the parameter filter=True. >>> dt.readings(show_thread_readings=True, filter=True) d1: ['s0-r1', 's1-r0'] : ([z12,z15],[boy(z12), (([x],[dog(x)]) -> ([],[chases(x,z12)])), (z17 = z15), runs(z15)]) Although this little discourse is extremely limited, it should give you a feel for the kind of semantic processing issues that arise when we go beyond single sentences, and also a feel for the techniques that can be deployed to address them. 10.6 Summary • First-order logic is a suitable language for representing natural language meaning in a computational setting since it is flexible enough to represent many useful as- pects of natural meaning, and there are efficient theorem provers for reasoning with first-order logic. (Equally, there are a variety of phenomena in natural language semantics which are believed to require more powerful logical mechanisms.) • As well as translating natural language sentences into first-order logic, we can state the truth conditions of these sentences by examining models of first-order formu- las. • In order to build meaning representations compositionally, we supplement first- order logic with the λ-calculus. • β-reduction in the λ-calculus corresponds semantically to application of a function to an argument. Syntactically, it involves replacing a variable bound by λ in the function expression with the expression that provides the argument in the function application. • A key part of constructing a model lies in building a valuation which assigns in- terpretations to non-logical constants. These are interpreted as either n-ary predi- cates or as individual constants. • An open expression is an expression containing one or more free variables. Open expressions receive an interpretation only when their free variables receive values from a variable assignment. • Quantifiers are interpreted by constructing, for a formula φ[x] open in variable x, the set of individuals which make φ[x] true when an assignment g assigns them as the value of x. The quantifier then places constraints on that set. 402 | Chapter 10: Analyzing the Meaning of Sentences • A closed expression is one that has no free variables; that is, the variables are all bound. A closed sentence is true or false with respect to all variable assignments. • If two formulas differ only in the label of the variable bound by binding operator (i.e., λ or a quantifier) , they are said to be α-equivalents. The result of relabeling a bound variable in a formula is called α-conversion. • Given a formula with two nested quantifiers Q1 and Q2, the outermost quantifier Q1 is said to have wide scope (or scope over Q2). English sentences are frequently ambiguous with respect to the scope of the quantifiers they contain. • English sentences can be associated with a semantic representation by treating SEM as a feature in a feature-based grammar. The SEM value of a complex expressions, typically involves functional application of the SEM values of the component expressions. 10.7 Further Reading Consult http://www.nltk.org/ for further materials on this chapter and on how to install the Prover9 theorem prover and Mace4 model builder. General information about these two inference tools is given by (McCune, 2008). For more examples of semantic analysis with NLTK, please see the semantics and logic HOWTOs at http://www.nltk.org/howto. Note that there are implementations of two other approaches to scope ambiguity, namely Hole semantics as described in (Black- burn & Bos, 2005), and Glue semantics, as described in (Dalrymple et al., 1999). There are many phenomena in natural language semantics that have not been touched on in this chapter, most notably: 1. Events, tense, and aspect 2. Semantic roles 3. Generalized quantifiers, such as most 4. Intensional constructions involving, for example, verbs such as may and believe While (1) and (2) can be dealt with using first-order logic, (3) and (4) require different logics. These issues are covered by many of the references in the following readings. A comprehensive overview of results and techniques in building natural language front- ends to databases can be found in (Androutsopoulos, Ritchie & Thanisch, 1995). Any introductory book to modern logic will present propositional and first-order logic. (Hodges, 1977) is highly recommended as an entertaining and insightful text with many illustrations from natural language. For a wide-ranging, two-volume textbook on logic that also presents contemporary material on the formal semantics of natural language, including Montague Grammar and intensional logic, see (Gamut, 1991a, 1991b). (Kamp & Reyle, 1993) provides the 10.7 Further Reading | 403 definitive account of Discourse Representation Theory, and covers a large and inter- esting fragment of natural language, including tense, aspect, and modality. Another comprehensive study of the semantics of many natural language constructions is (Car- penter, 1997). There are numerous works that introduce logical semantics within the framework of linguistic theory. (Chierchia & McConnell-Ginet, 1990) is relatively agnostic about syntax, while (Heim & Kratzer, 1998) and (Larson & Segal, 1995) are both more ex- plicitly oriented toward integrating truth-conditional semantics into a Chomskyan framework. (Blackburn & Bos, 2005) is the first textbook devoted to computational semantics, and provides an excellent introduction to the area. It expands on many of the topics covered in this chapter, including underspecification of quantifier scope ambiguity, first-order inference, and discourse processing. To gain an overview of more advanced contemporary approaches to semantics, in- cluding treatments of tense and generalized quantifiers, try consulting (Lappin, 1996) or (van Benthem & ter Meulen, 1997). 10.8 Exercises 1. ○ Translate the following sentences into propositional logic and verify that they parse with LogicParser. Provide a key that shows how the propositional variables in your translation correspond to expressions of English. a. If Angus sings, it is not the case that Bertie sulks. b. Cyril runs and barks. c. It will snow if it doesn’t rain. d. It’s not the case that Irene will be happy if Olive or Tofu comes. e. Pat didn’t cough or sneeze. f. If you don’t come if I call, I won’t come if you call. 2. ○ Translate the following sentences into predicate-argument formulas of first-order logic. a. Angus likes Cyril and Irene hates Cyril. b. Tofu is taller than Bertie. c. Bruce loves himself and Pat does too. d. Cyril saw Bertie, but Angus didn’t. e. Cyril is a four-legged friend. f. Tofu and Olive are near each other. 3. ○ Translate the following sentences into quantified formulas of first-order logic. a. Angus likes someone and someone likes Julia. 404 | Chapter 10: Analyzing the Meaning of Sentences b. Angus loves a dog who loves him. c. Nobody smiles at Pat. d. Somebody coughs and sneezes. e. Nobody coughed or sneezed. f. Bruce loves somebody other than Bruce. g. Nobody other than Matthew loves Pat. h. Cyril likes everyone except for Irene. i. Exactly one person is asleep. 4. ○ Translate the following verb phrases using λ-abstracts and quantified formulas of first-order logic. a. feed Cyril and give a capuccino to Angus b. be given ‘War and Peace’ by Pat c. be loved by everyone d. be loved or detested by everyone e. be loved by everyone and detested by no-one 5. ○ Consider the following statements: >>> lp = nltk.LogicParser() >>> e2 = lp.parse('pat') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() exists y.love(pat, y) Clearly something is missing here, namely a declaration of the value of e1. In order for ApplicationExpression(e1, e2) to be β-convertible to exists y.love(pat, y), e1 must be a λ-abstract which can take pat as an argument. Your task is to construct such an abstract, bind it to e1, and satisfy yourself that these statements are all satisfied (up to alphabetic variance). In addition, provide an informal English translation of e3.simplify(). Now carry on doing this same task for the further cases of e3.simplify() shown here: >>> print e3.simplify() exists y.(love(pat,y) | love(y,pat)) >>> print e3.simplify() exists y.(love(pat,y) | love(y,pat)) >>> print e3.simplify() walk(fido) 6. ○ As in the preceding exercise, find a λ-abstract e1 that yields results equivalent to those shown here: >>> e2 = lp.parse('chase') >>> e3 = nltk.ApplicationExpression(e1, e2) 10.8 Exercises | 405 >>> print e3.simplify() \x.all y.(dog(y) -> chase(x,pat)) >>> e2 = lp.parse('chase') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() \x.exists y.(dog(y) & chase(pat,x)) >>> e2 = lp.parse('give') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() \x0 x1.exists y.(present(y) & give(x1,y,x0)) 7. ○ As in the preceding exercise, find a λ-abstract e1 that yields results equivalent to those shown here: >>> e2 = lp.parse('bark') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() exists y.(dog(x) & bark(x)) >>> e2 = lp.parse('bark') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() bark(fido) >>> e2 = lp.parse('\\P. all x. (dog(x) -> P(x))') >>> e3 = nltk.ApplicationExpression(e1, e2) >>> print e3.simplify() all x.(dog(x) -> bark(x)) 8. ◑ Develop a method for translating English sentences into formulas with binary generalized quantifiers. In such an approach, given a generalized quantifier Q, a quantified formula is of the form Q(A, B), where both A and B are expressions of type 〈e, t〉. Then, for example, all(A, B) is true iff A denotes a subset of what B denotes. 9. ◑ Extend the approach in the preceding exercise so that the truth conditions for quantifiers such as most and exactly three can be computed in a model. 10. ◑ Modify the sem.evaluate code so that it will give a helpful error message if an expression is not in the domain of a model’s valuation function. 11. ● Select three or four contiguous sentences from a book for children. A possible source of examples are the collections of stories in nltk.corpus.gutenberg: bryant- stories.txt, burgess-busterbrown.txt, and edgeworth-parents.txt. Develop a grammar that will allow your sentences to be translated into first-order logic, and build a model that will allow those translations to be checked for truth or falsity. 12. ● Carry out the preceding exercise, but use DRT as the meaning representation. 13. ● Taking (Warren & Pereira, 1982) as a starting point, develop a technique for converting a natural language query into a form that can be evaluated more effi- ciently in a model. For example, given a query of the form (P(x) & Q(x)), convert it to (Q(x) & P(x)) if the extension of Q is smaller than the extension of P. 406 | Chapter 10: Analyzing the Meaning of Sentences CHAPTER 11 Managing Linguistic Data Structured collections of annotated linguistic data are essential in most areas of NLP; however, we still face many obstacles in using them. The goal of this chapter is to answer the following questions: 1. How do we design a new language resource and ensure that its coverage, balance, and documentation support a wide range of uses? 2. When existing data is in the wrong format for some analysis tool, how can we convert it to a suitable format? 3. What is a good way to document the existence of a resource we have created so that others can easily find it? Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the life cycle of a corpus. As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling. 11.1 Corpus Structure: A Case Study The TIMIT Corpus was the first annotated speech database to be widely distributed, and it has an especially clear organization. TIMIT was developed by a consortium in- cluding Texas Instruments and MIT, from which it derives its name. It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems. The Structure of TIMIT Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials. For each of eight dialect regions, 50 male and female speakers having a range of ages and educa- tional backgrounds each read 10 carefully chosen sentences. Two sentences, read by all speakers, were designed to bring out dialect variation: 407 (1) a. she had your dark suit in greasy wash water all year b. don’t ask me to carry an oily rag like that The remaining sentences were chosen to be phonetically rich, involving all phones (sounds) and a comprehensive range of diphones (phone bigrams). Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones. Five of the sentences read by each speaker are also read by six other speakers (for comparability). The remaining three sentences read by each speaker were unique to that speaker (for coverage). NLTK includes a sample from the TIMIT Corpus. You can access its documentation in the usual way, using help(nltk.corpus.timit). Print nltk.corpus.timit.fileids() to see a list of the 160 recorded utterances in the corpus sample. Each filename has internal structure, as shown in Figure 11-1. Figure 11-1. Structure of a TIMIT identifier: Each recording is labeled using a string made up of the speaker’s dialect region, gender, speaker identifier, sentence type, and sentence identifier. Each item has a phonetic transcription which can be accessed using the phones() meth- od. We can access the corresponding word tokens in the customary way. Both access methods permit an optional argument offset=True, which includes the start and end offsets of the corresponding span in the audio file. >>> phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1') >>> phonetic ['h#', 'sh', 'iy', 'hv', 'ae', 'dcl', 'y', 'ix', 'dcl', 'd', 'aa', 'kcl', 408 | Chapter 11: Managing Linguistic Data 's', 'ux', 'tcl', 'en', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax', 'q', 'ao', 'l', 'y', 'ih', 'ax', 'h#'] >>> nltk.corpus.timit.word_times('dr1-fvmh0/sa1') [('she', 7812, 10610), ('had', 10610, 14496), ('your', 14496, 15791), ('dark', 15791, 20720), ('suit', 20720, 25647), ('in', 25647, 26906), ('greasy', 26906, 32668), ('wash', 32668, 37890), ('water', 38531, 42417), ('all', 43091, 46052), ('year', 46052, 50522)] In addition to this text data, TIMIT includes a lexicon that provides the canonical pronunciation of every word, which can be compared with a particular utterance: >>> timitdict = nltk.corpus.timit.transcription_dict() >>> timitdict['greasy'] + timitdict['wash'] + timitdict['water'] ['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't', 'axr'] >>> phonetic[17:30] ['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax'] This gives us a sense of what a speech processing system would have to do in producing or recognizing speech in this particular dialect (New England). Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics. >>> nltk.corpus.timit.spkrinfo('dr1-fvmh0') SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR') Notable Design Features TIMIT illustrates several key features of corpus design. First, the corpus contains two layers of annotation, at the phonetic and orthographic levels. In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels. Moreover, even at a given level there may be different labeling schemes or even disagreement among annotators, such that we want to rep- resent multiple versions. A second property of TIMIT is its balance across multiple dimensions of variation, for coverage of dialect regions and diphones. The inclusion of speaker demographics brings in many more independent variables that may help to account for variation in the data, and which facilitate later uses of the corpus for pur- poses that were not envisaged when the corpus was created, such as sociolinguistics. A third property is that there is a sharp division between the original linguistic event captured as an audio recording and the annotations of that event. The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact. Any transformations of that artifact which involve human judgment—even something as simple as tokenization—are subject to later revision; thus it is important to retain the source material in a form that is as close to the original as possible. A fourth feature of TIMIT is the hierarchical structure of the corpus. With 4 files per sentence, and 10 sentences for each of 500 speakers, there are 20,000 files. These are organized into a tree structure, shown schematically in Figure 11-2. At the top level 11.1 Corpus Structure: A Case Study | 409 there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models. Finally, notice that even though TIMIT is a speech corpus, its transcriptions and asso- ciated data are just text, and can be processed using programs just like any other text corpus. Therefore, many of the computational methods described in this book are ap- plicable. Moreover, notice that all of the data types included in the TIMIT Corpus fall into the two basic categories of lexicon and text, which we will discuss later. Even the speaker demographics data is just another instance of the lexicon data type. This last observation is less surprising when we consider that text and record structures are the primary domains for the two subfields of computer science that focus on data management, namely text retrieval and databases. A notable feature of linguistic data management is that it usually brings both data types together, and that it can draw on results and techniques from both fields. Figure 11-2. Structure of the published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have eight sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker aks0 are listed, showing 10 wav files accompanied by a text transcription, a word- aligned transcription, and a phonetic transcription. 410 | Chapter 11: Managing Linguistic Data Fundamental Data Types Despite its complexity, the TIMIT Corpus contains only two fundamental data types, namely lexicons and texts. As we saw in Chapter 2, most lexical resources can be rep- resented using a record structure, i.e., a key plus one or more fields, as shown in Figure 11-3. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated. It could also be a phrasal lexicon, where the key field is a phrase rather than a single word. A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics. We can also construct special tabulations (known as paradigms) to illustrate contrasts and systematic varia- tion, as shown in Figure 11-3 for three verbs. TIMIT’s speaker table is also a kind of lexicon. Figure 11-3. Basic linguistic data types—lexicons and texts: Amid their diversity, lexicons have a record structure, whereas annotated texts have a temporal organization. At the most abstract level, a text is a representation of a real or fictional speech event, and the time-course of that event carries over into the text itself. A text could be a small unit, such as a word or sentence, or a complete narrative or dialogue. It may come with annotations such as part-of-speech tags, morphological analysis, discourse structure, and so forth. As we saw in the IOB tagging technique (Chapter 7), it is possible to represent higher-level constituents using tags on individual words. Thus the abstraction of text shown in Figure 11-3 is sufficient. 11.1 Corpus Structure: A Case Study | 411 Despite the complexities and idiosyncrasies of individual corpora, at base they are col- lections of texts together with record-structured data. The contents of a corpus are often biased toward one or the other of these types. For example, the Brown Corpus contains 500 text files, but we still use a table to relate the files to 15 different genres. At the other end of the spectrum, WordNet contains 117,659 synset records, yet it incorporates many example sentences (mini-texts) to illustrate word usages. TIMIT is an interesting midpoint on this spectrum, containing substantial free-standing material of both the text and lexicon types. 11.2 The Life Cycle of a Corpus Corpora are not born fully formed, but involve careful preparation and input from many people over an extended period. Raw data needs to be collected, cleaned up, documented, and stored in a systematic structure. Various layers of annotation might be applied, some requiring specialized knowledge of the morphology or syntax of the language. Success at this stage depends on creating an efficient workflow involving appropriate tools and format converters. Quality control procedures can be put in place to find inconsistencies in the annotations, and to ensure the highest possible level of inter-annotator agreement. Because of the scale and complexity of the task, large cor- pora may take years to prepare, and involve tens or hundreds of person-years of effort. In this section, we briefly review the various stages in the life cycle of a corpus. Three Corpus Creation Scenarios In one type of corpus, the design unfolds over in the course of the creator’s explorations. This is the pattern typical of traditional “field linguistics,” in which material from elic- itation sessions is analyzed as it is gathered, with tomorrow’s elicitation often based on questions that arise in analyzing today’s. The resulting corpus is then used during sub- sequent years of research, and may serve as an archival resource indefinitely. Comput- erization is an obvious boon to work of this type, as exemplified by the popular program Shoebox, now over two decades old and re-released as Toolbox (see Section 2.4). Other software tools, even simple word processors and spreadsheets, are routinely used to acquire the data. In the next section, we will look at how to extract data from these sources. Another corpus creation scenario is typical of experimental research where a body of carefully designed material is collected from a range of human subjects, then analyzed to evaluate a hypothesis or develop a technology. It has become common for such databases to be shared and reused within a laboratory or company, and often to be published more widely. Corpora of this type are the basis of the “common task” method of research management, which over the past two decades has become the norm in government-funded research programs in language technology. We have already en- countered many such corpora in the earlier chapters; we will see how to write Python 412 | Chapter 11: Managing Linguistic Data programs to implement the kinds of curation tasks that are necessary before such cor- pora are published. Finally, there are efforts to gather a “reference corpus” for a particular language, such as the American National Corpus (ANC) and the British National Corpus (BNC). Here the goal has been to produce a comprehensive record of the many forms, styles, and uses of a language. Apart from the sheer challenge of scale, there is a heavy reliance on automatic annotation tools together with post-editing to fix any errors. However, we can write programs to locate and repair the errors, and also to analyze the corpus for balance. Quality Control Good tools for automatic and manual preparation of data are essential. However, the creation of a high-quality corpus depends just as much on such mundane things as documentation, training, and workflow. Annotation guidelines define the task and document the markup conventions. They may be regularly updated to cover difficult cases, along with new rules that are devised to achieve more consistent annotations. Annotators need to be trained in the procedures, including methods for resolving cases not covered in the guidelines. A workflow needs to be established, possibly with sup- porting software, to keep track of which files have been initialized, annotated, validated, manually checked, and so on. There may be multiple layers of annotation, provided by different specialists. Cases of uncertainty or disagreement may require adjudication. Large annotation tasks require multiple annotators, which raises the problem of achieving consistency. How consistently can a group of annotators perform? We can easily measure consistency by having a portion of the source material independently annotated by two people. This may reveal shortcomings in the guidelines or differing abilities with the annotation task. In cases where quality is paramount, the entire corpus can be annotated twice, and any inconsistencies adjudicated by an expert. It is considered best practice to report the inter-annotator agreement that was achieved for a corpus (e.g., by double-annotating 10% of the corpus). This score serves as a helpful upper bound on the expected performance of any automatic system that is trained on this corpus. Caution! Care should be exercised when interpreting an inter-annotator agree- ment score, since annotation tasks vary greatly in their difficulty. For example, 90% agreement would be a terrible score for part-of-speech tagging, but an exceptional score for semantic role labeling. The Kappa coefficient κ measures agreement between two people making category judgments, correcting for expected chance agreement. For example, suppose an item is to be annotated, and four coding options are equally likely. In this case, two people coding randomly would be expected to agree 25% of the time. Thus, an agreement of 11.2 The Life Cycle of a Corpus | 413 25% will be assigned κ = 0, and better levels of agreement will be scaled accordingly. For an agreement of 50%, we would get κ = 0.333, as 50 is a third of the way from 25 to 100. Many other agreement measures exist; see help(nltk.metrics.agreement) for details. We can also measure the agreement between two independent segmentations of lan- guage input, e.g., for tokenization, sentence segmentation, and named entity recogni- tion. In Figure 11-4 we see three possible segmentations of a sequence of items which might have been produced by annotators (or programs). Although none of them agree exactly, S1 and S2 are in close agreement, and we would like a suitable measure. Win- dowdiff is a simple algorithm for evaluating the agreement of two segmentations by running a sliding window over the data and awarding partial credit for near misses. If we preprocess our tokens into a sequence of zeros and ones, to record when a token is followed by a boundary, we can represent the segmentations as strings and apply the windowdiff scorer. >>> s1 = "00000010000000001000000" >>> s2 = "00000001000000010000000" >>> s3 = "00010000000000000001000" >>> nltk.windowdiff(s1, s1, 3) 0 >>> nltk.windowdiff(s1, s2, 3) 4 >>> nltk.windowdiff(s2, s3, 3) 16 In this example, the window had a size of 3. The windowdiff computation slides this window across a pair of strings. At each position it totals up the number of boundaries found inside this window, for both strings, then computes the difference. These dif- ferences are then summed. We can increase or shrink the window size to control the sensitivity of the measure. Curation Versus Evolution As large corpora are published, researchers are increasingly likely to base their inves- tigations on balanced, focused subsets that were derived from corpora produced for Figure 11-4. Three segmentations of a sequence: The small rectangles represent characters, words, sentences, in short, any sequence which might be divided into linguistic units; S1 and S2 are in close agreement, but both differ significantly from S3. 414 | Chapter 11: Managing Linguistic Data entirely different reasons. For instance, the Switchboard database, originally collected for speaker identification research, has since been used as the basis for published studies in speech recognition, word pronunciation, disfluency, syntax, intonation, and dis- course structure. The motivations for recycling linguistic corpora include the desire to save time and effort, the desire to work on material available to others for replication, and sometimes a desire to study more naturalistic forms of linguistic behavior than would be possible otherwise. The process of choosing a subset for such a study may count as a non-trivial contribution in itself. In addition to selecting an appropriate subset of a corpus, this new work could involve reformatting a text file (e.g., converting to XML), renaming files, retokenizing the text, selecting a subset of the data to enrich, and so forth. Multiple research groups might do this work independently, as illustrated in Figure 11-5. At a later date, should some- one want to combine sources of information from different versions, the task will probably be extremely onerous. Figure 11-5. Evolution of a corpus over time: After a corpus is published, research groups will use it independently, selecting and enriching different pieces; later research that seeks to integrate separate annotations confronts the difficult challenge of aligning the annotations. The task of using derived corpora is made even more difficult by the lack of any record about how the derived version was created, and which version is the most up-to-date. An alternative to this chaotic situation is for a corpus to be centrally curated, and for committees of experts to revise and extend it at periodic intervals, considering sub- missions from third parties and publishing new releases from time to time. Print dic- tionaries and national corpora may be centrally curated in this way. However, for most corpora this model is simply impractical. A middle course is for the original corpus publication to have a scheme for identifying any sub-part. Each sentence, tree, or lexical entry could have a globally unique identi- fier, and each token, node, or field (respectively) could have a relative offset. Annota- tions, including segmentations, could reference the source using this identifier scheme (a method which is known as standoff annotation). This way, new annotations could be distributed independently of the source, and multiple independent annotations of the same source could be compared and updated without touching the source. If the corpus publication is provided in multiple versions, the version number or date could be part of the identification scheme. A table of correspondences between 11.2 The Life Cycle of a Corpus | 415 identifiers across editions of the corpus would permit any standoff annotations to be updated easily. Caution! Sometimes an updated corpus contains revisions of base material that has been externally annotated. Tokens might be split or merged, and constituents may have been rearranged. There may not be a one-to-one correspondence between old and new identifiers. It is better to cause standoff annotations to break on such components of the new version than to silently allow their identifiers to refer to incorrect locations. 11.3 Acquiring Data Obtaining Data from the Web The Web is a rich source of data for language analysis purposes. We have already discussed methods for accessing individual files, RSS feeds, and search engine results (see Section 3.1). However, in some cases we want to obtain large quantities of web text. The simplest approach is to obtain a published corpus of web text. The ACL Special Interest Group on Web as Corpus (SIGWAC) maintains a list of resources at http:// www.sigwac.org.uk/. The advantage of using a well-defined web corpus is that they are documented, stable, and permit reproducible experimentation. If the desired content is localized to a particular website, there are many utilities for capturing all the accessible contents of a site, such as GNU Wget (http://www.gnu.org/ software/wget/). For maximal flexibility and control, a web crawler can be used, such as Heritrix (http://crawler.archive.org/). Crawlers permit fine-grained control over where to look, which links to follow, and how to organize the results. For example, if we want to compile a bilingual text collection having corresponding pairs of documents in each language, the crawler needs to detect the structure of the site in order to extract the correspondence between the documents, and it needs to organize the downloaded pages in such a way that the correspondence is captured. It might be tempting to write your own web crawler, but there are dozens of pitfalls having to do with detecting MIME types, converting relative to absolute URLs, avoiding getting trapped in cyclic link structures, dealing with network latencies, avoiding overloading the site or being banned from accessing the site, and so on. Obtaining Data from Word Processor Files Word processing software is often used in the manual preparation of texts and lexicons in projects that have limited computational infrastructure. Such projects often provide templates for data entry, though the word processing software does not ensure that the data is correctly structured. For example, each text may be required to have a title and date. Similarly, each lexical entry may have certain obligatory fields. As the data grows 416 | Chapter 11: Managing Linguistic Data in size and complexity, a larger proportion of time may be spent maintaining its con- sistency. How can we extract the content of such files so that we can manipulate it in external programs? Moreover, how can we validate the content of these files to help authors create well-structured data, so that the quality of the data can be maximized in the context of the original authoring process? Consider a dictionary in which each entry has a part-of-speech field, drawn from a set of 20 possibilities, displayed after the pronunciation field, and rendered in 11-point bold type. No conventional word processor has search or macro functions capable of verifying that all part-of-speech fields have been correctly entered and displayed. This task requires exhaustive manual checking. If the word processor permits the document to be saved in a non-proprietary format, such as text, HTML, or XML, we can some- times write programs to do this checking automatically. Consider the following fragment of a lexical entry: “sleep [sli:p] v.i. condition of body and mind...”. We can key in such text using MSWord, then “Save as Web Page,” then inspect the resulting HTML file: sleep [sli:p] v.i. a condition of body and mind ...
Observe that the entry is represented as an HTML paragraph, using the element, and that the part of speech appears inside a element. The following program defines the set of legal parts-of-speech, legal_pos. Then it ex- tracts all 11-point content from the dict.htm file and stores it in the set used_pos. Observe that the search pattern contains a parenthesized sub-expression; only the material that matches this subexpression is returned by re.findall. Finally, the program constructs the set of illegal parts-of-speech as the set difference between used_pos and legal_pos: >>> legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det']) >>> pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<") >>> document = open("dict.htm").read() >>> used_pos = set(re.findall(pattern, document)) >>> illegal_pos = used_pos.difference(legal_pos) >>> print list(illegal_pos) ['v.i', 'intrans'] This simple program represents the tip of the iceberg. We can develop sophisticated tools to check the consistency of word processor files, and report errors so that the maintainer of the dictionary can correct the original file using the original word processor. 11.3 Acquiring Data | 417 Once we know the data is correctly formatted, we can write other programs to convert the data into a different format. The program in Example 11-1 strips out the HTML markup using nltk.clean_html(), extracts the words and their pronunciations, and generates output in “comma-separated value” (CSV) format. Example 11-1. Converting HTML created by Microsoft Word into comma-separated values. def lexical_data(html_file): SEP = '_ENTRY' html = open(html_file).read() html = re.sub(r' 2: yield entry.split(' ', 3) >>> import csv >>> writer = csv.writer(open("dict1.csv", "wb")) >>> writer.writerows(lexical_data("dict.htm")) Obtaining Data from Spreadsheets and Databases Spreadsheets are often used for acquiring wordlists or paradigms. For example, a com- parative wordlist may be created using a spreadsheet, with a row for each cognate set and a column for each language (see nltk.corpus.swadesh and www.rosettapro ject.org). Most spreadsheet software can export their data in CSV format. As we will see later, it is easy for Python programs to access these using the csv module. Sometimes lexicons are stored in a full-fledged relational database. When properly normalized, these databases can ensure the validity of the data. For example, we can require that all parts-of-speech come from a specified vocabulary by declaring that the part-of-speech field is an enumerated type or a foreign key that references a separate part-of-speech table. However, the relational model requires the structure of the data (the schema) be declared in advance, and this runs counter to the dominant approach to structuring linguistic data, which is highly exploratory. Fields which were assumed to be obligatory and unique often turn out to be optional and repeatable. A relational database can accommodate this when it is fully known in advance; however, if it is not, or if just about every property turns out to be optional or repeatable, the relational approach is unworkable. Nevertheless, when our goal is simply to extract the contents from a database, it is enough to dump out the tables (or SQL query results) in CSV format and load them into our program. Our program might perform a linguistically motivated query that cannot easily be expressed in SQL, e.g., select all words that appear in example sentences for which no dictionary entry is provided. For this task, we would need to extract enough information from a record for it to be uniquely identified, along with the headwords and example sentences. Let’s suppose this information was now available in a CSV file dict.csv: 418 | Chapter 11: Managing Linguistic Data "sleep","sli:p","v.i","a condition of body and mind ..." "walk","wo:k","v.intr","progress by lifting and setting down each foot ..." "wake","weik","intrans","cease to sleep" Now we can express this query as shown here: >>> import csv >>> lexicon = csv.reader(open('dict.csv')) >>> pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon] >>> lexemes, defns = zip(*pairs) >>> defn_words = set(w for defn in defns for w in defn.split()) >>> sorted(defn_words.difference(lexemes)) ['...', 'a', 'and', 'body', 'by', 'cease', 'condition', 'down', 'each', 'foot', 'lifting', 'mind', 'of', 'progress', 'setting', 'to'] This information would then guide the ongoing work to enrich the lexicon, work that updates the content of the relational database. Converting Data Formats Annotated linguistic data rarely arrives in the most convenient format, and it is often necessary to perform various kinds of format conversion. Converting between character encodings has already been discussed (see Section 3.3). Here we focus on the structure of the data. In the simplest case, the input and output formats are isomorphic. For instance, we might be converting lexical data from Toolbox format to XML, and it is straightforward to transliterate the entries one at a time (Section 11.4). The structure of the data is reflected in the structure of the required program: a for loop whose body takes care of a single entry. In another common case, the output is a digested form of the input, such as an inverted file index. Here it is necessary to build an index structure in memory (see Example 4.8), then write it to a file in the desired format. The following example constructs an index that maps the words of a dictionary definition to the corresponding lexeme for each lexical entry , having tokenized the definition text , and discarded short words . Once the index has been constructed, we open a file and then iterate over the index entries, to write out the lines in the required format . >>> idx = nltk.Index((defn_word, lexeme) ... for (lexeme, defn) in pairs ... for defn_word in nltk.word_tokenize(defn) ... if len(defn_word) > 3) >>> idx_file = open("dict.idx", "w") >>> for word in sorted(idx): ... idx_words = ', '.join(idx[word]) ... idx_line = "%s: %s\n" % (word, idx_words) ... idx_file.write(idx_line) >>> idx_file.close() The resulting file dict.idx contains the following lines. (With a larger dictionary, we would expect to find multiple lexemes listed for each index entry.) 11.3 Acquiring Data | 419 body: sleep cease: wake condition: sleep down: walk each: walk foot: walk lifting: walk mind: sleep progress: walk setting: walk sleep: wake In some cases, the input and output data both consist of two or more dimensions. For instance, the input might be a set of files, each containing a single column of word frequency data. The required output might be a two-dimensional table in which the original columns appear as rows. In such cases we populate an internal data structure by filling up one column at a time, then read off the data one row at a time as we write data to the output file. In the most vexing cases, the source and target formats have slightly different coverage of the domain, and information is unavoidably lost when translating between them. For example, we could combine multiple Toolbox files to create a single CSV file con- taining a comparative wordlist, losing all but the \lx field of the input files. If the CSV file was later modified, it would be a labor-intensive process to inject the changes into the original Toolbox files. A partial solution to this “round-tripping” problem is to associate explicit identifiers with each linguistic object, and to propagate the identifiers with the objects. Deciding Which Layers of Annotation to Include Published corpora vary greatly in the richness of the information they contain. At a minimum, a corpus will typically contain at least a sequence of sound or orthographic symbols. At the other end of the spectrum, a corpus could contain a large amount of information about the syntactic structure, morphology, prosody, and semantic content of every sentence, plus annotation of discourse relations or dialogue acts. These extra layers of annotation may be just what someone needs for performing a particular data analysis task. For example, it may be much easier to find a given linguistic pattern if we can search for specific syntactic structures; and it may be easier to categorize a linguistic pattern if every word has been tagged with its sense. Here are some commonly provided annotation layers: Word tokenization The orthographic form of text does not unambiguously identify its tokens. A to- kenized and normalized version, in addition to the conventional orthographic ver- sion, may be a very convenient resource. Sentence segmentation As we saw in Chapter 3, sentence segmentation can be more difficult than it seems. Some corpora therefore use explicit annotations to mark sentence segmentation. 420 | Chapter 11: Managing Linguistic Data Paragraph segmentation Paragraphs and other structural elements (headings, chapters, etc.) may be explic- itly annotated. Part-of-speech The syntactic category of each word in a document. Syntactic structure A tree structure showing the constituent structure of a sentence. Shallow semantics Named entity and coreference annotations, and semantic role labels. Dialogue and discourse Dialogue act tags and rhetorical structure. Unfortunately, there is not much consistency between existing corpora in how they represent their annotations. However, two general classes of annotation representation should be distinguished. Inline annotation modifies the original document by insert- ing special symbols or control sequences that carry the annotated information. For example, when part-of-speech tagging a document, the string "fly" might be replaced with the string "fly/NN", to indicate that the word fly is a noun in this context. In contrast, standoff annotation does not modify the original document, but instead creates a new file that adds annotation information using pointers that reference the original document. For example, this new document might contain the string "", to indicate that token 8 is a noun. Standards and Tools For a corpus to be widely useful, it needs to be available in a widely supported format. However, the cutting edge of NLP research depends on new kinds of annotations, which by definition are not widely supported. In general, adequate tools for creation, publication, and use of linguistic data are not widely available. Most projects must develop their own set of tools for internal use, which is no help to others who lack the necessary resources. Furthermore, we do not have adequate, generally accepted stand- ards for expressing the structure and content of corpora. Without such standards, gen- eral-purpose tools are impossible—though at the same time, without available tools, adequate standards are unlikely to be developed, used, and accepted. One response to this situation has been to forge ahead with developing a generic format that is sufficiently expressive to capture a wide variety of annotation types (see Sec- tion 11.8 for examples). The challenge for NLP is to write programs that cope with the generality of such formats. For example, if the programming task involves tree data, and the file format permits arbitrary directed graphs, then input data must be validated to check for tree properties such as rootedness, connectedness, and acyclicity. If the input files contain other layers of annotation, the program would need to know how to ignore them when the data was loaded, but not invalidate or obliterate those layers when the tree data was saved back to the file. 11.3 Acquiring Data | 421 Another response has been to write one-off scripts to manipulate corpus formats; such scripts litter the filespaces of many NLP researchers. NLTK’s corpus readers are a more systematic approach, founded on the premise that the work of parsing a corpus format should be done only once (per programming language). Instead of focusing on a common format, we believe it is more promising to develop a common interface (see nltk.corpus). Consider the case of treebanks, an important corpus type for work in NLP. There are many ways to store a phrase structure tree in a file. We can use nested parentheses, or nested XML elements, or a dependency no- tation with a (child-id, parent-id) pair on each line, or an XML version of the dependency notation, etc. However, in each case the logical structure is almost the same. It is much easier to devise a common interface that allows application programmers to write code to access tree data using methods such as children(), leaves(), depth(), and so forth. Note that this approach follows accepted practice within computer science, viz. ab- stract data types, object-oriented design, and the three-layer architecture (Fig- ure 11-6). The last of these—from the world of relational databases—allows end-user applications to use a common model (the “relational model”) and a common language (SQL) to abstract away from the idiosyncrasies of file storage. It also allows innovations in filesystem technologies to occur without disturbing end-user applications. In the same way, a common corpus interface insulates application programs from data formats. Figure 11-6. A common format versus a common interface. In this context, when creating a new corpus for dissemination, it is expedient to use a widely used format wherever possible. When this is not possible, the corpus could be accompanied with software—such as an nltk.corpus module—that supports existing interface methods. Special Considerations When Working with Endangered Languages The importance of language to science and the arts is matched in significance by the cultural treasure embodied in language. Each of the world’s ~7,000 human languages 422 | Chapter 11: Managing Linguistic Data is rich in unique respects, in its oral histories and creation legends, down to its gram- matical constructions and its very words and their nuances of meaning. Threatened remnant cultures have words to distinguish plant subspecies according to therapeutic uses that are unknown to science. Languages evolve over time as they come into contact with each other, and each one provides a unique window onto human pre-history. In many parts of the world, small linguistic variations from one town to the next add up to a completely different language in the space of a half-hour drive. For its breathtaking complexity and diversity, human language is as a colorful tapestry stretching through time and space. However, most of the world’s languages face extinction. In response to this, many linguists are hard at work documenting the languages, constructing rich records of this important facet of the world’s linguistic heritage. What can the field of NLP offer to help with this effort? Developing taggers, parsers, named entity recognizers, etc., is not an early priority, and there is usually insufficient data for developing such tools in any case. Instead, the most frequently voiced need is to have better tools for collecting and curating data, with a focus on texts and lexicons. On the face of things, it should be a straightforward matter to start collecting texts in an endangered language. Even if we ignore vexed issues such as who owns the texts, and sensitivities surrounding cultural knowledge contained in the texts, there is the obvious practical issue of transcription. Most languages lack a standard orthography. When a language has no literary tradition, the conventions of spelling and punctuation are not well established. Therefore it is common practice to create a lexicon in tandem with a text collection, continually updating the lexicon as new words appear in the texts. This work could be done using a text processor (for the texts) and a spreadsheet (for the lexicon). Better still, SIL’s free linguistic software Toolbox and Fieldworks provide sophisticated support for integrated creation of texts and lexicons. When speakers of the language in question are trained to enter texts themselves, a common obstacle is an overriding concern for correct spelling. Having a lexicon greatly helps this process, but we need to have lookup methods that do not assume someone can determine the citation form of an arbitrary word. The problem may be acute for languages having a complex morphology that includes prefixes. In such cases it helps to tag lexical items with semantic domains, and to permit lookup by semantic domain or by gloss. Permitting lookup by pronunciation similarity is also a big help. Here’s a simple dem- onstration of how to do this. The first step is to identify confusible letter sequences, and map complex versions to simpler versions. We might also notice that the relative order of letters within a cluster of consonants is a source of spelling errors, and so we normalize the order of consonants. 11.3 Acquiring Data | 423 >>> mappings = [('ph', 'f'), ('ght', 't'), ('^kn', 'n'), ('qu', 'kw'), ... ('[aeiou]+', 'a'), (r'(.)\1', r'\1')] >>> def signature(word): ... for patt, repl in mappings: ... word = re.sub(patt, repl, word) ... pieces = re.findall('[^aeiou]+', word) ... return ''.join(char for piece in pieces for char in sorted(piece))[:8] >>> signature('illefent') 'lfnt' >>> signature('ebsekwieous') 'bskws' >>> signature('nuculerr') 'nclr' Next, we create a mapping from signatures to words, for all the words in our lexicon. We can use this to get candidate corrections for a given input word (but we must first compute that word’s signature). >>> signatures = nltk.Index((signature(w), w) for w in nltk.corpus.words.words()) >>> signatures[signature('nuculerr')] ['anicular', 'inocular', 'nucellar', 'nuclear', 'unicolor', 'uniocular', 'unocular'] Finally, we should rank the results in terms of similarity with the original word. This is done by the function rank(). The only remaining function provides a simple interface to the user: >>> def rank(word, wordlist): ... ranked = sorted((nltk.edit_dist(word, w), w) for w in wordlist) ... return [word for (_, word) in ranked] >>> def fuzzy_spell(word): ... sig = signature(word) ... if sig in signatures: ... return rank(word, signatures[sig]) ... else: ... return [] >>> fuzzy_spell('illefent') ['olefiant', 'elephant', 'oliphant', 'elephanta'] >>> fuzzy_spell('ebsekwieous') ['obsequious'] >>> fuzzy_spell('nucular') ['nuclear', 'nucellar', 'anicular', 'inocular', 'unocular', 'unicolor', 'uniocular'] This is just one illustration where a simple program can facilitate access to lexical data in a context where the writing system of a language may not be standardized, or where users of the language may not have a good command of spellings. Other simple appli- cations of NLP in this area include building indexes to facilitate access to data, gleaning wordlists from texts, locating examples of word usage in constructing a lexicon, de- tecting prevalent or exceptional patterns in poorly understood data, and performing specialized validation on data created using various linguistic software tools. We will return to the last of these in Section 11.5. 424 | Chapter 11: Managing Linguistic Data 11.4 Working with XML The Extensible Markup Language (XML) provides a framework for designing domain- specific markup languages. It is sometimes used for representing annotated text and for lexical resources. Unlike HTML with its predefined tags, XML permits us to make up our own tags. Unlike a database, XML permits us to create data without first spec- ifying its structure, and it permits us to have optional and repeatable elements. In this section, we briefly review some features of XML that are relevant for representing lin- guistic data, and show how to access data stored in XML files using Python programs. Using XML for Linguistic Structures Thanks to its flexibility and extensibility, XML is a natural choice for representing linguistic structures. Here’s an example of a simple lexical entry. (2) whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head It consists of a series of XML tags enclosed in angle brackets. Each opening tag, such as , is matched with a closing tag, ; together they constitute an XML element. The preceding example has been laid out nicely using whitespace, but it could equally have been put on a single long line. Our approach to processing XML will usually not be sensitive to whitespace. In order for XML to be well formed, all opening tags must have corresponding closing tags, at the same level of nesting (i.e., the XML document must be a well-formed tree). XML permits us to repeat elements, e.g., to add another gloss field, as we see next. We will use different whitespace to underscore the point that layout does not matter. (3) whalenounany of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the heada very large person; impressive in size or qualities A further step might be to link our lexicon to some external resource, such as WordNet, using external identifiers. In (4) we group the gloss and a synset identifier inside a new element, which we have called “sense.” (4) whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head whale.n.02 11.4 Working with XML | 425 a very large person; impressive in size or qualities giant.n.04 Alternatively, we could have represented the synset identifier using an XML attribute, without the need for any nested structure, as in (5). (5) whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head a very large person; impressive in size or qualities This illustrates some of the flexibility of XML. If it seems somewhat arbitrary, that’s because it is! Following the rules of XML, we can invent new attribute names, and nest them as deeply as we like. We can repeat elements, leave them out, and put them in a different order each time. We can have fields whose presence depends on the value of some other field; e.g., if the part of speech is verb, then the entry can have a past_tense element to hold the past tense of the verb, but if the part of speech is noun, no past_tense element is permitted. To impose some order over all this freedom, we can constrain the structure of an XML file using a “schema,” which is a declaration akin to a context-free grammar. Tools exist for testing the validity of an XML file with respect to a schema. The Role of XML We can use XML to represent many kinds of linguistic information. However, the flexibility comes at a price. Each time we introduce a complication, such as by permit- ting an element to be optional or repeated, we make more work for any program that accesses the data. We also make it more difficult to check the validity of the data, or to interrogate the data using one of the XML query languages. Thus, using XML to represent linguistic structures does not magically solve the data modeling problem. We still have to work out how to structure the data, then define that structure with a schema, and then write programs to read and write the format and convert it to other formats. Similarly, we still need to follow some standard prin- ciples concerning data normalization. It is wise to avoid making duplicate copies of the same information, so that we don’t end up with inconsistent data when only one copy is changed. For example, a cross-reference that was represented as headword xref> would duplicate the storage of the headword of some other lexical entry, and the link would break if the copy of the string at the other location was modified. Existential dependencies between information types need to be modeled, so that we can’t create elements without a home. For example, if sense definitions cannot exist independently 426 | Chapter 11: Managing Linguistic Data of a lexical entry, the sense element can be nested inside the entry element. Many-to- many relations need to be abstracted out of hierarchical structures. For example, if a word can have many corresponding senses, and a sense can have several corresponding words, then both words and senses must be enumerated separately, as must the list of (word, sense) pairings. This complex structure might even be split across three separate XML files. As we can see, although XML provides us with a convenient format accompanied by an extensive collection of tools, it offers no panacea. The ElementTree Interface Python’s ElementTree module provides a convenient way to access data stored in XML files. ElementTree is part of Python’s standard library (since Python 2.5), and is also provided as part of NLTK in case you are using Python 2.4. We will illustrate the use of ElementTree using a collection of Shakespeare plays that have been formatted using XML. Let’s load the XML file and inspect the raw data, first at the top of the file , where we see some XML headers and the name of a schema called play.dtd, followed by the root element PLAY. We pick it up again at the start of Act 1 . (Some blank lines have been omitted from the output.) >>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml') >>> raw = open(merchant_file).read() >>> print raw[0:168] The Merchant of Venice >>> print raw[1850:2075] ACT I SCENE I. Venice. A street. Enter ANTONIO, SALARINO, and SALANIO ANTONIO In sooth, I know not why I am so sad: We have just accessed the XML data as a string. As we can see, the string at the start of Act 1 contains XML tags for title, scene, stage directions, and so forth. The next step is to process the file contents as structured XML data, using Element Tree. We are processing a file (a multiline string) and building a tree, so it’s not sur- prising that the method name is parse . The variable merchant contains an XML ele- ment PLAY . This element has internal structure; we can use an index to get its first child, a TITLE element . We can also see the text content of this element, the title of the play . To get a list of all the child elements, we use the getchildren() method . >>> from nltk.etree.ElementTree import ElementTree >>> merchant = ElementTree().parse(merchant_file) >>> merchant 11.4 Working with XML | 427 >>> merchant[0] >>> merchant[0].text 'The Merchant of Venice' >>> merchant.getchildren() [, , , , , , , , ] The play consists of a title, the personae, a scene description, a subtitle, and five acts. Each act has a title and some scenes, and each scene consists of speeches which are made up of lines, a structure with four levels of nesting. Let’s dig down into Act IV: >>> merchant[-2][0].text 'ACT IV' >>> merchant[-2][1] >>> merchant[-2][1][0].text 'SCENE I. Venice. A court of justice.' >>> merchant[-2][1][54] >>> merchant[-2][1][54][0] >>> merchant[-2][1][54][0].text 'PORTIA' >>> merchant[-2][1][54][1] >>> merchant[-2][1][54][1].text "The quality of mercy is not strain'd," Your Turn: Repeat some of the methods just shown, for one of the other Shakespeare plays included in the corpus, such as Romeo and Ju- liet or Macbeth. For a list, see nltk.corpus.shakespeare.fileids(). Although we can access the entire tree this way, it is more convenient to search for sub- elements with particular names. Recall that the elements at the top level have several types. We can iterate over just the types we are interested in (such as the acts), using merchant.findall('ACT'). Here’s an example of doing such tag-specific searches at ev- ery level of nesting: >>> for i, act in enumerate(merchant.findall('ACT')): ... for j, scene in enumerate(act.findall('SCENE')): ... for k, speech in enumerate(scene.findall('SPEECH')): ... for line in speech.findall('LINE'): ... if 'music' in str(line.text): ... print "Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text) Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice; Act 3 Scene 2 Speech 9: Fading in music: that the comparison Act 3 Scene 2 Speech 9: And what is music then? Then music is Act 5 Scene 1 Speech 23: And bring your music forth into the air. Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music 428 | Chapter 11: Managing Linguistic Data Act 5 Scene 1 Speech 23: And draw her home with music. Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music. Act 5 Scene 1 Speech 25: Or any air of music touch their ears, Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet Act 5 Scene 1 Speech 25: But music for the time doth change his nature. Act 5 Scene 1 Speech 25: The man that hath no music in himself, Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music. Act 5 Scene 1 Speech 29: It is your music, madam, of the house. Act 5 Scene 1 Speech 32: No better a musician than the wren. Instead of navigating each step of the way down the hierarchy, we can search for par- ticular embedded elements. For example, let’s examine the sequence of speakers. We can use a frequency distribution to see who has the most to say: >>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')] >>> speaker_freq = nltk.FreqDist(speaker_seq) >>> top5 = speaker_freq.keys()[:5] >>> top5 ['PORTIA', 'SHYLOCK', 'BASSANIO', 'GRATIANO', 'ANTONIO'] We can also look for patterns in who follows whom in the dialogues. Since there are 23 speakers, we need to reduce the “vocabulary” to a manageable size first, using the method described in Section 5.3. >>> mapping = nltk.defaultdict(lambda: 'OTH') >>> for s in top5: ... mapping[s] = s[:4] ... >>> speaker_seq2 = [mapping[s] for s in speaker_seq] >>> cfd = nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2)) >>> cfd.tabulate() ANTO BASS GRAT OTH PORT SHYL ANTO 0 11 4 11 9 12 BASS 10 0 11 10 26 16 GRAT 6 8 0 19 9 5 OTH 8 16 18 153 52 25 PORT 7 23 13 53 0 21 SHYL 15 15 2 26 21 0 Ignoring the entry of 153 for exchanges between people other than the top five, the largest values suggest that Othello and Portia have the most significant interactions. Using ElementTree for Accessing Toolbox Data In Section 2.4, we saw a simple interface for accessing Toolbox data, a popular and well-established format used by linguists for managing data. In this section, we discuss a variety of techniques for manipulating Toolbox data in ways that are not supported by the Toolbox software. The methods we discuss could be applied to other record- structured data, regardless of the actual file format. We can use the toolbox.xml() method to access a Toolbox file and load it into an ElementTree object. This file contains a lexicon for the Rotokas language of Papua New Guinea. 11.4 Working with XML | 429 >>> from nltk.corpus import toolbox >>> lexicon = toolbox.xml('rotokas.dic') There are two ways to access the contents of the lexicon object: by indexes and by paths. Indexes use the familiar syntax; thus lexicon[3] returns entry number 3 (which is actually the fourth entry counting from zero) and lexicon[3][0] returns its first field: >>> lexicon[3][0] >>> lexicon[3][0].tag 'lx' >>> lexicon[3][0].text 'kaa' The second way to access the contents of the lexicon object uses paths. The lexicon is a series of record objects, each containing a series of field objects, such as lx and ps. We can conveniently address all of the lexemes using the path record/lx. Here we use the findall() function to search for any matches to the path record/lx, and we access the text content of the element, normalizing it to lowercase: >>> [lexeme.text.lower() for lexeme in lexicon.findall('record/lx')] ['kaa', 'kaa', 'kaa', 'kaakaaro', 'kaakaaviko', 'kaakaavo', 'kaakaoko', 'kaakasi', 'kaakau', 'kaakauko', 'kaakito', 'kaakuupato', ..., 'kuvuto'] Let’s view the Toolbox data in XML format. The write() method of ElementTree ex- pects a file object. We usually create one of these using Python’s built-in open() func- tion. In order to see the output displayed on the screen, we can use a special predefined file object called stdout (standard output), defined in Python’s sys module. >>> import sys >>> from nltk.etree.ElementTree import ElementTree >>> tree = ElementTree(lexicon[3]) >>> tree.write(sys.stdout) kaa N MASC isi cooking banana banana bilong kukim itoo FLORA 12/Aug/2005 Taeavi iria kaa isi kovopaueva kaparapasia. Taeavi i bin planim gaden banana bilong kukim tasol long paia. Taeavi planted banana in order to cook it. Formatting Entries We can use the same idea we saw in the previous section to generate HTML tables instead of plain text. This would be useful for publishing a Toolbox lexicon on the Web. It produces HTML elements , (table row), and (table data). 430 | Chapter 11: Managing Linguistic Data >>> html = "\n" >>> for entry in lexicon[70:80]: ... lx = entry.findtext('lx') ... ps = entry.findtext('ps') ... ge = entry.findtext('ge') ... html += " %s | %s | %s | \n" % (lx, ps, ge) >>> html += " " >>> print html kakae | ??? | small | kakae | CLASS | child | kakaevira | ADV | small-like | kakapikoa | ??? | small | kakapikoto | N | newborn baby | kakapu | V | place in sling for purpose of carrying | kakapua | N | sling for lifting | kakara | N | arm band | Kakarapaia | N | village name | kakarau | N | frog | 11.5 Working with Toolbox Data Given the popularity of Toolbox among linguists, we will discuss some further methods for working with Toolbox data. Many of the methods discussed in previous chapters, such as counting, building frequency distributions, and tabulating co-occurrences, can be applied to the content of Toolbox entries. For example, we can trivially compute the average number of fields for each entry: >>> from nltk.corpus import toolbox >>> lexicon = toolbox.xml('rotokas.dic') >>> sum(len(entry) for entry in lexicon) / len(lexicon) 13.635955056179775 In this section, we will discuss two tasks that arise in the context of documentary lin- guistics, neither of which is supported by the Toolbox software. Adding a Field to Each Entry It is often convenient to add new fields that are derived automatically from existing ones. Such fields often facilitate search and analysis. For instance, in Example 11-2 we define a function cv(), which maps a string of consonants and vowels to the corre- sponding CV sequence, e.g., kakapua would map to CVCVCVV. This mapping has four steps. First, the string is converted to lowercase, then we replace any non-alphabetic characters [^a-z] with an underscore. Next, we replace all vowels with V. Finally, any- thing that is not a V or an underscore must be a consonant, so we replace it with a C. Now, we can scan the lexicon and add a new cv field after every lx field. Exam- ple 11-2 shows what this does to a particular entry; note the last line of output, which shows the new cv field. 11.5 Working with Toolbox Data | 431 Example 11-2. Adding a new cv field to a lexical entry. from nltk.etree.ElementTree import SubElement def cv(s): s = s.lower() s = re.sub(r'[^a-z]', r'_', s) s = re.sub(r'[aeiou]', r'V', s) s = re.sub(r'[^V_]', r'C', s) return (s) def add_cv_field(entry): for field in entry: if field.tag == 'lx': cv_field = SubElement(entry, 'cv') cv_field.text = cv(field.text) >>> lexicon = toolbox.xml('rotokas.dic') >>> add_cv_field(lexicon[53]) >>> print nltk.to_sfm_string(lexicon[53]) \lx kaeviro \ps V \pt A \ge lift off \ge take off \tkp go antap \sc MOTION \vx 1 \nt used to describe action of plane \dt 03/Jun/2005 \ex Pita kaeviroroe kepa kekesia oa vuripierevo kiuvu. \xp Pita i go antap na lukim haus win i bagarapim. \xe Peter went to look at the house that the wind destroyed. \cv CVVCVCV If a Toolbox file is being continually updated, the program in Exam- ple 11-2 will need to be run more than once. It would be possible to modify add_cv_field() to modify the contents of an existing entry. However, it is a safer practice to use such programs to create enriched files for the purpose of data analysis, without replacing the manually curated source files. Validating a Toolbox Lexicon Many lexicons in Toolbox format do not conform to any particular schema. Some entries may include extra fields, or may order existing fields in a new way. Manually inspecting thousands of lexical entries is not practicable. However, we can easily iden- tify frequent versus exceptional field sequences, with the help of a FreqDist: >>> fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon) >>> fd.items() [('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41), ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37), 432 | Chapter 11: Managing Linguistic Data ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27), ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20), ..., ('lx:alt:rt:ps:pt:ge:eng:eng:eng:tkp:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe', 1)] After inspecting the high-frequency field sequences, we could devise a context-free grammar for lexical entries. The grammar in Example 11-3 uses the CFG format we saw in Chapter 8. Such a grammar models the implicit nested structure of Toolbox entries, building a tree structure, where the leaves of the tree are individual field names. We iterate over the entries and report their conformance with the grammar, as shown in Example 11-3. Those that are accepted by the grammar are prefixed with a '+' , and those that are rejected are prefixed with a '-' . During the process of developing such a grammar, it helps to filter out some of the tags . Example 11-3. Validating Toolbox entries using a context-free grammar. grammar = nltk.parse_cfg(''' S -> Head PS Glosses Comment Date Sem_Field Examples Head -> Lexeme Root Lexeme -> "lx" Root -> "rt" | PS -> "ps" Glosses -> Gloss Glosses | Gloss -> "ge" | "tkp" | "eng" Date -> "dt" Sem_Field -> "sf" Examples -> Example Ex_Pidgin Ex_English Examples | Example -> "ex" Ex_Pidgin -> "xp" Ex_English -> "xe" Comment -> "cmt" | "nt" | ''') def validate_lexicon(grammar, lexicon, ignored_tags): rd_parser = nltk.RecursiveDescentParser(grammar) for entry in lexicon: marker_list = [field.tag for field in entry if field.tag not in ignored_tags] if rd_parser.nbest_parse(marker_list): print "+", ':'.join(marker_list) else: print "-", ':'.join(marker_list) >>> lexicon = toolbox.xml('rotokas.dic')[10:20] >>> ignored_tags = ['arg', 'dcsv', 'pt', 'vx'] >>> validate_lexicon(grammar, lexicon, ignored_tags) - lx:ps:ge:tkp:sf:nt:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe - lx:rt:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe - lx:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe - lx:ps:ge:tkp:nt:sf:dt - lx:ps:ge:tkp:dt:cmt:ex:xp:xe:ex:xp:xe - lx:ps:ge:ge:ge:tkp:cmt:dt:ex:xp:xe - lx:rt:ps:ge:ge:tkp:dt - lx:rt:ps:ge:eng:eng:eng:ge:tkp:tkp:dt:cmt:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe - lx:rt:ps:ge:tkp:dt:ex:xp:xe - lx:ps:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe 11.5 Working with Toolbox Data | 433 Another approach would be to use a chunk parser (Chapter 7), since these are much more effective at identifying partial structures and can report the partial structures that have been identified. In Example 11-4 we set up a chunk grammar for the entries of a lexicon, then parse each entry. A sample of the output from this program is shown in Figure 11-7. Figure 11-7. XML representation of a lexical entry, resulting from chunk parsing a Toolbox record. Example 11-4. Chunking a Toolbox lexicon: A chunk grammar describing the structure of entries for a lexicon for Iu Mien, a language of China. from nltk_contrib import toolbox grammar = r""" lexfunc: {(*)*} example: {*} sense: {***} record: {+} """ >>> from nltk.etree.ElementTree import ElementTree >>> db = toolbox.ToolboxData() >>> db.open(nltk.data.find('corpora/toolbox/iu_mien_samp.db')) >>> lexicon = db.parse(grammar, encoding='utf8') >>> toolbox.data.indent(lexicon) >>> tree = ElementTree(lexicon) >>> output = open("iu_mien_samp.xml", "w") >>> tree.write(output, encoding='utf8') >>> output.close() 434 | Chapter 11: Managing Linguistic Data 11.6 Describing Language Resources Using OLAC Metadata Members of the NLP community have a common need for discovering language re- sources with high precision and recall. The solution which has been developed by the Digital Libraries community involves metadata aggregation. What Is Metadata? The simplest definition of metadata is “structured data about data.” Metadata is de- scriptive information about an object or resource, whether it be physical or electronic. Although the term “metadata” itself is relatively new, the underlying concepts behind metadata have been in use for as long as collections of information have been organized. Library catalogs represent a well-established type of metadata; they have served as col- lection management and resource discovery tools for decades. Metadata can be gen- erated either “by hand” or automatically using software. The Dublin Core Metadata Initiative began in 1995 to develop conventions for finding, sharing, and managing information. The Dublin Core metadata elements represent a broad, interdisciplinary consensus about the core set of elements that are likely to be widely useful to support resource discovery. The Dublin Core consists of 15 metadata elements, where each element is optional and repeatable: Title, Creator, Subject, De- scription, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, and Rights. This metadata set can be used to describe resources that exist in digital or traditional formats. The Open Archives Initiative (OAI) provides a common framework across digital re- positories of scholarly materials, regardless of their type, including documents, data, software, recordings, physical artifacts, digital surrogates, and so forth. Each repository consists of a network-accessible server offering public access to archived items. Each item has a unique identifier, and is associated with a Dublin Core metadata record (and possibly additional records in other formats). The OAI defines a protocol for metadata search services to “harvest” the contents of repositories. OLAC: Open Language Archives Community The Open Language Archives Community, or OLAC, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practices for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. OLAC’s home on the Web is at http: //www.language-archives.org/. OLAC Metadata is a standard for describing language resources. Uniform description across repositories is ensured by limiting the values of certain metadata elements to the use of terms from controlled vocabularies. OLAC metadata can be used to describe data and tools, in both physical and digital formats. OLAC metadata extends the 11.6 Describing Language Resources Using OLAC Metadata | 435 Dublin Core Metadata Set, a widely accepted standard for describing resources of all types. To this core set, OLAC adds descriptors to cover fundamental properties of language resources, such as subject language and linguistic type. Here’s an example of a complete OLAC record: A grammar of Kayardild. With comparative notes on Tangkic. Evans, Nicholas D. Kayardild grammar Kayardild English Kayardild Grammar (ISBN 3110127954) Berlin - Mouton de Gruyter Nicholas Evans hardcover, 837 pages related to ISBN 0646119966 Australia Text Participating language archives publish their catalogs in an XML format, and these records are regularly “harvested” by OLAC services using the OAI protocol. In addition to this software infrastructure, OLAC has documented a series of best practices for describing language resources, through a process that involved extended consultation with the language resources community (e.g., see http://www.language-archives.org/ REC/bpr.html). OLAC repositories can be searched using a query engine on the OLAC website. Search- ing for “German lexicon” finds the following resources, among others: • CALLHOME German Lexicon, at http://www.language-archives.org/item/oai: www.ldc.upenn.edu:LDC97L18 • MULTILEX multilingual lexicon, at http://www.language-archives.org/item/oai:el ra.icp.inpg.fr:M0001 • Slelex Siemens Phonetic lexicon, at http://www.language-archives.org/item/oai:elra .icp.inpg.fr:S0048 Searching for “Korean” finds a newswire corpus, and a treebank, a lexicon, a child- language corpus, and interlinear glossed texts. It also finds software, including a syn- tactic analyzer and a morphological analyzer. Observe that the previous URLs include a substring of the form: oai:www.ldc.upenn.edu:LDC97L18. This is an OAI identifier, using a URI scheme regis- tered with ICANN (the Internet Corporation for Assigned Names and Numbers). These 436 | Chapter 11: Managing Linguistic Data identifiers have the format oai:archive:local_id, where oai is the name of the URI scheme, archive is an archive identifier, such as www.ldc.upenn.edu, and local_id is the resource identifier assigned by the archive, e.g., LDC97L18. Given an OAI identifier for an OLAC resource, it is possible to retrieve the complete XML record for the resource using a URL of the following form: http://www.language- archives.org/static-records/oai:archive:local_id. 11.7 Summary • Fundamental data types, present in most corpora, are annotated texts and lexicons. Texts have a temporal structure, whereas lexicons have a record structure. • The life cycle of a corpus includes data collection, annotation, quality control, and publication. The life cycle continues after publication as the corpus is modified and enriched during the course of research. • Corpus development involves a balance between capturing a representative sample of language usage, and capturing enough material from any one source or genre to be useful; multiplying out the dimensions of variability is usually not feasible be- cause of resource limitations. • XML provides a useful format for the storage and interchange of linguistic data, but provides no shortcuts for solving pervasive data modeling problems. • Toolbox format is widely used in language documentation projects; we can write programs to support the curation of Toolbox files, and to convert them to XML. • The Open Language Archives Community (OLAC) provides an infrastructure for documenting and discovering language resources. 11.8 Further Reading Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web. The primary sources of linguistic corpora are the Linguistic Data Consortium and the European Language Resources Agency, both with extensive online catalogs. More de- tails concerning the major corpora mentioned in the chapter are available: American National Corpus (Reppen, Ide & Suderman, 2005), British National Corpus (BNC, 1999), Thesaurus Linguae Graecae (TLG, 1999), Child Language Data Exchange Sys- tem (CHILDES) (MacWhinney, 1995), and TIMIT (Garofolo et al., 1986). Two special interest groups of the Association for Computational Linguistics that or- ganize regular workshops with published proceedings are SIGWAC, which promotes the use of the Web as a corpus and has sponsored the CLEANEVAL task for removing HTML markup, and SIGANN, which is encouraging efforts toward interoperability of 11.8 Further Reading | 437 linguistic annotations. An extended discussion of web crawling is provided by (Croft, Metzler & Strohman, 2009). Full details of the Toolbox data format are provided with the distribution (Buseman, Buseman & Early, 1996), and with the latest distribution freely available from http:// www.sil.org/computing/toolbox/. For guidelines on the process of constructing a Tool- box lexicon, see http://www.sil.org/computing/ddp/. More examples of our efforts with the Toolbox are documented in (Bird, 1999) and (Robinson, Aumann & Bird, 2007). Dozens of other tools for linguistic data management are available, some surveyed by (Bird & Simons, 2003). See also the proceedings of the LaTeCH workshops on language technology for cultural heritage data. There are many excellent resources for XML (e.g., http://zvon.org/) and for writing Python programs to work with XML http://www.python.org/doc/lib/markup.html. Many editors have XML modes. XML formats for lexical information include OLIF (http://www.olif.net/) and LIFT (http://code.google.com/p/lift-standard/). For a survey of linguistic annotation software, see the Linguistic Annotation Page at http://www.ldc.upenn.edu/annotation/. The initial proposal for standoff annotation was (Thompson & McKelvie, 1997). An abstract data model for linguistic annotations, called “annotation graphs,” was proposed in (Bird & Liberman, 2001). A general- purpose ontology for linguistic description (GOLD) is documented at http://www.lin guistics-ontology.org/. For guidance on planning and constructing a corpus, see (Meyer, 2002) and (Farghaly, 2003). More details of methods for scoring inter-annotator agreement are available in (Artstein & Poesio, 2008) and (Pevzner & Hearst, 2002). Rotokas data was provided by Stuart Robinson, and Iu Mien data was provided by Greg Aumann. For more information about the Open Language Archives Community, visit http://www .language-archives.org/, or see (Simons & Bird, 2003). 11.9 Exercises 1. ◑ In Example 11-2 the new field appeared at the bottom of the entry. Modify this program so that it inserts the new subelement right after the lx field. (Hint: create the new cv field using Element('cv'), assign a text value to it, then use the insert() method of the parent element.) 2. ◑ Write a function that deletes a specified field from a lexical entry. (We could use this to sanitize our lexical data before giving it to others, e.g., by removing fields containing irrelevant or uncertain content.) 3. ◑ Write a program that scans an HTML dictionary file to find entries having an illegal part-of-speech field, and then reports the headword for each entry. 438 | Chapter 11: Managing Linguistic Data 4. ◑ Write a program to find any parts-of-speech (ps field) that occurred less than 10 times. Perhaps these are typing mistakes? 5. ◑ We saw a method for adding a cv field (Section 11.5). There is an interesting issue with keeping this up-to-date when someone modifies the content of the lx field on which it is based. Write a version of this program to add a cv field, replacing any existing cv field. 6. ◑ Write a function to add a new field syl which gives a count of the number of syllables in the word. 7. ◑ Write a function which displays the complete entry for a lexeme. When the lexeme is incorrectly spelled, it should display the entry for the most similarly spelled lexeme. 8. ◑ Write a function that takes a lexicon and finds which pairs of consecutive fields are most frequent (e.g., ps is often followed by pt). (This might help us to discover some of the structure of a lexical entry.) 9. ◑ Create a spreadsheet using office software, containing one lexical entry per row, consisting of a headword, a part of speech, and a gloss. Save the spreadsheet in CSV format. Write Python code to read the CSV file and print it in Toolbox format, using lx for the headword, ps for the part of speech, and gl for the gloss. 10. ◑ Index the words of Shakespeare’s plays, with the help of nltk.Index. The result- ing data structure should permit lookup on individual words, such as music, re- turning a list of references to acts, scenes, and speeches, of the form [(3, 2, 9), (5, 1, 23), ...], where (3, 2, 9) indicates Act 3 Scene 2 Speech 9. 11. ◑ Construct a conditional frequency distribution which records the word length for each speech in The Merchant of Venice, conditioned on the name of the char- acter; e.g., cfd['PORTIA'][12] would give us the number of speeches by Portia consisting of 12 words. 12. ◑ Write a recursive function to convert an arbitrary NLTK tree into an XML coun- terpart, with non-terminals represented as XML elements, and leaves represented as text content, e.g.: Pierre Vinken , 13. ● Obtain a comparative wordlist in CSV format, and write a program that prints those cognates having an edit-distance of at least three from each other. 14. ● Build an index of those lexemes which appear in example sentences. Suppose the lexeme for a given entry is w. Then, add a single cross-reference field xrf to this entry, referencing the headwords of other entries having example sentences con- taining w. Do this for all entries and save the result as a Toolbox-format file. 11.9 Exercises | 439 Afterword: The Language Challenge Natural language throws up some interesting computational challenges. We’ve ex- plored many of these in the preceding chapters, including tokenization, tagging, clas- sification, information extraction, and building syntactic and semantic representations. You should now be equipped to work with large datasets, to create robust models of linguistic phenomena, and to extend them into components for practical language technologies. We hope that the Natural Language Toolkit (NLTK) has served to open up the exciting endeavor of practical natural language processing to a broader audience than before. In spite of all that has come before, language presents us with far more than a temporary challenge for computation. Consider the following sentences which attest to the riches of language: 1. Overhead the day drives level and grey, hiding the sun by a flight of grey spears. (William Faulkner, As I Lay Dying, 1935) 2. When using the toaster please ensure that the exhaust fan is turned on. (sign in dormitory kitchen) 3. Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated activi- ties with Ki values of 45.1-271.6 μM (Medline, PMID: 10718780) 4. Iraqi Head Seeks Arms (spoof news headline) 5. The earnest prayer of a righteous man has great power and wonderful results. (James 5:16b) 6. Twas brillig, and the slithy toves did gyre and gimble in the wabe (Lewis Carroll, Jabberwocky, 1872) 7. There are two ways to do this, AFAIK :smile: (Internet discussion archive) Other evidence for the riches of language is the vast array of disciplines whose work centers on language. Some obvious disciplines include translation, literary criticism, philosophy, anthropology, and psychology. Many less obvious disciplines investigate language use, including law, hermeneutics, forensics, telephony, pedagogy, archaeol- ogy, cryptanalysis, and speech pathology. Each applies distinct methodologies to gather 441 observations, develop theories, and test hypotheses. All serve to deepen our under- standing of language and of the intellect that is manifested in language. In view of the complexity of language and the broad range of interest in studying it from different angles, it’s clear that we have barely scratched the surface here. Addi- tionally, within NLP itself, there are many important methods and applications that we haven’t mentioned. In our closing remarks we will take a broader view of NLP, including its foundations and the further directions you might want to explore. Some of the topics are not well supported by NLTK, and you might like to rectify that problem by contributing new software and data to the toolkit. Language Processing Versus Symbol Processing The very notion that natural language could be treated in a computational manner grew out of a research program, dating back to the early 1900s, to reconstruct mathematical reasoning using logic, most clearly manifested in work by Frege, Russell, Wittgenstein, Tarski, Lambek, and Carnap. This work led to the notion of language as a formal system amenable to automatic processing. Three later developments laid the foundation for natural language processing. The first was formal language theory. This defined a language as a set of strings accepted by a class of automata, such as context-free lan- guages and pushdown automata, and provided the underpinnings for computational syntax. The second development was symbolic logic. This provided a formal method for cap- turing selected aspects of natural language that are relevant for expressing logical proofs. A formal calculus in symbolic logic provides the syntax of a language, together with rules of inference and, possibly, rules of interpretation in a set-theoretic model; examples are propositional logic and first-order logic. Given such a calculus, with a well-defined syntax and semantics, it becomes possible to associate meanings with expressions of natural language by translating them into expressions of the formal cal- culus. For example, if we translate John saw Mary into a formula saw(j, m), we (im- plicitly or explicitly) interpret the English verb saw as a binary relation, and John and Mary as denoting individuals. More general statements like All birds fly require quan- tifiers, in this case ∀, meaning for all: ∀x (bird(x) → fly(x)). This use of logic provided the technical machinery to perform inferences that are an important part of language understanding. A closely related development was the principle of compositionality, namely that the meaning of a complex expression is composed from the meaning of its parts and their mode of combination (Chapter 10). This principle provided a useful corre- spondence between syntax and semantics, namely that the meaning of a complex ex- pression could be computed recursively. Consider the sentence It is not true that p, where p is a proposition. We can represent the meaning of this sentence as not(p). 442 | Afterword: The Language Challenge Similarly, we can represent the meaning of John saw Mary as saw(j, m). Now we can compute the interpretation of It is not true that John saw Mary recursively, using the foregoing information, to get not(saw(j,m)). The approaches just outlined share the premise that computing with natural language crucially relies on rules for manipulating symbolic representations. For a certain period in the development of NLP, particularly during the 1980s, this premise provided a common starting point for both linguists and practitioners of NLP, leading to a family of grammar formalisms known as unification-based (or feature-based) grammar (see Chapter 9), and to NLP applications implemented in the Prolog programming lan- guage. Although grammar-based NLP is still a significant area of research, it has become somewhat eclipsed in the last 15–20 years due to a variety of factors. One significant influence came from automatic speech recognition. Although early work in speech processing adopted a model that emulated the kind of rule-based phonological pho- nology processing typified by the Sound Pattern of English (Chomsky & Halle, 1968), this turned out to be hopelessly inadequate in dealing with the hard problem of rec- ognizing actual speech in anything like real time. By contrast, systems which involved learning patterns from large bodies of speech data were significantly more accurate, efficient, and robust. In addition, the speech community found that progress in building better systems was hugely assisted by the construction of shared resources for quanti- tatively measuring performance against common test data. Eventually, much of the NLP community embraced a data-intensive orientation to language processing, cou- pled with a growing use of machine-learning techniques and evaluation-led methodology. Contemporary Philosophical Divides The contrasting approaches to NLP described in the preceding section relate back to early metaphysical debates about rationalism versus empiricism and realism versus idealism that occurred in the Enlightenment period of Western philosophy. These debates took place against a backdrop of orthodox thinking in which the source of all knowledge was believed to be divine revelation. During this period of the 17th and 18th centuries, philosophers argued that human reason or sensory experience has priority over revelation. Descartes and Leibniz, among others, took the rationalist position, asserting that all truth has its origins in human thought, and in the existence of “innate ideas” implanted in our minds from birth. For example, they argued that the principles of Euclidean geometry were developed using human reason, and were not the result of supernatural revelation or sensory experience. In contrast, Locke and others took the empiricist view, that our primary source of knowledge is the experience of our faculties, and that human reason plays a secondary role in reflecting on that experience. Often- cited evidence for this position was Galileo’s discovery—based on careful observation of the motion of the planets—that the solar system is heliocentric and not geocentric. In the context of linguistics, this debate leads to the following question: to what extent does human linguistic experience, versus our innate “language faculty,” provide the Afterword: The Language Challenge | 443 basis for our knowledge of language? In NLP this issue surfaces in debates about the priority of corpus data versus linguistic introspection in the construction of computa- tional models. A further concern, enshrined in the debate between realism and idealism, was the metaphysical status of the constructs of a theory. Kant argued for a distinction between phenomena, the manifestations we can experience, and “things in themselves” which can never been known directly. A linguistic realist would take a theoretical construct like noun phrase to be a real-world entity that exists independently of human percep- tion and reason, and which actually causes the observed linguistic phenomena. A lin- guistic idealist, on the other hand, would argue that noun phrases, along with more abstract constructs, like semantic representations, are intrinsically unobservable, and simply play the role of useful fictions. The way linguists write about theories often betrays a realist position, whereas NLP practitioners occupy neutral territory or else lean toward the idealist position. Thus, in NLP, it is often enough if a theoretical ab- straction leads to a useful result; it does not matter whether this result sheds any light on human linguistic processing. These issues are still alive today, and show up in the distinctions between symbolic versus statistical methods, deep versus shallow processing, binary versus gradient clas- sifications, and scientific versus engineering goals. However, such contrasts are now highly nuanced, and the debate is no longer as polarized as it once was. In fact, most of the discussions—and most of the advances, even—involve a “balancing act.” For example, one intermediate position is to assume that humans are innately endowed with analogical and memory-based learning methods (weak rationalism), and use these methods to identify meaningful patterns in their sensory language experience (empiri- cism). We have seen many examples of this methodology throughout this book. Statistical methods inform symbolic models anytime corpus statistics guide the selection of pro- ductions in a context-free grammar, i.e., “grammar engineering.” Symbolic methods inform statistical models anytime a corpus that was created using rule-based methods is used as a source of features for training a statistical language model, i.e., “grammatical inference.” The circle is closed. NLTK Roadmap The Natural Language Toolkit is a work in progress, and is being continually expanded as people contribute code. Some areas of NLP and linguistics are not (yet) well sup- ported in NLTK, and contributions in these areas are especially welcome. Check http: //www.nltk.org/ for news about developments after the publication date of this book. Contributions in the following areas are particularly encouraged: 444 | Afterword: The Language Challenge Phonology and morphology Computational approaches to the study of sound patterns and word structures typically use a finite-state toolkit. Phenomena such as suppletion and non-concat- enative morphology are difficult to address using the string-processing methods we have been studying. The technical challenge is not only to link NLTK to a high- performance finite-state toolkit, but to avoid duplication of lexical data and to link the morphosyntactic features needed by morph analyzers and syntactic parsers. High-performance components Some NLP tasks are too computationally intensive for pure Python implementa- tions to be feasible. However, in some cases the expense arises only when training models, not when using them to label inputs. NLTK’s package system provides a convenient way to distribute trained models, even models trained using corpora that cannot be freely distributed. Alternatives are to develop Python interfaces to high-performance machine learning tools, or to expand the reach of Python by using parallel programming techniques such as MapReduce. Lexical semantics This is a vibrant area of current research, encompassing inheritance models of the lexicon, ontologies, multiword expressions, etc., mostly outside the scope of NLTK as it stands. A conservative goal would be to access lexical information from rich external stores in support of tasks in word sense disambiguation, parsing, and semantic interpretation. Natural language generation Producing coherent text from underlying representations of meaning is an impor- tant part of NLP; a unification-based approach to NLG has been developed in NLTK, and there is scope for more contributions in this area. Linguistic fieldwork A major challenge faced by linguists is to document thousands of endangered lan- guages, work which generates heterogeneous and rapidly evolving data in large quantities. More fieldwork data formats, including interlinear text formats and lexicon interchange formats, could be supported in NLTK, helping linguists to curate and analyze this data, while liberating them to spend as much time as pos- sible on data elicitation. Other languages Improved support for NLP in languages other than English could involve work in two areas: obtaining permission to distribute more corpora with NLTK’s data col- lection; and writing language-specific HOWTOs for posting at http://www.nltk .org/howto, illustrating the use of NLTK and discussing language-specific problems for NLP, including character encodings, word segmentation, |