自然语言处理 | （二）Python对文本的简单处理

osa46z67 9年前
   <p>大家好，我是厦门大学王亚南经济研究院的一名本科生。今天将为大家介绍一些用Python处理文本的方法。</p>    <p>注：非商业转载注明作者即可，商业转载请联系作者授权并支付稿费。本专栏已授权“维权骑士”网站( <a href="/misc/goto?guid=4959676853861969517" rel="nofollow,noindex"> http:// rightknights.com </a> )对我在知乎发布文章的版权侵权行为进行追究与维权。</p>    <p>---------------------------------------------------------------------------------------------------------------------------</p>    <p>NLP主要是对文本的处理。在更深的应用中，我们可以根据我们的需要，去处理我们想要处理的文本（比如上次提到的“购物网站中的买家评论”）。而在开始的时候，我们一般使用NLTK中提供的语料进行练习；NLTK不仅提供文本处理的工具，而且提供了一些文本材料。</p>    <p>在我们已经下载的\nltk-3.2.1\nltk文件夹中，有一个book.py的模块。在Python命令窗口使用“from nltk.book import *”命令，可以导入该模块提供的文本；包括9本名著和9个句子。如下所示：</p>    <pre>  <code class="language-python">>>> from nltk.book import *  *** Introductory Examples for the NLTK Book ***  Loading text1, ..., text9 and sent1, ..., sent9  Type the name of the text or sentence to view it.  Type: 'texts()' or 'sents()' to list the materials.  text1: Moby Dick by Herman Melville 1851  text2: Sense and Sensibility by Jane Austen 1811  text3: The Book of Genesis  text4: Inaugural Address Corpus  text5: Chat Corpus  text6: Monty Python and the Holy Grail  text7: Wall Street Journal  text8: Personals Corpus  text9: The Man Who Was Thursday by G . K . Chesterton 1908  >>></code></pre>    <p>从结果中我们可以看到，9本名著的名字分别是text1~text9，9个句子的名字分别是sent1~sent9。在操作命令中，我们将使用这些名字来指代相应的文本，以对其进行处理。</p>    <p>下面的内容是对一些方法或函数的介绍，分为两个层面：文本层面和词汇层面。首先，在文本层面，哪些方法可以完成以下任务：</p>    <p>1. 在一段文本中，找出某个词语所在的上下文；</p>    <p>2. 找出与某个词有着类似用法的词，并确定它们在文本中出现的语境；</p>    <p>3. 在整个文本中，某个词或某些词在文本中是怎样分布的；</p>    <p>在2. 和3. 中，我们要处理的可能是多个词语（“它们”“某些词”）。如果有一定Python基础，那么不难猜到，我们可以用一个 <em>字符串</em> 来表示单个词语；对多个词语，我们需要用一个 <em>链表</em> 来表示。一个链表由一个英文方括号“[]”界定，方括号内的内容为有限个（可以为零个）有序的字符串（词语或其他符号），各个字符串之间用逗号分隔。可以试着执行：</p>    <pre>  <code class="language-python">>>> print sent1  ['Call', 'me', 'Ishmael', '.']  >>></code></pre>    <p>得到的就是一个链表。</p>    <p>下面来介绍这些任务的意义。</p>    <p>某一些词汇的上下文可能能够给我们提供一些有价值的信息。text3是《创世纪》（ <em>The Book of Genesis</em> ），如果我们想知道《创世纪》中的一些角色活了多久，那么我们可以通过对“lived”这个词进行1. 操作，以得到相关信息。操作如下：</p>    <pre>  <code class="language-python">>>> text3.concordance('lived')  Displaying 25 of 38 matches:  ay when they were created . And Adam lived an hundred and thirty years , and be  ughters : And all the days that Adam lived were nine hundred and thirty yea and  nd thirty yea and he died . And Seth lived an hundred and five years , and bega  ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an  welve years : and he died . And Enos lived ninety years , and begat Cainan : An   years , and begat Cainan : And Enos lived after he begat Cainan eight hundred   ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :  rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund  years : and he died . And Mahalaleel lived sixty and five years , and begat Jar  s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a  and five yea and he died . And Jared lived an hundred sixty and two years , and  o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y   and two yea and he died . And Enoch lived sixty and five years , and begat Met   ; for God took him . And Methuselah lived an hundred eighty and seven years ,    , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred   nd nine yea and he died . And Lamech lived an hundred eighty and two years , an  ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin  naan shall be his servant . And Noah lived after the flood three hundred and fi  xad two years after the flo And Shem lived after he begat Arphaxad five hundred  at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa  ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an  begat sons and daughters . And Salah lived thirty years , and begat Eber : And   y years , and begat Eber : And Salah lived after he begat Eber four hundred and   begat sons and daughters . And Eber lived four and thirty years , and begat Pe  y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an  >>></code></pre>    <p>“concordance”是text类（可参考Python中“类”的概念）的一个 <em>方法</em> （或 <em>函数</em> ；这里不对二者作区分），在后面的括号中以字符串的形式输入我们想要查找的词语，就可以得到其上下文。</p>    <p>相似地，可以用如下代码完成2. 任务：</p>    <pre>  <code class="language-python">>>> text2.similar('monstrous')  very exceedingly so heartily a great good amazingly as sweet  remarkably extremely vast  >>> text2.common_contexts(['monstrous', 'very'])  a_pretty is_pretty a_lucky am_glad be_glad  >>></code></pre>    <p>执行第一行代码得到的结果是在text2这个文本——《理智与情感》（ <em>Sense and Sensibility</em> ）——中，与“monstrous”这个词有着相似用法的词；在第二行代码中，我们使用了“common_contexts”这个 <em>方法</em> （中间的“_”符号相当于函数名中出现的连字符），得到的是圆括号中的词链表中两个词语共同的上下文。</p>    <p>第3. 个任务看起来更实用；我们可以将结果以分布图的形式输出。这时我们需要用到两个程序包：NumPy和Matplotlib。（可以到 <a href="/misc/goto?guid=4959676853943021410" rel="nofollow,noindex"> http://www. nltk.org/ </a> 上进行安装，也可以到 <a href="/misc/goto?guid=4959676854034517928" rel="nofollow,noindex"> http:// pan.baidu.com/s/1slSsSs H </a> 直接下载。）</p>    <p>通过执行代码（以由人名组成的链表为参数）：</p>    <pre>  <code class="language-python">>>> text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])</code></pre>    <p>我们可以得到，整部小说中，这四位主人公大致的出场分布。现在，如果告诉你，四个人中有两人是夫妻，那么没有读过这部小说的读者也可以根据得到的分布图猜一下，这两个人是谁。</p>    <p>接下来介绍一些词汇层面的处理方法。这里简单说三种：len(), set(), sorted(), count()。（明确一下：这里讲“词汇层面”并不意味着这三种方法处理的对象是词汇，而是指应用这三种方法时，我们的目的与整个文本的语境基本无关。）</p>    <p>len()的参数可以是text或sent（或链表；下同），处理得到的结果是这段文本或这个链表的长度，即所含词语及其它符号的数量（词语或其它符号若重复出现，将被重复计数；区别于“词汇量”）。需要注意的是，在计数过程中，标点符号（如逗号’,’）会被单独计数；而’.”’这样“句号加右双引号”的组合，会被计为一个符号。例：</p>    <pre>  <code class="language-python">>>> len(text1)  260819  >>> len(sent1)  4  >>></code></pre>    <p>set()和sorted()的参数同样是text或sent。set()可将作为参数的文本（text或sent；下同）中出现的所有词语或其他字符不重复地以链表的形式输出，相当于输出一个乱序的词汇表；而sorted()经常与set()搭配使用，相信你已经猜到它的作用了：将作为参数的文本按默认顺序排列。</p>    <p>这样，使用如下代码，就可得到一个文本所用的词汇表了（以text2，《理智与情感》为例；词汇表中包含除字母单词外的其他符号）：</p>    <pre>  <code class="language-python">>>> sorted(set(text2))</code></pre>    <p>将以上三种函数配合使用，可以开发出更多考查文本属性的函数。可以想想，如何计算一段文本的词汇多样性？（提示：可以用每个词汇出现的平均次数来衡量。）</p>    <p>count()方法的参数是字符串形式的词语，如：</p>    <pre>  <code class="language-python">>>> text2.count('monstrous')  11  >>></code></pre>    <p>得到的是“monstrous”这个词在text2中出现的频次。</p>    <p>结合前面的介绍，不难算出这个词在该文本中出现的频率。当然，对频次和频率的统计，我们有更加方便的方法（nltk内置的FreqDist()函数），在这里暂不作介绍，在得到更丰富的文本材料后，我们将用这个函数和另一个有关频率分布的函数，完成更多有意义的操作。</p>    <p>---------------------------------------------------------------------------------------------------------------------------</p>    <p>备注：对text1~9和sent1~9，我们可以像操作 <em>序列</em> 那样，进行索引、切片和遍历。具体概念可参考Python教程中的相关内容。参见之前的文章：</p>    <p><a href="/misc/goto?guid=4959676854113642002" rel="nofollow,noindex">https:// zhuanlan.zhihu.com/p/21 360064?refer=xmucpphttps://zhuanlan.zhihu.com/p/21360064?refer=xmucpp </a></p>    <p>代码举例：</p>    <pre>  <code class="language-python">>>> text2.count('monstrous')  11  >>> text2[173]#得到text2中第172个词语或其他符号  u'nephew'  >>> text2.index('monstrous')#得到该词第一次出现时对应的索引  38430  >>> text2[173:177]#text2中第172个词到第176个词，不包括第176个词  [u'nephew', u'and', u'niece', u',']  >>> for word in text2:   print word, #得到整篇《理智与情感》（需要一定时间才能输出全部文本）</code></pre>    <p>-----------------------------------------------------END------------------------------------------------------</p>    <p>更多项目介绍，请关注我们的项目专栏： <a href="/misc/goto?guid=4959676854191570185" rel="nofollow,noindex">China's Prices Project - 知乎专栏</a></p>    <p>项目联系方式：</p>    <ul>     <li>项目邮箱（@iGuo 的邮箱）：zhangguocpp@163.com</li>     <li>申请加入项目，请联系人事负责人@Suri ：liuxiaomancpp@163.com</li>     <li>知乎：@iGuo@Suri（项目负责人）@林行健@Dementia （技术负责人）@张土不 （财务负责人）</li>    </ul>    <p>作者：CPP</p>    <p>链接： <a href="/misc/goto?guid=4959676854276610371" rel="nofollow,noindex"> https:// zhuanlan.zhihu.com/p/21 511857 </a></p>    <p>来源：知乎</p>    <p>著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。</p>    <p> </p>    <p>来自：https://zhuanlan.zhihu.com/p/22059714</p>    <p> </p>
自然语言处理 | （二）Python对文本的简单处理

相关经验

目录