Python文本处理工具包:TextBlob

jopen 10年前

TextBlob是一个很有意思的Python文本处理工具包,它其实是基于上面两个Python工具包NLKT和Pattern做了封装(TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both),同时提供了很多文本处理功能的接口,包括词性标注,名词短语提取,情感分析,文本分类,拼写检查等,甚至包括翻译和语言检测,不过这个是基于Google的API的,有调用次数限制。TextBlob相对比较年轻,有兴趣的同学可以关注。

from textblob import TextBlob    text = '''  The titular threat of The Blob has always struck me as the ultimate movie  monster: an insatiably hungry, amoeba-like mass able to penetrate  virtually any safeguard, capable of--as a doomed doctor chillingly  describes it--"assimilating flesh on contact.  Snide comparisons to gelatin be damned, it's a concept with the most  devastating of potential consequences, not unlike the grey goo scenario  proposed by technological theorists fearful of  artificial intelligence run rampant.  '''    blob = TextBlob(text)  blob.tags           # [(u'The', u'DT'), (u'titular', u'JJ'),                      #  (u'threat', u'NN'), (u'of', u'IN'), ...]    blob.noun_phrases   # WordList(['titular threat', 'blob',                      #            'ultimate movie monster',                      #            'amoeba-like mass', ...])    for sentence in blob.sentences:      print(sentence.sentiment.polarity)  # 0.060  # -0.341    blob.translate(to="es")  # 'La amenaza titular de The Blob...'

特性:

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

官方主页:http://textblob.readthedocs.org/en/dev/
Github代码页:https://github.com/sloria/textblob