自动文本摘要生成

JerHma 8年前

来自: https://github.com/miso-belica/sumy

Automatic text summarizer

自动文本摘要生成。简单的库和命令行工具用于从HTML页面或纯文本抽取摘要。该软件包还包含了文本摘要简单的评价框架。实现的摘要方法如下:

Here are some other summarizers:

Installation

Make sure you have Python 2.7/3.3+ and pip ( Windows , Linux ) installed. Run simply (preferred way):

$ [sudo] pip install sumy

Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/sumy.git

Usage

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?  $ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/  $ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan  $ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization  $ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/  $ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan  $ sumy_eval --help # for more info

Python API

Or you can use sumy like a library in your project.

# -*- coding: utf8 -*-    from __future__ import absolute_import  from __future__ import division, print_function, unicode_literals    from sumy.parsers.html import HtmlParser  from sumy.parsers.plaintext import PlaintextParser  from sumy.nlp.tokenizers import Tokenizer  from sumy.summarizers.lsa import LsaSummarizer as Summarizer  from sumy.nlp.stemmers import Stemmer  from sumy.utils import get_stop_words      LANGUAGE = "czech"  SENTENCES_COUNT = 10      if __name__ == "__main__":      url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"      parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))      # or for plain text files      # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))      stemmer = Stemmer(LANGUAGE)        summarizer = Summarizer(stemmer)      summarizer.stop_words = get_stop_words(LANGUAGE)        for sentence in summarizer(parser.document, SENTENCES_COUNT):          print(sentence)

Tests

Setup:

$ pip install pytest pytest-cov

Run tests via

$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5