全文和文章元数据抽取开源Python库:newspaper

jopen 9年前

newspaper: 一个新闻、全文和文章元数据抽取开源Python库。支持包括中文在内的多种自然语言,支持关键字、图像和摘要等多种元数据类型抽取,支持多线程下载。

  • Full Python3 and Python2 support
  • Multi-threaded article download framework
  • News url identification
  • Text extraction from html
  • Top image extraction from html
  • All image extraction from html
  • Keyword extraction from text
  • Summary extraction from text
  • Author extraction from text
  • Google trending terms extraction
  • Works in 10+ languages (English, Chinese, German, Arabic, ...)
>>> from newspaper import Article    >>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'  >>> article = Article(url)    >>> article.download()    >>> article.html  '<!DOCTYPE HTML><html itemscope itemtype="http://...'    >>> article.parse()    >>> article.authors  ['Leigh Ann Caldwell', 'John Honway']    >>> article.publish_date  datetime.datetime(2013, 12, 30, 0, 0)    >>> article.text  'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'    >>> article.top_image  'http://someCDN.com/blah/blah/blah/file.png'    >>> article.movies  ['http://油Tube.com/path/to/link.com', ...]    >>> article.nlp()    >>> article.keywords  ['New Years', 'resolution', ...]    >>> article.summary  'The study shows that 93% of people ...'    >>> import newspaper    >>> cnn_paper = newspaper.build('http://cnn.com')    >>> for article in cnn_paper.articles:  >>>     print(article.url)  http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/  http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html  ...    >>> for category in cnn_paper.category_urls():  >>>     print(category)    http://lifestyle.cnn.com  http://cnn.com/world  http://tech.cnn.com  ...    >>> cnn_article = cnn_paper.articles[0]  >>> cnn_article.download()  >>> cnn_article.parse()  >>> cnn_article.nlp()  ...    >>> from newspaper import fulltext    >>> html = requests.get(...).text  >>> text = fulltext(html)    Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language.    >>> from newspaper import Article  >>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'    >>> a = Article(url, language='zh') # Chinese    >>> a.download()  >>> a.parse()    >>> print(a.text[:150])  香港行政长官梁振英在各方压力下就其大宅的违章建  筑(僭建)问题到立法会接受质询,并向香港民众道歉。  梁振英在星期二(12月10日)的答问大会开始之际  在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的  意图和动机。 一些亲北京阵营议员欢迎梁振英道歉,  且认为应能获得香港民众接受,但这些议员也质问梁振英有    >>> print(a.title)  港特首梁振英就住宅违建事件道歉    If you are certain that an entire news source is in one language, go ahead and use the same api :)    >>> import newspaper  >>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')    >>> for category in sina_paper.category_urls():  >>>     print(category)  http://health.sina.com.cn  http://eladies.sina.com.cn  http://english.sina.com  ...    >>> article = sina_paper.articles[0]  >>> article.download()  >>> article.parse()    >>> print(article.text)  新浪武汉汽车综合 随着汽车市场的日趋成熟,  传统的“集全家之力抱得爱车归”的全额购车模式已然过时,  另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购  买爱车最为时尚的消费理念,他们认为,这种新颖的购车  模式既能在短期内  ...    >>> print(article.title)  两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽  车网_新浪汽车_新浪网

项目主页:http://www.open-open.com/lib/view/home/1432176001646