快速自动提取关键词(RAKE)算法的Python实现:rake

jopen 9年前

快速自动提取关键词(RAKE)算法的一个Python实现。自动从单个文档关键字提取。

    import rake        import operator                # EXAMPLE ONE - SIMPLE        stoppath = "SmartStoplist.txt"        '''''       # 1. initialize RAKE by providing a path to a stopwords file       rake_object = rake.Rake(stoppath, 5, 3, 4)  # the notation is: (1)Each word has at least 5 characters, (2)Each phrase has at most 3 words,(3)Each keyword appears in the text at least 4 times                     # 2. run on RAKE on a given text       sample_file = open("data/docs/fao_test/w2167e.txt", 'r')       text = sample_file.read()              keywords = rake_object.run(text) # this command can output all the keywords and their scores              # 3. print results       print "Keywords:", keywords              print "----------"           '''        # EXAMPLE TWO - BEHIND THE SCENES (from https://github.com/aneesha/RAKE/rake.py)                # initialize RAKE by providing a path to a stopwords file        rake_object = rake.Rake(stoppath)                text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility " \               "of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. " \               "Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating"\               " sets of solutions for all types of systems are given. These criteria and the corresponding algorithms " \               "for constructing a minimal supporting set of solutions can be used in solving all the considered types of " \               "systems and systems of mixed types."                                # Split text into sentences        sentenceList = rake.split_sentences(text) # sentence was split by  punctuation mark, comma and period here.                for sentence in sentenceList:            print "Sentence:", sentence                # generate candidate keywords        stopwordpattern = rake.build_stop_word_regex(stoppath)        phraseList = rake.generate_candidate_keywords(sentenceList, stopwordpattern)   # phrase is the candidated keywords        # this method does not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who).        # improvements can be made here        Read more at https://www.airpair.com/nlp/keyword-extraction-tutorial#4Lc4GeP5t5cYe7OR.99        print "Phrases:", phraseList                # calculate individual word scores        wordscores = rake.calculate_word_scores(phraseList)                # generate candidate keyword scores        keywordcandidates = rake.generate_candidate_keyword_scores(phraseList, wordscores)        # One issue here is that the candidates are not normalized in any way.         # As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder.        # Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first.        # so stemming is always used before keyword extraction. This can be another improvement.                                 for candidate in keywordcandidates.keys():            print "Candidate: ", candidate, ", score: ", keywordcandidates.get(candidate)                                # sort candidates by score to determine top-scoring keywords        sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True)        totalKeywords = len(sortedKeywords)                # for example, you could just take the top third as the final keywords        for keyword in sortedKeywords[0:(totalKeywords / 3)]: # note that hte final keywords are determined by top third            print "Keyword: ", keyword[0], ", score: ", keyword[1]                print rake_object.run(text) # this command outputs all the keywords and their scores.  

项目主页:http://www.open-open.com/lib/view/home/1421808449078