Python-tesseract:光学字符识别Tesseract OCR的Python封装包

jopen 11年前

Python-tesseract 是光学字符识别Tesseract OCR引擎的Python封装类。能够读取任何常规的图片文件(JPG, GIF ,PNG , TIFF等)并解码成可读的语言。在OCR处理期间不会创建任何临文件。
示例1:

import tesseract  api = tesseract.TessBaseAPI()  api.Init(".","eng",tesseract.OEM_DEFAULT)  api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")  api.SetPageSegMode(tesseract.PSM_AUTO)    mImgFile = "eurotext.jpg"  mBuffer=open(mImgFile,"rb").read()  result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)  print "result(ProcessPagesBuffer)=",result
示例2:
import cv2.cv as cv  import tesseract    api = tesseract.TessBaseAPI()  api.Init(".","eng",tesseract.OEM_DEFAULT)  api.SetPageSegMode(tesseract.PSM_AUTO)    image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)  tesseract.SetCvImage(image,api)  text=api.GetUTF8Text()  conf=api.MeanTextConf()

项目主页:http://www.open-open.com/lib/view/home/1352354768500