Hadoopy: 使用Cython实现Python对Hadoop的封装

jopen 11年前

Hadoopy是Hadoop Streaming的一个Python封装，采用Cython开发。它简单，快速，并且易于被修改。它已经在超过700个节点的集群中测试过了。Hadoopy的目标是：

Similar interface as the Hadoop API (design patterns usable between Python/Java interfaces)
General compatibility with dumbo to allow users to switch back and forth
Usable on Hadoop clusters without Python or admin access
Fast conversion and processing
Stay small and well documented
Be transparent with what is going on
Handle programs with complicated .so’s, ctypes, and extensions
Code written for hack-ability
Simple HDFS access (e.g., reading, writing, ls)
Support (and not replicate) the greater Hadoop ecosystem (e.g., Oozie, whirr)

杀手特点（Hadoopy的独特之处）：

Automated job parallelization ‘auto-oozie’ available in the hadoopy flow project (maintained out of branch)
Local execution of unmodified MapReduce job with launch_local
Read/write sequence files of TypedBytes directly to HDFS from python (readtb, writetb)
Allows printing to stdout and stderr in Hadoop tasks without causing problems (uses the ‘pipe hopping’ technique, both are available in the task’s stderr)
Works on clusters without any extra installation, Python, or any Python libraries (uses Pyinstaller that is included in this source tree)

额外特性：

Works on OS X
Critical path is in Cython
Simple HDFS access (readtb and ls) inside Python, even inside running jobs
Unit test interface
Reporting using status and counters (and print statements! no need to be scared of them in Hadoopy)
Supports design patterns in the Lin&Dyer book
Typedbytes support (very fast)
Oozie support