数据科学处理Python下的系列工具(库):Rosetta

jopen 9年前

RosettaPython下的系列工具( 库),为数据科学处理尤其是文本处理提供支持,其中对并行、大文件处理等方面的优化非常好。

Tools for data science with a focus on text processing.

  • Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
  • Integrates with existing scientific Python stack as well as select outside tools.

Examples

  • See theexamples/directory.
  • The docs contain plots of example output.

Packages

cmdutils

  • Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
  • Focus on stream processing and csv files.

parallel

  • Wrappers for Python multiprocessing that add ease of use
  • Memory-friendly multiprocessing

text

  • Stream text from disk to formats used in common ML processes
  • Write processed text to sparse formats
  • Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
  • Other general utilities

workflow

  • High-level wrappers that have helped with our workflow and provide additional examples of code use

modeling

  • General ML modeling utilities

Install

Check out the master branch from the rosettarepo. Then, (so long as you havepip).

cd rosetta  make  make test

If you update the source, you can do

make reinstall  make test

The abovemaketargets usepip, so you can of course dopip uninstallat any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

pip install rosetta-X.X.X.tar.gz

Development

Code

You can get the latest sources with

git clone git://github.com/columbia-applied-data-science/rosetta

Contributing

Feel free to contribute a bug report or a request by opening an issue

The preferred method to contribute is to fork and send a pull request. Before doing this, read CONTRIBUTING.md

Dependencies

  • Major dependencies on Pandas and numpy.
  • Minor dependencies on Gensim and statsmodels.
  • Some examples need scikit-learn.
  • Minor dependencies on docx
  • Minor dependencies on the unix utilities pdftotext and catdoc

Testing

From the base repo directory,rosetta/, you can run all tests with

make test

项目主页:http://www.open-open.com/lib/view/home/1422504070611