Apache Spark 1.3 发布，基于内存计算的开源的集群计算系统

f663x 10年前

Apache Spark 1.3 发布，1.3 版本引入了期待已久的 DataFrame API，这是 Spark 的 RDD 抽象设计来简单快速支持大数据集的变革。同时在流转换 ML 和 SQL 的大量提升。

Spark是一个基于内存计算的开源的集群计算系统，目的是让数据分析更加快速。Spark非常小巧玲珑，由加州伯克利大学AMP实验室的Matei为主的小团队所开发。使用的语言是Scala，项目的core部分的代码只有63个Scala文件，非常短小精悍。
Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。
Spark 是在 Scala 语言中实现的，它将 Scala 用作其应用程序框架。与 Hadoop 不同，Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。
尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际上它是对 Hadoop 的补充，可以在 Hadoop 文件系统中并行运行。通过名为Mesos的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发，可用来构建大型的、低延迟的数据分析应用程序。
Spark 集群计算架构
虽然 Spark 与 Hadoop 有相似之处，但它提供了具有有用差异的一个新的集群计算框架。首先，Spark 是为集群计算中的特定类型的工作负载而设计，即那些在并行操作之间重用工作数据集（比如机器学习算法）的工作负载。为了优化这些类型的工作负载，Spark 引进了内存集群计算的概念，可在内存集群计算中将数据集缓存在内存中，以缩短访问延迟。
Spark 还引进了名为弹性分布式数据集(RDD) 的抽象。RDD 是分布在一组节点中的只读对象集合。这些集合是弹性的，如果数据集一部分丢失，则可以对它们进行重建。重建部分数据集的过程依赖于容错机制，该机制可以维护 "血统"（即允许基于数据衍生过程重建部分数据集的信息）。RDD 被表示为一个 Scala 对象，并且可以从文件中创建它；一个并行化的切片（遍布于节点之间）；另一个 RDD 的转换形式；并且最终会彻底改变现有 RDD 的持久性，比如请求缓存在内存中。
Spark 中的应用程序称为驱动程序，这些驱动程序可实现在单一节点上执行的操作或在一组节点上并行执行的操作。与 Hadoop 类似，Spark 支持单节点集群或多节点集群。对于多节点操作，Spark 依赖于 Mesos 集群管理器。Mesos 为分布式应用程序的资源共享和隔离提供了一个有效平台。该设置充许 Spark 与 Hadoop 共存于节点的一个共享池中。

Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.

We’ll be publishing in depth overview posts covering Spark’s new features over the coming weeks. Some of the salient features of this release are:

A new DataFrame API

The DataFrame API that we recently announced officially ships in Spark 1.3. DataFrames evolve Spark’s RDD model, making operations with structured datasets even faster and easier. They are inspired by, and fully interoperable with, Pandas and R data frames, and are available in Spark’s Java, Scala, and Python API’s as well as the upcoming (unreleased) R API. DataFrames introduce new simplified operators for filtering, aggregating, and projecting over large datasets. Internally, DataFrames leverage the Spark SQL logical optimizer to intelligently plan the physical execution of operations to work well on large datasets. This planning permeates all the way into physical storage, where optimizations such as predicate pushdown are applied based on analysis of user programs. Read more about the data frames API in the SQL programming guide.

# Constructs a DataFrame from a JSON dataset. users = context.load("s3n://path/to/users.json", "json") # Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)  # Alternatively, using Pandas-like syntax young = users[users.age < 21]  # DataFrame's support existing RDD operators print("Young users: " + young.count())

Spark SQL Graduates from Alpha

Spark SQL graduates from an alpha component in this release, guaranteeing compatibility for the SQL dialect and semantics in future releases. Spark SQL’s data source API now fully interoperates with the new DataFrame component, allowing users to create DataFrames directly from Hive tables, Parquet files, and other sources. Users can also intermix SQL and data frame operators on the same data sets. New in 1.3 is the ability to read and write tables from a JDBC connection, with native support for Postgres and MySQL and other RDBMS systems. That API adds has write support for producing output tables as well, to JDBC or any other source.

> CREATE TEMPORARY TABLE impressions USING org.apache.spark.sql.jdbc OPTIONS (  url "jdbc:postgresql:dbserver",  dbtable "impressions" ) > SELECT COUNT(*) FROM impressions

Built-in Support for Spark Packages

We earlier announced an initiative to create a community package repository for Spark at the end of 2014. Today Spark Packages has 45 community projects catering to Spark developers, including data source integrations, testing utilities, and tutorials. To make packages easy for Spark users, Spark 1.3 includes support for pulling published packages into the Spark shell or a program with a single flag.

# Launching Spark shell with a package ./bin/spark-shell --packages databricks/spark-avro:0.2

For developers, Spark Packages has also created an SBT plugin to make publishing packages easy and introduced automatic Spark compatibility checks of new releases.

Lower Level Kafka Support in Spark Streaming

Over the last few releases, Kafka has become a popular input source for Spark streaming. Spark 1.3 adds a new Kakfa streaming source that leverages Kafka’s replay capabilities to provide reliable delivery semantics without the use of a write ahead log. It also provides primitives which enable exactly once guarantees for applications that have strong consistently requirements. Kafka support adds a Python API in this release, along with new primitives for creating Python API’s in the future. For a full list of Spark streaming features see the upstream release notes.

New Algorithms in MLlib

Spark 1.3 provides a rich set of new algorithms. The latent Dirichlet allocation (LDA) is one of the first topic modeling algorithms to appear in MLlib. We’ll be documenting LDA in more detail in a follow-up post. Spark’s logistic regression has been generalized to multinomial logistic regression for multiclass classification. This release also adds improved clustering through Gaussian mixture models and power iteration clustering, and frequent itemsets mining through FP-growth. Finally, an efficient block matrix abstraction is introduced for distributed linear algebra. Several other algorithms and utilities are added and discussed in the full release notes.

This post only scratches the surface of interesting features in Spark 1.3. Overall, this release contains more than 1000 patches from 176 contributors making it our largest yet. Head over to the official release notes to learn more about this release, and watch the Databricks blog for more detailed posts about the major features in the next few weeks!