Apache Spark 1.4 发布，开源集群计算系统-Spark-中存储网

2015-07-06 23:51:16

来源
中存储网

Spark

Apache Spark 1.4 发布，该版本将 R API 引入 Spark，Spark core 有多各方面的改进，主要集中在操作，性能和兼容性上。

Apache Spark 1.4 发布，该版本将 R API 引入 Spark，同时提升了 Spark 的核心引擎和 MLlib ，以及 Spark Streaming 的可用性。部分重要更新如下：

Spark Core

Spark core 有多各方面的改进，主要集中在操作，性能和兼容性上：

SPARK-6942: Visualization for Spark DAGs and operational monitoring
SPARK-4897: Python 3 support
SPARK-3644: A REST API for application information
SPARK-4550: Serialized shuffle outputs for improved performance
SPARK-7081: Initial performance improvements in project Tungsten
SPARK-3074: External spilling for Python groupByKey operations
SPARK-3674: YARN support for Spark EC2 and SPARK-5342: Security for long running YARN applications
SPARK-2691: Docker support in Mesos and SPARK-6338: Cluster mode in Mesos

DataFrame API and Spark SQL

The DataFrame API 在 Spark 1.4 有重要扩展 (see this link for a full list)，主要集中在分析和数学函数。 Spark SQL 引入新的实用工具，并且支持 ORCFile.

SPARK-2883: Support for ORCFile format
SPARK-2213: Sort-merge joins to optimize very large joins
SPARK-5100: Dedicated UI for the SQL JDBC server
SPARK-6829: Mathematical functions in DataFrames
SPARK-8299: Improved error message reporting for DataFrame and SQL
SPARK-1442: Window functions in Spark SQL and DataFrames
SPARK-6231 / SPARK-7059: Improved API support for self joins
SPARK-5947: Partitioning support in Spark’s data source API
SPARK-7320: Rollup and cube functions
SPARK-6117: Summary and descriptive statistics

Spark ML/MLlib

Spark’s ML pipelines API graduates from alpha in this release, with new transformers and improved Python coverage. MLlib 增加了几种新算法。

SPARK-5884: A variety of feature transformers for ML pipelines
SPARK-7381: Python API for ML pipelines
SPARK-5854: Personalized PageRank for GraphX
SPARK-6113: Stabilize DecisionTree and ensembles APIs
SPARK-7262: Binary LogisticRegression with L1/L2 (elastic net)
SPARK-7015: OneVsRest multiclass to binary reduction
SPARK-4588: Add API for feature attributes
SPARK-1406: PMML model evaluation support via MLib
SPARK-5995: Make ML Prediction Developer APIs public
SPARK-3066: Support recommendAll in matrix factorization model
SPARK-4894: Bernoulli naive Bayes
SPARK-5563: LDA with online variational inference to the release note

Spark Streaming

Spark streaming 添加了新的视觉仪表图形，并大大提高了 UI 调试器的信息，同时增强了对 Kafka 和 Kinesis 的支持。

SPARK-7602: Visualization and monitoring in the streaming UI including batch drill down (SPARK-6796, SPARK-6862)
SPARK-7621: Better error reporting for Kafka
SPARK-2808: Support for Kafka 0.8.2.1 and Kafka with Scala 2.11
SPARK-5946: Python API for Kafka direct mode
SPARK-7111: Input rate tracking for Kafka
SPARK-5960: Support for transferring AWS credentials to Kinesis
SPARK-7056 A pluggable interface for write ahead logs

更多内容请查看发行说明。

Spark 1.4下载请点这里： downloads 。

Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。

Spark 是在 Scala 语言中实现的，它将 Scala 用作其应用程序框架。与 Hadoop 不同，Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际上它是对 Hadoop 的补充，可以在 Hadoo 文件系统中并行运行。通过名为 Mesos 的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发，可用来构建大型的、低延迟的数据分析应用程序。

声明： 此文观点不代表本站立场；转载须要保留原文链接；版权疑问请联系我们。

Apache Spark 1.4 发布，开源集群计算系统

Spark Core

DataFrame API and Spark SQL

Spark ML/MLlib

Spark Streaming

Spark入门教程（Python版）

Tachyon架构分析和现存问题讨论

IBM全力支持 spark或成未来最重要开源项目

TalkingData大规模机器学习的应用

英特尔段建钢：Spark将成为下一代大数据的标准

科技要闻

前亚马逊工程师因黑客攻击交易所并盗窃1200多万美元获刑3年

漂亮手机来了，华为P系列改名Pura

聊聊网络安全的那些事

富捷电子：国内贴片电阻头部制造商

微星推出MPG EZ120 ARGB风扇：磁性连接设计最多可接18个

Apache Spark 1.4 发布，开源集群计算系统

Spark Core

DataFrame API and Spark SQL

Spark ML/MLlib

Spark Streaming

相关推荐

科技要闻