超越 Hadoop* 的大数据: 未来的研究方向


超越 Hadoop* 的大数据: 未来的研究方向 ACAS002 Jason Dai 工程总监兼首席工程师,软件与解决方案事业部 芮 勱 恪 博士 科研计划总监, 高校科研协作办公室 2 议程 • 大数据和 Hadoop* 生态系统 • 英特尔与大学合作大数据研究 • 高效的 map reduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 本课程演示文稿(PDF)发布在技术课程目录网站: intel.com/go/idfsessionsBJ 该网址同时打印于会议指南中专题讲座日程页的上方 3 议程 • 大数据和 Hadoop* 生态系统 • 英特尔大学合作部和大数据研究 • 高效的 mapreduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 4 什么是大数据? 大数据的特点是数量大、速度快、现有系统与算法难以处理。 • 数量大 – TB 级转向 PB 级 – 需要智能(而非强力)的大规模并行处理 • 速度快 – 无所不在的传感器带来了新的海量数据 – 摄取困难 • 处理难 – 需要复杂分析(例如,查找类型、趋势和关系) – 需要整合多种数据类型 (无模式,无管理, 不一致的句法和语义) Samuel Madden ISTC 主任与教授 EECS, MIT 数据应当是资源,而非负载 现有数据处理工具不够完善 5 例如: Web 分析 大型网络企业: 成千上万的服务器, 不计其数的用户,和 每天 TB 级的“键击资料” 不仅仅是简单的报告: 例如:实时分析用户的下一步操作,或 应该为他们提供什么广告,或 他们可以归于哪一用户类型 现有分析系统要么: 无法扩展至所需规模,要么 无法提供所需完善度 Samuel Madden ISTC 主任与教授 EECS, MIT 6 例如: 传感器分析 智能手机提供商 收费机构 市政部门 保险公司 医生 企业 采集大规模视频流,定位,加速, 以及来自手机和其它设备的数据 这些数据需要存储、处理并挖掘, 例如,评测交通量、驾驶风险或医疗诊断。 Samuel Madden ISTC 主任与教授 EECS, MIT 7 数据交换时代 传统业务解决方案结合新分析模式实现实时价值机遇 新分析模式 经济高效的垂直 解决方案 计算 平台技术 结构 MIC EP EX EXALYTICS 传统业务解决方案 业务流程创新 内存数据库 — 集成式分析 — 系统与设备 医疗 能源 - 科学 制造 FSI 电子商务 大数据 大数据生态系统中的 Hadoop* 8 议程 • 大数据和 Hadoop* 生态系统 • 英特尔与大学合作大数据研究 • 高效的 map reduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 9 数据传输计算 与存储平台 数据管理 与处理 分析 数据使用 可视化 最终用户工具 应用 服务 英特尔大数据行动概述 分布式机器学习 (大学合作者) 物联网/ M2M (英特尔研究院和 大学合作者) 英特尔 软件 英特尔 架构 英特尔 研究院 英特尔 IT HiTune* 和其它面向 Hadoop 的工具 业务智能和 Hadoop* 压缩和解压 IPs 微服务器 Hadoop 发布与服务 其它 信托经纪人 (McAfee*) 基于地点的服务 (Telmap) 端到端数据安全 联合设备架构 视频分析 分布式视频分析 分布式架构(Guavus) 医疗,电信,…… 大型对象存储 面向大数据与分析的企业数据解决方案计划 大数据市场 确定规模和细分 市场(联合 Bain) Hadoop 性能和架构 10 议程 • 大数据和 Hadoop* 生态系统 • 英特尔与大学合作大数据研究 • 高效的 map reduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 11 算法,机器,人(AMPLab) 适应型/主动型 机器学习与分析 云计算 众包/ 人力计算 大规模和 多样化 数据 以 BSD 开源形式发布的所有软件 12 Berkeley 数据分析系统 Mesos*: 资源管理平台 SCADS: 不依赖规模的存储系统 PIQL, Spark: 处理框架 更高查询语言/处理框架 资源管理 存储 Mesos AMPLab 第三方 HDFS SCADS Hadoop* Hive* Pig* … MPI PIQL Shark Spark … … 13 数据中心编程: Spark • 面向再利用工作数据集的应用的内存集群计算框架 – 迭代算法: 机器学习,图形处理,优化 – 交互式数据采掘: 排序速度超过基于磁盘的工具 • 主要理念: RDD“可恢复、分布式数据集”,发生 故障后可自动重新构建 – 存储大型工作数据集 – 基于“数据沿袭”的容错机制 14 Spark: 动因 复杂任务、交互式查询和在线处理都需要一项技术是 Hadoop* MR 所不具备的: • 高效的数据共享 第 1 阶段 第 2 阶段 第 3 阶段 交互式任务 查询 1 查询 2 查询 3 交互式采掘 任务 1 任务 2 … 流处理 15 Hadoop* 中的传送与共享 Iter. 1 Iter. 2 . . . 输入 HDFS 读取 HDFS 存写 HDFS 读取 HDFS 存写 输入 查询 1 查询 2 查询 3 结果 1 结果 2 结果 3 . . . HDFS 读取 16 Iter. 1 Iter. 2 . . . 输入 Spark: 内存数据共享 分布式 内存 输入 查询 1 查询 2 查询 3 . . . 一次性 处理 17 引入 Shark • Spark + Hive* (NoSQL 中的 SQL) • 利用 Spark 的内存 RDD 缓存和灵活的语言功能: 结果再利用,和低延迟 • 可扩展,可容错,速度快 • 查询功能兼容 Hive 18 性能指标评测: 查询 1 SELECT * FROM grep WHERE field LIKE ‘%XYZ%’; 30GB 输入表 19 性能指标评测: 查询 2 5 GB 输入表 SELECT pagerank, pageURL FROM rankings WHERE pagerank > 10; * 20 议程 • 大数据和 Hadoop* 生态系统 • 英特尔与大学合作大数据研究 • 高效的 map reduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 21 CPU 1 CPU 2 CPU 3 CPU 4 数据并行(MapReduce) 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 2 4 . 1 8 4 . 3 1 8 . 4 8 4 . 4 1 7 . 5 6 7 . 5 1 4 . 9 3 4 . 3 解决大量独立的子问题 22 面向数据并行 ML 的 MapReduce • 大型数据并行任务的理想选择! 数据并行 图形并行 交叉 验证 特性 提取 MapReduce 计算充分的 统计 还可以继续完善 机器学习吗 ? 23 数据 机器学习流程 图片 docs 视频 排名 提取特性 面孔 重要 话语 边信息 图形信息 相似面孔 共享话语 家庭影片 结构化 机器学习算法 置信传播 LDA 协同过滤 数据 中的 价值 面部标签 doc 主题 推荐视频 24 数据 并行化机器学习 提取特性 图形信息 结构化 机器学习算法 数据 中的 价值 图形输入 多数为并行数据 结构化图形计算 并行图形 25 解决并行图形 ML 数据并行 图形并行 交叉 验证 特性 提取 Map Reduce 计算充分的 统计 图形 模式 Gibbs Sampling 置信传播 Variational Opt. 半监督学习 标签传播 CoEM 数据采掘 网页排名 三角形计数 协同过滤 张量分解 Map Reduce? 并行图形抽象 26 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Speedup Number of CPUs 较好 最佳 GraphLab CoEM 示例: 终身学习计划(CoEM) GraphLab 16 个内核 30 分钟 速度提高 15 倍! CPU 占用减少 6 倍! Hadoop* 95 个内核 7.5 小时 分布式 GraphLab 32 EC2 机器 80 秒 仅为 Hadoop 时间的 0.3% 27 示例: 网页排名 4 千万次网页,14 亿个链接 GraphLab Twister Hadoop 5.5 小时 1 小时 8 分钟 * * 28 议程 • 大数据和 Hadoop* 生态系统 • 英特尔与大学合作大数据研究 • 高效的 map reduce 内存实施 • 高效的图形分析算法 • 英特尔努力推动生产研究 29 英特尔对 Hadoop*的贡献 • 英特尔® Distribution for Apache Hadoop* – 性能,安全和管理 – 下载地址: http://hadoop.intel.com/ • 英特尔面向 Hadoop 的开源计划 – HiBench: Hadoop 综合基准指标套件 . https://github.com/intel-hadoop/hibench – Project Panthera: 有效支持基于 Hadoop 的标准 SQL 特性 . https://github.com/intel-hadoop/project-panthera – Project Rhino: 为 Apache Hadoop 生态系统增强数据保护 . https://github.com/intel-hadoop/project-rhino – Graph Builder: 基于 Hadoop 的可扩展图形构建工具 . http://graphlab.org/intel-graphbuilder/ 30 使用 Spark/Shark 进行内存实时数据分析 • 使用案例 1: 专门和交互式查询 – 交互式查询(探索性专门查询,商业智能图表和采掘) – 同类项目: Google* Dremel, Facebook* Peregrine, Cloudera* Impala, Apache* Drill, 等(数秒延迟) – 使用 Shark/Spark 为交互式查询实现次秒级的延迟 • 使用案例 2: 内存实时分析 – 迭代数据采掘,在线分析(例如:将图表载入内存以支持在线分析,高 速缓存中间结果以支持迭代机器学习) – 同类项目: Google PowerDrill – 使用 Shark/Spark 可靠地将数据载入分布式内存以支持在线分析 31 使用 Spark/Shark 进行内存实时数据分析 • 使用案例 3: 流处理 – 流分析,CEP (例如:入侵检测,实时统计,等) – 同类项目: Twitter* Storm, Apache* S4, Facebook* Puma – 使用 Spark 简化流处理 . 更佳的可靠性 . 面向离线、在线和流分析的统一框架 • 使用案例 4: 并行图形分析与机器学习 – 使用案例: 图形算法,机器学习(例如:社交网络分析,推荐引擎) – 同类项目: Google* Pregel, CMU GraphLab* – 使用 Bagel (Pregel on Spark) 支持 Spark 环境下的并行图形分析 和机器学习 32 总结 • Hadoop* 中部署的 MapReduce 十分有用,不过: – 内存实施显示出重要优势 – 图形算法可能更适合现有问题 • 英特尔继续和大学研究人员合作 • 英特尔致力于在生产环境中落实研究成果 33 行动号召 • 在您的大数据研究中引入英特尔研究成果! • 和我们一起利用 Spark/Shark 研究下一代内存 实时分析 34 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. • Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. • The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps. • Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number. • Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. • Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm • Intel, Sponsors of Tomorrow and the Intel logo are trademarks of Intel Corporation in the United States and other countries. • *Other names and brands may be claimed as the property of others. • Copyright ©2013 Intel Corporation. 35 Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer 36 Risk Factors The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel’s current chief executive officer plans to retire in May 2013 and the Board of Directors is working to choose a successor. The succession and transition process may have a direct and/or indirect effect on the business and operations of the company. In connection with the appointment of the new CEO, the company will seek to retain our executive management team (some of whom are being considered for the CEO position), and keep employees focused on achieving the company’s strategic goals and objectives. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent Form 10-Q, report on Form 10-K and earnings release. Rev. 1/17/13

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 6 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf