hadoop大数据处理讲义-vmware讲座


© 2014 VMware Inc. All rights reserved. Big Data in Virtualization and Cloud Gavin Lu, Sr R&D Manager June 4, 2014 Agenda • What’s VMware? • What’s big data? • Why virtualized big data? – Virtual Hadoop • What’s Big Data Extensions? • How BDE works? • What’s next? CONFIDENTIAL 2 CONFIDENTIAL 3 What’s VMware? What’s Big Data? Four V’s of Big Data CONFIDENTIAL 5 Big Data Landscape CONFIDENTIAL 6 Generic Big Data System Architecture CONFIDENTIAL 7 ETL Real Time Streams Unstructured Data (HDFS) Real Time Structured Database Big SQL Data Parallel Batch Processing Real-Time Processing Analytics Why Virtualized Big Data? vHadoop: Just Another Type of Workload 50%+ resources are sitting idle while high priority job is burning up its cluster. Utilize all resources from pool on demand. Dynamic elastic scaling on shared resource pool vSphere HA/FT App OS App OS App OS X X App OS App OS App OS App OS X VMware ESX VMware ESX • Single identical VMs running in lockstep on separate hosts • Zero downtime, zero data loss failover for all virtual machines in case of hardware failures • Integrated with VMware HA/DRS • No complex clustering or specialized hardware required • Single common mechanism for all applications and operating systems FT HA HA Overview Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters What’s BDE? Project Serengeti • Deploy a cluster within 10 minutes • Supports MapReduce/Hbase • Auto cluster wide operations • Supports mainstream Hadoop distros CONFIDENTIAL 12 vSphere BDE 1.0 GA on Sept 22, 2013 Showing up Partnerships . Distro partners certified . EMC Hadoop Starter Kit (EMC Isilon + vSphere BDE) Recognitions • Project Serengeti won the Infoworld’s 2013 Bossie Award (best open source big data tools category) • VMware named #10 most powerful Big Data Companies, NetworkWorld, August 2013 vCAC/BDE Solution on VSX How BDE Works? Provisioning Storage Data/Compute Separation Compute Current Hadoop: Combined Storage/Com pute Storage T1 T2 VM VM VM VM VM VM Slave Node Elasticity Experimentation Dynamic resourcepool Data layer Production recommendation engine Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Experimentation Production Compute VM Job Tracker Job Tracker VMware vSphere + Serengeti Hadoop Topology Awareness – HVE / D1 D2 R1 R2 N1 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 R3 R4 1 2 3 / D1 D2 R1 R2 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 R3 R4 1 2 3 N2 N3 N4 N5 N6 N7 N8 1 3 2 1 2 3 4 Performance Comparison VM Placement Policy Disk Placement Policy Host DN CN Host DN CN System disk Separated virtual system disks on specified local storage System disk Data disks Data disks Separated virtual system disks on shared storage Server Architecture Runtime Manager State, stats (Slots used, Pending work) Commands (Decommission, Recommission) Stats and VM configuration Serengeti Job Tracker vCenter DB Manual/Auto Power on/off Virtual Hadoop Manager (VHM) Job Tracker Task Tracker Task Tracker Task Tracker vCenter Server Serengeti Configuration VC state and stats Hadoop state and stats VC actions Hadoop actions Algorithms Cluster Configuration Performance Tuning on virtual platform Performance comparison between phy/vir 0 500 1000 1500 2000 2500 3000 3500 Teragen Terasort Teravalidate host 1vm/host 2vm/host 4vm/host 0 200 400 600 800 1000 1200 1400 1600 1800 Teragen Terasort Teravalidate host 1vm/host 2vm/host 4vm/host Apache CDH4 Virtualization Host VMDK Shared storage SAN/NAS Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode Ext4 Task- tracker Ext4 Ext4 Ext4 mapred.local.dir Single Node per Host (a) VMDK VMDK Ext4 Ext4 Ext4 Ext4 Single Node per Host (b) Virtualization Host VMDK Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode Ext4 Task- tracker Ext4 Ext4 Ext4 mapred.local.dir VMDK VMDK Ext4 Ext4 Ext4 Ext4 Virtualization Host VMDK OS Image – VMDK Hadoop Virtual Node 1 Datanode Ext4 Task- tracker Ext4 Ext4 Ext4 Shared storage SAN/NAS Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode Ext4 Task- tracker Ext4 Ext4 Ext4 mapred.local.dir Multiple Nodes per Host Virtualization Host OS Image – VMDK Hadoop Virtual Node 1 Task- tracker Shared storage SAN/NAS Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK … … Data/Compute Separated Deployment Mode Temp Storage Optimization Virtualization Host Virtualization Host Hadoop Virtual Node 1 Hadoop Virtual Node 2 TaskTracker Virtual Switch DataNode Hadoop Virtual Node Virtual Switch DataNode TaskTracker . Need Separate storage – Different size for different applications – Hard to forecast, waste of space D/C Separated Deployment Based on Shared Storage Virtualization Host VMDK OS Image – VMDK Hadoop Virtual Node 1 Ext4 Task- tracker Shared storage SAN/NAS Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode VMDK Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 … D/C Separated Deployment Mode on NFS Virtualization Host OS Image – VMDK Hadoop Virtual Node 1 Task- tracker Shared storage SAN/NAS Local disks OS Image – VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 Datanode Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 dir dir dir dir dir dir dir dir VMDK … … NFS Client NFS Server Hadoop Clusters Based on Shared Storage Virtualization Host OS Image – VMDK Shared storage SAN/NAS OS Image – VMDK Hadoop Virtual Node 2 Datanode Ext4 Hadoop Virtual Node 1 Task- tracker Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 … VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK What’s Next? Big Data Center? CONFIDENTIAL 38 Big Data Computer • AMD’s SeaMicro SM15000™ Fabric Compute Systems • HP Moonshot System CONFIDENTIAL 39 Big Data Storage CONFIDENTIAL 40 Big Data Storage CONFIDENTIAL 41 Big Data Storage CONFIDENTIAL 42 Big Data Network CONFIDENTIAL 43 Big Data Network CONFIDENTIAL 44 CONFIDENTIAL 45 Big Data Network Big Data NIC: RDMA and Sockets CONFIDENTIAL 46 Big Data NIC: RDMA Options on vSphere • a) Direct Path I/O Pass-through • b) SR-IOV VF Direct Path I/O CONFIDENTIAL 47 Big Data NIC: vRDMA over VMCI Architecture CONFIDENTIAL 48 CONFIDENTIAL 49 Big Data Business: the Final Call How to Tackle Big Data? CONFIDENTIAL 50 Thanks!
还剩50页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 3 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

pwgw

贡献于2016-02-04

下载需要 3 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf