孙建良:网易新一代对象存储引擎


网易新一代对象存储引擎 孙建良 SACC2017 关于我 • 孙建良 • ⽹易 • 图⽚处理系统 • ⼩⽂件缓存系统 • ⼴域⽹上传加速系统 • 新⼀代对象存储引擎 blog: work-jlsun.github.io SACC2017 Object Storage vs HDFS SACC2017 HDFS • Summary ✓ unstructured data in arbitary formats ✓ Block, usually 64MB. ✓ Blocks are replicated. ✓Write once (append allowed) ✓(Often) collocated with compute capacity. SACC2017 Object Store • what is an object store? ✓Key ✓Value ✓Attribute ✓Bucket ✓RestFul HTTP: https:// bucket.nos.netease.com/doc.txt 、SDK SACC2017 Object Store • Good things about object stores ✓(effectively) infinitely scalable – EB and beyond. ✓Various security models – data is safe. ✓Low cost, long term storage solution. SACC2017 Object Store • Object Storage is Not a File System ✓Write once – no append in place ✓Usually eventual consistent ✓No Real DIR SACC2017 Outline BasicArch 背景 NEFSSACC2017 对象存储基础架构 ✓PUT ✓GET ✓DELETE bucket.nos.netease.com V:DataK:MetaData Nginx Nginx DNS Service DDB Cluster DBI DFS Cluster FSI HTTP Restful Service Proxy Cluster Nginx statefull statefull stateless load balance SACC2017 SACC2017 ⽹易云存储服务发展之路 ✓数据写入 ✓副本组织形式 ᜓᅩᇫா࿤ಸ ෈կ඙֢ GRFLG ኩ᧗෈໩ݩ ෈կݷ Ӟ̵ԫ̵ӣᕆፓ୯ ໲ݩ ໲ ໲ ໲ ੒ ፏ ᏺ ፏ SN ᏺ ፏ MDSᏺ MDS SN Zookeepers • 分布式框架 24 10 10 10 10 背景-DFS “Everytings should be made as simple as possible, but not simpler”- Albert Einstein 背景 • 优点 ✓简单、简单、简单 ✓复制组、一致性、引擎 • 缺点 • it is simpler ✓性能 ✓可靠性 ✓成本 SACC2017 Design Goals ✓ Capacity:100PB+ ✓ WorkLoad:适应大小文件 ✓ Durability:8个9、11个9 ✓ Availability: 机架感知、组件高可用、减小依赖 ✓ Scale Easy: 灵活、不影响性能、支持Rebalance ✓ MultiTalent:多租户 ✓Simple:Keep it Simple SACC2017 NEFS SACC2017 Overview • Netease File System (NEFS) ✓Key-Value Blob Storage ✓Key:FID(16 Byte)(8+8) ✓Value : Blob (an arbitrary-sized byte Chunk) • Interface ✓PutFile :: User、Blob -> FID ✓GetFile::FID ->Blob ✓DeleteFile:FID->bool ✓GetFileInfo:FID->FileStatus SACC2017 Topology Pool 1 PoRO User Pool zone server disk zone servers servers servers servers servers servers zone ✓User ✓Pool ✓Zone ✓server ✓Disk SACC2017 Architecture ✓PS:Partition Server ✓MDS、MySQL ✓ZooKeeper ✓FSI FSI Zookeeper MDS MySQL PS Node PS PSPS ŏŏ ŏŏ Control Flow Data Flow PS Node PS PSPS ŏŏ PS Node PS PSPS ŏŏŏŏ SACC2017 PartitionServer PS AMetaData Partition X 00001-00.log 01000-00.log 04000-00.log03000-00.log Partition Partition Partition …… Partition X Replica f1 data f2 data f3 data FSI Zookeeper MDS MySQL PS B SACC2017 MDS • 数据定位:Topology 、(PartitionID->“ps1-ps2-ps3”) • 数据分布、放置、均衡 SACC2017 VS • 去中心化v元数据 ✓consistent hash & Crush ✓元数据少 ✓不够灵活:扩容、数据迁移 SACC2017 Choose • Reality ✓元数据本来就少,100PB, 几十M ✓按需扩容,不希望强制 rebalance SACC2017 NEFS data file older data file hint file older data file active data file hint file hint file older data file hint file index Partition BlockFileHeader LogEntry LogEntry ŏŏ LogEntry • 存储单元(BitCast存储模型) SACC2017 • Consistency Algorithm ✓Paxos 1990 ✓PacificA 2008 ✓Raft 2013 数据复制 Replicated State Machine Architecture SACC2017 Basic PacificA client Primary Backup Backup Write Write prepare list Commit Ack piggybacks SACC2017 MemberShip Change Leader Partition1 Follower Partition2 Server m Follower Partition2 Follower Partition1 Server p Leader Partition2 Follower Partition1 Server q MDS Change Leader Remove ReplicaAdd Replica SACC2017 PacificA vs Raft VS Basic MemberShip Performance Avalibility Durability PacifcA Write-ALL 依赖外部 Low Low High Raft 2/F +1 依赖⾃身 High High Low SACC2017 Choose • Reality • Write Any Replica(Partition) Group • MDS in System • Durability is important than Performance • Easy Implementation SACC2017 NEFS • Performance • Durability • Cost SACC2017 NEFS • Performance • Durability • Cost SACC2017 Performance ✓Just One IO Per Write ✓Big File split Into 1MBs slice ✓IO 优化 ✓Limit Concurrent IO ✓GroupCommit ✓Delete Not Force Flush data file older data file hint file older data file active data file hint file hint file older data file hint file index Partition Append Lazy Update Index Lazy Write Hint Write SACC2017 Performance ✓Just One IO Per Write ✓Big File split Into 1MBs slice ✓IO 优化 ✓Limit Concurrent IO ✓GroupCommit ✓Delete Not Force Flush 硬盘性能简测 SACC2017 NEFS • Performance • Durability • Cost SACC2017 NEFS • Durability AWS 产品线 SLA S3 Standard 11个9 S3 Standard – IA 11个9 Glacier 11个9• 100亿⽂件⼀年只可能丢失1个⽂件 SACC2017 Durability- 影响因素 ✓AFR:磁盘年故障率 ✓RepNum:存储复制因子 ✓T:坏盘恢复时间 ✓S:系统CopySet数量 ✓N:系统中磁盘数量 SACC2017 Durability- 影响因素 ✓AFR:磁盘年故障率 ✓RepNum:存储复制因子 ✓T:坏盘恢复时间 ✓S:系统CopySet数量 ✓N:系统中磁盘数量 “在包含999块磁盘的3备份存储系统中, 同时坏三块盘情况下的数据丢失概率?” 设计⼀:把999块磁盘组成333个磁盘对。 333/C(999,3) =5.02*e-07 disk1 disk1-copy2 disk2 disk2-copy2 disk333 disk333-copy2 ŏŏ ŏŏ disk1-copy3 disk2-copy3 disk333-copy3 ŏŏ SACC2017 Durability- 影响因素 ✓AFR:磁盘年故障率 ✓RepNum:存储复制因子 ✓T:坏盘恢复时间 ✓S:系统CopySet数量 ✓N:系统中磁盘数量 “在包含999块磁盘的3备份存储系统中, 同时坏三块盘情况下的数据丢失概率?” 设计⼆:数据随机打散到999盘中 C(999,3)/C(999,3)=1 ŏŏ ŏŏ ŏŏ SACC2017 如何度量 • 如何量化:离散化 T T T T …… T T T 1UQ@ C(n,k) X N njW N HnjW * 360*24/T 1 - ⥯M块℅‪䢤盘䢤₫№☏匝会对导睍丢数㖎 SACC2017 NEFS • Durability设计考量 ✓副本数 ✓恢复时间 ✓CopySet数量 SACC2017 SACC2017 ӥᘶ Ӥᘶ Pool 1 zone3 ٖᗑԻഘ= ٖᗑԻഘ= zone2 ٖᗑԻഘ< ٖᗑԻഘ< ٖ᮱Իഘ C ٖᗑԻഘ; 40*2=80 ٖᗑԻഘ; zone1 Rack1 Rack2 Rack3 ✓网络IO限速 ✓复制单元放置&大小 ✓布局 • 恢复时间 NEFS NEFS • 恢复时间 ✓布局 ✓复制单元放置&大小 ✓网络IO限速 SACC2017 NEFS • Performance • Durability • Cost SACC2017 NEFS • Cost • 提高复制因子 • 副本技术 • EC Files replica-1 replica-2 replica-3 හഝࣘ Storage Nodes Files හഝࣘ ໊ḵࣘ Storage Nodes SACC2017 NEFS data fileolder data fileolder data file active data fileolder data file Partition ໊ḵࣘහഝࣘ '%僤⵸㝞⥈㑳㞃⎹⎅╵⒏㥌䢤2CTVKVKQP EC Blocks SACC2017 总结 • A scalable high-available log-based Distributed Key-Value Blob Storage system. ✓Key-Value :Put、Get、Delete ✓Storage Engine: Log-Based(BitCase)Storage Engine ✓Strong Consistent(PacificA ) ✓Durability: 3 Copy Replica & Erase Code ✓It is Simple SACC2017 Future Works • Load Balance • EC Enhance • Performance • Metric、Ops • Remove zookeeper & Mysql • …… SACC2017 Hiring SACC2017 SACC2017
还剩46页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 10 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

WindStand

贡献于2018-01-08

下载需要 10 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf