Fastsocket Speed up your socket


Xiaofeng Lin jerrylin.lxf@gmail.com Fastsocket Speed up your socket SINA SINA: Xiaofeng Lin Xiaodong Li Contributor Tsinghua: Yu Chen Junjie Mao Jiaquan He Socket API Problem Socket API is fundamental for most network applications • Kernel eats too much CPU cycles • Less CPU cycles left to Application • Call for more efficient kernel socket API Theoretical Overall Performance: Single core performance * Available CPU cores How close can we reach the limit: Scalability. Performance on Multicore System Outline • Scalability Performance • Single Core Performance • Production System Feasibility • Future Work Fastsocket • Scalability Performance • Single Core Performance • Production System Feasibility • Future Work Testing Environment • HAProxy: Open source TCP/HTTP loadbalancer. • OS: CentOS-6.2 (Kernel : 2.6.32-220.23.1.el6) • CPU: Intel Ivy-Bridge E5-2697-v2 (12 core) * 2 • NIC: Intel X520 (Support Flow-Director) • Scenario: short TCP connections Server http_load HAproxy (Ivy bridge E5-2697 v2) http_load http_load More than 90% CPU is consumed by Kernel Kernel Inefficiency Synchronization Overhead Almost 80% CPU time is spent on locking. HTTP CPS (Connection Per Second) throughput with different number of CPU cores. Scalability is KEY to multicore system capacity. Scalability Problem Dilemma How to update single machine capacity? • Spend more for more CPU cores • Capacity dose not improve but get worse • Just awkward What to do • Hardware Assistance Limited effects. (NIC offload feature) • Data Plane Mode You need to implement your own TCP/IP stack. (DPDK) Lacks of kernel generality and full features. Other production limits. • Change Linux kernel Keep improving but not enough. (Linux upstream) Need to modify application code. (Megapipe OSDI) Great! CPU is doing more for haproxy. Fastsocket Scalability Fastsocket Scalability Scalability is key for multicore performance. Fastsocket: 5.8X Base and 2.3X Lastest. Production System Evaluation Two 8-core HAProxy (HTTP proxy) servers handling same amount of traffic, with and without Fastsocket. Kernel Bottleneck • Non Local Process of Connections • Global TCP Control Block (TCB) Management • Synchronization Overhead from VFS Non Local Process of Connections A given connection is processed in two phases: • Net-RX SoftIRQ in interrupt context • Application and System Call in process context. Two phases are often handles by different CPU cores: • Introduces lock contentions. • Causes CPU cache bouncing. CPU1 CPU0 KernelNetwork Stack Haproxy Haproxy Network Stack Queue-0 Queue-1 RSS Scenario that server actively connects out: Non Local Process of Connections Global TCB Management TCB is represented as socket in Linux kernel: •Listen Socket: – A single listen socket is used for connection setup •Established Socket: – A global hash table is used for connection management Global TCB Management CPU1 CPU0 Kernel Listen Socket Haproxy NIC Haproxy Queue-0 Queue-1 Single listen socket HOTSPOT: VFS Overhead • Socket is abstracted under VFS •Intensive synchronization for Inode and Dentry in VFS • These overhead are inherited by socket Methodology • Resources Partition • Local Processing Design Component • Receive Flow Deliver • Local Listen Table & Local Established Table • Fastsocket-aware VFS Fastsocket Architecture Application Process TCP Layer Local Established Table Local Listen Table Application Process NIC Kernel User... ... Fastsocket-aware VFSFastsocket-aware VFS PerCore Process Zone NET-RX SoftIRQ RSS TCP Layer Local Established Table Local Listen Table NET-RX SoftIRQ PerCore Process Zone Receive Flow Dilever Outgoing packets to NIC Incoming packets from NIC Receive Flow Deliver (RFD) CPU1 CPU0 Kernel Network Stack Haproxy Haproxy Network Stack Queue-0 Queue-1 RSS Receive Flow Deliver(RFD) Good Path Bad Path RFD delivers packets to the CPU core where application will further process them. RFD can leverage advanced NIC features (Flow-Director) Local Listen Table CPU1 CPU0 Kernel Haproxy NIC Haproxy Queue-0 Queue-1 Local Listen Socket Local Listen Table Local Listen Socket Local Listen Table Clone the listen socket for each CPU core in LOCAL table. Local Established Table Established sockets are managed in LOCAL table. CPU1 CPU0 Kernel Local Establish Table Haproxy NIC Haproxy Queue-0 Queue-1 Local Establish Table Fastsocket-aware VFS Provide a FAST path for socket in VFS: • Inode and Dentry are useless for socket • Bypass the unnecessary lock-intensive routines • Retain enough to be compatible Fastsocket Architecture Application Process TCP Layer Local Established Table Local Listen Table Application Process NIC Kernel User... ... Fastsocket-aware VFSFastsocket-aware VFS PerCore Process Zone NET-RX SoftIRQ RSS TCP Layer Local Established Table Local Listen Table NET-RX SoftIRQ PerCore Process Zone Receive Flow Dilever Outgoing packets to NIC Incoming packets from NIC Optimization Effects Intel Hyper-Threading Further boot performance 20% with Intel HT. E5-2697-v2 Fastsocket 406632 Fastsocket-HT 476331(117.2%) Fastsocket-HT:476.3k cps, 3.87m pps, 3.1G bps (short connection) Fastsocket • Scalability Performance • Single Core Performance • Production System Feasibility • Future Work Methodology • Network Specialization • Cross-Layer Design Network Specialization General service provided inside Linux kernel: • Slab • Epoll • VFS • Timer etc. Customize these general services for Network Fastsocket Skb-Pool • Percore skb pool • Combine skb header and data • Local Pool and Recycle Pool (Flow-Director) Fastsocket Fast-Epoll Motivation: • Epoll entries are managed in RB-Tree • Search the RB-Tree each time when calling epoll_ctl() • Memory access is a killer for performance Solution: • TCP Connection rarely shared across processes • Cache Epoll entry in file structure to save the search Cross-Layer Optimization • What is cross-layer optimization? • Overall optimization with Cross-Layer design – Direct-TCP – Receive-CPU-Selection Fastsocket Direct-TCP • Record input route information in TCP socket • Lookup socket directly before network stack • Read input route information from socket • Mark the packet as routed Fastsocket Receive-CPU-Selection Similar to Google RFS • Application marks current CPU id in the socket • Lookup socket directly before network stack • Read CPU id from socket and deliver accordingly Lighter, more accurate and thus faster than Google RFS Redis Benchmark Testing Environment: • Redis: Key-value cache and store • CPU: Intel E5 2640 v2 (6 core) * 2 • NIC: Intel X520 Configuration: • Persist TCP connections • 8 redis instances serving on difference ports • Only 8 CPU cores are used Redis Benchmark Disable Flow-Director: 20% throughput increase Enable Flow-Director: 45% throughput increase Fastsocket • Scalability Performance • Single Core Performance • Production System Feasibility • Future Work Compatibility & Generality • Full BSD-Socket API compatible • Full kernel network feature support • Require no change of application code • Nginx, HAProxy, Lighttpd, Redis, Memcached, etc. Deployment • Install RPM (Kernel, fastsocket.ko and libfsocket.so) • Load fastsocket.ko • Start application with PRELOAD libfsocket.so LD_PRELOAD=./libfsocket.so haproxy Maintainability • Flexible Control – Selectively enable Fastsocket to certain applications (nginx) – Compatible with regular socket (ssh) • Quick Rollback – Restart the application without libfsocket.so • Easy Updating – Most of codes are in fastsocket.ko – Updating fastsocket.ko and libfsocket.so is enough SINA Deployment • Adopted in HTTP load balance service (HAProxy) • Deployed in half of the major IDCs • Stable running for 8 months • Fastsocket will update all HAPoxy by the end of year Fastsocket • Scalability Performance • Single Core Performance • Production System Feasibility • Future Work Future Work • Make it faster – Improving Interrupt Mechanism – System-Call Batching – Zero Copy – Further Customization and Cross-Layer Optimization • Linux Mainstream Open Source https://github.com/fastos/fastsocket Recruitment Join us ! xiaodong2@staff.sina.com Fastsocket Thank you! Q & A
还剩50页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 8 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

n7xx

贡献于2014-10-30

下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf