ABSTRACT
This paper describes the use of Storm at Twitter. Storm is a real-time fault-tolerant and distributed stream data processing system. Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka. topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results from an empirical evaluation demonstrating the resilience of Storm in dealing with machine failures. Storm is under active development at Twitter and we also present some potential directions for future work.
- Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Rajeev Motwani, Itaru Nishizawa, Utkarsh Srivastava, Dilys Thomas, Rohit Varma, Jennifer Widom: STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19--26 (2003)Google Scholar
- Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik: Retrospective on Aurora. VLDB J. 13(4): 370--383 (2004) Google ScholarDigital Library
- Minos N. Garofalakis, Johannes Gehrke: Querying and Mining Data Streams: You Only Get One Look. VLDB 2002Google Scholar
- Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, Stanley B. Zdonik: The Design of the Borealis Stream Processing Engine. CIDR 2005: 277--289Google Scholar
- S4 Distributed stream computing platform. http://incubator.apache.org/s4/Google Scholar
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033--1044 (2013) Google ScholarDigital Library
- Apache Samza. http://samza.incubator.apache.orgGoogle Scholar
- Spark Streaming. http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmlGoogle Scholar
- Mohamed H. Ali, Badrish Chandramouli, Jonathan Goldstein, Roman Schindlauer: The extensibility framework in Microsoft StreamInsight. ICDE 2011: 1242--1253 Google ScholarDigital Library
- Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski: Continuous Queries in Oracle. VLDB 2007: 1173--1184 Google ScholarDigital Library
- IBM Infosphere Streams. http://www-03.ibm.com/software/products/en/infosphere-streams/Google Scholar
- Namit Jain, Shailendra Mishra, Anand Srinivasan, Johannes Gehrke, Jennifer Widom, Hari Balakrishnan, Ugur Çetintemel, Mitch Cherniack, Richard Tibbetts, Stanley B. Zdonik: Towards a streaming SQL standard. PVLDB 1(2): 1379--1390 (2008) Google ScholarDigital Library
- Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a distributed messaging system for log processing. SIGMOD Workshop on Networking Meets Databases, 2011.Google Scholar
- Kestrel: A simple, distributed message queue system. http://robey.github.com/kestrelGoogle Scholar
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In NSDI, 2011. Google ScholarDigital Library
- Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Conference 1990: 102--111 Google ScholarDigital Library
- Apache Zookeeper. http://zookeeper.apache.org/Google Scholar
- Summingbird. https://github.com/Twitter/summingbirdGoogle Scholar
- Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, Shivakumar Venkataraman: Photon: fault-tolerant and scalable joining of continuous data streams. SIGMOD Conference 2013: 577--588 Google ScholarDigital Library
- Nathan Marz: (Storm) Tutorial. https://github.com/nathanmarz/storm/wiki/TutorialGoogle Scholar
- Storm, Stream Data Processing: http://hortonworks.com/labs/storm/Google Scholar
- Apache Storm: http://hortonworks.com/hadoop/storm/Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler: Apache Hadoop YARN: yet another resource negotiator. SoCC 2013: 5Google Scholar
- Nathan Marz: Trident API Overview. https://github.com/nathanmarz/storm/wiki/Trident-API-OverviewGoogle Scholar
- ZeroMQ: http://zeromq.org/Google Scholar
Index Terms
- Storm@twitter
Recommendations
R-Storm: Resource-Aware Scheduling in Storm
Middleware '15: Proceedings of the 16th Annual Middleware ConferenceThe era of big data has led to the emergence of new systems for real-time distributed stream processing, e.g., Apache Storm is one of the most popular stream processing systems in industry today. However, Storm, like many other stream processing systems ...
Comments