Hacker News new | past | comments | ask | show | jobs | submit login
Twitter's Storm (complex event processing system) is now open source (github.com/nathanmarz)
461 points by harrigan on Sept 19, 2011 | hide | past | favorite | 56 comments



Hey all, I'm the author of Storm. Just wanted to point you to a few resources:

I've written a lot of documentation on the wiki, which you can find here: https://github.com/nathanmarz/storm/wiki

There's a few companion projects to Storm. These are:

One-click deploy for Storm on EC2: https://github.com/nathanmarz/storm-deploy

Adapter to use Kestrel as a Spout within Storm: https://github.com/nathanmarz/storm-kestrel

Starter project with example topologies that you can run in local mode: https://github.com/nathanmarz/storm-starter

Feel free to ask me questions here or on Storm's mailing list ( http://groups.google.com/group/storm-user ), and I'll answer as best I can!


Thank you for releasing such thorough documentation with your project; it is a great indicator of professionalism and passion. ^.^


Thanks! Good documentation is a critical part of having a usable project, and that will always be a focus of Storm. If you have any feedback for how the documentation can be better, please let me know.


Just to point out, Nathan will be speaking at the Clojure/Conj conference this year. Talking about Cascalog.

I have a schedule conflict this year :( but would recommend the conference to everyone - the first year was informative with contagious enthusiasm. The other speakers also look cool.


Why did you guys go with clojure, instead of something maybe like scala?


I've had a ton of success using Clojure with projects like Cascalog and ElephantDB. It's a phenomenal language.


And one more resource... here are the slides from my launch presentation of Storm:

http://www.slideshare.net/nathanmarz/storm-distributed-and-f...


Nathan, Congratulations on the awesome work, Storm looks really great.

A minor nit, I was looking at the code, and all the hanging parens make it look really sad :-(


Bah, why the downvote? I was pointing out a major Lisp convention.


How does Storm compare to Kafka?


Hi nathan, how does this compare to S4?


The biggest difference is that Storm guarantees that data will be processed, whereas S4 can lose data. I also think Storm is significantly easier to use.


Woah. Launch a Storm cluster on AWS with one line:

https://github.com/nathanmarz/storm-deploy/wiki

All of these storm projects with their project.clj files betray the Clojure roots (using Leiningen as the build tool, which is amazingly great). Here's to hoping for more Clojure examples/docs.


The Storm deploy is awesome. It's built using an awesome tool called Pallet: https://github.com/pallet/pallet

I'm going to try to get more documentation on using Storm from Clojure in the next few weeks. I write all my topologies in Clojure using a small Clojure DSL that ships with Storm:

https://github.com/nathanmarz/storm/blob/master/src/clj/back...


It is a little buried in the github wiki, but this 'Rationale' page is a good overview of the project - https://github.com/nathanmarz/storm/wiki/Rationale


thanks. given that, can anyone answer the following?

- if there's no intermediate queuing, how are "overloads" handled? what if the system can't handle a transient load? is there some kind of flow control signalling back up the path to control the rate (i assume not!)?

- what happens if a bolt accepts data then crashes before processing. is that data resent? if so, how does the system handle bolts that are not idempotent? (the problem is that if you assume that data is not handled until the handling process terminates then you may send duplicates; if you assume that data are handled if they are received then you lose data on crashes).

- how can "fields grouping" scale? surely if you add more bolts they won't receive any messages unless a new field appears.

- is there some kind of automatic restarting? how are failures handled? [ok, this is handled by supervisor nodes and is described in the link above. although it's difficult to understand how they can be stateless. what happens to the managed processes if a supervisor dies?]

thanks again! [edit: thanks for the great answers; should have guessed the consistent hashing one ;o]


You generally feed a Storm topology with a queueing system at the source. For example, we use Kestrel to feed Storm topologies. There's a setting called "TOPOLOGY_MAX_SPOUT_PENDING" which controls how many tuples can be pending on any given spout task at a time. This is more than sufficient for controlling bursts of data. If things back up, they will queue on the queueing server (s). In the future (not for a few months, at least), Storm will have auto-scaling where it will automatically scale the topology to the data.

If a tuple fails to be processed, the tuple(s) that triggered that tuple are replayed from the spout. See https://github.com/nathanmarz/storm/wiki/Guaranteeing-messag... for more info on that. In that sense, Storm is an at-least-once delivery system (but messages are sent more than once only in failure scenarios). It is up to you to architect your systems to handle this. The approach I take is to build systems using a hybrid of batch processing (Hadoop) and realtime processing (Storm). With batch processing, you can run idempotent functions even with duplication, which lets you correct what's happening at the realtime layer. In that sense, Hadoop and Storm are extremely complementary. Here are some slides from a presentation I gave about this technique: http://www.slideshare.net/nathanmarz/the-secrets-of-building...

Fields grouping uses consistent hashing underneath. So if you redeploy with more parallelism, it scales naturally and easily.

If a supervisor dies, it starts back up like nothing happened. Most notably, nothing happens to the worker processes. All state is kept either in disk or in Zookeeper. All daemons in Storm are fail-fast, and the Supervisor uses kill -9's to kill workers, so that makes things extremely robust.

I hope that answered your questions!


The supervisor/worker thing sounds quite a bit like OTP.


Hey Nathan, you should incorporate this page in the Readme markdown so people can get an overview on the front page


If you don't want to read through the wiki, here's what I've gathered (though I may have misunderstood what they're doing).

This looks like a workflow management system, where you define a dependency graph and their system automatically puts messages in queues, pops them, and executes a step. It seems like it solves the boilerplate part of distributed computing - managing message queues and fault tolerance. Please correct me if I got this wrong or missed something.


Sounds a bit like http://en.wikipedia.org/wiki/Flow-based_programming

I'll have to read through their stuff to see if they have some interesting ideas for my FBP implementation :-)


It does, doesn't it? I still haven't made it through the FBP book, and I'm having a hard time with the author's savior-of-the-world attitude.


Hmm... So you're saying it's a light weight version of Apache Service Mix/Camel? The same way Kestrel is a lightweight implementation of JMS.


"distributed computing boilerplate" ha, I like it


Hi Nathan! this is awesome. I'm really excited to dive deeper.

some questions:

- I'm trying to understand the relationship between ZeroMQ and Kestrel in your architecture. is ZeroMQ used for message passing? and Kestrel used as a stream source/sink - aka a sprout? in other words, my assumptions are: zookeeper helps manage node discovery and coordination while message passing between nimble managed bolt processes' are through zeromq. kestrel queues are used for external integration (data stream sources). Is this correct or am I missing something?

- do you have any tutorials on using cascalog with Storm? are they compatible or have you developed a different clojure programming model/DSL for working with Storm?

thanks and again - nice work!


You have it correct. ZeroMQ is used for message passing between components, and Zookeeper is used for node discovery and coordination.

Realtime processing is fundamentally different than batch processing, so you can't maintain the same semantics of Cascalog on top of Storm. Storm has a small DSL for writing topologies in pure Clojure, but it's not a higher level abstraction like Cascalog is. I've started thinking about what a great higher level abstraction would look like, but what that should look like is still an open question.



I just knew that someone would use the "a storm is coming" quote. Let's see if this piece of software will shake the universe…


nathanmarz does the work at BackType, Twitter gets the credit


keep in mind that through the acquisition, nathanmarz got something out of it as well :)


BackType has been acquired by Twitter.


we know that, but all the work was done pre-twitter aquisition.


He gets the credit he deserves as it's under his Github. Who cares about the company name?


Saying it is used by twitter makes it a lot easier to sell to management.


I'll be taking a look at this today. I'm most excited about it's fault tolerance features. If this sufficiently abstracts out the details of providing robust fault tolerance, it could be a great tool to use with cloud computing.


What are some potential applications of this?


Any big data application where the majority of the working set changes continuously and requires more processing than just "bucketing".

The example Twitter gives here is for trending topics, but Storm is basically to message queues what Rails is to CRUD web apps, and so you can draw use cases from eg everything Tibco and JMS are used for today.

Storm is nothing you couldn't have done a year ago, or ten years ago, with a message-oriented architecture. But it has a very attractive feature, which is that it bakes in all the fiddley details and problem solving Twitter did while scaling it to their architecture. Systems like this tend to be easy to prototype and a nightmare to mature and manage, so that's not a small feature.


By their very nature, decentralized processing systems have exponential complexity, so this is very welcome to see released. But oh how I hate having to go back into Java :)


Storm can be used with any programming language. I still need to finish the documentation on this, but I have notes here: https://github.com/nathanmarz/storm/wiki/Using-non-JVM-langu...

The gist is: Storm topologies are just Thrift structures, and Storm can execute processing components in any language by communicating with subprocesses over a simple protocol based on JSON messages over stdin/stdout. Storm has adapters implementing this protocol for Ruby and Python, and it's easy to make a new adapter for any other language. The adapter libraries are ~100 lines of code and have no dependencies other than JSON.


Would it be difficult to use protostuff or protobuf instead? Or maybe Thrift is better for this purpose in some way I don't anticipate?


Ahh, that is super sweet then. Missed the Thrift.


Nathan Marz is a Clojurian, large chunks of Storm are written in Clojure. I think the examples are in Java just to not scare off potential users.


Why would you have to go back to Java? It looks like the only requirement here is a JVM.

At least in Rubyland, JVM dependencies have tended to be a win.


Is it common practice for a corporation to release code via one user's Github account? I would expect that if Twitter were open-sourcing something, it would show up as something like https://github.com/twitter/storm (that link is 404, to save you the trouble).


This is really more nathanmarz's Storm more than Twitter's Storm. He wrote it at BackType with the intention of open sourcing it, and when Twitter bought BackType they apparently didn't ask him not to. So, it makes sense that it's being released through his Github account. If Twitter uses it significantly internally, they'll clone it to their organization account, like they did for most of the projects that are on Twitter's Github profile.


This is fantastic! My mind is spinning with the industries that you could benefit from this, but didn't have the time/resources/focus to roll this sort of (very difficult to scale) system on their own.


Really glad that you guys now release and than announce ! the best way to avoid the let down of ending not opening something announced earlier (whatever the reason). Keep the trend going !


Just out of curiosity, to what degree was Clojure chosen because of the ability to use Java libraries vs the language design + community?


Clojure is an amazing language, and part of that is the access to a huge breadth of Java libraries. Storm makes use of Java libraries/projects like JZMQ, Zookeeper, and Thrift.


This is awesome, really excited to see this get released!


Yeeeeessssss! Thank you Twitter!!!


Someone want to implement something like that without 100 jars of dependencies and 8Gb of memory required to just run, in old-fashioned C/Lisp way (or more modern nginx-way)? Update: on top of Plan9?! ^_^

Or, at least, in more suitable Erlang? ^_^

Isn't it an obvious startup-idea?


Storm does not have 100's of jars of dependencies, nor does it require much memory to run. In fact, it's quite straightforward to spin up a Storm cluster (see https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-... ) and Storm has a local mode where you can develop/test topologies on your local mode completely in-process.


As far as I know, even clojure-contrib requires lots of jars, and I guess Scala is also very costly, but let it go.

I appreciate your innovative idea and amount of work you have done, so this small efficiency issue does not really matter.

btw, who cares about resources when hardware is so cheap and purchased in ocean containers? ^_^


No.


Don't bother replying to this guy. Read his comment history. He's either a professional troll or mentally ill.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: