The document describes a company's use of Storm to process internet advertising data and identify fraudulent traffic over time. Initially, Storm helped scale their system from 60 million to over 1.5 billion impressions per month. Later, as data volumes grew exponentially and real-time processing became important, they switched to batch processing with Hadoop. Now, as fraud detection requires faster response, they are returning to Storm paired with Mahout for real-time and online machine learning to identify emerging threats. Storm helped solve their initial scaling problem and remains a reliable option when real-time analysis is needed.
2. Ashley Brown
Chief Architect
Using Storm since
September 2011
Based in the West End
Founded in early 2010
Focused on (fighting)
advertising fraud since
2011
3. What I'll cover
This is a case study: how Storm fits into our
architecture and business
NO CODE
NO WORKER-QUEUE DIAGRAMS1
I assume you've seen them all before
5. Use the right tool for the job
If a piece of software
isn't helping the
business, chuck it out
Only jump on a
bandwagon if it's
going in your direction
Douglas de Jager, CEO
6. Don't write it if you can download it
If there's an open
source project that
does what's needed,
use that
Sometimes this
means throwing out
home-grown code
when a new project is
released (e.g. Storm)
Ben Hodgson, Software Engineer
8. Find fraudulent website traffic
Collect billions of
server and client-side
signals per month
Sift, sort, analyse and
condense
Identify automated,
fraudulent and
suspicious traffic
Joe Tallett, Software Engineer
9. Protect against it
Give our customers
the information they
need to protect their
businesses from bad
traffic
This means: give
them clean data to
make business
decisions
Simon Overell, Chief Scientist
10. Expose it
Work with partners to
reveal the scale of the
problem
Drive people to
solutions which can
eliminate fraud
11. Cut it off
Eliminate bad traffic
sources by cutting off
their revenue
Build reactive
solutions that stop
fraud being profitable
Vegard Johnsen, CCO
13. Storm solved a scaling problem
impressions/
month 60 million
60 millionsignals/month
60 million
240 million
1.5 billion
6 billion
Pre-history
(Summer 2010)
storm released present day
RabbitMQ queues
Python workers
Hypertable:
API datastore
Hadoop batch
analysis/
RabbitMQ queues
Python workers
VoltDB:
API datastore +
real-time joins
No batch analysis
(this setup was
pretty reliable!)
Custom cluster
management
+ worker scaling
RabbitMQ queues
Storm Topologies:
in-memory joins
HBase:
API datastore
Cascading for post-
failure restores
(too much data to do
without!)
billions and billions
and billions
billions and billions
and billions
Logging via CDN
Cascading for data
analysis
Hive for aggregations
High-level aggregates
in MySQL
1 2 3
14. Key moments
● Enter the advertising anti-fraud market
○ 30x increase in impression volume
○ Bursty traffic
○ Existing queue + worker system not robust enough
○ I can do without the 2am wake up calls
● Enter Storm.
1
16. What's wrong with that?
● Queue/worker scaling system relied on our
code; it worked, but:
○ Only 1, maybe 2 pairs of eyes had ever looked at it
○ Code becomes a maintenance liability as soon as it
is written
○ Writing infrastructure software not one of our goals
● Maintaining in-memory database for the
volume we wanted was not cost-effective
(mainly due to AWS memory costs)
● Couldn't scale dynamically: full DB cluster
restart required
17. The Solution
● Migrate our internal event-stream-based
workers to Storm
○ Whole community to check and maintain code
● Move to HBase for long-term API datastore
○ Keep data for longer - better trend decisions
● VoltDB joins → in-memory joins & HBase
○ small in-memory join window, then flushed
○ full 15-minute join achieved by reading from HBase
○ Trident solves this now - wasn't around then
19. How long (Storm migration)?
● 17 Sept 2011: Released
● 21 Sept 2011: Test cluster processing
● 29 Sept 2011: Substantial implementation of
core workers
● 30 Sept 2011: Python workers running
under Storm control
● Total engineers: 1
20. The Results (redacted)
● Classifications
available within 15
minutes
● Dashboard
provides overview
of 'legitimate' vs
other traffic
● Better data on
which to make
business decisions
21. Lessons
● Storm is easy to install & run
● First iteration: use Storm for control and
scaling of existing queue+worker systems
● Second iteration: use Storm to provide
redundancy via acking/replays
● Third iteration: remove intermediate
queues to realise performance benefits
22. A Quick Aside on DRPC
● Our initial API implementation in HBase was
slow
● Large number of partial aggregates to
consume, all handled by a single process
● Storm's DRPC provided a 10x speedup -
machines across the cluster pulled partials
from HBase, generated 'mega-partials'; final
step as a reducer => final totals.
23. Storm solved a scaling problem
impressions/
month 60 million
60 millionsignals/month
60 million
240 million
1.5 billion
6 billion
Pre-history
(Summer 2010)
storm released present day
RabbitMQ queues
Python workers
Hypertable:
API datastore
Hadoop batch
analysis/
RabbitMQ queues
Python workers
VoltDB:
API datastore +
real-time joins
No batch analysis
(this setup was
pretty reliable!)
Custom cluster
management
+ worker scaling
RabbitMQ queues
Storm Topologies:
in-memory joins
HBase:
API datastore
Cascading for post-
failure restores
(too much data to do
without!)
billions and billions
and billions
billions and billions
and billions
Logging via CDN
Cascading for data
analysis
Hive for aggregations
High-level aggregates
in MySQL
1 2 3
24. Key moments
● Enabled across substantial internet ad
inventory
○ 10x increase in impression volume
○ Low-latency, always-up requirements
○ Competitive marketplace
● Exit Storm.
2
25. What happened?
● Stopped upgrading at 0.6.2
○ Big customers unable to use real-time data at time
○ An unnecessary cost
○ Batch options provided better resiliency and cost
profile
● Too expensive to provide very low-latency
data collection in a compatible way
● Legacy systems continue to run...
28. The Results
● Identification of a botnet cluster attracting
international press
● Many other sources of fraud under active
investigation
● Using Amazon EC2 spot instances for batch
analysis when cheapest - not paying for
always-up
30. Lessons
● Benefit of real-time processing is a business
decision - batch may be more cost effective
● Storm is easy and reliable to use, but you
need supporting infrastructure around it (e.g.
queue servers)
● It may be the supporting infrastructure that
gives you problems...
31. Storm solved a scaling problem
impressions/
month 60 million
60 millionsignals/month
60 million
240 million
1.5 billion
6 billion
Pre-history
(Summer 2010)
storm released present day
RabbitMQ queues
Python workers
Hypertable:
API datastore
Hadoop batch
analysis/
RabbitMQ queues
Python workers
VoltDB:
API datastore +
real-time joins
No batch analysis
(this setup was
pretty reliable!)
Custom cluster
management
+ worker scaling
RabbitMQ queues
Storm Topologies:
in-memory joins
HBase:
API datastore
Cascading for post-
failure restores
(too much data to do
without!)
billions and billions
and billions
billions and billions
and billions
Logging via CDN
Cascading for data
analysis
Hive for aggregations
High-level aggregates
in MySQL
1 2 3
32. Key moments
● Arms race begins
○ Fraudsters in control of large botnets able to
respond quickly
○ Source and signatures of fraud will change faster
and faster in the future, as we close off more
avenues
○ Growing demand for more immediate classifications
than provided by batch-only
● Welcome Back Storm.
3
33. What now?
● Returning to Storm, paired with Mahout
● Crunching billions and billions of impressions
using Cascading + Mahout
● Real-time response using Trident + Mahout
○ Known-bad signatures identify new botnet IPs,
suspect publishers
○ Online learning adapts models to emerging threats
34. Lessons
● As your business changes, your architecture
must change
● Choose Storm if:
○ you have existing ad-hoc event streaming systems
that could use more resiliency
○ your business needs a new real-time analysis
component that fits an event-streaming model
○ you're happy to run appropriate infrastructure around
it
● Don't choose Storm if:
○ you have no use for real-time data
○ you only want to use it because it's cool
35. More Lessons
● Using Cascading for Hadoop jobs and Storm
for real-time is REALLY handy
○ Retains event-streaming paradigm
○ No need to completely re-think implementation when
switching between them
○ In some circumstances can share code
○ We have a library which provides common analysis
components for both implementations
● A reasonably managed Storm cluster will
stay up for ages.