Skip to content
Xinglang Wang edited this page Aug 6, 2015 · 5 revisions

In this tutorial you will learn how to deploy and customize the pulsar pipeline.

Pipeline components

Pulsar pipeline is built on top of the Jetstream framework. The pipeline includes below components:

  1. Collector - Ingest events via rest end point with Geo-location classification and Device-type detection
  2. Sessionizer - Sessionize the events, maintain the session state and generate marker events
  3. Distributor - Filter and mutate event to different consumers, act as an event router
  4. [Metrics Calculator](Metrics Calculator) - Calculate metrics by various dimensions and persist it to metrics store
  5. Replay - Pipeline applications persistent the events into kafka when it can't process the event to downstream application. Replay is responsible for picking up the events from kafka and process to the original downstream application to make sure 100% reliability in pipeline.
  6. ConfigApp - Dynamic provision configurations for whole pipeline, this is a built-in app from Jetstream framework

Demo apps

  1. MetricService - This provide metrics for the metrics dashboard
  2. MetricUI - This is a dashboard to visualize the metrics
  3. TwitterSample - This is an app which use twitter sample api to feed events into the pipeline. It depended on twitter4j and feed sampled twitter feed to the pipeline.

See detail

Dependencies

In order to run the pipeline, the following external dependencies are required:

  1. Zookeeper
  2. Mongo
  3. Kafka
  4. Cassandra

These components have already been dockerized by pulsar and third party and available on docker hub.

For other compile-time dependencies, please refer to Pulsar External Dependencies

Tested with Zookeeper 3.4.6, Mongo 2.6.7, kafka_2.10-0.8.1.1 and Cassandra 2.1.2

Deployment

The pipeline have been fully dockerized. And it can be deployed on single node or distributed environment. Each component will be delivered as a docker image. Each app component assume the zookeeper server name is zkserver and mongourl is mongo://mongoserver:27017/config, kafkaserver zk url is kafkaserver:2181, kafka broker list is kafkaserver:9092.

All the external dependencies also be dockerized or use existed docker image on dockerhub.

  1. jetstream/zookeeper - A single node zookeeper server, expose 2181 port
  2. jetstream/kafka - A single node kafka with zk server and a broker. expose 2181 port and 9092 port.
  3. mongo - This is offical mongo docker image. it is a single node mongo server, expose 27107 port
  4. poklet/cassandra - For Cassandra docker, we use a 3p Cassandra docker. Users can feel free to configure multiple nodes and multiple seed nodes. You can choose to start the OpsCenter. A CQL client is also included in the docker.

The zookeeper server info can be overwritten by the environment variable JETSTREAM_ZKSERVER_HOST, JETSTREAM_ZKSERVER_PORT. The mongo url can be overwritten by the environment variable JETSTREAM_MONGOURL. The JVM options can be overwritten by using environment variable JETSTREAM_JAVA_OPTS. The kafka server info can be overwritten by using environment variable PULSAR_KAFKA_ZK and PULSAR_KAFKA_BROKERS. The Cassandra server info can be overwritten by using environment variable PULSAR_CASSANDRA.

If you don't want to use docker, you can follow this Setup local development to run it in IDE.

Single node deployment

Inside one docker host, each docker container instance can use the virtual network to communicate each other. It can use the docker link functionality.

Prepare a node with docker support:

  1. Preprare a Linux instance which has docker support, docker version should be 1.3.0 or later.
  2. Minimum hardware requirement : 8GB RAM | 2 VCPU | 30.0GB Disk

Quick start

Here is a quick start script to start all docker containers, you can copy below script or download from here or just run sh <(curl -s https://raw.githubusercontent.com/pulsarIO/realtime-analytics/master/Demo/rundemo.sh)

#!/bin/sh

echo '>>> Start external dependency'
sudo docker run -d --name zkserver -t "jetstream/zookeeper"
sleep 5
sudo docker run -d --name mongoserver -t "mongo"
sleep 5
sudo docker run -d --name kafkaserver -t "jetstream/kafka"
sleep 5
sudo docker run -d --name cassandraserver -t "poklet/cassandra"

echo '>>> Sleep 30 seconds'
sleep 30
mkdir /tmp/pulsarcql
wget https://raw.githubusercontent.com/pulsarIO/realtime-analytics/master/metriccalculator/pulsar.cql -O /tmp/pulsarcql/pulsar.cql
sudo docker run -it --rm --link cassandraserver:cass1 -v /tmp/pulsarcql:/data poklet/cassandra bash -c 'cqlsh $CASS1_PORT_9160_TCP_ADDR -f /data/pulsar.cql'

echo '>>> Sleep 10 seconds'
sleep 10

echo '>>> Start Pulsar pipeline'
sudo docker run -d -p 0.0.0.0:8000:9999 -p 0.0.0.0:8081:8080 --link zkserver:zkserver --link mongoserver:mongoserver -t "jetstream/config"
sleep 5
sudo docker run -d -p 0.0.0.0:8001:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/replay"
sleep 5
sudo docker run -d -p 0.0.0.0:8003:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/sessionizer"
sleep 5
sudo docker run -d -p 0.0.0.0:8004:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/distributor"
sleep 5
sudo docker run -d -p 0.0.0.0:8005:9999 --name metriccalculator --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver  --link cassandraserver:cassandraserver -t "pulsar/metriccalculator"
sleep 5
sudo docker run -d -p 0.0.0.0:8002:9999 -p 0.0.0.0:8080:8080 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/collector"
sleep 5

echo '>>> Start demo'
sudo docker run -d -p 0.0.0.0:8006:9999 -p 0.0.0.0:8083:8083 --name metricservice --link zkserver:zkserver --link mongoserver:mongoserver --link cassandraserver:cassandraserver -t "pulsar/metricservice"
sleep 5
sudo docker run -d -p 0.0.0.0:8007:9999 -p 0.0.0.0:8088:8088 --link zkserver:zkserver --link mongoserver:mongoserver --link metricservice:metricservice --link metriccalculator:metriccalculator -t "pulsar/metricui"

# The twitter sample use the twitter4j api, to run the app, it requires twitter OAUTH token (https://dev.twitter.com/oauth/overview/application-owner-access-tokens)
#sleep 5
#sudo docker run -d -p 0.0.0.0:8008:9999 --link zkserver:zkserver --link mongoserver:mongoserver -e TWITTER4J_OAUTH_CONSUMERKEY= -e TWITTER4J_OAUTH_CONSUMERSECRET=  -e TWITTER4J_OAUTH_ACCESSTOKEN=  -e TWITTER4J_OAUTH_ACCESSTOKENSECRET= -t pulsar/twittersample
#sleep 5

Start dependency docker containers:

  1. Run zookeeper: sudo docker run -d --name zkserver -t "jetstream/zookeeper"
  2. Run mongo: sudo docker run -d --name mongoserver -t "mongo"
  3. Run kafka: sudo docker run -d --name kafkaserver -t "jetstream/kafka"
  4. Run Cassandra: sudo docker run -d --name cassandraserver -t "poklet/cassandra"

Create Cassandra Tables and keyspace

  1. Run CQL Client to install tables: sudo docker run -it --rm --link cassandraserver:cass1 -v $PATH_CQL:/data poklet/cassandra bash -c 'cqlsh $CASS1_PORT_9160_TCP_ADDR -f /data/pulsar.cql'
  2. Pulsar init CQL: pulsar.cql

Start config app and replay app:

  1. Run Config server: sudo docker run -d -p 0.0.0.0:8000:9999 -p 0.0.0.0:8081:8080 --link zkserver:zkserver --link mongoserver:mongoserver -t "jetstream/config"
  2. Run Replay app: sudo docker run -d -p 0.0.0.0:8001:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/replay"

Start pipeline apps:

  1. Run Collector: sudo docker run -d -p 0.0.0.0:8002:9999 -p 0.0.0.0:8080:8080 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/collector"
  2. Run Sessionizer: sudo docker run -d -p 0.0.0.0:8003:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/sessionizer"
  3. Run Distributor: sudo docker run -d -p 0.0.0.0:8005:9999 --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver -t "pulsar/distributor"
  4. Run Metriccalculator: sudo docker run -d -p 0.0.0.0:8006:9999 --name metriccalculator --link zkserver:zkserver --link mongoserver:mongoserver --link kafkaserver:kafkaserver --link cassandraserver:cassandraserver -t "pulsar/metriccalculator"

Start demo apps:

  1. Run MetricService: sudo docker run -d -p 0.0.0.0:8006:9999 -p 0.0.0.0:8083:8083 --name metricservice --link zkserver:zkserver --link mongoserver:mongoserver --link cassandraserver:cassandraserver -t "pulsar/metricservice"
  2. Run MetricUI: sudo docker run -d -p 0.0.0.0:8007:9999 -p 0.0.0.0:8088:8088 --link zkserver:zkserver --link mongoserver:mongoserver --link metricservice:metricservice --link metriccalculator:metriccalculator -t "pulsar/metricui"
  3. Run twittersample: sudo docker run -d -p 0.0.0.0:8008:9999 --link zkserver:zkserver --link mongoserver:mongoserver -e TWITTER4J_OAUTH_CONSUMERKEY= -e TWITTER4J_OAUTH_CONSUMERSECRET= -e TWITTER4J_OAUTH_ACCESSTOKEN= -e TWITTER4J_OAUTH_ACCESSTOKENSECRET= -t pulsar/twittersample Twittersample requires twitter oauth token, see https://dev.twitter.com/oauth/overview/application-owner-access-tokens

Check pipeline apps:

Each app will expose a http port for monitoring:

  1. Replay - http://<hostname>:8001
  2. Collector - http://<hostname>:8002, Collector Rest end point - http://<hostname>:8080
  3. Sessionizer - http://<hostname>:8003
  4. Distributor - http://<hostname>:8004
  5. MetricCalculator - http://<hostname>:8005
  6. Configuration - http://<hostname>:8000, Configuration UI - http://<hostname>:8081
  7. Query data via CQL Client: sudo docker run -it --rm --link cassandraserver:cass "poklet/cassandra" cqlsh cass
  8. MetricService - http://<hostname>:8006
  9. MetricUI - http://<hostname>:8007, Dashboard port: http://<hostname>:8088
  10. TwitterSample - http://<hostname>:8008

Check Demo dashboard

Open the http://<hostname>:8088 via Chrome or Firefox.

Distributed deployment

On distributed environment, when run pulsar app, it should pass the zkserver and mongoserver as parameters when starting the app container. Meanwhile, it should let the pulsar app bind the container host network directly, and should put each app on different host.

Start dependency docker containers:

  1. Run zkServer: sudo docker run -d -p 0.0.0.0:2181:2181 -t "jetstream/zookeeper"
  2. Run mongo: sudo docker run -d -p 0.0.0.0:27017:27017 -t "mongo"
  3. Run kafka: sudo docker run -d -p 0.0.0.0:9092:9092 -p 0.0.0.0:2181:2181 -t "jetstream/kafka"
  4. Run cassandra: sudo docker run -d -p 0.0.0.0:9106:9160 -p 0.0.0.0:22:22 -p 0.0.0.0:61621:61621 -p 0.0.0.0:7000:7000 -p 0.0.0.0:7001:7001 -p 0.0.0.0:7199:7199 -p 0.0.0.0:8012:8012 -p 0.0.0.0:9042:9042 -t "poklet/cassandra"

Start config app and replay app:

  1. Run Config server: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -t "jetstream/config"
  2. Run Replay app: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_KAFKA_ZK="kafkazkconn" -e PULSAR_KAFKA_BROKERS="kafkaborkers" -t "pulsar/replay"

Start pipeline apps:

  1. Run Collector: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_KAFKA_BROKERS="kafkaborkers" -t "pulsar/collector"
  2. Run Sessionizer: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_KAFKA_BROKERS="kafkaborkers" -t "pulsar/sessionizer"
  3. Run Distributor: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_KAFKA_BROKERS="kafkaborkers" -t "pulsar/distributor"
  4. Run Metriccalculator: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_KAFKA_BROKERS="kafkaborkers" -e PULSAR_CASSANDRA="cassandraseeds" -t "pulsar/metriccalculator"

Start demo apps:

  1. Run MetricService: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e PULSAR_CASSANDRA="cassandraserverip" -t "pulsar/metricservice"
  2. Run MetricUI: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e METRIC_SERVER_HOST="metricserverip" -e METRIC_CALCULATOR_HOST="metriccalculatorserverip" -t "pulsar/metricui"
  3. Run twittersample: sudo docker run -d --net=host -e JETSTREAM_ZKSERVER_HOST="zkserverip" -e JETSTREAM_MONGOURL="mongo://mongoserverip:27017/config" -e TWITTER4J_OAUTH_CONSUMERKEY= -e TWITTER4J_OAUTH_CONSUMERSECRET= -e TWITTER4J_OAUTH_ACCESSTOKEN= -e TWITTER4J_OAUTH_ACCESSTOKENSECRET= -t pulsar/twittersample Twittersample requires twitter oauth token, see https://dev.twitter.com/oauth/overview/application-owner-access-tokens

Deploy multiple apps on single host

It require do a port assignment to avoid port conflict on --net=host mode. Here is an example - sudo docker run -d --net=host **-e JETSTREAM_REST_BASEPORT=8076 -e JETSTREAM_APP_PORT=8007 -e JETSTREAM_CONTEXT_BASEPORT=16000 **-e JETSTREAM_ZKSERVER_HOST="172.17.0.63" -e JETSTREAM_MONGOURL="mongo://172.17.0.64:27017/config" -e METRIC_SERVER_HOST="172.17.0.74" -e METRIC_CALCULATOR_HOST="172.17.0.72" -t "pulsar/metricui"

When start the container, use the JETSTREAM_REST_BASEPORT, JETSTREAM_APP_PORT, JETSTREAM_CONTEXT_BASEPORT to reassign the port. For detail - see Port Assignment

EC2 deployment

Follow up the https://docs.docker.com/installation/amazon/ to setup the docker host. then follow above steps to start the pipeline.

Build from source

Each component of the pipeline can be built from source using maven. Please use maven version 3.0 or higher version.

Customize the pipeline

Each component of the pipeline can be customized, you can customize it by changing the spring xml. Each component can also be used as library, you can create your own app and reference the processors/channels defined on pulsar pipeline.

Create new app by Archetype

Jetstream has a archetype to create a new app quickly.