2. Giving operational responsibility back to the feature teams
(read: developers) instead of having a monolithic SRE team
❖ Capacity planning
❖ Configuration management
❖ Monitoring, alerting and on-call
etc....
But we provide the tools and infrastructure for them!
Distributed Operational Responsibility
3. ❖ Organizational Scalability - too frequent changes for a
monolithic SRE team to keep up
❖ Getting The Right Person(tm) on the problem faster
❖ Accountability - making the right people hurt
❖ Autonomy - feature teams make all their own planning
and decisions
So let’s talk about monitoring....
But... but why...????
4. ❖ Developers need training, but not a new education
❖ Developers need autonomy, but will do stupid things
❖ Developers need to care about metrics and analytics,
but not the pipeline
So how does that affect tooling?
Human challenges
5. Alerting - What developers should care about
Metrics and events
Magic
monitoring
pipeline
Alerting
rules
6. Alerting - The reality
Apache Kafka
FFWD
Metrics and events
Other
stuff
Even
more
stuff
7. ...but we provide several different abstraction levels
depending on complexity of the task
❖ Script hooks i.e. drop a script in a folder
❖ Python scripts using the Riemann library
❖ Talk directly to FFWD using a supported protocol
Developers collect their own metrics
8. ....but we help them by providing....
❖ Continuous integration with integration tests
❖ Abstractions from externals like PagerDuty
❖ Shared common functionality
Developers write their own alerting rules
9. ❖ We build monitoring as a platform with many levels of
entry
❖ Self-service is king!
❖ We spend a lot of our time teaching and talking rather
than typing
....and that’s a good thing!
Impact on the monitoring team
10. Distributed Operational Responsibility is work-
in-progress
❖ We don’t know if this will work well
❖ We will run into new problems
❖ We will keep changing the way we work
anyway
......and finally