Monitoring with Ganglia

Vladimir Vuksan (Fastly)
Operations
Location: E146
Average rating: ***..
(3.33, 6 ratings)
Slides:   1-ODP 

In this talk I intend to describe a monitoring setup using Ganglia and Nagios for a highly available CDN service. There are a number of unique problems with CDNs that make it more difficult to monitor such as

  • geographically distributed nodes running on multiple networks, network providers making them more susceptible to network splits, isolated routing issues etc.
  • extreme levels of traffic – gigabits and gigabits per box which quickly uncover poorly chosen configuration values
  • “invisible” to the end web-user resulting in a slower feedback loop since web-user may not complain to the CDN directly but to the visiting web site making the feedback loop longer

In order to be more proactive we have built out a significant monitoring infrastructure. In the talk I will talk about following

  • How we utilize Ganglia to push metrics to multiple geographically separate data centers
  • How we use Ganglia to correlate metrics, build quick comparisons, identify metrics to investigate
  • Detail some of the metric groups we follow including some not well known you may want to look at e.g. detailed network statistics such as TCP retransmission rates, TCP failure rates, etc.
  • How we proactively look for things to monitor ie. we collect in excess of 1100+ metrics per host monitored
  • Describe ways of utilizing Nagios and Ganglia to verify configs, software versions, network configs. We use it to verify state for globally distributed network of nodes in near real-time
  • How we use IRC / Chat as a common message bus for humans to keep everyone on the same page and helps us being proactive e.g. * Alerting – display internal and external alerts for everyone to see – anyone can react * Config changes being applied to the system(s) – helps to discover misuse or wrong * Support tickets being reported – allows anyone to respond quickly or discuss things in the timeline * Twitter mentions – for customer support issues * Critical application errors * How we use Ganglia to label graphs when we do work on individual nodes

Thanks to in part to our great monitoring uptime has been near perfect 100% since we have been able to spot problems early and address them before they became serious issues

Our Pingdom stats

http://stats.pingdom.com/wmhe8yvtvkcx/397359

I am intending on borrowing liberally from my blog post about a monitoring setup available here

http://blog.vuksan.com/2012/09/01/my-monitoring-setup/

Photo of Vladimir Vuksan

Vladimir Vuksan

Fastly

Vladimir Vuksan (Fastly Operations) has worked in technical operations, systems engineering and software development for over 15 years. Prior to Fastly, he has worked at Broadcom, Mocospace, Rave Mobile Safety, Demandware, University of New Mexico implementing high availability solutions and building tools to make managing and running infrastructure easier. He is also one of the developers on Ganglia, an incredible useful performance tracking and trending solution.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or scordesse@oreilly.com.

Contact Us

View a complete list of OSCON contacts