[This article by David Gildeh comes to you from the DZone Guide to Performance & Monitoring — 2015 Edition. For additional information including insight from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.]
In 2011, several champions of the DevOps movement started a Twitter campaign with the hashtag “#monitoringsucks.” Today, I wanted to revisit this sentiment in conjunction with a study I was doing before my team and I launched Dataloop.IO, a new monitoring tool for online services. We wanted to ensure that our new tool was solving real problems that other companies were facing, such as the monitoring pains we had experienced at our previous company, Alfresco.
We interviewed over 60 other online services, ranging in size from huge online services like the BBC to smaller startups in both London and the US. We focused only on online services and their specific needs, so our results are specific to the online services sector. Most services were running on public cloud infrastructure (like AWS) and fully committed to a DevOps approach.
Since we were trying to figure out how to develop a better monitoring tool, we needed to answer two questions. First, we needed to know what monitoring tools companies were currently using. Second, we needed to see how many servers these companies were monitoring with these tools. Finally, we wanted to see how these two variables related, so that we could get a sense of which tools were solving (or causing!) which problems at which company sizes.
On a higher level, we learned two things. First, in some ways, monitoring still sucks, in ways we’ll explain in more detail below. Second, the pain is actually going to get worse as more companies move to microservices.
What monitoring tools companies are using
Although a lot of new tools have arrived since 2011, it’s clear that older open source tools like Nagios, Zabbix, and Icinga still dominate the market; 70% of the companies we spoke to are still using these older tools for their core monitoring and alerting (see Chart 01). Among DZone’s enterprise developer audience, 43% used Nagios, Zabbix, or Ichinga. Nagios was also the most popular monitoring tool at 29% marketshare .
Around 70% of the companies used more than one monitoring tool, with most using an average of two. Nagios and Graphite configurations were the most common, with many also using New Relic. However, only two of the companies we spoke to actually paid for New Relic. Most users did not upgrade from the free version because they thought the paid version was too expensive.
There were a lot of different tools in the “Other” category, but no particular tool stood out. The tools in the “Other” category were mainly used by startups and included some of the newer SaaS monitoring tools such as Librato and Datadog. Older open source tools like Cacti and Munin were also well-represented in this group, along with AWS CloudWatch. However, a significant number of tools in this category were home-grown solutions. Some of the larger services had invested significant resources into building custom monitoring solutions from the ground up.
How many servers companies are monitoring
If we look at tool usage versus the number of servers the companies manage (ranged from < 20 servers being a startup service to >1000 servers for the large online services), the proportion of older open source tools like Nagios and paid on-premise tools goes up as the service gets larger, whereas the smaller services are more likely to use developer- focused tools like Graphite, LogStash, and New Relic.
This makes sense insofar as many of the larger services are older (> 5 years old) and have legacy monitoring infrastructure. They also have the resources to hire a dedicated operations team that tends to bring in the tools they’re most familiar with, namely Nagios and its alternatives. They also have the budget to pay for the enterprise versions of monitoring tools like Splunk and AppDynamics.
In one company’s case, they were receiving around 5000 alert emails a day.
On the other hand, the smaller services often don’t have any DevOps or dedicated operations people in their company, so their developers tend to use simple- to-install SaaS monitoring tools or popular tools in the developer community such as Graphite and LogStash. There seems to be a tipping point between 50-100 servers where the company has the resources to bring in a DevOps/operations person or team, and they in turn tend to bring in time-tested infrastructure monitoring tools that they trust, like Nagios.
Here are four of the major trends we uncovered from interviewing more than 60 online service companies about their monitoring strategies.
1. Building & Scaling the “Kit car”
With 78% of online services running their own open-source monitoring solution, many spent between 4-6 months building their monitoring solution with open-source components and then tuning it to work well with their environment. The key issue is that many of the tools were originally designed 10-15 years ago, before cloud architecture, DevOps, and microservices. A significant amount of time is spent adapting these older tools to work in today’s dynamic environments.
Once they had built and tuned their “kit car” monitoring stack, their services started growing, and they needed to spend more time modifying their monitoring system so that it could handle the increasing amount of data. For example, a large online service with over 1000 instances on AWS had a Zabbix server fall over after the MySQL database behind it filled up to 2TB of data. In the end, they just dropped the database and restarted again, rather than try to make Zabbix scale.
2. Spammy Alerts
If there was one consistent complaint from all the companies we spoke to, it was about overenthusiastic alerting. It’s clear that none of the tools, even the monitoring tools that claim to have advanced machine learning algorithms, have solved the problem of alert fatigue. The problem is only getting worse as companies scale out on more servers or run microservices on continuously changing cloud environments. It was also clear that, despite the marketing claims of many companies in the space, none of the machine learning algorithms for anomaly detection or predictive alerting really worked well in practice. This indicated to us that there is still a long way to go before these tools help automatically filter out the noise of alerts in monitoring.
In one company’s case, they were receiving around 5000 alert emails a day. With that volume, alerting had just become noise, and most of the team simply filtered the alerts into a folder or automatically deleted them altogether.
3. Data Silos
Many companies we talked to were collecting real-time data. These data sources even included business metrics such as number of signups, checkouts, or revenue figures, which teams used to keep an eye on the service. However, most of the monitoring tools they used suffered from poor usability and dated UIs, so the collected data was siloed away for the eyes of the operations team alone. This means that real-time data about the service was less accessible to other stakeholders who would get value out of seeing this data too.
Many services solved the data silo problem by building custom dashboards that could be displayed on TVs around the office or shared via URL. But these dashboards were usually very static and required a developer to make changes when needed.
On the other hand, for companies where monitoring data was easily shared, the monitoring tool became a much more valuable tool for different teams to collaborate around, both to identify areas for improvement and to gain visibility into real-time performance across the business.
A key trend in online services is the microservices deployment model, which involves separate cross-functional development teams deploying and supporting their own services in production. This strategy enables a large, complex application to become highly scalable as it grows. However, it dramatically increases the number of servers and services the DevOps/operations team needs to support, so it only works if the development teams become the first line of support when things go wrong.
In this model, operations staff become a “platforms” team, providing common tools and processes for development teams. This ops-provided platform includes self-service monitoring, whereby developers have the ability to add their own checks and create their own dashboards and alerts.
For companies where monitoring data was easily shared, the monitoring tool became a much more valuable tool.
Keeping up with high-speed deployments and ephemeral instances in a microservices model is a huge challenge for today’s monitoring tools. It’s also difficult to visualize the complex flow of tasks through various services and to deal with the highly dynamic scale . Unfortunately, it’s clear that current monitoring tools have not been designed around this microservice- centric model, and most suffer from poor usability and adoption in teams outside operations. New tools designed specifically to handle microservices are required so that both operations and development can collaborate easily around a single source of performance information, instead of having developers use their own tool (typically New Relic) and operations use theirs (typically Nagios).
There are many more monitoring tools available in the four years after “#monitoringsucks” became a DevOps meme, but our research shows that many organizations are still struggling with monitoring. We believe this is mainly because new tools focus on the technical aspects of monitoring, but do not adequately drive adoption in teams outside operations. This is the problem we’re focusing on at Dataloop.IO, because we believe that the glue between Dev and Ops is reliable monitoring that everyone in the organization uses to make data-driven decisions that improve their software.
 2015 DZone Performance & Monitoring Survey  http://www.slideshare.net/adriancockcroft/software-architecture- monitoring-microservices-a-challenge