{"id":842,"date":"2015-05-20T08:44:35","date_gmt":"2015-05-20T17:44:35","guid":{"rendered":"http:\/\/blog.box.kr\/?p=842"},"modified":"2015-05-20T08:44:35","modified_gmt":"2015-05-20T17:44:35","slug":"scrapdoes-monitoring-still-suck","status":"publish","type":"post","link":"https:\/\/blog.box.kr\/?p=842","title":{"rendered":"[scrap]Does Monitoring Still Suck?"},"content":{"rendered":"<p><a href=\"http:\/\/java.dzone.com\/articles\/does-monitoring-still-suck?mz=110215-high-perf&amp;utm_content=buffer692ef&amp;utm_medium=social&amp;utm_source=facebook.com&amp;utm_campaign=buffer\">http:\/\/java.dzone.com\/articles\/does-monitoring-still-suck?mz=110215-high-perf&amp;utm_content=buffer692ef&amp;utm_medium=social&amp;utm_source=facebook.com&amp;utm_campaign=buffer<\/a><\/p>\n<p>&nbsp;<\/p>\n<div>\n<div>\n<p><i>[This article by David Gildeh\u00a0comes to you from the\u00a0<\/i><a href=\"http:\/\/dzone.com\/research\/2015-guide-to-performance-and-monitoring\" target=\"_blank\">DZone Guide to Performance &amp; Monitoring &#8212; 2015 Edition<\/a><i>. For additional information including insight\u00a0from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.]<\/i><\/p>\n<p><i>In 2011, several champions of the DevOps movement started a Twitter campaign with the hashtag \u201c<\/i><i>#monitoringsucks.<\/i><i>\u201d Today, I wanted to revisit this sentiment in conjunction with a study I was doing before my team and I launched Dataloop.IO, a new monitoring tool for online services. We wanted to ensure that our new tool was solving real problems that other companies were facing, such as the monitoring pains we had experienced at our previous company, Alfresco.<\/i><\/p>\n<p>We interviewed over 60 other online services, ranging in size from huge online services like the BBC to smaller startups in both London and the US. We focused only on online services and their specific needs, so our results are specific to the online services sector. Most services were running on public cloud infrastructure (like AWS) and fully committed to a DevOps approach.<\/p>\n<p>Since we were trying to figure out how to develop\u2028a better monitoring tool, we needed to answer two questions. First, we needed to know what monitoring tools companies were currently using. Second, we needed to see how many servers these companies were monitoring with these tools. Finally, we wanted to see how these two variables related, so that we could get a sense of which tools were solving (or causing!) which problems at which company sizes.<\/p>\n<p>On a higher level, we learned two things. First, in some ways, monitoring still sucks, in ways we\u2019ll explain in more detail below. Second, the pain is actually going to get worse as more companies move to microservices.<\/p>\n<h3><b>What monitoring tools companies are using<\/b><\/h3>\n<p>Although a lot of new tools have arrived\u00a0since 2011, it\u2019s clear that older open source tools like Nagios, Zabbix, and Icinga still dominate\u2028the market; <b>70% <\/b>of the companies we spoke to are still using these older tools for their core monitoring and alerting (see Chart 01). Among DZone\u2019s enterprise developer audience, 43% used Nagios, Zabbix, or Ichinga. Nagios was also the most popular monitoring tool at 29% marketshare [1].<\/p>\n<p>Around 70% of the companies used more than one monitoring tool, with most using an average of two. Nagios and Graphite configurations were the most common, with many also using New Relic. However, only two of the companies we spoke to actually paid for New Relic. Most users did not upgrade from the free version because they thought the paid version was too expensive.<\/p>\n<p>There were a lot of different tools in the \u201cOther\u201d category, but no particular tool stood out. The tools in the \u201cOther\u201d category were mainly used by startups and included some of the newer SaaS monitoring tools such as Librato and Datadog. Older open source tools like Cacti and Munin were also well-represented in this group, along with AWS CloudWatch. However, a significant number of tools in this category were home-grown solutions. Some of the larger services had invested significant resources into building custom monitoring solutions from the ground up.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/java.dzone.com\/sites\/all\/files\/perfmonarticle1img1.png?w=623\" alt=\"\" data-recalc-dims=\"1\" \/><\/p>\n<\/div>\n<\/div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<h3><b>How many servers companies are monitoring<\/b><\/h3>\n<p>If we look at tool usage versus the number of\u00a0servers the companies manage (ranged from &lt; 20 servers being a startup service to &gt;1000 servers for the large online services), the proportion of older open source tools like Nagios and paid on-premise tools goes up as the service gets larger, whereas the smaller services are more likely to use developer- focused tools like Graphite, LogStash, and New Relic.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/java.dzone.com\/sites\/all\/files\/Screen%20Shot%202015-05-19%20at%2012.03.07%20PM.png?w=623\" alt=\"\" data-recalc-dims=\"1\" \/><\/p>\n<div>\n<div>\n<div>\n<p>This makes sense insofar as many of the larger services are older (&gt; 5 years old) and have legacy monitoring infrastructure. They also have the resources to hire a dedicated operations team that tends to bring in the tools they\u2019re most familiar with, namely Nagios and its alternatives. They also have the budget to pay for the enterprise versions of monitoring tools like Splunk and AppDynamics.<\/p>\n<h3><b>In one company\u2019s case, they were receiving around 5000 alert emails a day.<\/b><\/h3>\n<p>On the other hand, the smaller services often don\u2019t have any DevOps or dedicated operations people in their company, so their developers tend to use simple- to-install SaaS monitoring tools or popular tools\u00a0in the developer community such as Graphite and LogStash. There seems to be a tipping point between 50-100 servers where the company has the resources to bring in a DevOps\/operations person or team, and they in turn tend to bring in time-tested infrastructure monitoring tools that they trust, like Nagios.<\/p>\n<h3><b>Key Trends<\/b><\/h3>\n<p>Here are four of the major trends we uncovered from interviewing more than 60 online service companies about their monitoring strategies.<\/p>\n<p>1. Building &amp; Scaling the \u201cKit car\u201d<\/p>\n<p>With 78% of online services running their own\u00a0open-source monitoring solution, many spent between 4-6 months building their monitoring solution with open-source components and then tuning it to work well with their environment. The key issue is that many of the tools were originally designed 10-15 years ago, before cloud architecture, DevOps, and microservices. A significant amount of time is spent adapting these older tools to work in today\u2019s dynamic environments.<\/p>\n<p>Once they had built and tuned their \u201ckit car\u201d monitoring stack, their services started growing, and they needed to spend more time modifying their monitoring system so that it could handle the increasing amount of data. For example, a large online service with over 1000 instances on AWS had a Zabbix server fall over after the MySQL database behind it filled up to 2TB of data. In the end, they just dropped the database and restarted again, rather than try to make Zabbix scale.<\/p>\n<p>2. Spammy Alerts<\/p>\n<p>If there was one consistent complaint from\u00a0all the companies we spoke to, it was about overenthusiastic alerting. It\u2019s clear that none of the tools, even the monitoring tools that claim to have advanced machine learning algorithms, have solved the problem of alert fatigue. The problem is only getting worse as companies scale out on more servers or run microservices on continuously changing\u2028cloud environments. It was also clear that, despite the marketing claims of many companies in the space, none of the machine learning algorithms for anomaly detection or predictive alerting really worked well in practice. This indicated to us that there is still a long way to go before these tools help automatically filter out the noise of alerts in monitoring.<\/p>\n<p>In one company\u2019s case, they were receiving around <b>5000 alert emails a day<\/b>. With that volume, alerting had just become noise, and most of the team simply filtered the alerts into a folder or automatically deleted them altogether.<\/p>\n<p>3. Data Silos<\/p>\n<p>Many companies we talked to were collecting\u00a0real-time data. These data sources even included business metrics such as number of signups, checkouts, or revenue figures, which teams used\u2028to keep an eye on the service. However, most of\u2028the monitoring tools they used suffered from poor usability and dated UIs, so the collected data was siloed away for the eyes of the operations team alone. This means that real-time data about the service was less accessible to other stakeholders who would get value out of seeing this data too.<\/p>\n<p>Many services solved the data silo problem by building custom dashboards that could be displayed on TVs around the office or shared via URL. But these dashboards were usually very static and required a developer to make changes when needed.<\/p>\n<p>On the other hand, for companies where monitoring data was easily shared, the monitoring tool became a much more valuable tool for different teams\u2028to collaborate around, both to identify areas for improvement and to gain visibility into real-time performance across the business.<\/p>\n<p>4. Microservices<\/p>\n<p>A key trend in online services is the\u00a0microservices deployment model, which involves separate cross-functional development\u2028teams deploying and supporting their own services\u2028in production. This strategy enables a large, complex application to become highly scalable as it grows. However, it dramatically increases the number of servers and services the DevOps\/operations team needs\u00a0to support, so it only works if the development teams become the first line of support when things go wrong.<\/p>\n<p>In this model, operations staff become a \u201cplatforms\u201d team, providing common tools and processes for development teams. This ops-provided platform includes self-service monitoring, whereby developers have the ability to add their own checks and create their own dashboards and alerts.<\/p>\n<h3><b>For companies where monitoring data was easily shared, the monitoring tool became a much more valuable tool.<\/b><\/h3>\n<p>Keeping up with high-speed deployments and ephemeral instances in a microservices model is a huge challenge for today\u2019s monitoring tools. It\u2019s also difficult to visualize the complex flow of tasks through various services and to deal with the highly dynamic scale [2]. Unfortunately, it\u2019s clear that current monitoring tools have not been designed around this microservice- centric model, and most suffer from poor usability\u2028and adoption in teams outside operations. New tools designed specifically to handle microservices are required so that both operations and development\u2028can collaborate easily around a single source of performance information, instead of having developers use their own tool (typically New Relic) and operations use theirs (typically Nagios).<\/p>\n<h3><b>Conclusion<\/b><\/h3>\n<p>There are many more monitoring tools available in the four years after \u201c#monitoringsucks\u201d became a DevOps meme, but our research shows that many organizations are still struggling with monitoring. We believe this is mainly because new tools focus on the technical aspects of monitoring, but do not adequately drive adoption\u00a0in teams outside operations. This is the problem we\u2019re focusing on at Dataloop.IO, because we believe that the glue between Dev and Ops is reliable monitoring that everyone in the organization uses to make data-driven decisions that improve their software.<\/p>\n<p><b>[1] 2015 DZone Performance &amp; Monitoring Survey\u2028[2] <a href=\"http:\/\/www.slideshare.net\/adriancockcroft\/software-architecture-\">http:\/\/www.slideshare.net\/adriancockcroft\/software-architecture-<\/a> monitoring-microservices-a-challenge<\/b><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p align=\"center\"><a href=\"http:\/\/library.dzone.com\/assets\/request\/whitepaper\/207037?oid=cd15_lp\">DOWNLOAD YOUR FREE COPY TODAY<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/java.dzone.com\/articles\/does-monitoring-still-suck?mz=110215-high-perf&amp;utm_content=buffer692ef&amp;utm_medium=social&amp;utm_source=facebook.com&amp;utm_campaign=buffer &nbsp; [This article by David Gildeh\u00a0comes to you from the\u00a0DZone Guide to Performance &amp; Monitoring &#8212; 2015 Edition. For additional information including insight\u00a0from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.] In 2011, several champions of the DevOps movement started a Twitter campaign with the hashtag \u201c#monitoringsucks.\u201d Today, I wanted to revisit this sentiment in conjunction with a study I was doing before my team and I launched Dataloop.IO, a new monitoring tool for online services. We wanted to ensure that our new tool was solving real problems that other companies were facing, such as the monitoring pains we had experienced at our previous company, Alfresco. We interviewed over 60 other online services, ranging in size from huge online services like the BBC to smaller startups in both London and the US. We focused only on online services and their specific needs, so our results are specific to the online services sector. Most services were running on public cloud infrastructure (like AWS) and fully committed to a DevOps approach. Since we were trying to figure out how to develop\u2028a better monitoring tool, we needed [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"ngg_post_thumbnail":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[4,5],"tags":[],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p5q9Zn-dA","jetpack-related-posts":[{"id":373,"url":"https:\/\/blog.box.kr\/?p=373","url_meta":{"origin":842,"position":0},"title":"Application Performance Monitoring (APM) Framework for J2EE Applications","date":"2014-09-15","format":false,"excerpt":"https:\/\/code.google.com\/p\/monitor-24x7\/ \u00a0 Description 24x7Monitoring is an Open Source Application Performance Monitoring (APM) Framework for J2EE Applications that uses Aspect Oriented Programming to collect Performance metrics about the running JVM and display the data to the user in a tabular\/graphical format. 24x7Monitoring does not require any modification to the source code\u2026","rel":"","context":"In &quot;\ucc38\uace0\ub97c \uc704\ud55c \uc800\uc7a5\ubb3c&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":959,"url":"https:\/\/blog.box.kr\/?p=959","url_meta":{"origin":842,"position":1},"title":"server monitoring solutions.","date":"2015-08-26","format":false,"excerpt":"https:\/\/www.nagios.org\/\u00a0 \u00a0( Open Source ) \u00a0 http:\/\/www.whatap.io\u00a0 \u00a0( SaaS ) opinions for whatap ( http:\/\/archmond.net\/?p=3609 )","rel":"","context":"In &quot;\uae30\uc220&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":828,"url":"https:\/\/blog.box.kr\/?p=828","url_meta":{"origin":842,"position":2},"title":"[scrap]Top 25 Best Linux Performance Monitoring and Debugging Tools","date":"2015-05-20","format":false,"excerpt":"http:\/\/www.thegeekstuff.com\/2011\/12\/linux-performance-monitoring-tools\/ I\u2019ve compiled 25 performance monitoring and debugging tools that will be helpful when you are working on Linux environment. This list is not comprehensive or authoritative by any means. However this list has enough tools for you to play around and pick the one that is suitable your specific\u2026","rel":"","context":"In &quot;\uae30\uc220&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":838,"url":"https:\/\/blog.box.kr\/?p=838","url_meta":{"origin":842,"position":3},"title":"[scrap]Server and Storage I\/O Benchmarking and Performance Resources","date":"2015-05-20","format":false,"excerpt":"http:\/\/java.dzone.com\/articles\/server-and-storage-io?utm_content=bufferf3e10&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer \u00a0 The following are a list of various articles, tips, post and other resources about\u00a0server storage\u00a0I\/O performance\u00a0benchmarking for legacy, virtual, cloud and software defined environments along with associated tools. The best server and storage I\/O\u00a0(input\/output operation) is the one that you do not have to do, the second best\u2026","rel":"","context":"In &quot;\uae30\uc220&quot;","img":{"alt_text":"server storage I\/O locality of reference","src":"https:\/\/i0.wp.com\/storageio.com\/images\/SIO_IndustryTrends_Locality.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":813,"url":"https:\/\/blog.box.kr\/?p=813","url_meta":{"origin":842,"position":4},"title":"[scrap]How to Monitor MySQL Replication?","date":"2015-05-18","format":false,"excerpt":"Just setting up MySQL replication is not enough, you would need to periodically monitor your slaves to ensure they continue to work seamlessly. Here is a basic overview of the Slave variables to monitor and the tools that will help you monitor those with ease. Top variables to monitor on\u2026","rel":"","context":"In &quot;DB\uad00\ub828&quot;","img":{"alt_text":"how to monitor mysql replication","src":"https:\/\/i0.wp.com\/blog.webyog.com\/wp-content\/uploads\/2012\/11\/how-to-monitor-mysql-replication.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":947,"url":"https:\/\/blog.box.kr\/?p=947","url_meta":{"origin":842,"position":5},"title":"System Monitoring command","date":"2015-07-30","format":false,"excerpt":"1. OS\/system $\u00a0vmstat 2 10 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \/\/\u00a010 system resource status to every 2 sec. $ iostat 2 10 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\/\/ 10 I\/O status to every 2 sec. $\u00a0sar\u00a02 10 \u00a0 \u00a0 \u00a0\u2026","rel":"","context":"In &quot;\uae30\uc220&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/blog.box.kr\/index.php?rest_route=\/wp\/v2\/posts\/842"}],"collection":[{"href":"https:\/\/blog.box.kr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.box.kr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.box.kr\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.box.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=842"}],"version-history":[{"count":0,"href":"https:\/\/blog.box.kr\/index.php?rest_route=\/wp\/v2\/posts\/842\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.box.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.box.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.box.kr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}