Home > Technology > Monitoring – the pitfalls that no one wants to talk about

Monitoring – the pitfalls that no one wants to talk about

Over the years of having to deal with various monitoring systems, I have learned a lesson or two….  I am sharing some of them, in the hope that you can avoid some of the pitfalls and ultimately have fewer scars than I do.

  1. Every alert needs to be actionable.
  2. False Positives will quickly overload the team and drive up OPEX like no other operational line item will.
  3. Hard coded thresholds are a maintenance nightmare and require a staff to maintain them.
  4. Event and alert naming is crucial and needs to include the data center, device name, unique identifier, and brief human readable short description.  Ideally a link to a runbook and a reference to the automation that did / didn’t catch the issue.
  5. To ease troubleshooting, all monitoring systems need to use the same time zone (UTC is recommended).
  6. As of October 2013, I have not seen any commercial solution that works properly.  In fact, there are quite a few commercial monitoring technologies that just do not work the moment you move beyond the basics.  To validate them, ask the vendor to show you their text case matrix, especially on storage devices.
  7. Most engineers will avoid working on the monitoring definitions because they don’t see the value and based on their experience it will result in more work and not help them.  As such, you will need to have a strong automation capability / mindset / understanding in the team in advance in order to keep things under control.
  8. People do not like to wake other people up in the middle of the night and therefore will avoid it.
  9. Most people do not answer their phones when called the first time.  Based on my experience, only 30% will answer on the first call.  So use an automated system to notify people and don’t rely on the on-call engineers to call other people.
  10. Most people require approximately 7 minutes to wake up when called.
  11. When the on-call people are called for trivial things, it really irritates them.  As such everything needs to be done to minimize trivial notifications.
  12. The on-call rotation needs a clean handover from the previous on-call rotation.  In my experience, handing over a physical item helps with the hand over.
  13. Contact lists and on-call people means that the appropriate roles are contacted when needed.  These lists need to be easily accessible with multiple locations.  My recommendation is in at least 5 locations.  The list needs to contain at least the name, subject matter, contact details and primary and secondary on-call roster.
  14. Predefining escalation criteria is overlooked and this often delays getting the correct people onto the issue.
  15. Averaging metrics will skew your metrics because the high and low outliers will mask issues.
  16. What will be monitoring the monitoring system?  This is almost always overlooked and it is critical to know when elements of the monitoring system have failed.  This is one of the reasons why I do not believe in a single monolithic monitoring system with vendor claims to solve all monitoring problems.







Categories: Technology Tags:
  1. No comments yet.
  1. No trackbacks yet.

© 2008-2021 Gavin McMurdo aka SparkPilot All Rights Reserved