Metrics for Measuring Your Service Level Agreements
One of the buzzwords we constantly come across when answering PRTG requests is “SLA Reporting”. To keep up with demand, one of our partners created a PRTG plugin for SLA monitoring and back in March of this year, my colleague Sascha wrote a blog post on it. But what exactly is an SLA, when is it required and what does it have to do with monitoring?
First things first, SLA is an abbreviation for service-level agreement. These agreements are usually made between a (service) provider and their customers and contain the details on what services will be provided and how the stability of the services can be ensured.
And when ever you hear that something needs to be guaranteed, you of course would want to keep an eye on this – or in other words monitor it. Common metrics for SLAs are the mean time between/to failures, the mean time to repair/recovery, and uptime. To understand SLA monitoring a bit better, let’s dive into what these numbers are and what the difference between them is:
The first metric measures how much time has elapsed before an error occurs. If it is a system that can be repaired, the metric is referred to as “mean time between failures”, since we have to be realistic and expect more than one failure. And if we are referring to something that cannot be salvaged after a failure, we call it “mean time to failure” (Oh, and in case you were wondering: yes, my source is Wikipedia).
Once a failure has occurred, the goal is to get everything back up and running as fast as possible. The time that will elapse between the outage and getting everything back up and running is the mean time to repair. And, in the best case, this number should be as low as possible.
CommentairesAucun commentaire pour le moment
Suivre le flux RSS des commentaires
Ajouter un commentaire