Saving money with network monitoring? Of course!

Mon, 24/10/2016 - 11:16

In years of Linux system support on critical infrastructures we have matured the need to rely on a deep and reliable monitoring system, one that would allow us to have real-time notification of any software or hardware malfunction in our infrastructure
Hence the need to adopt realtime monitoring that would overcome the many criticalities of the most common NMS software and extend its functionality

Transcript presentation:

Recovery using network monitoring

The processes that lead to remote or on-site intervention can be time-consuming and wasteful for our company, this results in lost time and dead cost.
In order to improve efficiency, it is necessary to analyze the steps that lead to service disruption interventions.
In particular, the comparison between:

Standard process (without monitoring)
Monitoring with notification

As can be seen in the diagrams in the slides, the process of detection and notification moves all evaluation during the customer's uptime, decreasing the downtime.
Furthermore, it offers the ability to proactively intervene, before the block, without causing disruption to the customer, by alerting support when predetermined warning thresholds are exceeded

Why does the customer always notice first?

Common limitations in monitoring software

Most monitoring systems suffer from problems such as:

Hard to implement: Complex command-line configurations
Low flexibility on supported services and metrics
Little or no ability to extend supported functions and software and poor availability of plugins
Inability to apply recursive rules and templates
High cost of software or configuration

What is the impact for our company

Loss of image (difficult to calculate)
Inefficiency (time spent on diagnostics)
Loss of work

How much does a block cost us?

The cost is calculated by multiplying, the number of users, the hourly cost and the hours of downtime.
Example: 50 users * 28.3 €/hour * 2 hours = 2.830 €*
*Average Italian hourly cost 2015 source Eurostat

Nagios

Open-source monitoring based on plugins, integrated email problem notification system in case of failure, centralized view of infrastructure status

Nagios Core vs XI

The paid version of Nagios, called Nagios XI offers advanced graphs, configuration wizards, host editing from the web, and multiple host editing

Best features

Open source plugins through which virtually any device or service can be monitored.
The exchange contains thousands of ready-to-use plugins

Biggest issue

Hard to define objects:

Host
Check
Thresholds
Template
Plugins

Evaluation:

17 years after the release of Nagios, it has been one of the best Network monitoring in history, but...

Is it still suitable for our times?
Is it ready for the challenges of tomorrow?
Is it possible that nothing has changed since then?

What has changed?

Multiplying value of data and systems

Cloud: Private/Public/Hibrid
Infrastructures increasingly complex

Increasing criticality of services

H24 usage
High reliability
Geographical distribution

Automation of resource allocation

Instant deployment
"Disposable" instances

In essence:

More services
More reliability
More complexity
Less time to manage them

In addition, there is a widespread lack of perception of criticality and complexity:
...we've always done it this way!
...it's always worked, now it doesn't work anymore!
...how do you not know this?
...no one warned me that I had to change the tape!
...for what we have to do with it, it's already too much already!

How to survive?

Granular monitoring of applications and systems
Real-time reporting and collaboration among the teams involved
Integration into the development and testing process
Detailed and visual reporting
Flexibility and reliability of monitoring software

In a word? Icinga2

Simple interface
Responsive
Team integration
- Slack notifications, with team chat, invitation management, and bots
Advanced notifications
- Multiple notifications on mobile, mobile apps, mail, sms
Advanced performance graphs
- Advanced performance graphs, with range selection, comparison between multiple hosts and services...
Infrastructure
- Based on Docker containers
- Multi Master on international high reliability cloud servers
- Open software designed for high reliability in clusters
- Encrypted communications between nodes, zones and clusters
- Speed: 10x Nagios + Gearman (benchmark on 1.000.000 checks on services)
- To date 200 checks available (on our infrastructure), compatibility with nagios plugins, over 3000 available
- Store results in Mysql or Postresql
- Deploy via Ansible and API of hosts and services

Our choice after evaluation of proposed black and white box monitoring solutions is Icinga2 and Grafana
At this link you can find our network monitoring used on all our clients

Software used:

Debian
Docker
Ansible
Rest API
Graylog
Logstash
Gitlab
Grafana

Share our passion for Opensource too!

Thanks

A special thanks to those who made this event possible and those who supported the infrastructure development and testing

Relug
CNA Digitale
Particles

Business hour