Saving money with network monitoring? Of course!
In years of Linux system support on critical infrastructures we have matured the need to rely on a deep and reliable monitoring system, one that would allow us to have real-time notification of any software or hardware malfunction in our infrastructure
Hence the need to adopt realtime monitoring that would overcome the many criticalities of the most common NMS software and extend its functionality
Transcript presentation:
Recovery using network monitoring
The processes that lead to remote or on-site intervention can be time-consuming and wasteful for our company, this results in lost time and dead cost.
In order to improve efficiency, it is necessary to analyze the steps that lead to service disruption interventions.
In particular, the comparison between:
- Standard process (without monitoring)
- Monitoring with notification
As can be seen in the diagrams in the slides, the process of detection and notification moves all evaluation during the customer's uptime, decreasing the downtime.
Furthermore, it offers the ability to proactively intervene, before the block, without causing disruption to the customer, by alerting support when predetermined warning thresholds are exceeded
Why does the customer always notice first?
Common limitations in monitoring software
Most monitoring systems suffer from problems such as:
- Hard to implement: Complex command-line configurations
- Low flexibility on supported services and metrics
- Little or no ability to extend supported functions and software and poor availability of plugins
- Inability to apply recursive rules and templates
- High cost of software or configuration
What is the impact for our company
- Loss of image (difficult to calculate)
- Inefficiency (time spent on diagnostics)
- Loss of work
How much does a block cost us?
The cost is calculated by multiplying, the number of users, the hourly cost and the hours of downtime.
Example: 50 users * 28.3 €/hour * 2 hours = 2.830 €*
*Average Italian hourly cost 2015 source Eurostat
Nagios
Open-source monitoring based on plugins, integrated email problem notification system in case of failure, centralized view of infrastructure status
Nagios Core vs XI
The paid version of Nagios, called Nagios XI offers advanced graphs, configuration wizards, host editing from the web, and multiple host editing
Best features
Open source plugins through which virtually any device or service can be monitored.
The exchange contains thousands of ready-to-use plugins
Biggest issue
Hard to define objects:
- Host
- Check
- Thresholds
- Template
- Plugins
Evaluation:
17 years after the release of Nagios, it has been one of the best Network monitoring in history, but...
- Is it still suitable for our times?
- Is it ready for the challenges of tomorrow?
- Is it possible that nothing has changed since then?
What has changed?
Multiplying value of data and systems
- Cloud: Private/Public/Hibrid
- Infrastructures increasingly complex
Increasing criticality of services
- H24 usage
- High reliability
- Geographical distribution
Automation of resource allocation
- Instant deployment
- "Disposable" instances
In essence:
- More services
- More reliability
- More complexity
- Less time to manage them
In addition, there is a widespread lack of perception of criticality and complexity:
...we've always done it this way!
...it's always worked, now it doesn't work anymore!
...how do you not know this?
...no one warned me that I had to change the tape!
...for what we have to do with it, it's already too much already!
How to survive?
- Granular monitoring of applications and systems
- Real-time reporting and collaboration among the teams involved
- Integration into the development and testing process
- Detailed and visual reporting
- Flexibility and reliability of monitoring software
In a word? Icinga2
- Simple interface
- Responsive
- Team integration
- Slack notifications, with team chat, invitation management, and bots
- Advanced notifications
- Multiple notifications on mobile, mobile apps, mail, sms
- Advanced performance graphs
- Advanced performance graphs, with range selection, comparison between multiple hosts and services...
- Infrastructure
- Based on Docker containers
- Multi Master on international high reliability cloud servers
- Open software designed for high reliability in clusters
- Encrypted communications between nodes, zones and clusters
- Speed: 10x Nagios + Gearman (benchmark on 1.000.000 checks on services)
- To date 200 checks available (on our infrastructure), compatibility with nagios plugins, over 3000 available
- Store results in Mysql or Postresql
- Deploy via Ansible and API of hosts and services
Our choice after evaluation of proposed black and white box monitoring solutions is Icinga2 and Grafana
At this link you can find our network monitoring used on all our clients
Software used:
- Debian
- Docker
- Ansible
- Rest API
- Graylog
- Logstash
- Gitlab
- Grafana
Share our passion for Opensource too!
Thanks
A special thanks to those who made this event possible and those who supported the infrastructure development and testing