Slow Monitoring System

Monitoring Tuning

Background

Small background here: I’m using two different tools for monitoring:

Opsview Configuration tool for Nagios, and it generates some graphics about response times etc which is nice
Observium Awesome and easily configured tool for snmp enabled host. Did surprise me extremely in relation to the grade of detected devices:
- Pfsense got detected
- nas4free
- cisco, brocade, the ususal stuff
- Huawei Stuff (not always built in in such tools)
- mikrotik (awesome)
- my printer (misdetected as HP)
- not yet got to work: snom phones (would be cool to graph the missed, received and dialed calls)

When the problems started

The setup was to my full pleasure at first. The monitoring system is a VMware VM running on an ESXi running on a dualsocket opteron, 16 total cores, 64GB Ram. The vm had something in the range of 2GB ram (never fully used up) and it ran on a Hardwareraid 1 on some old 1TB Drive. Worked fine with up to 100 Services, 30+ Hosts in nagios and about 20 Machines in the observium system. Then I decided to add most of the routers and WiFi Accesspoints of the local american cooffeechain where I am contracted through some thirdparty to operate the WiFi (It’s a known fact that we run this) so the load increased. Heavily. stuff happened like the graphs did loose half the measurememnts because the system got so clogged. I want proper graphs and not something I could even comb MY long hair with.

Procedure

First attempt

First attempt was to temporary move this VM over to a dedicated box. Dualcore Core2Duo, 2TB Single Drive, ESXi was all that I had available and was booted. This did not really help unfortunately. We still got crazy amount of downtime messages even though I could ping these hosts fine. I checked my Internet Line (This whole setup runs from our office) but we had used up less than 1Mbit upload and the majority of monitoring goes via a VPN channel directly opened from the monitoring server so the CPU of the router or “Max Connections per Second” or anything was not really the culprit.

Second attempt

So we added an SSD. Unfortunately the bare metal recovery tried by r1soft did not really work out so we had to use a regular harddisk (larger or equal the old VM size) and the SSD in combination.

The whole system has been restored to this 250GB Harddisk (Notebook format for now due to availability problems, no shop was open at that time) and surprise, this was a successful V2P migration - vmware vm as source, native install as restore. KUDOS to R1soft / Idera for actually beeing capable of such thing with no changes to the installed Linux - I did not even had to delete /etc/udev.d/70-network.rules where I would normally have to erase the old MAC adresses out of it. It seems to me r1soft does actually take care of this or whatever. I’m really impressed, this is the 3rd bare metal restore that actually WORKS.

So after restoring we did format the SSD with ext4, mounted to /ssd and moved the following folders:

SSD enabled folders:

/var/lib/mysql
/opt/observium
/usr/local/nagios

I just plainly stopped all services, moved the folder, created a symlink back. Easiest, no config file to adjust.

Result

Load average dropped from an average of 2-4, peaks of 7+, to a mere 0.3. That has still some room to the top.

Silvan Michael Gebhardt