Currently if list_nodes fails in the define-nagios-hosts.py
script, the entire script will fail with some unknown
error. This change updates the script to better catch
and report any exceptions that happen.
Change-Id: I0e33f47af8ad8f69f2f1e4a5b377d0e31d0c0819
The issue is that the successful response from prometheus wasn't
triggering of the exit from the retry loop. Now on successful queries
the while retry loop will break into a successful exit strategy.
Change-Id: I528c1c17d2131256097cac5a67ec7ea17541c685
Added a retry around the Nagios request commands. Updated the code
based on comments and feedback.
Change-Id: I24588c112e2b5ec954f857550bda7d78bdf6d03e
Added support to be able to talk to TLS enabled prometheus
and elasticsearch by passing the CA cert to the request object.
Change-Id: I0616b3e5d251cc6c9cd3cc28bc44977ff5164b3c
Updated the default timeout from 30sec to 120sec as the bigger
query was taking longer than 30 sec, resulting in UNKNOWN alerts.
Updated the service checks from "unknown" to "warning" since
the unknown alerts were causing issue with ticketing.
Change-Id: I65919207be8b5422ffb13f3d1ccfff0323f23168
The nagios api password was being revealed in several
error messages, this uses regex to remove the password
and replace it with a place holder to avoid exposing it.
Change-Id: I8771cbc3127edba47dff8db5de0990659d4d2b49
Updating the script, so that it can work with multiple storage classes.
Also corrected the script on certain failure points.
Change-Id: Ic3d7c6b4877fc5ce4e1ce3b58b05bb7b138b0c80
The current copyright refers to a non-existent group
"openstack helm authors" with often out-of-date references that
are confusing when adding a new file to the repo.
This change removes all references to this copyright by the
non-existent group and any blank lines underneath.
Change-Id: Ic78d29883364378cc14b11402f16d99dcec1fc96
This commit aligns Nagios plugins with Python3.
Since dict.iteritems() was removed in Python3 and substituted
with ditc.items() we have to change them in plugins.
Change-Id: I782f90a91e8dadd959c4d8537a80c44180c0b78d
This commit changes confdition in define-host plugin in order to
prevent failing with "KeyError: 'NODE_DOMAIN'" in case we do not
have such environment variable.
Change-Id: I030e3f01ca9d25f3946fd621635f422d3278f21e
This fixes a typo in the nagios define-host plugin that
would append a domain name to the hostname
Change-Id: I62c3eca27d3ced28d2abe18d75bb6e71889c8ee3
Sometimes nagios to prometheus connection is
taking longer than 20 seconds and extending
to 40 seconds.
Change-Id: I20b8d47cbc8eeb08d93bf902f922f8bbf8769839
This adds logic to check if an environment variable has been
set for the domain name and appends it to the host name
if it has so that the full FQDN appears in Nagios dashboard.
If the the domain name has not been set just the host name is
given like previous behavior.
Change-Id: Id42edb073d4701ddb61f4957af7e5ac5f931dfbf
The plugin script uses [hits][hits] for checking the total
result count. So, it was updated to use the right field.
Change-Id: I371302bc24e59320a59bd815922d41e387e23e3a
Adding the script with status and the message.
This will help the calling plugins to check the status.
Change-Id: I17f7db72240dd53513064f5180f5914c9c638ed5
This updates the Nagios plugin for checking health metrics exposed
by an exporter endpoint directly. To address issues where multiple
exporter replicas may not all be active, this moves the plugin to
instead use the python kubernetes client to programmatically
determine which endpoint tied to the exporter service is active
and returns the metrics exposed by the active endpoint. This
allows for more robust checking of service health in scenarios
where circumventing prometheus is desired
Change-Id: I14e21936d1808a4f41b20368451da95100075dda
Signed-off-by: Steve Wilkerson <sw5822@att.com>
Fixes include types validation,index fields changes, counts
and adding meaningful details to the critical messages etc
Change-Id: Ib8d8a87be4e0526378aa04ccd8ff5631805adfeb
The plugin only trigerred Critical for any type of Severity fired from the
Prometheus alerts.
Now the code is updated to handle the Prometheus alert of severity=warning
along with severity=page.
This should help in alarm tuning in Prometheus and Nagios.
Change-Id: I89c1880ab05b896590391db611354b069ade363a
This script was earlier giving output as OK for null values in the
metrics dictionary variable.
Edited the script to handle a null "metrics" dictionary.
Updated the indentation
Change-Id: I760d6ac4fc5341361d064a8a15f6e44287d48f40
This renames the check_update_prometheus_hosts plugin to be more
representative of what the current functionality does, which is
to simply define nagios hosts. This also updates the behavior of
the plugin to no longer force a reload of nagios via a hangup
signal when attempting to update the hosts file. The result is a
significant reduction in the logs output by the Nagios service,
which will better enable tracking history of service checks and
hosts.
Instead of this plugin being run as a recurring check, it can now
be run as an init container for the Nagios pod so Nagios has a
comprehensive list of its hosts and host groups before starting
the service
Change-Id: Ife2cdf2112db3798dbde73bafe436ef3c0c8a870
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This updates the plugin responsible for defining Nagios's hosts
and hostgroups to use the Kubernetes python client instead of
querying Prometheus for this information. This results in a more
predictable and reliable list of hosts for Nagios to use, as
querying Prometheus for scalar metrics in a point-in-time could
result in a host not being added correctly in scenarios where
a host is down when Nagios is attempting to query Prometheus to
generate the list of hosts
Change-Id: I962696eac7c9cc94650666a1d3a60c610d1ae867
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This updates the entrypoint script and the plugins included with
the nagios image executable
Change-Id: Iaeb2fad62ac213b74637dadc329e7ea304602ab8
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This adds the Prometheus-aware Nagios core 4 image built for
openstack-helm-infra to the openstack-helm-images repository
Change-Id: Icd7bcdee59f1dc719d0dc5e7517294ac922f680e
Signed-off-by: Steve Wilkerson <sw5822@att.com>