Introduction
This document describes the node-exporter disk full problem noticed in a user's network.
Background
When an audit of the Cluster Manager Common Execution Environment (CEE) is performed, the audit result indicates the node-exporter disk is full.
Problem
A critical severity alert condition exists because a disk full condition is projected to occur in the next 24 hours, this alert was noticed on CEE:
" Device /dev/sda3 of node-exporter cee03/node-exporter-4dd4a4dd4a is projected to be full within the next 24 hours"
Analysis
The alert reported is on the CEE that tracks hardware issues for the rack and projects the full disk condition to occur in the next 24 hours.
cisco@deployer-cm-primary:~$ kubectl get pods -A -o wide | grep node
cee03 node-exporter-4dd4a4dd4a 1/1 Running 1 111d 10.10.1.1 deployer-cm-primary <none> <none>
root@deployer-cm-primary:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 568G 171G 368G 32% /
tmpfs 64M 0 64M 0% /dev
tmpfs 189G 0 189G 0% /sys/fs/cgroup
tmpfs 189G 0 189G 0% /host/sys/fs/cgroup
/dev/sda1 9.8G 3.5G 5.9G 37% /host/root
udev 189G 0 189G 0% /host/root/dev
tmpfs 189G 0 189G 0% /host/root/dev/shm
tmpfs 38G 15M 38G 1% /host/root/run
tmpfs 5.0M 0 5.0M 0% /host/root/run/lock
/dev/sda3 71G 67G 435M 100% /host/root/var/log
When an audit is performed, it appears to fill up the /dev/sda3 disc.
root@deployer-cm-primary:/host/root/var/log# du -h --max-depth=1
76M ./sysstat
16K ./lost+found
4.0K ./containers
4.0K ./landscape
9.3M ./calico
1.1G ./apiserver
808K ./pods
5.6G ./journal
60G ./audit
36K ./apt
67G .
A check of the audit shows it keeps the logs and as a result, the server condition of exporter-node disk full is likely to occur.
cisco@deployer-cm-primary:~$ sudo cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /var/log/audit/audit.log
log_group = adm
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = keep_logs
space_left = 75
space_left_action = email
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = halt
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 60
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cisco@deployer-cm-primary:~$
Solution
Preform the command code listed next, on both the deployer-cm-primary and the deployer-cm-secondary to remediate the potential node-exporter disk full condition.
sudo vim /etc/audit/auditd.conf
Then, use the code listed next to change the inside file from keep_logs to rotate.
max_log_file_action = rotate
After the code is changed, restart the service.
sudo systemctl restart auditd.service
Verify the critical alert is removed.