OS Agent - Troubleshooting

The host does not appear in the Monitor tab

Issue

You ran the setup script on the host
The Telegraf service should be running
The Monitor tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh)

Solution

1. Check Telegraf is running on the host

# Linux
sudo systemctl status telegraf
journalctl -u telegraf -n 50
 
# Windows (PowerShell)
Get-Service telegraf
Get-EventLog -LogName Application -Source Telegraf -Newest 20

If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again.

2. Check Telegraf can reach the Collector

The most common Telegraf log line:

Error writing to outputs.http: Post “https://...”: dial tcp …: connect: connection refused

Test from the host:

curl -v https://collector.example.com/api/v1/os-agent/push

Even a 401 Unauthorized answer is good - it means the network path works.

If you get connection refused or timeout → check:

The URL in telegraf.conf (scheme, host, port)
The Collector is running and listening on that port
No firewall blocks the connection from the host to the Collector

3. Check the API key

If Telegraf logs received status code: 401:

Open the API Keys tab in the Collector
Copy the current global key
Update the X-API-Key line in telegraf.conf
Restart Telegraf

4. Check the Active flag

If the Active checkbox in the API Keys tab is unchecked, every push is rejected with 401. Tick it and click Save.

The host appears but the status is ERROR

Issue

The host is listed in the Monitor tab
The status column shows ERROR (red)
The Last Push is more than 5 minutes old, or Never

Solution

1. The Telegraf service is stopped

Restart it:

# Linux
sudo systemctl restart telegraf
 
# Windows
Restart-Service telegraf

2. Telegraf is running but pushes fail

Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive…).

3. The API key was regenerated

After clicking Regenerate in the Collector, every deployed agent using the old key starts to be rejected. Update telegraf.conf with the new key and restart Telegraf.

4. The host is set Inactive

Open the host detail in the Monitor tab and click Activate.

The host shows but the Top Processes table is empty

Issue

The host is in the Monitor tab with status OK
CPU / Memory / Disk cards are filled
The Top Processes table is empty or shows No process data

Solution

1. The "process" input is not enabled

Open the Configuration tab
Tick process in the Enabled Inputs list
Click Save
Re-run the setup script on the host (or copy the new telegraf.conf over) and restart Telegraf

2. The "process" input is enabled but data has not arrived yet

procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh.

3. Telegraf runs in a Docker container without SYS_PTRACE

Inside Docker without SYS_PTRACE, the procstat input only sees Telegraf's own process. Add the cap when starting the container:

docker run --cap-add SYS_PTRACE ... telegraf:latest

Or run Telegraf on the host instead of in a container.

Disk I/O on processes is always zero

Issue

The Top Processes table is filled
The Disk I/O toggle shows zero everywhere - no read or write bytes

Solution

This is not a bug - per-process disk I/O needs read access to /proc/<pid>/io on Linux:

Linux baremetal as root → works
Linux Docker without SYS_PTRACE → file is unreadable even for root inside the container - field stays at zero
Windows admin → works

If the host is in Docker, add –cap-add SYS_PTRACE or –privileged to the run, or accept that the field stays empty.

One specific input shows nothing

Issue

Most metrics are filled
One specific category (swap, kernel, temp…) is empty

Solution

swap, system, processes, kernel, temp are Linux only - empty on Windows is normal
temp on Linux needs lm-sensors installed (sudo apt install lm-sensors && sudo sensors-detect)
diskio on Windows reports per-physical-drive only, not per-partition

Auth Failures counter rises

Issue

The Statistics tab shows a non-zero Auth Failures counter
The number grows over time

Solution

The counter rises every time a push gets 401. Possible causes:

An agent was deployed with an old key and not updated after a regen → redeploy the new key
Someone is probing the endpoint with the wrong key → check the Collector log for the source IP
A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent

Click Clear in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed.

Parse Errors counter rises

Issue

The Statistics tab shows a non-zero Parse Errors counter
The Collector log has Failed to parse influx body: messages

Solution

The Collector received a body it cannot parse as Influx Line Protocol. Causes:

A test client sends JSON or some other format - the OS Agent push expects text/plain Influx LP only
A buggy agent is sending malformed lines
The body was truncated by a proxy or load balancer

Open the Collector log, find the Failed to parse influx body: line - it shows the offending content. Fix the sender.

Host cap reached

Issue

The Collector log has the line:

OS Agent: host cap reached (10000), rejecting auto-discovery of <hostname>

New hosts no longer appear in the Monitor tab

Solution

Almost always means someone (or a script) is pushing with random host tag values using a valid API key. Steps:

Open the Monitor tab and look at the recent entries - delete obvious junk hostnames
Regenerate the global API key
Redeploy the new key to legitimate agents only - the abuser is locked out

Old hosts I do not want anymore

Issue

A host you have decommissioned still appears in the Monitor tab
Status is ERROR because it is not pushing

Solution

Stop Telegraf on the dead host first

If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either:

Stop the Telegraf service on the host (sudo systemctl stop telegraf or Stop-Service telegraf)
Or uninstall Telegraf entirely

Then delete in the Collector

Open the Monitor tab
Expand the host
Click Delete

If you cannot stop Telegraf (the host is unreachable), use Deactivate instead - inactive hosts reject all pushes and stay listed.

Configuration changes do not reach deployed agents

Issue

You changed the inputs or regenerated a key in the Collector
Deployed agents keep their old behavior

Solution

This is by design. The Collector does not push config to agents. Apply changes manually:

Re-run the setup script on the host (sudo ./setup.sh), or
Download the new telegraf.conf from the Configuration tab, copy it to the host, restart Telegraf:

# Linux
sudo systemctl restart telegraf
 
# Windows
Restart-Service telegraf

Table of Contents

OS Agent - Troubleshooting

The host does not appear in the Monitor tab

Issue

Solution

1. Check Telegraf is running on the host

2. Check Telegraf can reach the Collector

3. Check the API key

4. Check the Active flag

The host appears but the status is ERROR

Issue

Solution

1. The Telegraf service is stopped

2. Telegraf is running but pushes fail

3. The API key was regenerated

4. The host is set Inactive

The host shows but the Top Processes table is empty

Issue

Solution

1. The "process" input is not enabled

2. The "process" input is enabled but data has not arrived yet

3. Telegraf runs in a Docker container without SYS_PTRACE

Disk I/O on processes is always zero

Issue

Solution

One specific input shows nothing

Issue

Solution

Auth Failures counter rises

Issue

Solution

Parse Errors counter rises

Issue

Solution

Host cap reached

Issue

Solution

Old hosts I do not want anymore

Issue

Solution

Stop Telegraf on the dead host first

Then delete in the Collector

Configuration changes do not reach deployed agents

Issue

Solution