Table of Contents
OS Agent - Troubleshooting
The host does not appear in the Monitor tab
Issue
- You ran the setup script on the host
- The Telegraf service should be running
- The Monitor tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh)
Solution
1. Check Telegraf is running on the host
# Linux sudo systemctl status telegraf journalctl -u telegraf -n 50 # Windows (PowerShell) Get-Service telegraf Get-EventLog -LogName Application -Source Telegraf -Newest 20
If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again.
2. Check Telegraf can reach the Collector
The most common Telegraf log line:
Error writing to outputs.http: Post “https://...”: dial tcp …: connect: connection refused
Test from the host:
curl -v https://collector.example.com/api/v1/os-agent/push
Even a 401 Unauthorized answer is good - it means the network path works.
If you get connection refused or timeout → check:
- The URL in
telegraf.conf(scheme, host, port) - The Collector is running and listening on that port
- No firewall blocks the connection from the host to the Collector
3. Check the API key
If Telegraf logs received status code: 401:
- Open the API Keys tab in the Collector
- Copy the current global key
- Update the
X-API-Keyline intelegraf.conf - Restart Telegraf
4. Check the Active flag
If the Active checkbox in the API Keys tab is unchecked, every push is rejected with 401. Tick it and click Save.
The host appears but the status is ERROR
Issue
- The host is listed in the Monitor tab
- The status column shows ERROR (red)
- The Last Push is more than 5 minutes old, or Never
Solution
1. The Telegraf service is stopped
Restart it:
# Linux sudo systemctl restart telegraf # Windows Restart-Service telegraf
2. Telegraf is running but pushes fail
Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive…).
3. The API key was regenerated
After clicking Regenerate in the Collector, every deployed agent using the old key starts to be rejected. Update telegraf.conf with the new key and restart Telegraf.
4. The host is set Inactive
Open the host detail in the Monitor tab and click Activate.
The host shows but the Top Processes table is empty
Issue
- The host is in the Monitor tab with status OK
- CPU / Memory / Disk cards are filled
- The Top Processes table is empty or shows No process data
Solution
1. The "process" input is not enabled
- Open the Configuration tab
- Tick process in the Enabled Inputs list
- Click Save
- Re-run the setup script on the host (or copy the new
telegraf.confover) and restart Telegraf
2. The "process" input is enabled but data has not arrived yet
procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh.
3. Telegraf runs in a Docker container without SYS_PTRACE
Inside Docker without SYS_PTRACE, the procstat input only sees Telegraf's own process. Add the cap when starting the container:
docker run --cap-add SYS_PTRACE ... telegraf:latest
Or run Telegraf on the host instead of in a container.
Disk I/O on processes is always zero
Issue
- The Top Processes table is filled
- The Disk I/O toggle shows zero everywhere - no read or write bytes
Solution
This is not a bug - per-process disk I/O needs read access to /proc/<pid>/io on Linux:
- Linux baremetal as root → works
- Linux Docker without SYS_PTRACE → file is unreadable even for root inside the container - field stays at zero
- Windows admin → works
If the host is in Docker, add –cap-add SYS_PTRACE or –privileged to the run, or accept that the field stays empty.
One specific input shows nothing
Issue
- Most metrics are filled
- One specific category (swap, kernel, temp…) is empty
Solution
swap,system,processes,kernel,tempare Linux only - empty on Windows is normaltempon Linux needslm-sensorsinstalled (sudo apt install lm-sensors && sudo sensors-detect)diskioon Windows reports per-physical-drive only, not per-partition
Auth Failures counter rises
Issue
- The Statistics tab shows a non-zero Auth Failures counter
- The number grows over time
Solution
The counter rises every time a push gets 401. Possible causes:
- An agent was deployed with an old key and not updated after a regen → redeploy the new key
- Someone is probing the endpoint with the wrong key → check the Collector log for the source IP
- A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent
Click Clear in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed.
Parse Errors counter rises
Issue
- The Statistics tab shows a non-zero Parse Errors counter
- The Collector log has
Failed to parse influx body:messages
Solution
The Collector received a body it cannot parse as Influx Line Protocol. Causes:
- A test client sends JSON or some other format - the OS Agent push expects
text/plainInflux LP only - A buggy agent is sending malformed lines
- The body was truncated by a proxy or load balancer
Open the Collector log, find the Failed to parse influx body: line - it shows the offending content. Fix the sender.
Host cap reached
Issue
- The Collector log has the line:
OS Agent: host cap reached (10000), rejecting auto-discovery of <hostname>
- New hosts no longer appear in the Monitor tab
Solution
Almost always means someone (or a script) is pushing with random host tag values using a valid API key. Steps:
- Open the Monitor tab and look at the recent entries - delete obvious junk hostnames
- Regenerate the global API key
- Redeploy the new key to legitimate agents only - the abuser is locked out
Old hosts I do not want anymore
Issue
- A host you have decommissioned still appears in the Monitor tab
- Status is ERROR because it is not pushing
Solution
Stop Telegraf on the dead host first
If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either:
- Stop the Telegraf service on the host (
sudo systemctl stop telegraforStop-Service telegraf) - Or uninstall Telegraf entirely
Then delete in the Collector
- Open the Monitor tab
- Expand the host
- Click Delete
If you cannot stop Telegraf (the host is unreachable), use Deactivate instead - inactive hosts reject all pushes and stay listed.
Configuration changes do not reach deployed agents
Issue
- You changed the inputs or regenerated a key in the Collector
- Deployed agents keep their old behavior
Solution
This is by design. The Collector does not push config to agents. Apply changes manually:
- Re-run the setup script on the host (
sudo ./setup.sh), or - Download the new
telegraf.conffrom the Configuration tab, copy it to the host, restart Telegraf:
# Linux sudo systemctl restart telegraf # Windows Restart-Service telegraf
