====== OS Agent - Troubleshooting ======
===== The host does not appear in the Monitor tab =====
=== Issue ===
* You ran the setup script on the host
* The Telegraf service should be running
* The **Monitor** tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh)
=== Solution ===
=== 1. Check Telegraf is running on the host ===
# Linux
sudo systemctl status telegraf
journalctl -u telegraf -n 50
# Windows (PowerShell)
Get-Service telegraf
Get-EventLog -LogName Application -Source Telegraf -Newest 20
If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again.
=== 2. Check Telegraf can reach the Collector ===
The most common Telegraf log line:
''Error writing to outputs.http: Post "https://...": dial tcp ...: connect: connection refused''
Test from the host:
curl -v https://collector.example.com/api/v1/os-agent/push
Even a //401 Unauthorized// answer is good - it means the network path works.
If you get //connection refused// or //timeout// → check:
- The URL in ''telegraf.conf'' (scheme, host, port)
- The Collector is running and listening on that port
- No firewall blocks the connection from the host to the Collector
=== 3. Check the API key ===
If Telegraf logs ''received status code: 401'':
- Open the **API Keys** tab in the Collector
- Copy the current global key
- Update the ''X-API-Key'' line in ''telegraf.conf''
- Restart Telegraf
=== 4. Check the Active flag ===
If the **Active** checkbox in the **API Keys** tab is unchecked, **every** push is rejected with //401//. Tick it and click Save.
----
===== The host appears but the status is ERROR =====
=== Issue ===
* The host is listed in the **Monitor** tab
* The status column shows //ERROR// (red)
* The //Last Push// is more than 5 minutes old, or //Never//
=== Solution ===
=== 1. The Telegraf service is stopped ===
Restart it:
# Linux
sudo systemctl restart telegraf
# Windows
Restart-Service telegraf
=== 2. Telegraf is running but pushes fail ===
Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive...).
=== 3. The API key was regenerated ===
After clicking **Regenerate** in the Collector, every deployed agent using the old key starts to be rejected. Update ''telegraf.conf'' with the new key and restart Telegraf.
=== 4. The host is set Inactive ===
Open the host detail in the **Monitor** tab and click **Activate**.
----
===== The host shows but the Top Processes table is empty =====
=== Issue ===
* The host is in the **Monitor** tab with status //OK//
* CPU / Memory / Disk cards are filled
* The **Top Processes** table is empty or shows //No process data//
=== Solution ===
=== 1. The "process" input is not enabled ===
- Open the **Configuration** tab
- Tick **process** in the Enabled Inputs list
- Click Save
- Re-run the setup script on the host (or copy the new ''telegraf.conf'' over) and restart Telegraf
=== 2. The "process" input is enabled but data has not arrived yet ===
procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh.
=== 3. Telegraf runs in a Docker container without SYS_PTRACE ===
Inside Docker without ''SYS_PTRACE'', the procstat input only sees Telegraf's own process. Add the cap when starting the container:
docker run --cap-add SYS_PTRACE ... telegraf:latest
Or run Telegraf on the host instead of in a container.
----
===== Disk I/O on processes is always zero =====
=== Issue ===
* The Top Processes table is filled
* The **Disk I/O** toggle shows zero everywhere - no read or write bytes
=== Solution ===
This is **not a bug** - per-process disk I/O needs read access to ''/proc//io'' on Linux:
* **Linux baremetal as root** → works
* **Linux Docker without SYS_PTRACE** → file is unreadable even for root inside the container - field stays at zero
* **Windows admin** → works
If the host is in Docker, add ''--cap-add SYS_PTRACE'' or ''--privileged'' to the run, or accept that the field stays empty.
----
===== One specific input shows nothing =====
=== Issue ===
* Most metrics are filled
* One specific category (swap, kernel, temp...) is empty
=== Solution ===
* ''swap'', ''system'', ''processes'', ''kernel'', ''temp'' are **Linux only** - empty on Windows is normal
* ''temp'' on Linux needs ''lm-sensors'' installed (''sudo apt install lm-sensors && sudo sensors-detect'')
* ''diskio'' on Windows reports per-physical-drive only, not per-partition
----
===== Auth Failures counter rises =====
=== Issue ===
* The **Statistics** tab shows a non-zero **Auth Failures** counter
* The number grows over time
=== Solution ===
The counter rises every time a push gets //401//. Possible causes:
* An agent was deployed with an old key and not updated after a regen → redeploy the new key
* Someone is probing the endpoint with the wrong key → check the Collector log for the source IP
* A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent
Click **Clear** in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed.
----
===== Parse Errors counter rises =====
=== Issue ===
* The **Statistics** tab shows a non-zero **Parse Errors** counter
* The Collector log has ''Failed to parse influx body:'' messages
=== Solution ===
The Collector received a body it cannot parse as Influx Line Protocol. Causes:
* A test client sends JSON or some other format - the OS Agent push expects ''text/plain'' Influx LP only
* A buggy agent is sending malformed lines
* The body was truncated by a proxy or load balancer
Open the Collector log, find the ''Failed to parse influx body:'' line - it shows the offending content. Fix the sender.
----
===== Host cap reached =====
=== Issue ===
* The Collector log has the line:
''OS Agent: host cap reached (10000), rejecting auto-discovery of ''
* New hosts no longer appear in the **Monitor** tab
=== Solution ===
Almost always means someone (or a script) is pushing with random ''host'' tag values using a valid API key. Steps:
- Open the **Monitor** tab and look at the recent entries - delete obvious junk hostnames
- Regenerate the global API key
- Redeploy the new key to **legitimate** agents only - the abuser is locked out
----
===== Old hosts I do not want anymore =====
=== Issue ===
* A host you have decommissioned still appears in the **Monitor** tab
* Status is //ERROR// because it is not pushing
=== Solution ===
=== Stop Telegraf on the dead host first ===
If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either:
* Stop the Telegraf service on the host (''sudo systemctl stop telegraf'' or ''Stop-Service telegraf'')
* Or uninstall Telegraf entirely
=== Then delete in the Collector ===
- Open the **Monitor** tab
- Expand the host
- Click **Delete**
If you cannot stop Telegraf (the host is unreachable), use **Deactivate** instead - inactive hosts reject all pushes and stay listed.
----
===== Configuration changes do not reach deployed agents =====
=== Issue ===
* You changed the inputs or regenerated a key in the Collector
* Deployed agents keep their old behavior
=== Solution ===
This is **by design**. The Collector does not push config to agents. Apply changes manually:
* Re-run the setup script on the host (''sudo ./setup.sh''), or
* Download the new ''telegraf.conf'' from the **Configuration** tab, copy it to the host, restart Telegraf:
# Linux
sudo systemctl restart telegraf
# Windows
Restart-Service telegraf