====== OS Agent - Troubleshooting ====== ===== The host does not appear in the Monitor tab ===== === Issue === * You ran the setup script on the host * The Telegraf service should be running * The **Monitor** tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh) === Solution === === 1. Check Telegraf is running on the host === # Linux sudo systemctl status telegraf journalctl -u telegraf -n 50 # Windows (PowerShell) Get-Service telegraf Get-EventLog -LogName Application -Source Telegraf -Newest 20 If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again. === 2. Check Telegraf can reach the Collector === The most common Telegraf log line: ''Error writing to outputs.http: Post "https://...": dial tcp ...: connect: connection refused'' Test from the host: curl -v https://collector.example.com/api/v1/os-agent/push Even a //401 Unauthorized// answer is good - it means the network path works. If you get //connection refused// or //timeout// → check: - The URL in ''telegraf.conf'' (scheme, host, port) - The Collector is running and listening on that port - No firewall blocks the connection from the host to the Collector === 3. Check the API key === If Telegraf logs ''received status code: 401'': - Open the **API Keys** tab in the Collector - Copy the current global key - Update the ''X-API-Key'' line in ''telegraf.conf'' - Restart Telegraf === 4. Check the Active flag === If the **Active** checkbox in the **API Keys** tab is unchecked, **every** push is rejected with //401//. Tick it and click Save. ---- ===== The host appears but the status is ERROR ===== === Issue === * The host is listed in the **Monitor** tab * The status column shows //ERROR// (red) * The //Last Push// is more than 5 minutes old, or //Never// === Solution === === 1. The Telegraf service is stopped === Restart it: # Linux sudo systemctl restart telegraf # Windows Restart-Service telegraf === 2. Telegraf is running but pushes fail === Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive...). === 3. The API key was regenerated === After clicking **Regenerate** in the Collector, every deployed agent using the old key starts to be rejected. Update ''telegraf.conf'' with the new key and restart Telegraf. === 4. The host is set Inactive === Open the host detail in the **Monitor** tab and click **Activate**. ---- ===== The host shows but the Top Processes table is empty ===== === Issue === * The host is in the **Monitor** tab with status //OK// * CPU / Memory / Disk cards are filled * The **Top Processes** table is empty or shows //No process data// === Solution === === 1. The "process" input is not enabled === - Open the **Configuration** tab - Tick **process** in the Enabled Inputs list - Click Save - Re-run the setup script on the host (or copy the new ''telegraf.conf'' over) and restart Telegraf === 2. The "process" input is enabled but data has not arrived yet === procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh. === 3. Telegraf runs in a Docker container without SYS_PTRACE === Inside Docker without ''SYS_PTRACE'', the procstat input only sees Telegraf's own process. Add the cap when starting the container: docker run --cap-add SYS_PTRACE ... telegraf:latest Or run Telegraf on the host instead of in a container. ---- ===== Disk I/O on processes is always zero ===== === Issue === * The Top Processes table is filled * The **Disk I/O** toggle shows zero everywhere - no read or write bytes === Solution === This is **not a bug** - per-process disk I/O needs read access to ''/proc//io'' on Linux: * **Linux baremetal as root** → works * **Linux Docker without SYS_PTRACE** → file is unreadable even for root inside the container - field stays at zero * **Windows admin** → works If the host is in Docker, add ''--cap-add SYS_PTRACE'' or ''--privileged'' to the run, or accept that the field stays empty. ---- ===== One specific input shows nothing ===== === Issue === * Most metrics are filled * One specific category (swap, kernel, temp...) is empty === Solution === * ''swap'', ''system'', ''processes'', ''kernel'', ''temp'' are **Linux only** - empty on Windows is normal * ''temp'' on Linux needs ''lm-sensors'' installed (''sudo apt install lm-sensors && sudo sensors-detect'') * ''diskio'' on Windows reports per-physical-drive only, not per-partition ---- ===== Auth Failures counter rises ===== === Issue === * The **Statistics** tab shows a non-zero **Auth Failures** counter * The number grows over time === Solution === The counter rises every time a push gets //401//. Possible causes: * An agent was deployed with an old key and not updated after a regen → redeploy the new key * Someone is probing the endpoint with the wrong key → check the Collector log for the source IP * A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent Click **Clear** in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed. ---- ===== Parse Errors counter rises ===== === Issue === * The **Statistics** tab shows a non-zero **Parse Errors** counter * The Collector log has ''Failed to parse influx body:'' messages === Solution === The Collector received a body it cannot parse as Influx Line Protocol. Causes: * A test client sends JSON or some other format - the OS Agent push expects ''text/plain'' Influx LP only * A buggy agent is sending malformed lines * The body was truncated by a proxy or load balancer Open the Collector log, find the ''Failed to parse influx body:'' line - it shows the offending content. Fix the sender. ---- ===== Host cap reached ===== === Issue === * The Collector log has the line: ''OS Agent: host cap reached (10000), rejecting auto-discovery of '' * New hosts no longer appear in the **Monitor** tab === Solution === Almost always means someone (or a script) is pushing with random ''host'' tag values using a valid API key. Steps: - Open the **Monitor** tab and look at the recent entries - delete obvious junk hostnames - Regenerate the global API key - Redeploy the new key to **legitimate** agents only - the abuser is locked out ---- ===== Old hosts I do not want anymore ===== === Issue === * A host you have decommissioned still appears in the **Monitor** tab * Status is //ERROR// because it is not pushing === Solution === === Stop Telegraf on the dead host first === If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either: * Stop the Telegraf service on the host (''sudo systemctl stop telegraf'' or ''Stop-Service telegraf'') * Or uninstall Telegraf entirely === Then delete in the Collector === - Open the **Monitor** tab - Expand the host - Click **Delete** If you cannot stop Telegraf (the host is unreachable), use **Deactivate** instead - inactive hosts reject all pushes and stay listed. ---- ===== Configuration changes do not reach deployed agents ===== === Issue === * You changed the inputs or regenerated a key in the Collector * Deployed agents keep their old behavior === Solution === This is **by design**. The Collector does not push config to agents. Apply changes manually: * Re-run the setup script on the host (''sudo ./setup.sh''), or * Download the new ''telegraf.conf'' from the **Configuration** tab, copy it to the host, restart Telegraf: # Linux sudo systemctl restart telegraf # Windows Restart-Service telegraf