====== OS Agent - Troubleshooting ======


===== The host does not appear in the Monitor tab =====

=== Issue ===

  * You ran the setup script on the host
  * The Telegraf service should be running
  * The **Monitor** tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh)

=== Solution ===

=== 1. Check Telegraf is running on the host ===

<code bash>
# Linux
sudo systemctl status telegraf
journalctl -u telegraf -n 50

# Windows (PowerShell)
Get-Service telegraf
Get-EventLog -LogName Application -Source Telegraf -Newest 20
</code>

If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again.

=== 2. Check Telegraf can reach the Collector ===

The most common Telegraf log line:

''Error writing to outputs.http: Post "https://...": dial tcp ...: connect: connection refused''

Test from the host:

<code bash>
curl -v https://collector.example.com/api/v1/os-agent/push
</code>

Even a //401 Unauthorized// answer is good - it means the network path works.

If you get //connection refused// or //timeout// → check:

  - The URL in ''telegraf.conf'' (scheme, host, port)
  - The Collector is running and listening on that port
  - No firewall blocks the connection from the host to the Collector

=== 3. Check the API key ===

If Telegraf logs ''received status code: 401'':

  - Open the **API Keys** tab in the Collector
  - Copy the current global key
  - Update the ''X-API-Key'' line in ''telegraf.conf''
  - Restart Telegraf

=== 4. Check the Active flag ===

If the **Active** checkbox in the **API Keys** tab is unchecked, **every** push is rejected with //401//. Tick it and click Save.

----

===== The host appears but the status is ERROR =====

=== Issue ===

  * The host is listed in the **Monitor** tab
  * The status column shows //ERROR// (red)
  * The //Last Push// is more than 5 minutes old, or //Never//

=== Solution ===

=== 1. The Telegraf service is stopped ===

Restart it:

<code bash>
# Linux
sudo systemctl restart telegraf

# Windows
Restart-Service telegraf
</code>

=== 2. Telegraf is running but pushes fail ===

Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive...).

=== 3. The API key was regenerated ===

After clicking **Regenerate** in the Collector, every deployed agent using the old key starts to be rejected. Update ''telegraf.conf'' with the new key and restart Telegraf.

=== 4. The host is set Inactive ===

Open the host detail in the **Monitor** tab and click **Activate**.

----

===== The host shows but the Top Processes table is empty =====

=== Issue ===

  * The host is in the **Monitor** tab with status //OK//
  * CPU / Memory / Disk cards are filled
  * The **Top Processes** table is empty or shows //No process data//

=== Solution ===

=== 1. The "process" input is not enabled ===

  - Open the **Configuration** tab
  - Tick **process** in the Enabled Inputs list
  - Click Save
  - Re-run the setup script on the host (or copy the new ''telegraf.conf'' over) and restart Telegraf

=== 2. The "process" input is enabled but data has not arrived yet ===

procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh.

=== 3. Telegraf runs in a Docker container without SYS_PTRACE ===

Inside Docker without ''SYS_PTRACE'', the procstat input only sees Telegraf's own process. Add the cap when starting the container:

<code bash>
docker run --cap-add SYS_PTRACE ... telegraf:latest
</code>

Or run Telegraf on the host instead of in a container.

----

===== Disk I/O on processes is always zero =====

=== Issue ===

  * The Top Processes table is filled
  * The **Disk I/O** toggle shows zero everywhere - no read or write bytes

=== Solution ===

This is **not a bug** - per-process disk I/O needs read access to ''/proc/<pid>/io'' on Linux:

  * **Linux baremetal as root** → works
  * **Linux Docker without SYS_PTRACE** → file is unreadable even for root inside the container - field stays at zero
  * **Windows admin** → works

If the host is in Docker, add ''--cap-add SYS_PTRACE'' or ''--privileged'' to the run, or accept that the field stays empty.

----

===== One specific input shows nothing =====

=== Issue ===

  * Most metrics are filled
  * One specific category (swap, kernel, temp...) is empty

=== Solution ===

  * ''swap'', ''system'', ''processes'', ''kernel'', ''temp'' are **Linux only** - empty on Windows is normal
  * ''temp'' on Linux needs ''lm-sensors'' installed (''sudo apt install lm-sensors && sudo sensors-detect'')
  * ''diskio'' on Windows reports per-physical-drive only, not per-partition

----

===== Auth Failures counter rises =====

=== Issue ===

  * The **Statistics** tab shows a non-zero **Auth Failures** counter
  * The number grows over time

=== Solution ===

The counter rises every time a push gets //401//. Possible causes:

  * An agent was deployed with an old key and not updated after a regen → redeploy the new key
  * Someone is probing the endpoint with the wrong key → check the Collector log for the source IP
  * A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent

Click **Clear** in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed.

----

===== Parse Errors counter rises =====

=== Issue ===

  * The **Statistics** tab shows a non-zero **Parse Errors** counter
  * The Collector log has ''Failed to parse influx body:'' messages

=== Solution ===

The Collector received a body it cannot parse as Influx Line Protocol. Causes:

  * A test client sends JSON or some other format - the OS Agent push expects ''text/plain'' Influx LP only
  * A buggy agent is sending malformed lines
  * The body was truncated by a proxy or load balancer

Open the Collector log, find the ''Failed to parse influx body:'' line - it shows the offending content. Fix the sender.

----

===== Host cap reached =====

=== Issue ===

  * The Collector log has the line:

''OS Agent: host cap reached (10000), rejecting auto-discovery of <hostname>''

  * New hosts no longer appear in the **Monitor** tab

=== Solution ===

Almost always means someone (or a script) is pushing with random ''host'' tag values using a valid API key. Steps:

  - Open the **Monitor** tab and look at the recent entries - delete obvious junk hostnames
  - Regenerate the global API key
  - Redeploy the new key to **legitimate** agents only - the abuser is locked out

----

===== Old hosts I do not want anymore =====

=== Issue ===

  * A host you have decommissioned still appears in the **Monitor** tab
  * Status is //ERROR// because it is not pushing

=== Solution ===

=== Stop Telegraf on the dead host first ===

If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either:

  * Stop the Telegraf service on the host (''sudo systemctl stop telegraf'' or ''Stop-Service telegraf'')
  * Or uninstall Telegraf entirely

=== Then delete in the Collector ===

  - Open the **Monitor** tab
  - Expand the host
  - Click **Delete**

If you cannot stop Telegraf (the host is unreachable), use **Deactivate** instead - inactive hosts reject all pushes and stay listed.

----

===== Configuration changes do not reach deployed agents =====

=== Issue ===

  * You changed the inputs or regenerated a key in the Collector
  * Deployed agents keep their old behavior

=== Solution ===

This is **by design**. The Collector does not push config to agents. Apply changes manually:

  * Re-run the setup script on the host (''sudo ./setup.sh''), or
  * Download the new ''telegraf.conf'' from the **Configuration** tab, copy it to the host, restart Telegraf:

<code bash>
# Linux
sudo systemctl restart telegraf

# Windows
Restart-Service telegraf
</code>