User Tools

Site Tools


products:promonitor:latest:userguide:administration:os-agent:troubleshooting

OS Agent - Troubleshooting

The host does not appear in the Monitor tab

Issue

  • You ran the setup script on the host
  • The Telegraf service should be running
  • The Monitor tab in the Collector or Cockpit shows no entry for this host (even after waiting 30 seconds and clicking Refresh)

Solution

1. Check Telegraf is running on the host

# Linux
sudo systemctl status telegraf
journalctl -u telegraf -n 50
 
# Windows (PowerShell)
Get-Service telegraf
Get-EventLog -LogName Application -Source Telegraf -Newest 20

If the service is stopped or in a crash loop, the log shows the reason. Fix it and start the service again.

2. Check Telegraf can reach the Collector

The most common Telegraf log line:

Error writing to outputs.http: Post “https://...”: dial tcp …: connect: connection refused

Test from the host:

curl -v https://collector.example.com/api/v1/os-agent/push

Even a 401 Unauthorized answer is good - it means the network path works.

If you get connection refused or timeout → check:

  1. The URL in telegraf.conf (scheme, host, port)
  2. The Collector is running and listening on that port
  3. No firewall blocks the connection from the host to the Collector

3. Check the API key

If Telegraf logs received status code: 401:

  1. Open the API Keys tab in the Collector
  2. Copy the current global key
  3. Update the X-API-Key line in telegraf.conf
  4. Restart Telegraf

4. Check the Active flag

If the Active checkbox in the API Keys tab is unchecked, every push is rejected with 401. Tick it and click Save.


The host appears but the status is ERROR

Issue

  • The host is listed in the Monitor tab
  • The status column shows ERROR (red)
  • The Last Push is more than 5 minutes old, or Never

Solution

1. The Telegraf service is stopped

Restart it:

# Linux
sudo systemctl restart telegraf
 
# Windows
Restart-Service telegraf

2. Telegraf is running but pushes fail

Check the Telegraf log on the host - the latest error tells what is wrong (network, auth, host is set inactive…).

3. The API key was regenerated

After clicking Regenerate in the Collector, every deployed agent using the old key starts to be rejected. Update telegraf.conf with the new key and restart Telegraf.

4. The host is set Inactive

Open the host detail in the Monitor tab and click Activate.


The host shows but the Top Processes table is empty

Issue

  • The host is in the Monitor tab with status OK
  • CPU / Memory / Disk cards are filled
  • The Top Processes table is empty or shows No process data

Solution

1. The "process" input is not enabled

  1. Open the Configuration tab
  2. Tick process in the Enabled Inputs list
  3. Click Save
  4. Re-run the setup script on the host (or copy the new telegraf.conf over) and restart Telegraf

2. The "process" input is enabled but data has not arrived yet

procstat pushes every 30 seconds with the default top-K filter. Wait one cycle and click Refresh.

3. Telegraf runs in a Docker container without SYS_PTRACE

Inside Docker without SYS_PTRACE, the procstat input only sees Telegraf's own process. Add the cap when starting the container:

docker run --cap-add SYS_PTRACE ... telegraf:latest

Or run Telegraf on the host instead of in a container.


Disk I/O on processes is always zero

Issue

  • The Top Processes table is filled
  • The Disk I/O toggle shows zero everywhere - no read or write bytes

Solution

This is not a bug - per-process disk I/O needs read access to /proc/<pid>/io on Linux:

  • Linux baremetal as root → works
  • Linux Docker without SYS_PTRACE → file is unreadable even for root inside the container - field stays at zero
  • Windows admin → works

If the host is in Docker, add –cap-add SYS_PTRACE or –privileged to the run, or accept that the field stays empty.


One specific input shows nothing

Issue

  • Most metrics are filled
  • One specific category (swap, kernel, temp…) is empty

Solution

  • swap, system, processes, kernel, temp are Linux only - empty on Windows is normal
  • temp on Linux needs lm-sensors installed (sudo apt install lm-sensors && sudo sensors-detect)
  • diskio on Windows reports per-physical-drive only, not per-partition

Auth Failures counter rises

Issue

  • The Statistics tab shows a non-zero Auth Failures counter
  • The number grows over time

Solution

The counter rises every time a push gets 401. Possible causes:

  • An agent was deployed with an old key and not updated after a regen → redeploy the new key
  • Someone is probing the endpoint with the wrong key → check the Collector log for the source IP
  • A user typed the wrong URL or wrong port and another service is answering → check the URL on the agent

Click Clear in the Statistics tab, then watch the counter. If it stays at zero, the issue is fixed.


Parse Errors counter rises

Issue

  • The Statistics tab shows a non-zero Parse Errors counter
  • The Collector log has Failed to parse influx body: messages

Solution

The Collector received a body it cannot parse as Influx Line Protocol. Causes:

  • A test client sends JSON or some other format - the OS Agent push expects text/plain Influx LP only
  • A buggy agent is sending malformed lines
  • The body was truncated by a proxy or load balancer

Open the Collector log, find the Failed to parse influx body: line - it shows the offending content. Fix the sender.


Host cap reached

Issue

  • The Collector log has the line:

OS Agent: host cap reached (10000), rejecting auto-discovery of <hostname>

  • New hosts no longer appear in the Monitor tab

Solution

Almost always means someone (or a script) is pushing with random host tag values using a valid API key. Steps:

  1. Open the Monitor tab and look at the recent entries - delete obvious junk hostnames
  2. Regenerate the global API key
  3. Redeploy the new key to legitimate agents only - the abuser is locked out

Old hosts I do not want anymore

Issue

  • A host you have decommissioned still appears in the Monitor tab
  • Status is ERROR because it is not pushing

Solution

Stop Telegraf on the dead host first

If Telegraf is still running with a valid key, the host will re-appear after every Delete. Either:

  • Stop the Telegraf service on the host (sudo systemctl stop telegraf or Stop-Service telegraf)
  • Or uninstall Telegraf entirely

Then delete in the Collector

  1. Open the Monitor tab
  2. Expand the host
  3. Click Delete

If you cannot stop Telegraf (the host is unreachable), use Deactivate instead - inactive hosts reject all pushes and stay listed.


Configuration changes do not reach deployed agents

Issue

  • You changed the inputs or regenerated a key in the Collector
  • Deployed agents keep their old behavior

Solution

This is by design. The Collector does not push config to agents. Apply changes manually:

  • Re-run the setup script on the host (sudo ./setup.sh), or
  • Download the new telegraf.conf from the Configuration tab, copy it to the host, restart Telegraf:
# Linux
sudo systemctl restart telegraf
 
# Windows
Restart-Service telegraf
products/promonitor/latest/userguide/administration/os-agent/troubleshooting.txt · Last modified: by jtbeduchaud