===== Alarms and Metrics ===== ==== Purpose ==== * Defines global Alarms and Metrics configuration * All alarms and metrics generated by Pro.Monitor can be propagated by email or to third party applications by the use of plugins only. * If you want to use different plugins depending on the origin of an alarm (SAP/internal), you can use Alarm rules for that * Internal alarms are typically send by email to Pro.Monitor admin ==== How to access Alarms and Metrics feature ==== * From the top right of the screen, click on the setting icon * Select the admin configuration sub-menu * Click on tabs Alarms/Metrics \\ ==== System availability alerts ==== * **Max connection resp. time (sec) :** An alert will be generated if a System is not responding after a number of seconds set in this input field. The severity of the Alert can be set using the corresponding dropdown list. * **Max system down time (sec) :** An alert will be generated after attempting to reach a System for a number of seconds set in this input field. The severity of the Alert can be set using the corresponding dropdown list. * **Time zone alarm:** An alert will be generated if the time zone of a system is not properly set, or cannot be resolved. This option will define the severity used for this alert. {{..:..:..:userguide:administration:adminconfig:pasted:20190329-181135.png}} \\ ==== Internal alerts ==== * **Monitor job execution error : ** An alert will be generated if a Monitor job encounters an error during its execution. The severity of the Alert can be set using the corresponding dropdown list. * **CCMS errors :** An alert will be generated if CCMS kind jobs encounter an error during its execution. The severity of the Alert can be set in the dropdown list. * **Monitor Tree loading errors :** An alert will be generated if Monitor Tree kind jobs encounter an error during loading data from SAP. The severity of the Alert can be set in the dropdown list. {{..:..:..:userguide:administration:adminconfig:pasted:20190227-111002.png}} ==== Agents ==== This set of alarm settings will help to detect and be notified when a problem is detected on a agent: * **Max agent down time (sec) : ** * To be notified when an agent is not responding * Define the max time in seconds the agent must be available before sending a notification * **Min schedule ratio (%) : ** * This alarm allows to detect when an agent has not enough time to execute all its monitors * The server computes the ratio between executed monitors and rescheduled ones and compare it to the threshold * A ratio of 100% is to be expected on well configured agents * **Min successful exec. ratio (%) :** * This alarm allows to detect when an agent returns a lot of execution errors for its monitors * The server will compute the ratio between successful executions and failed ones * To have some monitor failing from time to time is normal, but a lot of failures might indicate a problem in the agent (resources/network) * **Max result send time (sec) :** * This alarm allows to detect when sending the results from the agent to the primary server is taking too long time * This can be caused by network problems, or resource problem on agent of primary server. * A notification will be sent if the send time is over threshold. * **Max time without results (sec) :** * This alarm allows to detect when an agent is not sending any results to the server * This can indicate a resource problem on the agent * A notification will be sent if the time since last received result is over threshold * **Max VM Heap usage (%) :** * This alarm allows to detect when an agent is using all its allocated memory * If the agent memory usage reaches 100%, this may indicate memory starvation and instability * A notification will be sent if VM memory usage reaches threshold * **Max OS RAM usage (%) :** * This alarm allows to detect when the overall OS memory usage is too high * High OS memory usage may prevent the server to use its allocated memory, and also use paging which will decrease performances. * A notification will be sent if OS memory usage is over threshold * **Max OS disk usage (%) :** * This alarm allows to detect when the application disk space is running low * Disk full situation must absolutely be avoided, it may bring the service down. * A notification will be sent if the disk used space is over threshold {{..:..:..:userguide:administration:adminconfig:pasted:20190227-111235.png}} \\ ==== Plugins ==== * **Max plugin down time (sec) :** * Allows to detect when a plugin is failing to send events. * This is usually a critical case, because it means that monitoring might not be visible in the corresponding third party platform * A notification will be sent if the plugin error last for more than threshold. {{..:..:..:userguide:administration:adminconfig:pasted:20190227-112140.png}} \\ ==== Licenses ==== * **Max expiration delay (days) :** * Allows to be notified when a license is going to expire * **Invalid license severity :** * Allows to be notified when a license is not valid {{..:..:..:userguide:administration:adminconfig:pasted:20190227-111642.png}} \\ ==== Internal alarms settings ==== * **Clear alarms :** * If set, all clearable alarms will be cleared (by using an alarm with //toClear// paramter set to true.) once the problem is not detected anymore. {{..:..:..:userguide:administration:adminconfig:pasted:20190227-111723.png}} \\ ==== Metrics sources ==== * **Alarm source :** SID, HOST, FQND, TITLE, INSTANCE, IP * **Metric source :** SID, HOST, FQND, TITLE, INSTANCE, IP {{..:..:..:userguide:administration:adminconfig:pasted:20190227-111809.png}}