In version 11g of Oracle database, there is a new feature whereby current performance data (obtained from AWR snapshots) can be compared against an AWR baseline and an alarm triggered if a given metric exceeds a certain threshold. From what I understand, there are 3 types of thresholds : fixed value, percent of maximum and significance level. The first type (fixed value) is very easy to understand - alarms are triggered whenever the metric in question exceeds certain fixed values specified for the warning and critical alerts (without reference to the baseline). The 2nd type (percent of maximum) presumably means that an alert is triggered whenever the current value of the metric exceeds the specified percent of the maximum value of the metric that was observed in the whole baseline period (if I understood this correctly - correct me if I'm wrong).
However, the 3rd type (significance level) is not at all easy to understand. The Oracle documentation is not at all clear on that point, nor could I find any Metalink notes on the subject. I also tried searching the OTN forums, to no avail. Could you please explain, in very simple terms, when exactly an alarm would be triggered if "significance level" is specified for the threshold type, if possible by giving a simple example. There are apparently 4 levels of such thresholds (high, very high, severe and extreme).
and we said...
I asked Graham Wood and John Beresniewicz for their input on this, they are the experts in this particular area
Graham Wood wrote:
> Copying JB as this is his specialty area, in case I don't get it right. :-)
> The basic idea of using significance level thresholds for alerting is that we are trying to detect outliers in the distribution of metric values, rather than setting a simple threshold value.
> By looking at the historical metric data from AWR we can identify values for 25th, 50th (median), 75th, 90th, 95th and 99th percentiles. Using a curve fitting algorithm we also extrapolate the 99.9th and 99.99th percentiles. We derive these percentiles based on time grouping, such as day, night, and hour of day.
> In the adaptive baselines feature in 11g we allow the user to specify the alert level, which equates to one of these percentile values:
> High 95th percentile
> Very High 99th percentile
> Severe 99.9th percentile
> Extreme 99.99th percentile
> Using the AWR history (actually the SYSTEM_MOVING_WINDOW baseline) the database will automatically determine the threshold level for a metric that corresponds to the selected significance level for the current time period.
> Setting a significance level of Extreme means that we would only alert on values that we would only expect to see once in a 10000 observations (approximately once in every years for hourly thresholds).
> Cheers, Graham
The significance level thresholds are intended to produce alert threshold values for key performance metrics that represent the following:
"Automatically set threshold such that values observed above the threshold are statistically unusual (i.e. significant) at the Nth percentile based on actual data observed for this metric over the SYSTEM_MOVING_WINDOW baseline."
The premise here is that systems with relatively stable performance characteristics should show statistical stability in core performance metric values, and when unusual but high-impact performance events occur we expect these will be reflected in highly unusual observations in one or more (normally statistically stable) metrics. The significance level thresholds give users a way to specify alerting in terms of "how unusual" rather than "how much".
Longer (original) reply:
Hi Tom -
Graham did a pretty good job, but I'll add some stuff.
Fixed thresholds are set explicitly by user, and change only when user unsets or sets a different threshold. They are based entirely on user understanding of the underlying metrics in relation to the underlying application and workload. This is the commonly understood paradigm for detecting performance issues: trigger an alert when metric threshold is crossed. There are numerous issues we perceived with this basic mechanism:
1) "Performance" expectations, and thus alert thresholds, often vary by application, workload, database size, etc. This results in what I call the MxN problem, which is that M metrics over N systems becomes MxN threshold decisions each of which can be very specific (i.e. threshold decisions not transferable.) This is potentially very manually intensive for users with many databases.
2) Workload may vary predictably on system (e.g. online day vs. batch night) and different performance expectations (and thus alert thresholds) may pertain to different workloads, so one threshold for all workloads is inappropriate.
3) Systems evolve over time and thresholds applicable for the system supporting 1,000 users may need to be altered when system supports 10,000 users.
The adaptive thresholds feature tries to address these issues as follows:
A) Thresholds are computed by the system based on a context of prior observations of this metric on this system. System-and-metric-specific thresholds are developed without obliging user to understand the specifics (helps relieve the MxN problem.)
B) Thresholds are periodically recomputed using statistical characterizations of metric values over the SYSTEM_MOVING_WINDOW baseline. Thus the thresholds adapt to slowly evolving workload or demand, as the moving window moves forward.
C) Metric statistics for adaptive thresholds are computed over grouping buckets (which we call "time groups") that can accommodate the common workload periodicities (day/night, weekday/weekend, etc.) Thresholds resets can happen as frequently as every hour.
So the net-net is that metric alert thresholds are determined and set automatically by the system using actual metric observations as their basis and using metric-and-system-independent semantics (significance level or pct of max.)
From Tom - Thanks both!
Is this answer out of date? If it is, please let us know via a Comment