I have been talking to a number of people, both online as well as in person (at MMS, TechED and in other occasions) and one of the recurring questions I get, as people plan for their deployment of System Center 2012 – Operations Manager and want to use the APM feature, is “how much more space will I need for APM data?”.
Now this really is a tough question… and, as much as I hate having to say so, the answer is “it depends”.
We do have APM factored in the
Sizing Helper Spreadsheet
– but as you might know from
some other earlier posts of mine, I am no big fan of this type of calculations
. I admit they are useful, but deep within me I consider them a necessary evil. This is because I realize how HARD it really is to accurately predict the amount of data for things (such as events) that are really the result of so many different factors and variables coming into play… that is almost impossible to give an accurate measure of how things will look like, without actually looking at the data a specific environment produces. APM events are not different: the sizer uses some math based off our tests, and generally errs on the side of caution, but it is still a very rough approximation when it comes to APM events. Every environment is unique, and the real amount of APM events that will be generated (hence the size they’ll take in your database) depends on how your applications behave in production, on the way your users interact with those, when they do it, how many of them there are, the thresholds and advanced settings you chose… Basically, trying to predict this is a bit of a divining act. And I don’t believe that is truly possible to do it accurately in advance.
What IS possible, instead, and much more reliable in my experience, is OBSERVATION and ADAPTATION (call it “tuning” if you wish): in short, taking a look of what data certain applications with certain settings produce over a period of time… and try to use that as a baseline for a forecast. And then, as time passes and the environment evolves, re-baseline and take it from there (tuning configuration and the applications themselves). This is what I have been trying to teach people about Tuning Management Packs when I was working in Premier Field Engineering, this is what I also write in the already-linked-post about ACS, and I am trying to apply the same logic here to APM.
So, what would a first step be in understanding the amount of database space that APM data is taking?
Wait for a couple of business cycles (typically one or two weeks for a good idea on size, albeit you should be able to spot problems in your applications much sooner!)
Look at how much space the sheer amount of data for that type actually takes up – my favorite/easy way for this is to go to SQL Management Studio and use the “Disk Usage by Top Tables” canned report; alternatively, there are several “top tables” queries out there,
i.e. on Kevin’s blog
that you can use
Look at numbers of Events per day with the query I am giving you below
Spikes are by nature possible (and likely to occur) but, in general, if you are looking at the “typical” week or two, you should see some “trends” (i.e. business days vs weekends) and you should be able to approximately map the amount of used space to days and understand how your data usage grows over time. This is far from perfect, but it is MUCH MORE realistic than what any sizing helper can possibly give you – because it shows the REAL data from your REAL applications with the REAL settings that you chose!
So, here’s the query from step #4: (to be executed against the Data Warehouse – note that it is a single query):
---APM Events and Alerts DW Version
--APM Events by day
select CONVERT(VARCHAR(20),UTCEVENTDATE,102) as Day, COUNT(*) as TotalAPMEvents
group by CONVERT(VARCHAR(20),UTCEVENTDATE,102)
order by CONVERT(VARCHAR(20),UTCEVENTDATE,102) DESC
--CSM Events by day
select CONVERT(VARCHAR(20),UTCDATE,102) as Day, COUNT(*) as TotalCSMEvents
group by CONVERT(VARCHAR(20),UTCDATE,102)
order by CONVERT(VARCHAR(20),UTCDATE,102) DESC
--get count of alerts per day
select CONVERT(VARCHAR(20),a.DBCreatedDateTime,102) as Day, count(*) as TotalAPMAlerts
from Alert.vAlert a
left outer join vRule r
on a.WorkflowRowId = r.RuleRowId
left outer join vMonitor m
on a.WorkflowRowId = m.MonitorRowId
where r.ManagementPackRowId in (select ManagementPackRowId from ManagementPack mp with (nolock) where mp.ManagementPackSystemName in ('Microsoft.SystemCenter.Apm.Web','Microsoft.SystemCenter.Apm.Library'))
or m.ManagementPackRowId in (select ManagementPackRowId from ManagementPack mp with (nolock) where mp.ManagementPackSystemName in ('Microsoft.SystemCenter.Apm.Web','Microsoft.SystemCenter.Apm.Library'))
group by CONVERT(VARCHAR(20),a.DBCreatedDateTime,102)
order by CONVERT(VARCHAR(20),a.DBCreatedDateTime,102) desc
--combined table to see ratio
CASE WHEN e.TotalAPMEvents IS NULL THEN 0 ELSE e.TotalAPMEvents END as 'Total APM Events',
CASE WHEN ce.TotalCSMEvents IS NULL THEN 0 ELSE ce.TotalCSMEvents END as 'Total CSM Events',
((CASE WHEN e.TotalAPMEvents IS NULL THEN 0 ELSE e.TotalAPMEvents END) + (CASE WHEN ce.TotalCSMEvents IS NULL THEN 0 ELSE ce.TotalCSMEvents END)) as 'Total APM/CSM Events',
CASE WHEN a.TotalAPMAlerts IS NULL THEN 0 ELSE a.TotalAPMAlerts END as 'Total APM Alerts'
from #APMAlerts a
right outer join #APMEvents e
on a.Day = e.Day
left outer join #CSMEvents ce
on a.Day = ce.Day
order by e.Day desc
drop table #APMAlerts
drop table #CSMEvents
drop table #APMEvents
The result set should look a bit like the following screenshot:
And by quickly copying and pasting the results in Excel you can get a graph like the following, which is fairly eloquent in terms of showing you ratio of APM events versus Alerts that have been generated based off those events:
Other useful queries could get fancier and look at things like:
which applications are generating the most events
what are the most common “problems” found
are those problems generating “performance” events or “exceptions”
These can also be accomplished with queries, of course, but before going the hard route, and very good starting point is using the report provided by Application Advisor, that give us this type of info already, i.e.:
Based on your findings you then can triage the issues and do a few things:
Fix the applications that are misbehaving
(always the preferred approach, if possible: healthier applications will produce less data, and your users will be happier too!)
Optimize the way you are collecting data
(i.e. increasing your threshold to generate less data – note that this might hide away important issues… the right threshold is a fine line)
Be more aggressive in Grooming
in one of the following two ways:
using Problem Management Rules (note that these will only apply to events after they have already been stored in the Operations Database, and won’t prevent alerting rules from firing in the first place); for ideas on how to prevent alerts, instead, please read my earlier post on
Custom APM Rules
Change throttling settings for the agents
this can be achieved by setting overrides on the “Apply APM Agent configuration rule” rule (which is targeted to “.NET monitoring agent”) – there are 15 (yes, fifteen!) override-able properties that govern how many events of each type an agent can send per each minute, hour and day – their names and descriptions are pretty self-explanatory, but we might consider documenting them in the future in case they aren’t.
A sample flow on how to use the Advisor Reports is illustrated by the following diagram:
I hope this gives you useful tips on how to approach slicing and dicing the data collected by APM in order to get the most value out of it while also controlling the amount of space that your database requires… and eventually improving your applications!
This posting is provided "AS IS" with no warranties, and confers no rights.