SCOM performance has been one of the top user voice items over the years. It directly impacts the way our customers interact with SCOM including web console and operations console. This blog aims to highlight specific user scenarios which we’ve addressed in SCOM 2019 release and also the ones we intend to address in subsequent SCOM releases. Such user scenarios have been highlighted by our customers which is reflected in the SCOM performance issues that our customer support team resolves frequently.
In SCOM 2019, here are some of the scenarios that have been fixed:
1. Windows computer view in SCOM console: A few of our customers gave us the feedback that opening the Windows computer view in the SCOM console took an unreasonable amount of time. As an example, one of our customers with 1600+ windows computers reported this issue. About 1400 servers and 50 clients (engineers) were affected because of this issue. While on an average, it took about 8-10 minutes for this view to load, in the extreme cases it took more than 20 minutes. Given our customers are dependent on SCOM to provide timely information including but not limited to alerts, health and performance metrics of their applications and workloads, the performance of SCOM console is critical for customers’ monitoring experience. To decrease the load time for this view, we optimized the SQL query relevant to this view.
2. Changing the settings of a User Role One of our customer had an environment with a lot of user roles, views, classes and relationship types in the database. The customer reached out to System Center team and explained that on an average, changing the settings of a user role, for instance, providing or revoking permissions on specific views or dashboards to a specific user role, took about 30minutes. Further research suggested that many other SCOM customers with similar environment also experience similar problem. In fact, many customers concurred that changing setting of, say, 15-20 user roles took an entire day that impacted their productivity and ability to use SCOM console effectively. At a high level, the SQL queries that fetched relevant data and eventually helped change the settings of a user role was optimized. This optimization led to significant improvements in the load time for our customers.
3. Grooming of Maintenance Mode Staging Table Our customer support team received a case from our customer that SCOM Operations Manager Datawarehouse grooming (emptying) of maintenance mode staging table was not occurring. This essentially meant that the table grew every day into millions of rows which eventually filled up the database that could potentially lead to additional cost to the customer to spin up a new database. Furthermore, the increase in utilization of database is usually correlated with decrease in performance of SCOM console. To fix this issue we added an index to the maintenance mode stable table. This ensured that proper grooming of the table happened.
4. SDK service not starting and severe perf degradation leading to SCOM console not loading While the technical details of the above mentioned issues are beyond the scope of this blog, suffice to say that a couple of SQL queries running in the backend were causing the above mentioned issues. In fact, a few our customers mentioned that they faced severe performance degradation since they upgraded from SCOM 1807 to 2019. The SCOM console took a long time to load and when it did even the basic tasks such as adding a management server to management group couldn’t be completed. To fix this issue, we optimized the SQL queries relevant to these issues which lead to significant performance improvement.
5. Reliability and performance improvement in XPlat agent
Prior to 2019, monitoring data related to health and performance were fetched through requests running in the same thread in the back end. Due to this design, any flaws in the perf channel affected the heartbeat requests and vice versa. This often led to system going into greyed out state.
In 2019, we isolated heartbeat threads from performance data related threads which meant that any malfunctioning in performance providers would not affect heartbeat request, thereby improving reliability of SCOM.
We also introduced filters in XPlat MPs to help customers in customizing their discovery and monitoring scope to entities of interest. With this filter, customers can define OMI queries to limit their workloads. For instance, in the SuSE platform there is a file system called “RaiserFS” which is not supported in core XPlat MPs, yet this file system was discovered with inappropriate performance data. Similarly, in case of hypervisor and container environments, large set of logical entities are created which should not monitored. This happens because of generic nature of XPlat agent which discovers all these entities. With the introduction of filters in Xplat MPs, discover and monitoring of all such entities can be fine-tuned further to improve performance and scale of XPlat agent.
If you have faced above mentioned issues in your SCOM environment, please let us know your current experience. Many of our customers are still using SCOM 2012 and SCOM 2016. Given that we will continue to make significant investment into improving SCOM performance in 2020, we strongly recommend that you upgrade your environment to the SCOM 2019 Update Rollup 2 to get better performance.
Lastly, in 2020, we’re planning to invest into improving performance of SCOM consoles for other top user scenarios such as alert, health and performance views.