Behind the Scenes: Monitoring as a Service at UITS

Filed in Collaboration, Featured by on June 2, 2015

Monitoring as a Service Image

As you use various services every day from that weather app on your phone to the e-mail software on your computer, have you ever wondered how companies monitor the health of those services? And what happens when a service goes down? Does the service provider receive a notification immediately or at all?

This all depends on how the service provider is monitoring services. Here at the UMass Systems Office, University Information Technology Services (UITS) wants to know when an outage happens before our end users report it to prevent or reduce the impact of user facing outages.

As users begin to expect more from technology, the ability to proactively or rapidly detect instability or unavailability of critical services, such as the University-wide HR system, becomes one of the most strategically important aspects of improving the efficiency and effectiveness of the University’s operations.

Although UITS has always had monitoring tools, a formal Monitoring as a Service strategy and program was implemented in February 2014. Through this program, monitoring was transformed by defining tools, metrics, and processes to mature the University’s monitoring capabilities and optimize stability and availability of our technology resources. As the program evolves, our 24/7 Operations Center staff provides us an ability to rapidly identify, analyze, and escalate technology issues. This leads to less outages and an improved response and recovery time.

Currently, UITS and UMass Amherst are collaborating to leverage the 24×7 Operations Center staff for monitoring and escalation of critical system outages on the Amherst campus.

So what tools and processes do we use to monitor University Shared Services?

Oracle Enterprise Manager (OEM)

OEM systems monitoring is one of the core components that enables UITS to manage and monitor 600+ host machines and 1000+ entities, such as databases, middleware components, and key web services. The objective of OEM is to monitor the health of servers and services, while providing application owners with rapid identification and notification of potential issues for impact review as well as defined ownership.

In addition to systems monitoring, OEM beacons are implemented to remotely monitor one or more services within our data center from participating campus locations at 5 minute intervals. The purpose of having beacons at each campus is to provide an additional level of monitoring that assists UITS in rapid identification of instability, availability, or connectivity issues between the campuses and our University Shared Services (e.g., Finance, HR, etc.).

Campuses can also opt in to get access to monitoring data via OEM Dashboard capabilities. The OEM dashboard provides the ability to view real time performance of key services executing from a campus beacon, as well as the performance results from other participating campuses for comparison purposes.

AlertSite

AlertSite, a hosted monitoring solution, provides UITS Systems, Application Administrators, and Operators with lights on monitoring 24×7 – 365 days of the year for public facing websites and web transactions, such as log ins and website navigation from external locations. AlertSite currently monitors 38 critical production services and has proven to be an effective tool for rapidly detecting customer facing issues real time. In addition, AlertSite provides a solid means to define and capture service level objectives, including up-time, availability, and response time metrics, as well as producing service level reports to assure service level compliance and allow comparison of actual performance with the designated objectives.

Operations Center

Our 24×7 live operations staff assists with rapid detection, validation, and flexible escalation of application, service, or server based availability events. The Operations Center staff is responsible for actively monitoring critical production servers, websites, and web-based transactions via the AlertSite and OEM dashboards. At the time of monitored events, Operations Center staff attempt to validate the error condition to reduce false positives and escalates based on criteria that is unique to each monitored service. Following escalation procedures, the Operations Center staff enters a ticket for monitored event recording and assignment of ownership.

Next Steps

UITS will broaden the scope of monitoring by introducing additional monitoring technology, such as System Center Operations Manager (SCOM) and processes. In addition, we plan to evolve the Operations Center to further assist UITS and campuses with rapid identification and remediation of monitored events and will mature our ability to present service level metrics. Lastly, we would like to visit each Campus and discuss ways in which Campuses could leverage our current monitoring processes.

For more information on the UMass Monitoring as a Service Program and collaboration opportunities, please contact me at khudzikiewicz@umassp.edu.

Tags: , , ,

Comments are closed.