Cloud Monitoring

Cloud monitoring

Cloud monitoring tools

Cloud monitoring tools aggregate performance data in real-time, tracking performance, resource allocation, network availability, and other key performance indicators (KPIs).
Capabilities that cloud monitoring software provides include...

Real-time, 24x7 monitoring of virtual machines, services, databases, and applications
Multilayer visibility into application, user, and file access behavior.
Reporting and auditing capabilities for ensuring regulatory standards are being met
Monitoring integrations across multicloud and hybrid cloud environments
Trends in user activity
Avoiding downtime
Avoiding under-provisioned workloads

Cloud monitoring tools include...

Website monitoring
Identify minor and large-scale hardware failures and security gaps.
Database monitoring
Monitor cloud database resources, track processes, queries, and availability of services.
Application Performance Management (APM)
Measure application availability and performance. Provide development teams tools for troubleshooting. Improve user experience, meet application and user service level agreements (SLAs), minimize downtime, and lower overall operational costs.

Hybrid cloud monitoring

Hybrid cloud environments combine the use of public cloud services and private on-premises infrastructure. Organizations can keep sensitive elements of their business, such as client data and transaction processes, on-premises while running other applications and services in cloud environments.
However, a lack of end-to-end visibility in applications and services in a hybrid environment can make it difficult to identify and address critical failures or bottlenecks in software development pipelines, website and application performance, network configurations, and other IT-related process. Hybrid cloud monitoring solutions provide integrations with cloud vendor performance data, providing data visualizations, allowing teams to make better decisions regarding service deprecation, application resource provisioning, mobile agility, and database management.

Multicloud monitoring

Multicloud environments are similar to a hybrid cloud in that they leverage the use of an on-premises solution in combination with cloud-based computing environments, but they add the complexity of utilizing multiple public cloud providers.
Multicloud adds complexity to management of infrastructure. Every cloud provider operates under different SLAs when it comes to availability, compliance, and security.
Multicloud monitoring listens for the golden signals of latency, traffic, errors, and saturation. Organizations can receive policy-driven notifications as incidents arise and run automated processes to resolve them.

Cloud monitoring best practices

Moving to hybrid or mutlicloud environments can provide many advantages for scaling enterprises, especially when looking to create more agile operations. But to get the most benefit out of cloud-based deployments, we should follow some standard cloud monitoring best practices.

Utilize end-user experience monitoring

While creating better internal efficiencies around process management is necessary, the primary goal of every business should be to monitor and address user experience at all levels. Gaining insights on how to improve application performance and availability for users can have significant impacts on the bottom line and the overall sustainability of products and services.
There are two ways that organizations can deploy digital experience monitoring in an enterprise setting:

Synthetic monitoring: Also known as active monitoring, synthetic monitoring provides simulated end-user viewpoints in order to provide feedback on application performance under various conditions. This allows us to run benchmarking and baselining of the entire connected infrastructure and how it responds to complex processes and heavy workloads before applications are deployed, helping to maximize availability and overall reliability.

Real user monitoring (RUM): Real user monitoring uses "real" user metrics to gain a better understanding of overall digital experiences. RUM is designed to collect all user activities in real-time, following the user's journey while measuring how backend services, application performance metrics, server load-times, and other KPIs are performing.

In complex infrastructures and hybrid cloud deployments, synthetic and real user monitoring work in collaboration with one another to provide complete visibility into the digital experience. This includes providing detailed analysis of network, backend, and frontend performance, as well as deep user insights that help organizations isolate key issues and address them.

Move to a unified platform

Moving all aspects of the infrastructure under one unified monitoring platform allows us to efficiently manage all KPIs in one place and have visibility into performance optimization.

Increase automation

Increase operational efficiency driven by intelligence and predictive golden signals. Better visibility and control over website performance, resource management, application availability, and more.

Cloud monitoring and IBM

IBM Cloud monitoring solutions...

IBM Cloud Pak for Multicloud Management
Cloud-Monitoring-as-a-Service (CMaaS)

Types of monitoring

IBM has defined the discipline of Cloud Service Management & Operations (CSMO) as "all the activities that an organization does to plan, design, deliver, operate, and control the IT and cloud services that it offers to customers."
Monitoring can be roughly divided into three types:

Metrics - collecting numerical information from the application and platform. This may be a number that is calculated by the application (e.g., how any items are in a queue) or exposed by the platform (e.g., how much memory is the process consuming).
Logging - collecting textual information (e.g., an error message generated by the application).
Synthetic monitoring - sending an external message to the application and examining the response to determine the component's status (e.g., sending a ping to a server or simulating an entire customer transaction).

Once the monitoring system has discovered that a specific metrics has passed a threshold or a log entry matches, a test it will forward an event up the incident management toolchain so that the issue can be solved either automatically or manually.

Practices and tools

The first level of monitoring is that of the platform (when in the cloud) and the datacenter infrastructure (when on-premises). While each platform and infrastructure usually has a dedicated (siloed) monitoring solution, we can use Netcool Operations Insight (NOI) or Cloud Event Management (CEM) to collect events from these solutions and use Application Performance Management (APM) to monitor them independently.

IBM's monitoring solutions for cloud platforms

Netcool Operations Insight (NOI) and Cloud Event Management (CEM) are designed to collect, correlate and consolidate millions of events and alarms from your on-and off-premise environments. We use them to leverage siloed monitoring systems and gather information and events. NOI and CEM have a role in event management and incident management which goes beyond monitoring.
IBM Cloud has a status console that displays the state of the IBM Cloud platform, services and runtimes.
Monitoring cloud ready workloads

Cloud-Ready workloads (virtualized servers, middleware and so on) are also monitored using APM for the application performance and NOI/CEM to collect information from other monitoring solutions.
Those using Cloud Automation Manager (CAM) in IBM Cloud Private can orchestrate and control multiple clouds, but the monitoring of these resources is not performed under IBM Cloud Private itself. In other words, if you use CAM to provision a traditional virtual server within your datacenter, then you will use your traditional solution to monitor the servers and not the IBM Cloud Private monitoring solution.

Monitoring cloud native workloads

Cloud native workloads are workloads that are specifically designed to benefit from the features of automation and orchestration that cloud platforms provide. These include Containers running under Kubernetes & Cloud Foundry runtimes in both IBM Cloud and IBM Cloud Private and IBM Cloud Functions in IBM Cloud.
While the same monitoring solutions for Cloud-Ready workloads exist, Cloud-Native workloads have further available solutions:
Prometheus is an open-source systems monitoring and alerting toolkit which is part of the Cloud Native Computing Foundation, together with Kubernetes. It can monitor multiple workloads, but is mostly used with the Container workloads. Prometheus comes built-in with IBM Cloud Private and can be deployed manually to monitor IBM Cloud workloads too.
IBM Cloud Monitoring automatically collects metric data from IBM Cloud applications and services, eliminating the need for agents. APIs make it easy to add custom metrics and to query your monitoring data. Cloud Monitoring can monitor all types of workloads in the IBM Cloud.
APM for DevOps is a new member of the APM solution suite, dedicated to ensuring the optimal performance of your applications and to make the most efficient use of containerized resources.

Collecting logs

While Cloud-Ready applications may still write logfiles to disks and depend on an external collector to read them, Cloud-Native applications will usually simply stream messages out. These log entries will be lost unless an existing log collector saves them.
The following is the list of IBM's solutions for collecting and analyzing logs:

IBM Cloud Log Analysis collects and aggregates application and platform logs for consolidated application or insights. It enables "zero configuration" out-of-the-box automated log collection of Cloud Foundry and Containers workloads. Log Analysis can collect logs from all types of workloads.
ElasticSearch, previously known as the ELK stack, enables you to securely and reliably search, analyze, and visualize your data. It is installed as part of IBM Cloud Private.
IBM Operations Analytics - Log Analysis helps turn terabytes of big operational log data into understandable and actionable insights for quicker problem solving and better overall service. It accelerates problem isolation, identification, and repair by providing dashboard views into analyzed sources of log data from solutions and devices across the service management infrastructure.
Log File Agents are components of APM which read and correlate logs

Service Management toolchain

While each of the cloud platforms and workloads may benefit from using a dedicated monitoring solution, the rest of the service management toolchain benefits from being consolidated. For example, it is simpler and easier for the organization if there is a single dashboard solution so everyone is looking at the same dashboard and a central ticketing solution to facilitate the tracking and transferring of tickets within the organization. These considerations are shown in the final and topmost row of the table, Service Management:
Further details about Service Management principles in general and the Incident Management process is particular may be found in the IBM Garage Method website.

Tools

Sysdig
LogDNA
New Relic
PagerDuty
Runbook Automation
Slack

More info

Monitoring and alerting capabilities
How do I monitor my IBM Cloud applications?
Service Management architecture