Centralized Monitoring
Table of Content
Objective
The document outlines a scalable monitoring architecture using Prometheus and Grafana within cloud environments like Azure and AWS, where Prometheus collects and stores metrics from various systems, while Grafana visualizes these metrics in customizable dashboards. Azure Managed Grafana integrates with services like Azure Kubernetes Service (AKS) to provide centralized, real-time monitoring across multiple environments, including hybrid cloud setups. The architecture supports persistent storage, secure access controls, and alerting, offering a comprehensive solution for tracking application and infrastructure performance.
Introduction
Grafana is a database analysis and monitoring tool. It allows you to create dashboard visualizations of key metrics that are important to you. Grafana supports a huge number of data sources. And, since the application is open source, you can be sure that when a new data source has been released, someone out there will add support for it. The most common use case of Grafana is displaying time series data, such as memory or CPU over time, alongside the current usage data.
Prometheus is a technology that collects and stores time series data. Time series are fundamental to Prometheus; its data model is arranged into: metrics that consist of a timestamp and a sample, which is the numeric value, such as how many disk bytes have been read or a stock price a set of labels called dimensions, for example, job and device.
Monitoring Tracker: Click here
Azure Managed Grafana
Azure Managed Grafana is a fully managed service for analytics and monitoring solutions. It's supported by Grafana Enterprise, which provides extensible data visualizations. Quickly and easily deploy Grafana dashboards with built-in high availability and control access with Azure security.
Architecture Overview
An architecture overview of a monitoring system using Prometheus and Grafana.
Key Components:
Targets:
These are systems or services that Prometheus monitors. In the image, the targets include:
Application Server: Represents a software server hosting an application.
API Server: A server providing API services to applications.
Node Exporter: A Prometheus exporter used to collect hardware and OS metrics from Linux systems.
Consul Exporter: Used to expose metrics from Consul (a service networking tool) to Prometheus.
These targets send metrics data to the Prometheus server using HTTP.
Prometheus Server:
This is the central component of the architecture. It scrapes metrics from the targets via HTTP.
Prometheus stores the metrics in a time-series database and uses PromQL (Prometheus Query Language) to query the data.
The server processes and organizes the collected metrics, preparing them for visualization.
Grafana Web UI:
Grafana is a visualization tool that allows users to query, visualize, and alert on metrics collected by Prometheus.
It communicates with the Prometheus server using PromQL to pull the relevant metrics data and then presents it through customizable dashboards.
Data Flow:
The metrics are collected from various targets (Application Server, API Server, Node Exporter, and Consul Exporter) via HTTP.
The Prometheus server stores and processes this data.
Users can visualize the metrics using the Grafana Web UI by querying Prometheus using PromQL.
This architecture is commonly used for real-time monitoring of services, infrastructure, and applications, offering robust alerting and visualization capabilities.
Current Architecture
A monitoring architecture setup within an Azure Kubernetes Service (AKS) environment. Here’s a detailed breakdown of the components and data flow:
Key Components:
Resource Group & Kubernetes Cluster:
The architecture is deployed in a Kubernetes cluster that is part of an Azure Resource Group. A resource group in Azure is a container for resources like Kubernetes clusters, disks, and networking components.
Internal Team:
The internal team accesses the monitoring setup via a web interface. This is facilitated by an ingress controller, typically through HTTP or HTTPS traffic.
Ingress (Nginx):
The Nginx ingress controller is responsible for routing the external traffic (from the internal team) into the Kubernetes cluster. It serves as the entry point to the monitoring services like Prometheus and Grafana.
It exposes services within the cluster to the external users securely.
Monitoring Namespace:
Inside the Kubernetes cluster, a monitoring namespace is dedicated to monitoring services. It contains key components:
Prometheus Server: Collects and stores metrics from the Kubernetes cluster and other sources. It is the core of the monitoring system.
Alert Manager: Works in conjunction with Prometheus to handle alerts and notifications based on predefined conditions.
Grafana: Used for visualizing the metrics stored in Prometheus. It offers dashboards to the internal team for monitoring system performance, health, and metrics.
Azure Disk & Persistent Volumes (PV/PVC):
Azure Disk provides persistent storage for the Prometheus data.
Persistent Volume (PV) is provisioned to ensure Prometheus has a dedicated storage location for the metrics it collects.
Persistent Volume Claim (PVC) for grafana-prometheus ensures Prometheus can claim storage from Azure Disk, enabling it to persist the collected data even if the pods restart.
Data Flow:
The internal team sends traffic to the system via a browser or monitoring tool.
The traffic passes through the Nginx ingress controller, which directs it to the appropriate service (in this case, Grafana or Prometheus).
Prometheus collects metrics from various sources within the Kubernetes cluster and stores the data using persistent volumes (backed by Azure Disk).
Grafana fetches the metrics from Prometheus and presents them in customizable dashboards for visualization.
Alert Manager generates alerts based on conditions defined in Prometheus, helping the internal team monitor issues and events in real-time.
Summary:
This is a robust monitoring system leveraging Prometheus for metrics collection and Grafana for visualization, all running in a Kubernetes cluster on Azure. The system is integrated with persistent storage using Azure Disk to ensure that metrics data is safely stored and available even after pod restarts. The ingress controller enables secure and organized traffic routing from users (internal team) to the services.
HL Arch
A centralized monitoring architecture using Prometheus and Azure Managed Grafana within multiple Azure Kubernetes Services (AKS) clusters.
Key Components and Flow:
Users:
Users access the system through a Grafana interface, which is hosted and managed by Azure.
Grafana is used for querying, visualizing, and monitoring data collected from the Kubernetes clusters and presented in dashboards.
Azure Managed Grafana:
This component represents a managed instance of Grafana hosted on Azure.
It pulls data from a central Data Collection component to provide visualizations, dashboards, and alerting features.
Grafana enables users to monitor the metrics collected by Prometheus from different Kubernetes clusters.
Data Collection:
The Data Collection component serves as a central hub or data store for metrics gathered from various Prometheus instances across multiple AKS clusters.
This centralized collection ensures that the data from different sources (AKS clusters) is aggregated for analysis.
Prometheus Instances:
There are multiple instances of Prometheus, each responsible for collecting metrics from a specific AKS (Azure Kubernetes Service) cluster:
Prometheus (AKS1): Collects metrics from the first Kubernetes cluster (AKS1).
Prometheus (AKS2): Collects metrics from the second Kubernetes cluster (AKS2).
Prometheus (AKS3): Collects metrics from the third Kubernetes cluster (AKS3).
Each Prometheus instance runs inside its respective Kubernetes cluster and monitors various services, applications, and infrastructure metrics.
The metrics are then sent to the Data Collection component for centralized processing and storage.
AKS Clusters:
The diagram shows three different AKS clusters (AKS1, AKS2, AKS3), which are likely used for different environments (e.g., production, development, or testing) or different workloads.
These AKS clusters are the source of the metrics that Prometheus collects.
Data Flow:
Prometheus collects the metrics from the services running in the respective AKS clusters (AKS1, AKS2, AKS3).
The collected data from Prometheus is sent to the Data Collection component, where it is aggregated and centralized.
Azure Managed Grafana connects to the Data Collection and pulls the necessary data for visualization.
Users access the Grafana dashboards to monitor the metrics and system performance.
Summary:
This architecture demonstrates a centralized monitoring solution across multiple Kubernetes clusters in Azure. Prometheus instances in each cluster gather metrics, which are then aggregated into a single Data Collection repository. Azure Managed Grafana serves as the interface for users to visualize and analyze the aggregated data. This setup ensures efficient and scalable monitoring across distributed environments.
LL Arch Design
It depicts a system architecture diagram focusing on monitoring and logging services integrated with Microsoft Azure and Amazon Web Services (AWS). Here's a breakdown of the key components and their interconnections:
User Interaction: Users interact with this architecture primarily through Azure Managed Gateways, which likely serve as entry points for accessing services hosted on Azure.
Azure Integration:
Azure Managed Gateway: This component is linked to an "Azure Monitor Workspace" and a "Log Analytics Workspace," indicating that telemetry and log data from the gateway are sent to Azure for monitoring and analysis.
Data Collection: There are three separate flows for data collection, each connected to what appears to be a data endpoint, suggesting that data is being aggregated or processed from these endpoints.
AWS Integration:
Prometheus Rule Group: Each of the three data collection flows sends data to a Prometheus Rule Group in different AWS clusters (AKS1, AKS2, and another group). Prometheus is used for monitoring and alerting based on predefined rules, which are probably used here to watch for specific conditions or thresholds in the data.
AWS Resources: There are multiple instances of AWS resources such as
ama-logs-efs
,ama-metrics
,ama-metrics-node
,ama-metrics-hsm
,ama-metrics-operator-targets
,ama-kms-logs
, etc. These are likely custom metrics and logs services set up for monitoring various aspects of the clusters.
Error Notification: In the AKS1 cluster, there's an error symbol next to the data flow into the Prometheus Rule Group, which might indicate a problem or a critical alert in that specific cluster path.
Data Flow: Data flows from the Azure environment through data collection nodes and then into AWS-specific Prometheus rule groups, suggesting a hybrid cloud scenario where monitoring and logging are managed across both Azure and AWS environments.
This diagram is useful for understanding how data moves through and is processed in a multi-cloud environment involving both Azure and AWS, with an emphasis on monitoring, logging, and alerting through Prometheus.
Deployment Setup
Outlines a typical deployment setup for monitoring, logging, and visualization using Microsoft Azure and integrated services. Here’s a detailed explanation of each component:
1. Log Analytics:
Log Analytics is a tool within the Azure Monitor ecosystem used to collect, correlate, and analyze log data generated by your applications and infrastructure.
Purpose: It helps administrators and developers query logs, detect trends, troubleshoot errors, and create custom alerts.
How It Works:
Log data from different resources (such as virtual machines, applications, containers, and other Azure services) is collected and stored in a Log Analytics Workspace.
You can use Kusto Query Language (KQL) to run queries on this data, visualizing metrics, troubleshooting issues, and setting up custom alerts for critical events.
Log Analytics supports integration with Azure Monitor and Application Insights to gather data on infrastructure health and performance.
2. Azure Managed Grafana:
Azure Managed Grafana is a managed service that runs Grafana dashboards on Azure. Grafana is an open-source platform used for monitoring and visualizing metrics and logs collected from multiple sources.
SMTP:
Grafana uses Simple Mail Transfer Protocol (SMTP) to send email notifications based on the configured alert rules.
In this setup, it likely means that Grafana is configured to send alert notifications (such as error messages, threshold breaches, etc.) via email using an SMTP server.
Administrators configure alerts for critical system or application performance thresholds, which will trigger an email through SMTP when conditions are met.
Use Case: Users can create customizable dashboards to track the health and performance of applications, monitor logs, and view historical data.
3. IAM (Identity and Access Management):
Azure Identity and Access Management (IAM) controls access to Azure resources. It defines who can do what within Azure.
Roles and Permissions:
IAM enables role-based access control (RBAC) to assign specific roles to users, groups, or applications. This limits their access to only the resources they need.
Common roles include Reader, Contributor, Owner, and custom roles for more fine-grained control.
Security: IAM ensures that only authorized users or applications have access to specific resources, increasing security and compliance.
4. Azure Monitor Workspace:
Azure Monitor Workspace is a central platform for monitoring your entire Azure environment, including applications, infrastructure, and networks.
Key Features:
Collects data from VMs, containers, apps, databases, and networks.
Provides metrics and logs to understand resource performance.
Visualization: Data collected can be visualized in various ways, including charts, tables, and reports, within the Azure portal or through tools like Grafana.
Alerts and Notifications: Azure Monitor helps to create alerts that notify users based on specific thresholds or performance deviations.
Integration with Log Analytics: Azure Monitor integrates with Log Analytics for deep analysis, allowing users to run queries on their log data to detect issues or optimize performance.
5. Application Insights:
Application Insights is a part of Azure Monitor that focuses on application performance management (APM).
Purpose:
It collects detailed telemetry data (such as request rates, response times, failure rates, etc.) from applications, whether they are hosted in the cloud or on-premises.
Exception Handling: Application Insights helps monitor exceptions, dependencies, and request traces to quickly troubleshoot problems in production environments.
Insights:
Tracks the performance and health of your applications in real-time, giving insights into bottlenecks or performance degradation.
Provides custom dashboards and live metrics to track key performance indicators (KPIs).
Also allows integration with Log Analytics, which provides advanced querying and analysis capabilities.
Deployment Flow Example:
Log Analytics Workspace collects logs and metrics from resources.
Application Insights gathers detailed telemetry from your apps.
Azure Monitor Workspace consolidates this data into a central platform where you can configure alerts and visualize logs.
Azure Managed Grafana is used to visualize all the data from Azure Monitor and Log Analytics, setting up alerts through SMTP for email notifications.
IAM secures access by restricting who can view, modify, or manage the monitoring resources, ensuring that only authorized personnel can make changes.
Summary:
This setup describes a comprehensive solution for monitoring and managing applications and infrastructure using Azure. By integrating Log Analytics, Azure Monitor Workspace, Application Insights, and Grafana, you get a unified platform to collect logs, analyze telemetry, visualize key metrics, and get alerts based on application and infrastructure performance. IAM ensures that all these tools are securely accessed, and SMTP in Grafana ensures timely alert notifications to your team.
Data Sources
In the context of Azure Monitor, Kubernetes, and Plugins, each of these sources feeds data into the monitoring, logging, and alerting ecosystem to help administrators and developers maintain infrastructure and application performance. Let’s explore how each of these works as a data source:
1. Azure Monitor as a Data Source:
Azure Monitor is a comprehensive monitoring service for collecting, analyzing, and acting on telemetry from your Azure and on-premises environments.
What Data is Collected:
Metrics: Azure Monitor collects numeric values that describe some aspect of a system at a particular time. Examples include CPU usage, memory usage, disk I/O, and network traffic.
Logs: These are events that are time-stamped data generated from Azure resources, applications, and third-party systems. Logs include details such as user actions, errors, and performance data.
Alerts: Azure Monitor generates alerts based on the telemetry data to notify administrators of potential issues or performance degradation.
Supported Sources:
Azure Resources: Virtual machines, web apps, databases, storage accounts, etc., within Azure, can send their telemetry data to Azure Monitor.
Application Insights: Used for application performance monitoring, this part of Azure Monitor collects telemetry from web applications, including user activity, error rates, and response times.
Custom Sources: You can also send data from custom applications or on-premises infrastructure to Azure Monitor using Custom Logs and Metrics.
Use Cases:
Monitoring the performance and health of your Azure infrastructure (VMs, databases, networking).
Creating custom dashboards and alerts based on the collected data.
Integrating with Log Analytics for more in-depth querying and analysis.
2. Kubernetes as a Data Source:
Kubernetes is an open-source platform for managing containerized workloads and services. Azure provides the Azure Kubernetes Service (AKS), which integrates with Azure Monitor to collect performance and health data.
What Data is Collected:
Cluster Metrics: CPU, memory usage, node health, pod status, and network I/O of the cluster. Kubernetes generates rich data about the workloads running within the cluster.
Container Metrics: Collects telemetry data from individual containers, such as resource utilization and logs from containerized applications.
Kubernetes Events: These are real-time notifications about changes in the cluster, such as when a pod is scheduled or when a service is restarted.
Kube-state-metrics: Provides detailed information on the state of the objects within the cluster, including deployment status, pod count, and resource limits.
Azure Monitor for Containers:
Azure Monitor integrates with AKS and other Kubernetes environments to collect metrics and logs. It allows you to visualize the health of your Kubernetes clusters and identify issues with deployments.
Log Analytics: Kubernetes data, including logs from containers and pods, is sent to Log Analytics for deeper analysis.
Use Cases:
Monitoring the health of AKS clusters or on-prem Kubernetes installations.
Managing container performance and resource allocation across multiple clusters.
Creating dashboards for real-time visualization of container metrics.
3. Plugins as Data Sources:
Plugins refer to additional tools, libraries, or extensions that allow you to collect data from various systems and integrate it into your monitoring environment.
What Data is Collected:
Depending on the plugin used, data sources can range from custom applications, databases, third-party services, or on-prem infrastructure.
Metrics, logs, traces, and events from the systems are integrated through these plugins into the monitoring and observability stack.
Types of Plugins:
Azure Monitor Plugins: Azure Monitor supports data collection from multiple sources through plugins. For example:
Azure Monitor Container Insights for AKS and other Kubernetes environments.
VM Insights for collecting telemetry from Azure Virtual Machines.
App Insights SDK to collect telemetry from custom applications.
Grafana Plugins: In a Grafana setup, multiple data sources can be integrated using plugins. For example:
Prometheus Plugin for collecting data from Prometheus metrics.
Elasticsearch Plugin for collecting logs from Elasticsearch.
MySQL Plugin to collect database performance metrics.
Custom Plugins: You can develop custom plugins for collecting telemetry from any application or service that isn’t natively supported.
Use Cases:
Extending the capabilities of your monitoring system to include data from custom or non-standard applications.
Using plugins to integrate third-party services (e.g., Datadog, Prometheus) into Azure Monitor or Grafana dashboards.
Automating alerting and telemetry collection from complex systems through plugin configuration.
Summary:
Azure Monitor as a data source provides comprehensive monitoring for Azure-based resources like VMs, applications, storage, and databases. It collects telemetry data (metrics, logs, and events) for analysis and reporting.
Kubernetes (via AKS or other distributions) collects performance and health data from containers, nodes, and clusters, offering deep visibility into containerized workloads.
Plugins extend monitoring functionality by allowing third-party services and custom applications to contribute their telemetry data, enabling a more holistic view of infrastructure and applications.
This setup ensures that performance, error tracking, and overall application health are well-monitored across diverse environments.
Grafana Dashboard
1. Dashboards & Variables:
Dashboards: A Grafana dashboard is a collection of panels organized on a single page to monitor and visualize real-time data (metrics, logs, etc.) from various sources.
Variables: Dynamic placeholders used in dashboards to filter and change data across panels. They make dashboards more interactive and allow users to switch between data sources, environments, or time ranges.
2. Panels:
Panels: Individual building blocks of a dashboard, each panel visualizes data in a specific way (e.g., graphs, tables, heatmaps). Panels can pull data from different data sources and use different visualization types.
3. Visualization & Configuration:
Visualization: Grafana offers multiple visualization options like line graphs, bar charts, gauges, heatmaps, and more to represent your data clearly.
Configuration: This involves configuring data queries, transformations, and panel settings (colors, labels, legends, thresholds) to tailor the data presentation to your needs.
4. Settings and Time:
Settings: General dashboard settings include layout adjustments, themes, permissions, and panel arrangement.
Time: Grafana provides flexible time ranges (e.g., last 1 hour, last 7 days) and time settings to specify and visualize data over a selected period. Users can set global time filters that apply to all panels on a dashboard.
5. Alert & Report:
Alerts: Grafana supports configuring alerts for real-time monitoring. You can define thresholds and get notifications when certain conditions are met via email, Slack, or other services.
Reports: Allows users to export dashboards or specific visualizations as PDFs, send periodic reports via email, or share them for regular updates on performance or system health.
In summary, Grafana Dashboards offer interactive, real-time visualization and alerting by combining panels, flexible variables, and robust configuration options.
References
Session Recordings:
Training session 1- https://drive.google.com/file/d/16a7U75udC6VJwMA21PseKZbSd-N1qeep/view?usp=drive_link
Training session 2- https://drive.google.com/file/d/1YbNgADBoNbSObhwM2eXENkWnhPR0Z0N1/view?usp=drive_link