Skip to Content
Software EngineeringSystem DesignCasesMetrics Monitoring and Alerting System

Metrics Monitoring and Alerting System

Questions

Who

  • Who are we building the system for? (In-house system for internal use)

What

  • What metrics do we want to collect? (Operational system metrics, not business metrics)
  • What is the scale of the infrastructure? (100 million daily active users, 1,000 server pools, 100 machines per pool)
  • What are the supported alert channels? (Email, phone, PagerDuty, webhooks)

How

  • How long should we keep the data? (1-year retention)
  • How may we reduce the resolution of the metrics data? (7 days at full resolution, 1-minute resolution for 30 days, 1-hour resolution thereafter)

Why

  • Why are we not collecting logs or supporting distributed system tracing? (Out of scope)

Overview

Components

  1. Data Collection

    • Function: Collect metric data from various sources.
    • Details: Collects operational metrics, system performance data, and application-specific metrics.
  2. Data Transmission

    • Function: Transfer collected data to the metrics monitoring system.
    • Details: Uses protocols like HTTP, gRPC, or custom protocols for efficient data transfer.
  3. Data Storage

    • Function: Organize and store the incoming data.
    • Details: Utilizes time-series databases optimized for high write loads and bursty read loads.
  4. Alerting

    • Function: Analyze data, detect anomalies, and generate alerts.
    • Details: Configurable to send alerts to various channels (e.g., email, SMS, Slack).
  5. Visualization

    • Function: Present data visually in graphs and charts.
    • Details: Enables engineers to identify patterns, trends, or issues more effectively.

Introducing a Queuing Component to Mitigate Data Loss

  • Risk Mitigation:

    • Address the risk of data loss when the time-series database is unavailable by introducing a queuing component.
  • Queuing System:

    • Metrics Collection:
      • Metrics collector sends data to a queuing system like Kafka.
    • Processing and Storage:
      • Consumers or streaming processing services (e.g., Apache Storm, Flink, Spark) process and push data to the time-series database.

Advantages of Using Kafka

  • Reliability and Scalability:
    • Kafka provides a highly reliable and scalable distributed messaging platform.
  • Decoupling Services:
    • Decouples data collection from data processing services.
  • Data Retention:
    • Prevents data loss by retaining data in Kafka when the database is unavailable.

Scaling the System with Kafka

  • Partition Configuration:
    • Configure the number of Kafka partitions based on throughput requirements.
  • Data Partitioning:
    • Partition metrics data by metric names.
    • Further partition metrics data with tags/labels for finer granularity.
  • Prioritization:
    • Categorize and prioritize metrics to ensure important metrics are processed first.

Zoom in: Alert System

  1. Load Config Files to Cache Servers

    • Config Format: Rules defined as config files, typically in YAML format.
    • Purpose: Cache servers hold the configuration files for quick access.
  2. Fetch Configs from Cache

    • Component: Alert manager retrieves alert configurations from the cache.
  3. Evaluate Alerts

    • Process:
      • Alert manager queries the service at predefined intervals based on config rules.
      • If a value violates the threshold, an alert event is created.
    • Responsibilities of Alert Manager:
      • Filter, Merge, and Dedupe Alerts: Consolidates multiple alerts to avoid duplicates.
      • Access Control: Restricts alert management operations to authorized individuals to prevent human error and enhance security.
      • Retry Mechanism: Ensures notifications are sent at least once by checking alert states.
  4. Store Alert States

    • Component: Alert store (key-value database such as Cassandra).
    • Function: Keeps the state of all alerts (inactive, pending, firing, resolved).
  5. Insert Alerts into Kafka

    • Action: Eligible alerts are inserted into Kafka for further processing.
  6. Pull Alert Events from Kafka

    • Component: Alert consumers pull alert events from Kafka.
  7. Process and Send Notifications

    • Component: Alert consumers process the pulled alert events.
    • Notification Channels: Sends notifications via different channels such as:
      • Email
      • Text message
      • PagerDuty
      • HTTP endpoints

Data Model

  • Time Series: Metrics data is recorded as a time series, with each series uniquely identified by its name and optional labels.
  • Write Load: Heavy and constant, with around 10 million data points written per day.
  • Read Load: Spiky, driven by visualization and alerting services.

Data Storage System

  • Requirements:

    • High performance for heavy write loads.
    • Efficient handling of spiky read loads.
    • Support for time-series operations like moving averages and rolling time windows.
    • Tagging/labeling data with efficient indexing.
  • Challenges with General-Purpose Databases:

    • Relational Databases: Require expert-level tuning, not optimized for time-series operations, and struggle under heavy write loads.
    • NoSQL Databases: Require deep knowledge for scalable schema design, making industrial-scale time-series databases a more attractive option.
  • Recommended Time-Series Databases:

    • OpenTSDB: Distributed, based on Hadoop/HBase, adds complexity.
    • MetricsDB (Twitter): Used internally by Twitter.
    • Amazon Timestream: AWS’s time-series database service.
    • InfluxDB: Popular for its high performance, ease of use, in-memory caching, and on-disk storage.
    • Prometheus: Widely used, supports real-time analysis, in-memory caching, and durable on-disk storage.

Metrics Data Collection Methods

Pull Model

  • Mechanism:

    • Dedicated metric collectors pull metrics from running applications periodically.
    • Requires the complete list of service endpoints.
    • Service Discovery: Utilizes etcd, Zookeeper, etc., for dynamic endpoint management.
    • Collectors fetch configuration metadata (pulling interval, IP addresses, timeout, retry parameters) from Service Discovery.
    • Metrics data is pulled via a predefined HTTP endpoint (e.g., /metrics).
    • Services expose the endpoint using a client library.
  • Implementation:

    • Endpoint Management: Metrics collectors can register change event notifications or periodically poll for endpoint updates.
    • Scaling: Uses a pool of collectors to handle high demand.
    • Coordination: Consistent hash ring can be used to avoid duplicate data collection by assigning each collector to a unique range.
  • Advantages:

    • Easy debugging: View metrics anytime via the /metrics endpoint.
    • Health check: Quickly identify if an application server is down.
  • Disadvantages:

    • Firewall/network complications: Requires all metric endpoints to be reachable.
    • Performance: Typically uses TCP, which might have higher latency compared to UDP used in push models.

Push Model

  • Mechanism:

    • Collection agent installed on each monitored server.
    • Agents collect and push metrics periodically to the metrics collector.
    • Agents can aggregate metrics locally to reduce data volume before sending.
  • Implementation:

    • Handling High Traffic: Agents can buffer data locally and resend if the metrics collector rejects pushes due to high load.
    • Auto-scaling: Metrics collectors in an auto-scaling cluster with a load balancer to manage varying loads.
  • Advantages:

    • Handles short-lived jobs effectively.
    • Better for complex network setups: Collectors receive data from anywhere if set up with a load balancer and auto-scaling group.
    • Lower latency: Typically uses UDP for metric transport.
  • Disadvantages:

    • Potential data loss if local buffering is used and servers are frequently rotated out.
    • Data authenticity: Push systems might require whitelisting or authentication to ensure authenticity of metrics.

Examples

  • Pull Architectures:

    • Prometheus: Uses a pull model for collecting metrics.
  • Push Architectures:

    • Amazon CloudWatch: Uses a push model.
    • Graphite: Another example of a push-based system.

Pull vs. Push: Considerations

  • Debugging:

    • Pull: Easy to view metrics anytime via the /metrics endpoint.
    • Push: Might involve additional steps to verify metrics.
  • Health Check:

    • Pull: Quick identification if an application server is down.
    • Push: More complicated to diagnose if metrics are not received due to network issues.
  • Short-lived Jobs:

    • Pull: Can be challenging; may require push gateways.
    • Push: Handles short-lived jobs better.
  • Network Setups:

    • Pull: Requires reachable metric endpoints, problematic in multi-data center setups.
    • Push: Collectors can receive data from anywhere with load balancers and auto-scaling.
  • Performance:

    • Pull: Uses TCP, potentially higher latency.
    • Push: Uses UDP, lower latency but less reliable.
  • Data Authenticity:

    • Pull: Metrics collected from predefined, authentic endpoints.
    • Push: Requires whitelisting or authentication to ensure data authenticity.

Things to consider: Aggregation Points for Metrics

  1. Collection Agent

    • Location: Client-side
    • Capabilities: Supports simple aggregation logic.
    • Example: Aggregating a counter every minute before sending it to the metrics collector.
    • Pros:
      • Reduces data volume sent to the collector.
    • Cons:
      • Limited to basic aggregation logic.
  2. Ingestion Pipeline

    • Location: Before writing to storage
    • Tools: Requires stream processing engines such as Flink.
    • Function: Aggregates data before storing it in the database.
    • Pros:
      • Significantly reduces write volume as only aggregated results are stored.
    • Cons:
      • Handling late-arriving events can be challenging.
      • Loss of data precision and flexibility due to not storing raw data.
  3. Query Side

    • Location: After writing to storage
    • Function: Aggregates raw data at query time over a specified period.
    • Pros:
      • No data loss; raw data is preserved.
    • Cons:
      • Slower query speeds because aggregation is performed at query time against the entire dataset.

Things to consider: Choosing a Time-Series Database

  • Query Patterns
    • Research Insight: According to Facebook’s research, 85% of queries to the operational data store are for data collected in the past 26 hours.
    • Implication: A time-series database that optimizes for recent data can significantly enhance performance.
    • Example: InfluxDB’s storage engine is designed to leverage this query pattern effectively.

Things to consider: Data Management Strategies

  1. Downsampling

    • Purpose: Reduces overall disk usage by converting high-resolution data to lower resolution over time.
    • Retention Rules Example:
      • 7 Days: No sampling.
      • 30 Days: Downsample to 1-minute resolution.
      • 1 Year: Downsample to 1-hour resolution.
  2. Cold Storage

    • Definition: Storage of inactive data that is rarely accessed.
    • Cost: Financially more cost-effective than active storage.
  3. Space Optimization through data encoding and compression

  • Benefits: Significantly reduces the size of stored data.
  • Built-in Features: Good time-series databases typically include encoding and compression capabilities.
  • Example: InfluxDB and similar databases offer advanced compression algorithms to minimize storage footprint.
Last updated on