Metrics Monitoring and Alerting System

Modified from - ByteByteGo courses

Questions

Who

Who are we building the system for? (In-house system for internal use)

What

What metrics do we want to collect? (Operational system metrics, not business metrics)
What is the scale of the infrastructure? (100 million daily active users, 1,000 server pools, 100 machines per pool)
What are the supported alert channels? (Email, phone, PagerDuty, webhooks)

How

How long should we keep the data? (1-year retention)
How may we reduce the resolution of the metrics data? (7 days at full resolution, 1-minute resolution for 30 days, 1-hour resolution thereafter)

Why

Why are we not collecting logs or supporting distributed system tracing? (Out of scope)

Overview

Components

Data Collection
- Function: Collect metric data from various sources.
- Details: Collects operational metrics, system performance data, and application-specific metrics.
Data Transmission
- Function: Transfer collected data to the metrics monitoring system.
- Details: Uses protocols like HTTP, gRPC, or custom protocols for efficient data transfer.
Data Storage
- Function: Organize and store the incoming data.
- Details: Utilizes time-series databases optimized for high write loads and bursty read loads.
Alerting
- Function: Analyze data, detect anomalies, and generate alerts.
- Details: Configurable to send alerts to various channels (e.g., email, SMS, Slack).
Visualization
- Function: Present data visually in graphs and charts.
- Details: Enables engineers to identify patterns, trends, or issues more effectively.

Introducing a Queuing Component to Mitigate Data Loss

Risk Mitigation:
- Address the risk of data loss when the time-series database is unavailable by introducing a queuing component.
Queuing System:
- Metrics Collection:
  - Metrics collector sends data to a queuing system like Kafka.
- Processing and Storage:
  - Consumers or streaming processing services (e.g., Apache Storm, Flink, Spark) process and push data to the time-series database.

Advantages of Using Kafka

Reliability and Scalability:
- Kafka provides a highly reliable and scalable distributed messaging platform.
Decoupling Services:
- Decouples data collection from data processing services.
Data Retention:
- Prevents data loss by retaining data in Kafka when the database is unavailable.

Scaling the System with Kafka

Partition Configuration:
- Configure the number of Kafka partitions based on throughput requirements.
Data Partitioning:
- Partition metrics data by metric names.
- Further partition metrics data with tags/labels for finer granularity.
Prioritization:
- Categorize and prioritize metrics to ensure important metrics are processed first.

Zoom in: Alert System

Load Config Files to Cache Servers
- Config Format: Rules defined as config files, typically in YAML format.
- Purpose: Cache servers hold the configuration files for quick access.
Fetch Configs from Cache
- Component: Alert manager retrieves alert configurations from the cache.
Evaluate Alerts
- Process:
  - Alert manager queries the service at predefined intervals based on config rules.
  - If a value violates the threshold, an alert event is created.
- Responsibilities of Alert Manager:
  - Filter, Merge, and Dedupe Alerts: Consolidates multiple alerts to avoid duplicates.
  - Access Control: Restricts alert management operations to authorized individuals to prevent human error and enhance security.
  - Retry Mechanism: Ensures notifications are sent at least once by checking alert states.
Store Alert States
- Component: Alert store (key-value database such as Cassandra).
- Function: Keeps the state of all alerts (inactive, pending, firing, resolved).
Insert Alerts into Kafka
- Action: Eligible alerts are inserted into Kafka for further processing.
Pull Alert Events from Kafka
- Component: Alert consumers pull alert events from Kafka.
Process and Send Notifications
- Component: Alert consumers process the pulled alert events.
- Notification Channels: Sends notifications via different channels such as:
  - Email
  - Text message
  - PagerDuty
  - HTTP endpoints

Data Model

Time Series: Metrics data is recorded as a time series, with each series uniquely identified by its name and optional labels.
Write Load: Heavy and constant, with around 10 million data points written per day.
Read Load: Spiky, driven by visualization and alerting services.

Data Storage System

Requirements:
- High performance for heavy write loads.
- Efficient handling of spiky read loads.
- Support for time-series operations like moving averages and rolling time windows.
- Tagging/labeling data with efficient indexing.
Challenges with General-Purpose Databases:
- Relational Databases: Require expert-level tuning, not optimized for time-series operations, and struggle under heavy write loads.
- NoSQL Databases: Require deep knowledge for scalable schema design, making industrial-scale time-series databases a more attractive option.
Recommended Time-Series Databases:
- OpenTSDB: Distributed, based on Hadoop/HBase, adds complexity.
- MetricsDB (Twitter): Used internally by Twitter.
- Amazon Timestream: AWS’s time-series database service.
- InfluxDB: Popular for its high performance, ease of use, in-memory caching, and on-disk storage.
- Prometheus: Widely used, supports real-time analysis, in-memory caching, and durable on-disk storage.

Metrics Data Collection Methods

Pull Model

Mechanism:
- Dedicated metric collectors pull metrics from running applications periodically.
- Requires the complete list of service endpoints.
- Service Discovery: Utilizes etcd, Zookeeper, etc., for dynamic endpoint management.
- Collectors fetch configuration metadata (pulling interval, IP addresses, timeout, retry parameters) from Service Discovery.
- Metrics data is pulled via a predefined HTTP endpoint (e.g., /metrics).
- Services expose the endpoint using a client library.
Implementation:
- Endpoint Management: Metrics collectors can register change event notifications or periodically poll for endpoint updates.
- Scaling: Uses a pool of collectors to handle high demand.
- Coordination: Consistent hash ring can be used to avoid duplicate data collection by assigning each collector to a unique range.
Advantages:
- Easy debugging: View metrics anytime via the /metrics endpoint.
- Health check: Quickly identify if an application server is down.
Disadvantages:
- Firewall/network complications: Requires all metric endpoints to be reachable.
- Performance: Typically uses TCP, which might have higher latency compared to UDP used in push models.

Push Model

Mechanism:
- Collection agent installed on each monitored server.
- Agents collect and push metrics periodically to the metrics collector.
- Agents can aggregate metrics locally to reduce data volume before sending.
Implementation:
- Handling High Traffic: Agents can buffer data locally and resend if the metrics collector rejects pushes due to high load.
- Auto-scaling: Metrics collectors in an auto-scaling cluster with a load balancer to manage varying loads.
Advantages:
- Handles short-lived jobs effectively.
- Better for complex network setups: Collectors receive data from anywhere if set up with a load balancer and auto-scaling group.
- Lower latency: Typically uses UDP for metric transport.
Disadvantages:
- Potential data loss if local buffering is used and servers are frequently rotated out.
- Data authenticity: Push systems might require whitelisting or authentication to ensure authenticity of metrics.

Examples

Pull Architectures:
- Prometheus: Uses a pull model for collecting metrics.
Push Architectures:
- Amazon CloudWatch: Uses a push model.
- Graphite: Another example of a push-based system.

Pull vs. Push: Considerations

Debugging:
- Pull: Easy to view metrics anytime via the /metrics endpoint.
- Push: Might involve additional steps to verify metrics.
Health Check:
- Pull: Quick identification if an application server is down.
- Push: More complicated to diagnose if metrics are not received due to network issues.
Short-lived Jobs:
- Pull: Can be challenging; may require push gateways.
- Push: Handles short-lived jobs better.
Network Setups:
- Pull: Requires reachable metric endpoints, problematic in multi-data center setups.
- Push: Collectors can receive data from anywhere with load balancers and auto-scaling.
Performance:
- Pull: Uses TCP, potentially higher latency.
- Push: Uses UDP, lower latency but less reliable.
Data Authenticity:
- Pull: Metrics collected from predefined, authentic endpoints.
- Push: Requires whitelisting or authentication to ensure data authenticity.

Things to consider: Aggregation Points for Metrics

Collection Agent
- Location: Client-side
- Capabilities: Supports simple aggregation logic.
- Example: Aggregating a counter every minute before sending it to the metrics collector.
- Pros:
  - Reduces data volume sent to the collector.
- Cons:
  - Limited to basic aggregation logic.
Ingestion Pipeline
- Location: Before writing to storage
- Tools: Requires stream processing engines such as Flink.
- Function: Aggregates data before storing it in the database.
- Pros:
  - Significantly reduces write volume as only aggregated results are stored.
- Cons:
  - Handling late-arriving events can be challenging.
  - Loss of data precision and flexibility due to not storing raw data.
Query Side
- Location: After writing to storage
- Function: Aggregates raw data at query time over a specified period.
- Pros:
  - No data loss; raw data is preserved.
- Cons:
  - Slower query speeds because aggregation is performed at query time against the entire dataset.

Things to consider: Choosing a Time-Series Database

Query Patterns
- Research Insight: According to Facebook’s research, 85% of queries to the operational data store are for data collected in the past 26 hours.
- Implication: A time-series database that optimizes for recent data can significantly enhance performance.
- Example: InfluxDB’s storage engine is designed to leverage this query pattern effectively.

Things to consider: Data Management Strategies

Downsampling
- Purpose: Reduces overall disk usage by converting high-resolution data to lower resolution over time.
- Retention Rules Example:
  - 7 Days: No sampling.
  - 30 Days: Downsample to 1-minute resolution.
  - 1 Year: Downsample to 1-hour resolution.
Cold Storage
- Definition: Storage of inactive data that is rarely accessed.
- Cost: Financially more cost-effective than active storage.
Space Optimization through data encoding and compression

Benefits: Significantly reduces the size of stored data.
Built-in Features: Good time-series databases typically include encoding and compression capabilities.
Example: InfluxDB and similar databases offer advanced compression algorithms to minimize storage footprint.