Metrics Monitoring and Alerting System
- Modified from - ByteByteGo courses
Questions
Who
- Who are we building the system for? (In-house system for internal use)
What
- What metrics do we want to collect? (Operational system metrics, not business metrics)
- What is the scale of the infrastructure? (100 million daily active users, 1,000 server pools, 100 machines per pool)
- What are the supported alert channels? (Email, phone, PagerDuty, webhooks)
How
- How long should we keep the data? (1-year retention)
- How may we reduce the resolution of the metrics data? (7 days at full resolution, 1-minute resolution for 30 days, 1-hour resolution thereafter)
Why
- Why are we not collecting logs or supporting distributed system tracing? (Out of scope)
Overview
Components
-
Data Collection
- Function: Collect metric data from various sources.
- Details: Collects operational metrics, system performance data, and application-specific metrics.
-
Data Transmission
- Function: Transfer collected data to the metrics monitoring system.
- Details: Uses protocols like HTTP, gRPC, or custom protocols for efficient data transfer.
-
Data Storage
- Function: Organize and store the incoming data.
- Details: Utilizes time-series databases optimized for high write loads and bursty read loads.
-
Alerting
- Function: Analyze data, detect anomalies, and generate alerts.
- Details: Configurable to send alerts to various channels (e.g., email, SMS, Slack).
-
Visualization
- Function: Present data visually in graphs and charts.
- Details: Enables engineers to identify patterns, trends, or issues more effectively.
Introducing a Queuing Component to Mitigate Data Loss
-
Risk Mitigation:
- Address the risk of data loss when the time-series database is unavailable by introducing a queuing component.
-
Queuing System:
- Metrics Collection:
- Metrics collector sends data to a queuing system like Kafka.
- Processing and Storage:
- Consumers or streaming processing services (e.g., Apache Storm, Flink, Spark) process and push data to the time-series database.
- Metrics Collection:
Advantages of Using Kafka
- Reliability and Scalability:
- Kafka provides a highly reliable and scalable distributed messaging platform.
- Decoupling Services:
- Decouples data collection from data processing services.
- Data Retention:
- Prevents data loss by retaining data in Kafka when the database is unavailable.
Scaling the System with Kafka
- Partition Configuration:
- Configure the number of Kafka partitions based on throughput requirements.
- Data Partitioning:
- Partition metrics data by metric names.
- Further partition metrics data with tags/labels for finer granularity.
- Prioritization:
- Categorize and prioritize metrics to ensure important metrics are processed first.
Zoom in: Alert System
-
Load Config Files to Cache Servers
- Config Format: Rules defined as config files, typically in YAML format.
- Purpose: Cache servers hold the configuration files for quick access.
-
Fetch Configs from Cache
- Component: Alert manager retrieves alert configurations from the cache.
-
Evaluate Alerts
- Process:
- Alert manager queries the service at predefined intervals based on config rules.
- If a value violates the threshold, an alert event is created.
- Responsibilities of Alert Manager:
- Filter, Merge, and Dedupe Alerts: Consolidates multiple alerts to avoid duplicates.
- Access Control: Restricts alert management operations to authorized individuals to prevent human error and enhance security.
- Retry Mechanism: Ensures notifications are sent at least once by checking alert states.
- Process:
-
Store Alert States
- Component: Alert store (key-value database such as Cassandra).
- Function: Keeps the state of all alerts (inactive, pending, firing, resolved).
-
Insert Alerts into Kafka
- Action: Eligible alerts are inserted into Kafka for further processing.
-
Pull Alert Events from Kafka
- Component: Alert consumers pull alert events from Kafka.
-
Process and Send Notifications
- Component: Alert consumers process the pulled alert events.
- Notification Channels: Sends notifications via different channels such as:
- Text message
- PagerDuty
- HTTP endpoints
Data Model
- Time Series: Metrics data is recorded as a time series, with each series uniquely identified by its name and optional labels.
- Write Load: Heavy and constant, with around 10 million data points written per day.
- Read Load: Spiky, driven by visualization and alerting services.
Data Storage System
-
Requirements:
- High performance for heavy write loads.
- Efficient handling of spiky read loads.
- Support for time-series operations like moving averages and rolling time windows.
- Tagging/labeling data with efficient indexing.
-
Challenges with General-Purpose Databases:
- Relational Databases: Require expert-level tuning, not optimized for time-series operations, and struggle under heavy write loads.
- NoSQL Databases: Require deep knowledge for scalable schema design, making industrial-scale time-series databases a more attractive option.
-
Recommended Time-Series Databases:
- OpenTSDB: Distributed, based on Hadoop/HBase, adds complexity.
- MetricsDB (Twitter): Used internally by Twitter.
- Amazon Timestream: AWS’s time-series database service.
- InfluxDB: Popular for its high performance, ease of use, in-memory caching, and on-disk storage.
- Prometheus: Widely used, supports real-time analysis, in-memory caching, and durable on-disk storage.
Metrics Data Collection Methods
Pull Model
-
Mechanism:
- Dedicated metric collectors pull metrics from running applications periodically.
- Requires the complete list of service endpoints.
- Service Discovery: Utilizes etcd, Zookeeper, etc., for dynamic endpoint management.
- Collectors fetch configuration metadata (pulling interval, IP addresses, timeout, retry parameters) from Service Discovery.
- Metrics data is pulled via a predefined HTTP endpoint (e.g.,
/metrics). - Services expose the endpoint using a client library.
-
Implementation:
- Endpoint Management: Metrics collectors can register change event notifications or periodically poll for endpoint updates.
- Scaling: Uses a pool of collectors to handle high demand.
- Coordination: Consistent hash ring can be used to avoid duplicate data collection by assigning each collector to a unique range.
-
Advantages:
- Easy debugging: View metrics anytime via the
/metricsendpoint. - Health check: Quickly identify if an application server is down.
- Easy debugging: View metrics anytime via the
-
Disadvantages:
- Firewall/network complications: Requires all metric endpoints to be reachable.
- Performance: Typically uses TCP, which might have higher latency compared to UDP used in push models.
Push Model
-
Mechanism:
- Collection agent installed on each monitored server.
- Agents collect and push metrics periodically to the metrics collector.
- Agents can aggregate metrics locally to reduce data volume before sending.
-
Implementation:
- Handling High Traffic: Agents can buffer data locally and resend if the metrics collector rejects pushes due to high load.
- Auto-scaling: Metrics collectors in an auto-scaling cluster with a load balancer to manage varying loads.
-
Advantages:
- Handles short-lived jobs effectively.
- Better for complex network setups: Collectors receive data from anywhere if set up with a load balancer and auto-scaling group.
- Lower latency: Typically uses UDP for metric transport.
-
Disadvantages:
- Potential data loss if local buffering is used and servers are frequently rotated out.
- Data authenticity: Push systems might require whitelisting or authentication to ensure authenticity of metrics.
Examples
-
Pull Architectures:
- Prometheus: Uses a pull model for collecting metrics.
-
Push Architectures:
- Amazon CloudWatch: Uses a push model.
- Graphite: Another example of a push-based system.
Pull vs. Push: Considerations
-
Debugging:
- Pull: Easy to view metrics anytime via the
/metricsendpoint. - Push: Might involve additional steps to verify metrics.
- Pull: Easy to view metrics anytime via the
-
Health Check:
- Pull: Quick identification if an application server is down.
- Push: More complicated to diagnose if metrics are not received due to network issues.
-
Short-lived Jobs:
- Pull: Can be challenging; may require push gateways.
- Push: Handles short-lived jobs better.
-
Network Setups:
- Pull: Requires reachable metric endpoints, problematic in multi-data center setups.
- Push: Collectors can receive data from anywhere with load balancers and auto-scaling.
-
Performance:
- Pull: Uses TCP, potentially higher latency.
- Push: Uses UDP, lower latency but less reliable.
-
Data Authenticity:
- Pull: Metrics collected from predefined, authentic endpoints.
- Push: Requires whitelisting or authentication to ensure data authenticity.
Things to consider: Aggregation Points for Metrics
-
Collection Agent
- Location: Client-side
- Capabilities: Supports simple aggregation logic.
- Example: Aggregating a counter every minute before sending it to the metrics collector.
- Pros:
- Reduces data volume sent to the collector.
- Cons:
- Limited to basic aggregation logic.
-
Ingestion Pipeline
- Location: Before writing to storage
- Tools: Requires stream processing engines such as Flink.
- Function: Aggregates data before storing it in the database.
- Pros:
- Significantly reduces write volume as only aggregated results are stored.
- Cons:
- Handling late-arriving events can be challenging.
- Loss of data precision and flexibility due to not storing raw data.
-
Query Side
- Location: After writing to storage
- Function: Aggregates raw data at query time over a specified period.
- Pros:
- No data loss; raw data is preserved.
- Cons:
- Slower query speeds because aggregation is performed at query time against the entire dataset.
Things to consider: Choosing a Time-Series Database
- Query Patterns
- Research Insight: According to Facebook’s research, 85% of queries to the operational data store are for data collected in the past 26 hours.
- Implication: A time-series database that optimizes for recent data can significantly enhance performance.
- Example: InfluxDB’s storage engine is designed to leverage this query pattern effectively.
Things to consider: Data Management Strategies
-
Downsampling
- Purpose: Reduces overall disk usage by converting high-resolution data to lower resolution over time.
- Retention Rules Example:
- 7 Days: No sampling.
- 30 Days: Downsample to 1-minute resolution.
- 1 Year: Downsample to 1-hour resolution.
-
Cold Storage
- Definition: Storage of inactive data that is rarely accessed.
- Cost: Financially more cost-effective than active storage.
-
Space Optimization through data encoding and compression
- Benefits: Significantly reduces the size of stored data.
- Built-in Features: Good time-series databases typically include encoding and compression capabilities.
- Example: InfluxDB and similar databases offer advanced compression algorithms to minimize storage footprint.