Notification System
- Modified from - ByteByteGo courses
Questions
What
- What types of notifications does the system support? (Push notification, SMS message, and email)
- What are the supported devices? (iOS devices, Android devices, laptop/desktop)
- What triggers notifications? (Client applications, scheduled on the server-side)
- What is the volume of notifications sent out each day? (10 million push notifications, 1 million SMS messages, 5 million emails)
When
- When should notifications be delivered? (As soon as possible, soft real-time with acceptable delays under high workload)
How
- How do users manage notifications? (Users can opt-out to stop receiving notifications)
Overview
RateLimit
- To avoid overwhelming users with too many notifications, we can limit the number of notifications a user can receive. This is important because receivers could turn off notifications completely if we send too often.
- The notification system checks user settings first before sending notifications.
You Cannot Have Exactly-Once Delivery
-
Common Misconceptions in Distributed Systems:
- Many have fundamental misunderstandings about distributed systems’ behaviors.
- These misconceptions are common and often stem from a lack of exposure or education.
-
Exactly-Once Delivery:
- Impossible in Distributed Systems:
- Web browser and server, server and database, server and message queue are all distributed systems.
- Exactly-once delivery semantics cannot be achieved in these systems.
- Delivery Semantics:
- At-Most-Once: Message might be delivered once or not at all.
- At-Least-Once: Message is delivered one or more times.
- Exactly-Once: Desired but unachievable in practice.
- Impossible in Distributed Systems:
-
Challenges:
- Network partitions and interruptions make exact delivery unfeasible.
- The Two Generals Problem and the FLP result highlight the impossibilities in achieving consensus and reliable delivery.
-
Trade-offs and Practical Solutions:
- At-Most-Once Delivery: Acknowledging before processing; risk of data loss if the receiver crashes.
- At-Least-Once Delivery: Acknowledging after processing; risk of duplication if the ack is lost or receiver crashes post-processing.
- Idempotent Operations: Ensuring that applying the same state change multiple times doesn’t lead to inconsistencies.
- Deduplication: Handling message duplications to simulate exactly-once delivery.
-
Protocols and Systems:
- Atomic Broadcast Protocols: Ensure messages are delivered reliably and in order, but require high coordination.
- Zab Protocol: Used in ZooKeeper, enforces idempotent operations.
-
Examples and Real-world Applications:
- Apache Kafka: Uses ZooKeeper for coordination to ensure strong consistency.
- RabbitMQ: Producers retransmit unacknowledged messages, leading to potential duplication which consumers must handle.
-
Design Implications:
- Distributed systems need to be designed with failure and asynchrony in mind.
- Understanding and choosing appropriate delivery semantics is crucial for system reliability.
- The focus should be on ensuring idempotency or handling duplicates to achieve reliable outcomes.
-
Conclusion:
- Exactly-once delivery is a myth in distributed systems; at-least-once delivery is the practical choice.
- Design systems with the understanding that perfect reliability isn’t possible, but resilience and fault-tolerance can be achieved.
There is no NOW
-
Simultaneity Issues in Distributed Systems:
- Perception of “Now”:
- Writing: Significant delay between writing and reading.
- Speaking: Perceived immediacy, but actual delay due to sound travel.
- Visual: Perception delay due to light travel.
- Physical Limitations:
- Information transfer takes time.
- Electricity in a wire travels at a finite speed.
- Computing systems must operate within these physical constraints.
- Perception of “Now”:
-
Synchronizing Time:
- NTP (Network Time Protocol):
- Calculates message travel time to synchronize clocks.
- GPS:
- Satellites with atomic clocks synchronize time and provide precise measurements.
- Challenges:
- Even with advanced technology, perfect synchronization is unattainable due to failures and delays.
- NTP (Network Time Protocol):
-
Impossibility Results:
- FLP Result:
- Shows that consensus is impossible in asynchronous systems with potential faults.
- CAP Theorem:
- States that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance.
- Practical Implications:
- Systems must be designed with the understanding that components will fail.
- FLP Result:
-
Fault Tolerance:
- Google’s Spanner:
- Uses NTP, GPS, and atomic clocks to minimize time uncertainty.
- Confronts the issue of time synchronization by providing a range of possible times (TrueTime).
- Google’s Spanner:
-
Coordination and Consensus Protocols:
- Paxos, Zab, Raft:
- Provide mechanisms to achieve consensus despite failures.
- Logical Time:
- Techniques like vector clocks abstract over unreliable physical clocks.
- Paxos, Zab, Raft:
-
Design Trade-offs:
- Coordination vs. Performance:
- Constant coordination incurs latency and throughput costs.
- Designing for minimal necessary coordination can improve performance.
- CRDTs (Conflict-Free Replicated Data Types):
- Avoid the need for strict ordering by ensuring updates are commutative and idempotent.
- Enable strong eventual consistency.
- Coordination vs. Performance:
-
Practical Examples:
- TCP:
- Assumes a more reliable network model than theoretical models, providing useful properties for distributed systems.
- ZooKeeper and Zab:
- Designed with TCP’s reliability assumptions, providing practical yet formally backed safety guarantees.
- TCP:
-
Ad-hoc Solutions and Their Pitfalls:
- “Last Write Wins” Policies:
- Misleading as “last” is meaningless in distributed systems; leads to unpredictable data loss.
- Ad-hoc Coordination:
- Custom solutions should be well-documented to avoid future issues and assist in debugging.
- “Last Write Wins” Policies:
Last updated on