Notification System

Modified from - ByteByteGo courses

Questions

What

What types of notifications does the system support? (Push notification, SMS message, and email)
What are the supported devices? (iOS devices, Android devices, laptop/desktop)
What triggers notifications? (Client applications, scheduled on the server-side)
What is the volume of notifications sent out each day? (10 million push notifications, 1 million SMS messages, 5 million emails)

When

When should notifications be delivered? (As soon as possible, soft real-time with acceptable delays under high workload)

How

How do users manage notifications? (Users can opt-out to stop receiving notifications)

Overview

RateLimit

To avoid overwhelming users with too many notifications, we can limit the number of notifications a user can receive. This is important because receivers could turn off notifications completely if we send too often.
The notification system checks user settings first before sending notifications.

You Cannot Have Exactly-Once Delivery

Common Misconceptions in Distributed Systems:
- Many have fundamental misunderstandings about distributed systems’ behaviors.
- These misconceptions are common and often stem from a lack of exposure or education.
Exactly-Once Delivery:
- Impossible in Distributed Systems:
  - Web browser and server, server and database, server and message queue are all distributed systems.
  - Exactly-once delivery semantics cannot be achieved in these systems.
- Delivery Semantics:
  - At-Most-Once: Message might be delivered once or not at all.
  - At-Least-Once: Message is delivered one or more times.
  - Exactly-Once: Desired but unachievable in practice.
Challenges:
- Network partitions and interruptions make exact delivery unfeasible.
- The Two Generals Problem and the FLP result highlight the impossibilities in achieving consensus and reliable delivery.
Trade-offs and Practical Solutions:
- At-Most-Once Delivery: Acknowledging before processing; risk of data loss if the receiver crashes.
- At-Least-Once Delivery: Acknowledging after processing; risk of duplication if the ack is lost or receiver crashes post-processing.
- Idempotent Operations: Ensuring that applying the same state change multiple times doesn’t lead to inconsistencies.
- Deduplication: Handling message duplications to simulate exactly-once delivery.
Protocols and Systems:
- Atomic Broadcast Protocols: Ensure messages are delivered reliably and in order, but require high coordination.
- Zab Protocol: Used in ZooKeeper, enforces idempotent operations.
Examples and Real-world Applications:
- Apache Kafka: Uses ZooKeeper for coordination to ensure strong consistency.
- RabbitMQ: Producers retransmit unacknowledged messages, leading to potential duplication which consumers must handle.
Design Implications:
- Distributed systems need to be designed with failure and asynchrony in mind.
- Understanding and choosing appropriate delivery semantics is crucial for system reliability.
- The focus should be on ensuring idempotency or handling duplicates to achieve reliable outcomes.
Conclusion:
- Exactly-once delivery is a myth in distributed systems; at-least-once delivery is the practical choice.
- Design systems with the understanding that perfect reliability isn’t possible, but resilience and fault-tolerance can be achieved.

There is no NOW

Simultaneity Issues in Distributed Systems:
- Perception of “Now”:
  - Writing: Significant delay between writing and reading.
  - Speaking: Perceived immediacy, but actual delay due to sound travel.
  - Visual: Perception delay due to light travel.
- Physical Limitations:
  - Information transfer takes time.
  - Electricity in a wire travels at a finite speed.
  - Computing systems must operate within these physical constraints.
Synchronizing Time:
- NTP (Network Time Protocol):
  - Calculates message travel time to synchronize clocks.
- GPS:
  - Satellites with atomic clocks synchronize time and provide precise measurements.
- Challenges:
  - Even with advanced technology, perfect synchronization is unattainable due to failures and delays.
Impossibility Results:
- FLP Result:
  - Shows that consensus is impossible in asynchronous systems with potential faults.
- CAP Theorem:
  - States that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance.
- Practical Implications:
  - Systems must be designed with the understanding that components will fail.
Fault Tolerance:
- Google’s Spanner:
  - Uses NTP, GPS, and atomic clocks to minimize time uncertainty.
  - Confronts the issue of time synchronization by providing a range of possible times (TrueTime).
Coordination and Consensus Protocols:
- Paxos, Zab, Raft:
  - Provide mechanisms to achieve consensus despite failures.
- Logical Time:
  - Techniques like vector clocks abstract over unreliable physical clocks.
Design Trade-offs:
- Coordination vs. Performance:
  - Constant coordination incurs latency and throughput costs.
  - Designing for minimal necessary coordination can improve performance.
- CRDTs (Conflict-Free Replicated Data Types):
  - Avoid the need for strict ordering by ensuring updates are commutative and idempotent.
  - Enable strong eventual consistency.
Practical Examples:
- TCP:
  - Assumes a more reliable network model than theoretical models, providing useful properties for distributed systems.
- ZooKeeper and Zab:
  - Designed with TCP’s reliability assumptions, providing practical yet formally backed safety guarantees.
Ad-hoc Solutions and Their Pitfalls:
- “Last Write Wins” Policies:
  - Misleading as “last” is meaningless in distributed systems; leads to unpredictable data loss.
- Ad-hoc Coordination:
  - Custom solutions should be well-documented to avoid future issues and assist in debugging.