Skip to Content

Notification System

Questions

What

  • What types of notifications does the system support? (Push notification, SMS message, and email)
  • What are the supported devices? (iOS devices, Android devices, laptop/desktop)
  • What triggers notifications? (Client applications, scheduled on the server-side)
  • What is the volume of notifications sent out each day? (10 million push notifications, 1 million SMS messages, 5 million emails)

When

  • When should notifications be delivered? (As soon as possible, soft real-time with acceptable delays under high workload)

How

  • How do users manage notifications? (Users can opt-out to stop receiving notifications)

Overview

RateLimit

  • To avoid overwhelming users with too many notifications, we can limit the number of notifications a user can receive. This is important because receivers could turn off notifications completely if we send too often.
  • The notification system checks user settings first before sending notifications.

You Cannot Have Exactly-Once Delivery 

  • Common Misconceptions in Distributed Systems:

    • Many have fundamental misunderstandings about distributed systems’ behaviors.
    • These misconceptions are common and often stem from a lack of exposure or education.
  • Exactly-Once Delivery:

    • Impossible in Distributed Systems:
      • Web browser and server, server and database, server and message queue are all distributed systems.
      • Exactly-once delivery semantics cannot be achieved in these systems.
    • Delivery Semantics:
      • At-Most-Once: Message might be delivered once or not at all.
      • At-Least-Once: Message is delivered one or more times.
      • Exactly-Once: Desired but unachievable in practice.
  • Challenges:

    • Network partitions and interruptions make exact delivery unfeasible.
    • The Two Generals Problem and the FLP result highlight the impossibilities in achieving consensus and reliable delivery.
  • Trade-offs and Practical Solutions:

    • At-Most-Once Delivery: Acknowledging before processing; risk of data loss if the receiver crashes.
    • At-Least-Once Delivery: Acknowledging after processing; risk of duplication if the ack is lost or receiver crashes post-processing.
    • Idempotent Operations: Ensuring that applying the same state change multiple times doesn’t lead to inconsistencies.
    • Deduplication: Handling message duplications to simulate exactly-once delivery.
  • Protocols and Systems:

    • Atomic Broadcast Protocols: Ensure messages are delivered reliably and in order, but require high coordination.
    • Zab Protocol: Used in ZooKeeper, enforces idempotent operations.
  • Examples and Real-world Applications:

    • Apache Kafka: Uses ZooKeeper for coordination to ensure strong consistency.
    • RabbitMQ: Producers retransmit unacknowledged messages, leading to potential duplication which consumers must handle.
  • Design Implications:

    • Distributed systems need to be designed with failure and asynchrony in mind.
    • Understanding and choosing appropriate delivery semantics is crucial for system reliability.
    • The focus should be on ensuring idempotency or handling duplicates to achieve reliable outcomes.
  • Conclusion:

    • Exactly-once delivery is a myth in distributed systems; at-least-once delivery is the practical choice.
    • Design systems with the understanding that perfect reliability isn’t possible, but resilience and fault-tolerance can be achieved.

There is no NOW 

  • Simultaneity Issues in Distributed Systems:

    • Perception of “Now”:
      • Writing: Significant delay between writing and reading.
      • Speaking: Perceived immediacy, but actual delay due to sound travel.
      • Visual: Perception delay due to light travel.
    • Physical Limitations:
      • Information transfer takes time.
      • Electricity in a wire travels at a finite speed.
      • Computing systems must operate within these physical constraints.
  • Synchronizing Time:

    • NTP (Network Time Protocol):
      • Calculates message travel time to synchronize clocks.
    • GPS:
      • Satellites with atomic clocks synchronize time and provide precise measurements.
    • Challenges:
      • Even with advanced technology, perfect synchronization is unattainable due to failures and delays.
  • Impossibility Results:

    • FLP Result:
      • Shows that consensus is impossible in asynchronous systems with potential faults.
    • CAP Theorem:
      • States that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance.
    • Practical Implications:
      • Systems must be designed with the understanding that components will fail.
  • Fault Tolerance:

    • Google’s Spanner:
      • Uses NTP, GPS, and atomic clocks to minimize time uncertainty.
      • Confronts the issue of time synchronization by providing a range of possible times (TrueTime).
  • Coordination and Consensus Protocols:

    • Paxos, Zab, Raft:
      • Provide mechanisms to achieve consensus despite failures.
    • Logical Time:
      • Techniques like vector clocks abstract over unreliable physical clocks.
  • Design Trade-offs:

    • Coordination vs. Performance:
      • Constant coordination incurs latency and throughput costs.
      • Designing for minimal necessary coordination can improve performance.
    • CRDTs (Conflict-Free Replicated Data Types):
      • Avoid the need for strict ordering by ensuring updates are commutative and idempotent.
      • Enable strong eventual consistency.
  • Practical Examples:

    • TCP:
      • Assumes a more reliable network model than theoretical models, providing useful properties for distributed systems.
    • ZooKeeper and Zab:
      • Designed with TCP’s reliability assumptions, providing practical yet formally backed safety guarantees.
  • Ad-hoc Solutions and Their Pitfalls:

    • “Last Write Wins” Policies:
      • Misleading as “last” is meaningless in distributed systems; leads to unpredictable data loss.
    • Ad-hoc Coordination:
      • Custom solutions should be well-documented to avoid future issues and assist in debugging.
Last updated on