Reliable Data Delivery

Reliability is a system-wide concern that requires coordination between Kafka administrators, Linux administrators, network/storage administrators, and application developers.

Core Reliability Guarantees

Kafka provides these key guarantees:

Message order is preserved within a partition
Messages are considered "committed" when written to all in-sync replicas
Committed messages won't be lost as long as one replica remains alive
Consumers can only read committed messages

Broker Configuration for Reliability

Three key broker configurations affect reliability:

Replication Factor
- Higher replication = better availability but more storage/network cost
- Consider:
  - Availability needs
  - Durability requirements
  - Throughput impact
  - End-to-end latency
  - Hardware costs
Unclean Leader Election
- Controls if out-of-sync replicas can become leader
- Default: false (safer)
- True allows higher availability but risks data loss
Minimum In-Sync Replicas
- Sets minimum replicas needed for writes
- Prevents single point of failure
- Example: min.insync.replicas=2 with 3 replicas means producers need 2 in-sync replicas to write

Producer Reliability

Key producer configurations:

acks Setting
- acks=0: Fire and forget
- acks=1: Leader acknowledgment
- acks=all: All in-sync replicas acknowledge
Retries
- Use delivery.timeout.ms to control retry window
- enable.idempotence=true prevents duplicates
- Handle non-retriable errors in application code

Consumer Reliability

Key consumer considerations:

Offset Management
- Commit offsets only after processing
- Balance commit frequency vs duplicate processing
- Handle rebalances properly
Configuration
- group.id controls consumer grouping
- auto.offset.reset controls start position
- enable.auto.commit controls commit behavior
- auto.commit.interval.ms sets commit frequency

Validation Approaches

Configuration Testing
- Use VerifiableProducer and VerifiableConsumer
- Test leader elections and restarts
- Validate behavior matches expectations
Application Testing
- Test under various failure conditions
- Validate offset management
- Test rebalance handling
Production Monitoring
- Monitor producer error/retry rates
- Track consumer lag
- Validate end-to-end data flow
- Watch broker error metrics

References

Flashcards

What are Kafka's core reliability guarantees?:: 1) Message order in partition 2) Committed when written to all in-sync replicas 3) No loss if one replica alive 4) Consumers only read committed messages

What are the three key broker configurations affecting reliability?:: 1) Replication Factor 2) Unclean Leader Election 3) Minimum In-Sync Replicas

What does setting acks=all mean for producers?:: The leader will wait for all in-sync replicas to acknowledge before responding. This provides the strongest durability guarantee but highest latency.

What is the main way consumers can lose messages?:: By committing offsets for messages they've read but haven't completely processed yet

What does unclean.leader.election.enable=false prevent?:: Prevents out-of-sync replicas from becoming leaders, which could cause data loss but may improve availability