Kafka Monitoring for Reliability

Key Metrics to Monitor

  1. Producer Metrics

    • Error rate per record
    • Retry rate per record
    • Average request latency
  2. Consumer Metrics

    • Consumer lag
    • Commit success rate
    • Rebalance events
  3. Broker Metrics

    • Under-replicated partitions
    • Request handler utilization
    • Request timing

Monitoring Tools

Best Practices

  1. Monitor both client and broker metrics
  2. Track message timestamps for end-to-end latency
  3. Set up alerts for critical thresholds
  4. Monitor failed request rates with error types

References

Flashcards

What are the most important producer metrics for reliability?:: Error rate and retry rate per record

What is the most important consumer metric to monitor?:: Consumer lag - indicates how far behind real-time the consumer is

What tool is recommended for monitoring consumer lag?:: Burrow - provides more sophisticated lag monitoring than simple threshold alerts