Kafka Monitoring for Reliability
Key Metrics to Monitor
-
Producer Metrics
- Error rate per record
- Retry rate per record
- Average request latency
-
Consumer Metrics
- Consumer lag
- Commit success rate
- Rebalance events
-
Broker Metrics
- Under-replicated partitions
- Request handler utilization
- Request timing
Monitoring Tools
- JMX metrics from clients
- Burrow for consumer lag monitoring
- Confluent Control Center for end-to-end monitoring
Best Practices
- Monitor both client and broker metrics
- Track message timestamps for end-to-end latency
- Set up alerts for critical thresholds
- Monitor failed request rates with error types
References
Flashcards
What are the most important producer metrics for reliability?:: Error rate and retry rate per record
What is the most important consumer metric to monitor?:: Consumer lag - indicates how far behind real-time the consumer is
What tool is recommended for monitoring consumer lag?:: Burrow - provides more sophisticated lag monitoring than simple threshold alerts