Broker
- An Apache Kafka Cluster is composed of multiple Brokers (servers)
- A Broker receives and sends data
- Each broker is identified with Its ID (integer)
- Each broker contains certain Topic Partitions
- After connecting to any Broker (called a bootstrap Broker), you will be connected to the entire Cluster
- A good starting point is 3 Brokers, but big clusters can be 100+
- Brokers enable Horizontal Scaling
- Broker Discovery is accomplished by calling a Bootstrap Server

References
Flashcards
A Broker:: Receives and sends data
A Broker is identified with:: It's ID (integer)
A Broker contains:: certain Topic Partitions
What Brokers are bootstrap brokers?:: All of them
A Bootstrap broker is responsible for:: connecting a consumer or producer to the rest of the Cluster
A good starting point for Broker count is:: 3
Brokers enable Horizontal Scaling
Bootstrap servers enable:: Broker Discovery
You have a Kafka cluster and all the topics have a replication factor of 3. One intern at your company stopped a broker, and accidentally deleted all the data of that broker on the disk. What will happen if the broker is restarted?:: The broker will start, and won't be online until all the data it needs to have is replicated from other leaders
If I produce to a topic that does not exist, and the broker setting auto.create.topic.enable=true, what will happen?:: Kafka will automatically create the topic with the broker settings num.partitions and default.replication.factor
How do Kafka brokers ensure great performance between the producers and consumers? (select two)
- It leverages zero-copy optimisations to send data straight from the page-cache
- It compresses the messages as it writes to the disk
- It transforms the messages into a binary format
- It buffers the messages on disk, and sends messages from the disk reads
- It does not transform the messages
? - It leverages zero-copy optimisations to send data straight from the page-cache
- It does not transform the messages
A bank uses a Kafka cluster for credit card payments. What should be the value of the property unclean.leader.election.enable?:: FALSE, Setting unclean.leader.election.enable to true means we allow out-of-sync replicas to become leaders, we will lose messages when this occurs, effectively losing credit card payments and making our customers very angry.
There are 3 producers writing to a topic with 5 partitions. There are 5 consumers consuming from the topic. How many Controllers will be present in the cluster?:: There is only one controller in a cluster at all times.
Which actions are undertaken by the controller in case it detects the failure of a broker (which was leader for some partitions)? (Pick three)
- Controller waits for a new leader to be nominated by ZooKeeper
- Controller selects a new leader and updates the ISR list
- Controller persists the new leader and ISR list to ZooKeeper
- Controller sends the new leader and ISR list changes to all brokers
? - Controller selects a new leader and updates the ISR list
- Controller persists the new leader and ISR list to ZooKeeper
- Controller sends the new leader and ISR list changes to all brokers
T/F Once sent to a topic, a message can be modified:: false, Kafka logs are append-only and the data is immutable
Kafka is configured with following parameters - log.retention.hours = 168 log.retention.minutes = 168 log.retention.ms = 168 How long will the messages be retained for?:: 168 ms If more than one similar config is specified, the smaller unit size will take precedence.
How often is log compaction evaluated?:: Log compaction is evaluated every time a segment is closed. It will be triggered if enough data is "dirty" (see dirty ratio config)
A producer application was sending messages to a partition with a replication factor of 2 by connecting to Broker 1 that was hosting partition leader. If the Broker 1 goes down, what will happen?:: The producer will automatically produce to the broker that has been elected leader. Once the client connects to any broker, it is connected to the entire cluster and in case of leadership changes, the clients automatically do a Metadata Request to an available broker to find out who is the new leader for the topic. Hence the producer will automatically keep on producing to the correct Kafka Broker
The Controller is a broker that is... (select two)
- elected by broker majority
- is responsible for partition leader election
- is responsible for consumer group rebalances
- elected by Zookeeper ensemble
? - is responsible for partition leader election
- elected by Zookeeper ensemble
Controller is a broker that in addition to usual broker functions is responsible for partition leader election. The election of that broker happens thanks to Zookeeper and at any time only one broker can be a controller
Partition leader election is done by:: The Kafka broker that is the Controller
Your manager would like to have topic availability over consistency. Which setting do you need to change in order to enable that?:: unclean.leader.election.enable=true allows non ISR replicas to become leader, ensuring availability but losing consistency as data loss will occur
How much should be the heap size of a broker in a production setup on a machine with 256 GB of RAM, in PLAINTEXT mode?:: 4GB In Kafka, a small heap size is needed, while the rest of the RAM goes automatically to the page cache (managed by the OS). The heap size goes slightly up if you need to enable SSL
What are the requirements for a Kafka broker to connect to a Zookeeper ensemble? (select two)
- Unique value for each broker's zookeeper.connect parameter
- Unique values for each broker's broker.id parameter
- All the brokers must share the same broker.id
- All the brokers must share the same zookeeper.connect parameter
? - Unique values for each broker's broker.id parameter
- All the brokers must share the same zookeeper.connect parameter
A topic "sales" is being produced to in the Americas region. You are mirroring this topic using Mirror Maker to the European region. From there, you are only reading the topic for analytics purposes. What kind of mirroring is this?:: Active-Passive
In Kafka, every broker... (select three)
- is a bootstrap server
- contains only a subset of the topics and the partitions
- knows all the metadata for all topics and partitions
- contains all the topics and all the partitions
- is a controller
- knows the metadata for the topics and partitions it has on its disk
? - is a bootstrap server
- contains only a subset of the topics and the partitions
- knows all the metadata for all topics and partitions
Kafka topics are divided into partitions and spread across brokers. Each brokers knows about all the metadata and each broker is a bootstrap broker, but only one of them is elected controller
What is the disadvantage of request/response communication?:: Coupling. Point-to-point (request-response) style will couple client to the server.
There are 5 brokers in a cluster, a topic with 10 partitions and replication factor of 3, and a quota of producer_bytes_rate of 1 MB/sec has been specified for the client. What is the maximum throughput allowed for the client?:: 5 MB/s Each producer is allowed to produce @ 1MB/s to a broker. Max throughput 5 * 1MB, because we have 5 brokers.
What happens when broker.rack configuration is provided in broker configuration in Kafka cluster?:: Replicas for a partition are spread across different racks. Partitions for newly created topics are assigned in a rack alternating manner, this is the only change broker.rack does
Suppose you have 6 brokers and you decide to create a topic with 10 partitions and a replication factor of 3. The brokers 0 and 1 are on rack A, the brokers 2 and 3 are on rack B, and the brokers 4 and 5 are on rack C. If the leader for partition 0 is on broker 4, and the first replica is on broker 2, which broker can host the last replica? (select two)
- 1
- 6
- 2
- 3
- 0
- 5
? - 1
- 0
When you create a new topic, partitions replicas are spreads across racks to maintain availability. Hence, the Rack A, which currently does not hold the topic partition, will be selected for the last replica
Kafkas has three quota types:: produce, consume and request
Produce quotas limit:: the rate at which a client can send data measured in bytes per second
Consume quotas limit:: the rate at which a client can receive data measured in bytes per second
Request quotas limit:: the percentage of time the broker spends processing client requestys
Quotas can be applied in what types of ways
?
- Across all clients
- Specific client-ids
- Specific users (only relevant on clusters where clients authenticate)
quota.producer.default:: specifies the default producer quota across all producers to the specific broker
When setting a quota for a user it's best to set that:: in Kafka's dynamic configuration since Kafka's configuration files are static and require a restart to pickup any changes
What might happen if a producer ignores the quota set for it by the broker and continues to send messages at a higher rate than the quota allows?:: Eventually the producer will exceed the buffer space and messages will begin to have TimeoutExceptions thrown
The recommendation for vm.swappiness is:: 1 It is a percentage of how likely the VM subsystem is to use swap space. We don't set to 0 because it means that the OS will not swap under any circumstance
The recommendation for vm.dirty.background_ratio value is:: lower than the default of 10, such as 5. This is because we often want to write log segments to a fast response time disk such as an SSD
The recommendation for vm.dirty_ratio value is:: higher than the default of 20, between 60 and 80, this is the total number of dirty pages allowed before the kernel forces synchronous operations to flush them to disk
The current upper end of a well-configured environment for number of partition replicas per broker and replicas per cluster are:: 14,000 partition replicas per broker and 1 million replicas per cluster
You can determine broker disk space availability by:: adding together all broker's disk space
Disk space requirements are dictated by:: Replication factor * expected data volume <= broker disk space * broker count
Your main constraints on broker hardware are:: disk capacity, disk speed, network capacity
The primary usage of CPU on a kafka broker is:: Handling client connections and decompressing/compressing messages for checksum validation
broker.id:: Every kafka broker must have an integer identifier, which is set using the broker.id configuration.
zookeeper.connect:: The location of the ZooKeeper used for storing the broker metadata
log.dirs:: Persists all messages to disk, and these log segments are stored in the directory specified in the log.dir configuration.
num.recovery.threads.per.data.dir
?
Configurable pool of threads for handling log segments. Currently this thread pool is used:
- When starting normally, to open each partition's log segments
- When starting after a failure, to check and truncate each partition's log segments
- When shutting down, to cleanly close log segments
auto.create.topics.enable:: Whether or not to automatically create a topic
auto.leader.rebalance.enable:: To maintain balance in a Kafka cluster, this config can be specified to ensure leadership is balanced as much as possible. Enables a background thread that checks the distribution of partitions are regular intervals via leader.imbalance.check.interval.seconds
delete.topic.enable:: Enables/disables topic deletion
num.partitions:: Determines the default number of partitions a new topic is created with. Defaults to 1
default.replication.factor:: Determins the default replication factor for new topics if none specified
What is the recommended setting for replication factor?:: At least 1 above the min.insync.replicas setting, but you will get more roboustness if you have a replication factor 2 above the min.insync.replicas setting (called RF++) for easier maintenance and outage management
log.retention.ms:: Default is one week and determines the length of time messages are retained by time. Note this leverages Last Modified Time of the file in question. Timer starts when the log segment is closed and no longer the active segment
log.retention.bytes:: Default of 1 GB and determines per partition how total number of bytes worth of messages to retain. (if 5 partitions, that's 5 GB on disk retained)
log.segment.bytes:: Determines the max size of a log segment in bytes (1 GB by default)
log.roll.ms:: Determines when a log segment is closed by time, default is 7 days (Book was wrong? See here)
min.insync.replicas:: In conjunction with acks=all, determines if a producer will be able to write a new message to the partition leader based on the minimum number of insync replicas set here, 1 is the default
message.max.bytes:: Broker limit on the maximum size of a message that can be produced, default is 1 MB