Kafka Streams
Solves transformations within Apache Kafka
Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations
Although any Kafka Streams application is stateless as the state is stored in Kafka, it can take a while and lots of resources to recover the state from Kafka. In order to speed up recovery, it is advised to store the Kafka Streams state on a persistent volume, so that only the missing part of the state needs to be recovered.
References
Flashcards
Which of the following event processing application is stateless? (select two)
- Read events from a stream and modifies them from JSON to Avro
- Publish the top 10 stocks each day
- Find the minimum and maximum stock prices for each day of trading
- Read log messages from a stream and writes ERROR events into a high-priority stream and the rest of the events into a low-priority stream
? - Read log messages from a stream and writes ERROR events into a high-priority stream and the rest of the events into a low-priority stream
- Read events from a stream and modifies them from JSON to Avro
Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations
The exactly once guarantee in the Kafka Streams is for which flow of data?:: Kafka => Kafka
You are running a Kafka Streams application in a Docker container managed by Kubernetes, and upon application restart, it takes a long time for the docker container to replicate the state and get back to processing the data. How can you improve dramatically the application restart?:: Mount a persistent volume for your RocksDB
Which of the following Kafka Streams operators are stateful? (select all that apply)
- joining
- reduce
- aggregate
- peek
- count
- flatmap
? - joining
- reduce
- aggregate
- count
You want to perform table lookups against a KTable everytime a new record is received from the KStream. What is the output of KStream-KTable join?:: Kstream,Here KStream is being processed to create another KStream.
Which of the following technologies can be used to perform event stream processing? (Pick two)
- Confluent Schema Registry
- Apache Kafka Streams
- Confluent KSQL
- Confluent Replicator
? - Apache Kafka Streams
- Confluent KSQL
We have a store selling shoes. What dataset is a great candidate to be modeled as a KTable in Kafka Streams?
- Inventory contents right now
- Items returned
- The transaction stream
- Money made until now
? - Inventory contents right now
- Money made until now
Aggregations of stream are stored in table, whereas Streams must be modeled as a KStream to avoid data explosion
An ecommerce website maintains two topics - a high volume "purchase" topic with 5 partitions and low volume "customer" topic with 3 partitions. You would like to do a stream-table join of these topics. How should you proceed?:: Model customer as a GlobalKTable. In case of KStream-KStream join, both need to be co-partitioned. This restriction is not applicable in case of join with GlobalKTable, which is the most efficient here.
In Kafka Streams, by what value are internal topics prefixed by?:: application.id In Kafka Streams, the application.id is also the underlying group.id for your consumers, and the prefix for all internal topics (repartition and state)
Select the Kafka Streams joins that are always windowed joins.:: KStream-KStream join
To transform data from a Kafka topic to another one, I should use:: Kafka Streams. Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("word-count-input");
KTable<String, Long> wordCounts = textLines
.mapValues(textLine -> textLine.toLowerCase())
.flatMapValues(textLine -> Arrays.asList(textLine.split("\W+")))
.selectKey((key, word) -> word)
.groupByKey()
.count(Materialized.as("Counts"));
wordCounts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));
builder.build();
What is an adequate topic configuration for the topic word-count-output?
?
cleanup.policy=compact Result is aggregated into a table with key as the unique word and value its frequency. We have to enable log compaction for this topic to align the topic's cleanup policy with KTable semantics.
Your streams application is reading from an input topic that has 5 partitions. You run 5 instances of your application, each with num.streams.threads set to 5. How many stream tasks will be created and how many will be active?:: One partition is assigned a thread, so only 5 will be active, and 25 threads (i.e. tasks) will be created
An ecommerce wesbite sells some custom made goods. What's the natural way of modeling this data in Kafka streams?:: Purchase as stream, Product as table, Customer as table. Mostly-static data is modeled as a table whereas business transactions should be modeled as a stream.
Which of these joins does not require input topics to be sharing the same number of partitions?
- KStream-GlobalKTable
- KStream-KTable join
- KStream-KStream join
- KTable-KTable join
?
KStream-GlobalKTable have their datasets replicated on each Kafka Streams instance and therefore no repartitioning is required
We want the average of all events in every five-minute window updated every minute. What kind of Kafka Streams window will be required on the stream?:: Hopping window A hopping window is defined by two properties: the window's size and its advance interval (aka "hop"), e.g., a hopping window with a size 5 minutes and an advance interval of 1 minute.
Which of the following Kafka Streams operators are stateless? (select all that apply)
- groupBy
- map
- filter
- flatmap
- branch
- aggregate
? - groupBy
- map
- filter
- flatmap
- branch