Kafka Streams

Solves transformations within Apache Kafka

Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations

Although any Kafka Streams application is stateless as the state is stored in Kafka, it can take a while and lots of resources to recover the state from Kafka. In order to speed up recovery, it is advised to store the Kafka Streams state on a persistent volume, so that only the missing part of the state needs to be recovered.

References

Stateful transformations

Flashcards

Which of the following event processing application is stateless? (select two)

Read events from a stream and modifies them from JSON to Avro
Publish the top 10 stocks each day
Find the minimum and maximum stock prices for each day of trading
Read log messages from a stream and writes ERROR events into a high-priority stream and the rest of the events into a low-priority stream
?
Read log messages from a stream and writes ERROR events into a high-priority stream and the rest of the events into a low-priority stream
Read events from a stream and modifies them from JSON to Avro
Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations

The exactly once guarantee in the Kafka Streams is for which flow of data?:: Kafka => Kafka

You are running a Kafka Streams application in a Docker container managed by Kubernetes, and upon application restart, it takes a long time for the docker container to replicate the state and get back to processing the data. How can you improve dramatically the application restart?:: Mount a persistent volume for your RocksDB

Which of the following Kafka Streams operators are stateful? (select all that apply)

joining
reduce
aggregate
peek
count
flatmap
?
joining
reduce
aggregate
count

You want to perform table lookups against a KTable everytime a new record is received from the KStream. What is the output of KStream-KTable join?:: Kstream,Here KStream is being processed to create another KStream.

Which of the following technologies can be used to perform event stream processing? (Pick two)

Confluent Schema Registry
Apache Kafka Streams
Confluent KSQL
Confluent Replicator
?
Apache Kafka Streams
Confluent KSQL

We have a store selling shoes. What dataset is a great candidate to be modeled as a KTable in Kafka Streams?

Inventory contents right now
Items returned
The transaction stream
Money made until now
?
Inventory contents right now
Money made until now
Aggregations of stream are stored in table, whereas Streams must be modeled as a KStream to avoid data explosion

An ecommerce website maintains two topics - a high volume "purchase" topic with 5 partitions and low volume "customer" topic with 3 partitions. You would like to do a stream-table join of these topics. How should you proceed?:: Model customer as a GlobalKTable. In case of KStream-KStream join, both need to be co-partitioned. This restriction is not applicable in case of join with GlobalKTable, which is the most efficient here.

In Kafka Streams, by what value are internal topics prefixed by?:: application.id In Kafka Streams, the application.id is also the underlying group.id for your consumers, and the prefix for all internal topics (repartition and state)

Select the Kafka Streams joins that are always windowed joins.:: KStream-KStream join

To transform data from a Kafka topic to another one, I should use:: Kafka Streams. Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics

StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> textLines = builder.stream("word-count-input");
        KTable<String, Long> wordCounts = textLines
                .mapValues(textLine -> textLine.toLowerCase())
                .flatMapValues(textLine -> Arrays.asList(textLine.split("\W+")))
                .selectKey((key, word) -> word)
                .groupByKey()
                .count(Materialized.as("Counts"));
        wordCounts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));
        builder.build();

What is an adequate topic configuration for the topic word-count-output?
?
cleanup.policy=compact Result is aggregated into a table with key as the unique word and value its frequency. We have to enable log compaction for this topic to align the topic's cleanup policy with KTable semantics.

Your streams application is reading from an input topic that has 5 partitions. You run 5 instances of your application, each with num.streams.threads set to 5. How many stream tasks will be created and how many will be active?:: One partition is assigned a thread, so only 5 will be active, and 25 threads (i.e. tasks) will be created

An ecommerce wesbite sells some custom made goods. What's the natural way of modeling this data in Kafka streams?:: Purchase as stream, Product as table, Customer as table. Mostly-static data is modeled as a table whereas business transactions should be modeled as a stream.

Which of these joins does not require input topics to be sharing the same number of partitions?

KStream-GlobalKTable
KStream-KTable join
KStream-KStream join
KTable-KTable join
?
KStream-GlobalKTable have their datasets replicated on each Kafka Streams instance and therefore no repartitioning is required

We want the average of all events in every five-minute window updated every minute. What kind of Kafka Streams window will be required on the stream?:: Hopping window A hopping window is defined by two properties: the window's size and its advance interval (aka "hop"), e.g., a hopping window with a size 5 minutes and an advance interval of 1 minute.

Which of the following Kafka Streams operators are stateless? (select all that apply)

groupBy
map
filter
flatmap
branch
aggregate
?
groupBy
map
filter
flatmap
branch