Kafka Streams

Solves transformations within Apache Kafka

Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations

Although any Kafka Streams application is stateless as the state is stored in Kafka, it can take a while and lots of resources to recover the state from Kafka. In order to speed up recovery, it is advised to store the Kafka Streams state on a persistent volume, so that only the missing part of the state needs to be recovered.

References

Stateful transformations

Flashcards

Which of the following event processing application is stateless? (select two)

The exactly once guarantee in the Kafka Streams is for which flow of data?:: Kafka => Kafka

You are running a Kafka Streams application in a Docker container managed by Kubernetes, and upon application restart, it takes a long time for the docker container to replicate the state and get back to processing the data. How can you improve dramatically the application restart?:: Mount a persistent volume for your RocksDB

Which of the following Kafka Streams operators are stateful? (select all that apply)

You want to perform table lookups against a KTable everytime a new record is received from the KStream. What is the output of KStream-KTable join?:: Kstream,Here KStream is being processed to create another KStream.

Which of the following technologies can be used to perform event stream processing? (Pick two)

We have a store selling shoes. What dataset is a great candidate to be modeled as a KTable in Kafka Streams?

An ecommerce website maintains two topics - a high volume "purchase" topic with 5 partitions and low volume "customer" topic with 3 partitions. You would like to do a stream-table join of these topics. How should you proceed?:: Model customer as a GlobalKTable. In case of KStream-KStream join, both need to be co-partitioned. This restriction is not applicable in case of join with GlobalKTable, which is the most efficient here.

In Kafka Streams, by what value are internal topics prefixed by?:: application.id In Kafka Streams, the application.id is also the underlying group.id for your consumers, and the prefix for all internal topics (repartition and state)

Select the Kafka Streams joins that are always windowed joins.:: KStream-KStream join

To transform data from a Kafka topic to another one, I should use:: Kafka Streams. Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics

StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> textLines = builder.stream("word-count-input");
        KTable<String, Long> wordCounts = textLines
                .mapValues(textLine -> textLine.toLowerCase())
                .flatMapValues(textLine -> Arrays.asList(textLine.split("\W+")))
                .selectKey((key, word) -> word)
                .groupByKey()
                .count(Materialized.as("Counts"));
        wordCounts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));
        builder.build();

What is an adequate topic configuration for the topic word-count-output?
?
cleanup.policy=compact Result is aggregated into a table with key as the unique word and value its frequency. We have to enable log compaction for this topic to align the topic's cleanup policy with KTable semantics.

Your streams application is reading from an input topic that has 5 partitions. You run 5 instances of your application, each with num.streams.threads set to 5. How many stream tasks will be created and how many will be active?:: One partition is assigned a thread, so only 5 will be active, and 25 threads (i.e. tasks) will be created

An ecommerce wesbite sells some custom made goods. What's the natural way of modeling this data in Kafka streams?:: Purchase as stream, Product as table, Customer as table. Mostly-static data is modeled as a table whereas business transactions should be modeled as a stream.

Which of these joins does not require input topics to be sharing the same number of partitions?

We want the average of all events in every five-minute window updated every minute. What kind of Kafka Streams window will be required on the stream?:: Hopping window A hopping window is defined by two properties: the window's size and its advance interval (aka "hop"), e.g., a hopping window with a size 5 minutes and an advance interval of 1 minute.

Which of the following Kafka Streams operators are stateless? (select all that apply)