KStreams and KTables Simple Operations (Stateless)

KStream And KTables

Kstreams

All inserts
Similar to a log
Infinite
Unbounded data streams

KTables

All upserts on non null values
Deletes on null values
Similar to a table
Parallel with log compacted topics

When should I use KStream vs KTable?

KStream
- reading from a topic that's not compacted
- If new data is partial information/transactional
KTable
- reading from a topic that's log-compacted (aggregations)
- More if you need a structure that's like a database table where every update is self sufficient (eg total bank balance)

Stateless Vs Stateful Operations

Stateless means that the result of a transformation only depends on the data-point you process
- EG multiply value by 2 because it doesn't need memory of the past to be achieved
Stateful means that the result of a transformation also depends on an external information
- Count operation is stateful because your app needs to know what happened since it started running in order to know the computation result

MapValues / Map

Takes one record and produces one record

MapValues
- Is only affecting values
- == does not change keys
- == does not trigger a repartition
- For KStreams and KTables
Map
- Affects both keys and values
- Triggers a re-partition
- For KStreams only

Warning

Didn't really talk about Map much, MapValues is closer to what you're used to

Filter / FilterNot

Takes one record and produces zero or one record

Filter
- Does not change keys / values
- does not trigger a repartition
- For KSTreams and KTables
FilterNot
- Inverse of Filter

FlatMapValues / FlatMap

Takes one record and produces zero, one or more records

FlatMapValues
- Does not change keys
- == does not trigger a repartition
- For KSTreams only
FlatMap
- Changes keys
- == triggers a repartition
- For KStreams only

KStream Branch

Branch a KStream based on one or more predicates
Predicates are evaluated in order, if no matches, records are dropped
You get multiple KStreams as a result

KStream SelectKey

Assigns a new Key to the record (from old key and value)
== marks the data for repartitioning
Best practice to isolate that transformation to know exactly where the partitioning happens

Reading From Kafka

You can read a topic as a KSTream a KTable or a GlobalKTable

GlobalKTable is basically the same as KTable

Writing to Kafka

You can write any KSTream or KTable back to KAfka
If you write a KTable back to Kafka, think about creating a log compacted topic
To: Terminal operation - write the records to a topic
Through: write to a topic and get a stream / table from the topic

Streams Marked for Repartition

As soon as an operation can possibly change the key, the stream will be marked for repartition:
- Map
- FlatMap
- SelectKey
If you don't need to repartition use these instead:
- MapValues
- FlatMapValues
Repartitioning is done seamlessly but will incur performance overhead

Log Compaction

Ensure that the log contains at least the last known value for a specific key within a partition
Very useful if we just require a snapshot instead of our full history
Only keep the latest 'update' for a key in the log
Order of messages is maintained
Offset is never changed, offsets are just skipped if a message is missing
Deleted records can still be seen by consumers for a period of delete.retention.ms (default of 24 hours)

Does not:

Prevent you from pushing duplicate data to Kafka
Prevent you from reading duplicate data from Kafka
guarantee that Log compaction will complete successfully every time and can fail

KStream and KTables Duality

A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table
A table can be considered a snapshot of a stream, at a point in time

KStreams and KTables Simple Operations (Stateless)

KStream And KTables

Kstreams

All inserts
Similar to a log
Infinite
Unbounded data streams

KTables

All upserts on non null values
Deletes on null values
Similar to a table
Parallel with log compacted topics

When should I use KStream vs KTable?

KStream
- reading from a topic that's not compacted
- If new data is partial information/transactional
KTable
- reading from a topic that's log-compacted (aggregations)
- More if you need a structure that's like a database table where every update is self sufficient (eg total bank balance)

Stateless Vs Stateful Operations

Stateless means that the result of a transformation only depends on the data-point you process
- EG multiply value by 2 because it doesn't need memory of the past to be achieved
Stateful means that the result of a transformation also depends on an external information
- Count operation is stateful because your app needs to know what happened since it started running in order to know the computation result

MapValues / Map

Takes one record and produces one record

MapValues
- Is only affecting values
- == does not change keys
- == does not trigger a repartition
- For KStreams and KTables
Map
- Affects both keys and values
- Triggers a re-partition
- For KStreams only

Warning

Didn't really talk about Map much, MapValues is closer to what you're used to

Filter / FilterNot

Takes one record and produces zero or one record

Filter
- Does not change keys / values
- does not trigger a repartition
- For KSTreams and KTables
FilterNot
- Inverse of Filter

FlatMapValues / FlatMap

Takes one record and produces zero, one or more records

FlatMapValues
- Does not change keys
- == does not trigger a repartition
- For KSTreams only
FlatMap
- Changes keys
- == triggers a repartition
- For KStreams only

KStream Branch

Branch a KStream based on one or more predicates
Predicates are evaluated in order, if no matches, records are dropped
You get multiple KStreams as a result

KStream SelectKey

Assigns a new Key to the record (from old key and value)
== marks the data for repartitioning
Best practice to isolate that transformation to know exactly where the partitioning happens

Reading From Kafka

You can read a topic as a KSTream a KTable or a GlobalKTable

GlobalKTable is basically the same as KTable

Writing to Kafka

You can write any KSTream or KTable back to KAfka
If you write a KTable back to Kafka, think about creating a log compacted topic
To: Terminal operation - write the records to a topic
Through: write to a topic and get a stream / table from the topic

Streams Marked for Repartition

As soon as an operation can possibly change the key, the stream will be marked for repartition:
- Map
- FlatMap
- SelectKey
If you don't need to repartition use these instead:
- MapValues
- FlatMapValues
Repartitioning is done seamlessly but will incur performance overhead

Log Compaction

Ensure that the log contains at least the last known value for a specific key within a partition
Very useful if we just require a snapshot instead of our full history
Only keep the latest 'update' for a key in the log
Order of messages is maintained
Offset is never changed, offsets are just skipped if a message is missing
Deleted records can still be seen by consumers for a period of delete.retention.ms (default of 24 hours)

Does not:

Prevent you from pushing duplicate data to Kafka
Prevent you from reading duplicate data from Kafka
guarantee that Log compaction will complete successfully every time and can fail

KStream and KTables Duality

A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table
A table can be considered a snapshot of a stream, at a point in time

KStreams and KTables Simple Operations (Stateless)

KStream And KTables

Kstreams

KTables

Stateless Vs Stateful Operations

MapValues / Map

Filter / FilterNot

FlatMapValues / FlatMap

KStream Branch

KStream SelectKey

Reading From Kafka

Writing to Kafka

Streams Marked for Repartition

Log Compaction

KStream and KTables Duality

KStreams and KTables Simple Operations (Stateless)

KStream And KTables

Kstreams

KTables

Stateless Vs Stateful Operations

MapValues / Map

Filter / FilterNot

FlatMapValues / FlatMap

KStream Branch

KStream SelectKey

Reading From Kafka

Writing to Kafka

Streams Marked for Repartition

Log Compaction

KStream and KTables Duality

References

Flashcards