KStreams and KTables Simple Operations (Stateless)
KStream And KTables
Kstreams
- All inserts
- Similar to a log
- Infinite
- Unbounded data streams
KTables
- All upserts on non null values
- Deletes on null values
- Similar to a table
- Parallel with log compacted topics
When should I use KStream vs KTable?
- KStream
- reading from a topic that's not compacted
- If new data is partial information/transactional
- KTable
- reading from a topic that's log-compacted (aggregations)
- More if you need a structure that's like a database table where every update is self sufficient (eg total bank balance)
Stateless Vs Stateful Operations
- Stateless means that the result of a transformation only depends on the data-point you process
- EG multiply value by 2 because it doesn't need memory of the past to be achieved
- Stateful means that the result of a transformation also depends on an external information
- Count operation is stateful because your app needs to know what happened since it started running in order to know the computation result
MapValues / Map
Takes one record and produces one record
- MapValues
- Is only affecting values
- == does not change keys
- == does not trigger a repartition
- For KStreams and KTables
- Map
- Affects both keys and values
- Triggers a re-partition
- For KStreams only
Warning
Didn't really talk about Map much, MapValues is closer to what you're used to
Filter / FilterNot
Takes one record and produces zero or one record
- Filter
- Does not change keys / values
- does not trigger a repartition
- For KSTreams and KTables
- FilterNot
- Inverse of Filter
FlatMapValues / FlatMap
Takes one record and produces zero, one or more records
- FlatMapValues
- Does not change keys
- == does not trigger a repartition
- For KSTreams only
- FlatMap
- Changes keys
- == triggers a repartition
- For KStreams only
KStream Branch
- Branch a KStream based on one or more predicates
- Predicates are evaluated in order, if no matches, records are dropped
- You get multiple KStreams as a result
KStream SelectKey
- Assigns a new Key to the record (from old key and value)
- == marks the data for repartitioning
- Best practice to isolate that transformation to know exactly where the partitioning happens
Reading From Kafka
You can read a topic as a KSTream a KTable or a GlobalKTable
GlobalKTable is basically the same as KTable
Writing to Kafka
- You can write any KSTream or KTable back to KAfka
- If you write a KTable back to Kafka, think about creating a log compacted topic
- To: Terminal operation - write the records to a topic
- Through: write to a topic and get a stream / table from the topic
Streams Marked for Repartition
- As soon as an operation can possibly change the key, the stream will be marked for repartition:
- Map
- FlatMap
- SelectKey
- If you don't need to repartition use these instead:
- MapValues
- FlatMapValues
- Repartitioning is done seamlessly but will incur performance overhead
Log Compaction
- Ensure that the log contains at least the last known value for a specific key within a partition
- Very useful if we just require a snapshot instead of our full history
- Only keep the latest 'update' for a key in the log
- Order of messages is maintained
- Offset is never changed, offsets are just skipped if a message is missing
- Deleted records can still be seen by consumers for a period of
delete.retention.ms(default of 24 hours)
Does not:
- Prevent you from pushing duplicate data to Kafka
- Prevent you from reading duplicate data from Kafka
- guarantee that Log compaction will complete successfully every time and can fail
KStream and KTables Duality
- A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table
- A table can be considered a snapshot of a stream, at a point in time
KStreams and KTables Simple Operations (Stateless)
KStream And KTables
Kstreams
- All inserts
- Similar to a log
- Infinite
- Unbounded data streams
KTables
- All upserts on non null values
- Deletes on null values
- Similar to a table
- Parallel with log compacted topics
When should I use KStream vs KTable?
- KStream
- reading from a topic that's not compacted
- If new data is partial information/transactional
- KTable
- reading from a topic that's log-compacted (aggregations)
- More if you need a structure that's like a database table where every update is self sufficient (eg total bank balance)
Stateless Vs Stateful Operations
- Stateless means that the result of a transformation only depends on the data-point you process
- EG multiply value by 2 because it doesn't need memory of the past to be achieved
- Stateful means that the result of a transformation also depends on an external information
- Count operation is stateful because your app needs to know what happened since it started running in order to know the computation result
MapValues / Map
Takes one record and produces one record
- MapValues
- Is only affecting values
- == does not change keys
- == does not trigger a repartition
- For KStreams and KTables
- Map
- Affects both keys and values
- Triggers a re-partition
- For KStreams only
Warning
Didn't really talk about Map much, MapValues is closer to what you're used to
Filter / FilterNot
Takes one record and produces zero or one record
- Filter
- Does not change keys / values
- does not trigger a repartition
- For KSTreams and KTables
- FilterNot
- Inverse of Filter
FlatMapValues / FlatMap
Takes one record and produces zero, one or more records
- FlatMapValues
- Does not change keys
- == does not trigger a repartition
- For KSTreams only
- FlatMap
- Changes keys
- == triggers a repartition
- For KStreams only
KStream Branch
- Branch a KStream based on one or more predicates
- Predicates are evaluated in order, if no matches, records are dropped
- You get multiple KStreams as a result
KStream SelectKey
- Assigns a new Key to the record (from old key and value)
- == marks the data for repartitioning
- Best practice to isolate that transformation to know exactly where the partitioning happens
Reading From Kafka
You can read a topic as a KSTream a KTable or a GlobalKTable
GlobalKTable is basically the same as KTable
Writing to Kafka
- You can write any KSTream or KTable back to KAfka
- If you write a KTable back to Kafka, think about creating a log compacted topic
- To: Terminal operation - write the records to a topic
- Through: write to a topic and get a stream / table from the topic
Streams Marked for Repartition
- As soon as an operation can possibly change the key, the stream will be marked for repartition:
- Map
- FlatMap
- SelectKey
- If you don't need to repartition use these instead:
- MapValues
- FlatMapValues
- Repartitioning is done seamlessly but will incur performance overhead
Log Compaction
- Ensure that the log contains at least the last known value for a specific key within a partition
- Very useful if we just require a snapshot instead of our full history
- Only keep the latest 'update' for a key in the log
- Order of messages is maintained
- Offset is never changed, offsets are just skipped if a message is missing
- Deleted records can still be seen by consumers for a period of
delete.retention.ms(default of 24 hours)
Does not:
- Prevent you from pushing duplicate data to Kafka
- Prevent you from reading duplicate data from Kafka
- guarantee that Log compaction will complete successfully every time and can fail
KStream and KTables Duality
- A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table
- A table can be considered a snapshot of a stream, at a point in time