How to use a persisted StateStore between two Kafka Streams - apache-kafka

I'm having some troubles trying to achieve the following via Kafka Streams:
At the startup of the app, the (compacted) topic alpha gets loaded into a Key-Value StateStore map
A Kafka Stream consumes from another topic, uses (.get) the map above and finally produces a new record into topic alpha
The result is that the in-memory map should aligned with the underlying topic, even if the streamer gets restarted.
My approach is the following:
val builder = new StreamsBuilderS()
val store = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"), kSerde, vSerde)
)
builder.addStateStore(store)
val loaderStreamer = new LoaderStreamer(store).startStream()
[...] // I wait a few seconds until the loading is complete and the stream os running
val map = instance.store("store", QueryableStoreTypes.keyValueStore[K, V]()) // !!!!!!!! ERROR HERE !!!!!!!!
builder
.stream("another-topic")(Consumed.`with`(kSerde, vSerde))
.doMyAggregationsAndgetFromTheMapAbove
.transform(() => new StoreTransformer[K, V]("store"), "store")
.to("alpha")(Produced.`with`(kSerde, vSerde))
LoaderStreamer(store):
[...]
val builders = new StreamsBuilderS()
builder.addStateStore(store)
builder
.table("alpha")(Consumed.`with`(kSerde, vSerde))
builder.build
[...]
StoreTransformer:
[...]
override def init(context: ProcessorContext): Unit = {
this.context = context
this.store =
context.getStateStore(store).asInstanceOf[KeyValueStore[K, V]]
}
override def transform(key: K, value: V): (K, V) = {
store.put(key, value)
(key, value)
}
[...]
...but what I get is:
Caused by: org.apache.kafka.streams.errors.InvalidStateStoreException:
The state store, store, may have migrated to another instance.
while trying to get the store handler.
Any idea on how to achieve this?
Thank you!

You can't share state store between two Kafka Streams applications.
According to documentation: https://docs.confluent.io/current/streams/faq.html#interactive-queries there might be two reason of above exception:
The local KafkaStreams instance is not yet ready and thus its local state stores cannot be queried yet.
The local KafkaStreams instance is ready, but the particular state store was just migrated to another instance behind the scenes.
The easiest way to deal with it is to wait till state store will be queryable:
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
Whole example can be found at confluent github.

Related

Writing to a topic within a Processor using Kafka Stream DSL

I need to use the Kafka Sreams API alongside Processor API. I also want to write different types of Objects to different topics within my processor implementation i.e emit different object on process & punctuate. I have seen there is a KIP-313 flatTransform that would probably solve my problem.
If i use:
inputStream.process(processorSupplier,,)
Since this is a "terminating" operation (its return type is void) could I use an internal Kafka producer within my Processor. I have not seen such an implementation is this a reasonable approach are there any side effects?
If you need so low level approach you can build whole topology by your own:
Topology topology = builder.build();
topology.addSource("inputNode","input");
topology.addProcessor("inProcessor", InputProcessor::new, "inputNode");
topology.addSink("sink1",
(k, v, rc) -> "topic1",
new StringSerializer(),
new IntegerSerializer(),
"inProcessor");
topology.addSink("sink2",
(k, v, rc) -> "topic2",
new StringSerializer(),
new StringSerializer(),
"inProcessor");
InputProcessor depends on the business logic produce different types of object and pass them to different sink nodes (topics).
Sample example has logic as follow:
if value of message can be parsed to Integer, forward it to two sink nodes (sink1, sink2), to sink1 as Integer to sink2 as String.
if not forward message only to sink2.
public class InputProcessor extends AbstractProcessor<String, String> {
#Override
public void process(String key, String value) {
try {
context().forward(key, Integer.parseInt(value), To.child("sink1"));
context().forward(key, value, To.child("sink2"));
}
catch (NumberFormatException nfe) {
context().forward(key, value, To.child("sink2"));
}
}
}

Set timestamp in output with Kafka Streams fails for transformations

Suppose we have a transformer (written in Scala)
new Transformer[String, V, (String, V)]() {
var context: ProcessorContext = _
override def init(context: ProcessorContext): Unit = {
this.context = context
}
override def transform(key: String, value: V): (String, V) = {
val timestamp = toTimestamp(value)
context.forward(key, value, To.all().withTimestamp(timestamp))
key -> value
}
override def close(): Unit = ()
}
where toTimestamp is just a function which returns an a timestamp fetched from the record value. Once it gets executed, there's an NPE:
Exception in thread "...-6f3693b9-4e8d-4e65-9af6-928884320351-StreamThread-5" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:110)
at CustomTransformer.transform()
at CustomTransformer.transform()
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:302)
at org.apache.kafka.streams.scala.kstream.KStream$$anon$1$$anon$2.transform(KStream.scala:300)
at
what essentially happens is that ProcessorContextImpl fails in:
public <K, V> void forward(final K key, final V value, final To to) {
toInternal.update(to);
if (toInternal.hasTimestamp()) {
recordContext.setTimestamp(toInternal.timestamp());
}
final ProcessorNode previousNode = currentNode();
because the recordContext was not initialized (an it could only be done internally by KafkaStreams).
This is a follow up question Set timestamp in output with Kafka Streams 1
If you work with transformer, you need to make sure that a new Transformer object is create when TransformerSupplier#get() is called. (cf. https://docs.confluent.io/current/streams/faq.html#why-do-i-get-an-illegalstateexception-when-accessing-record-metadata)
In the original question, I thought it's about your context variable that results in NPE, but now I realized it's about the Kafka Streams internals.
The Scala API has a bug in 2.0.0 that may result in the case that the same Transformer instance is reused (https://issues.apache.org/jira/browse/KAFKA-7250). I think that you are hitting this bug. Rewriting your code a little bit should fix the issues. Note, that Kafka 2.0.1 and Kafka 2.1.0 contain a fix.
#matthias-j-sax Same behavior if processor reused in Java code.
Topology topology = new Topology();
MyProcessor myProcessor = new MyProcessor();
topology.addSource("source", "topic-1")
.addProcessor(
"processor",
() -> {
return myProcessor;
},
"source"
)
.addSink("sink", "topic-2", "processor");
KafkaStreams streams = new KafkaStreams(topology, config);
streams.start();

Kafka Streams - How to scale Kafka store generated changelog topics

I am having multiple redundant app instances that want to consume all the events of a topic and store them independently for disk lookup (via a rocksdb).
For the sake of the argument, let's assume these redundant consumers are serving stateless http request; so the load is not shared using kafka, but kafka is rather used to replicate data from a producer into each of the instance localstore.
When looking at the topics generated, each consuming apps created 3 extra topics :
{topicname}STATE-STORE-0000000000-changelog
{application-name}-{storename}-changelog
{application-name}-{storename}-repartition
But each of these generated topics are as big as the compacted view of the original topic. Meaning each consuming store multiplies by 3 the size of the original topic (which was already compacted).
Why does kafka store require these 3 topics. Couldn't we simply configure the stream to reload from the last consumed offset when reconciling the ondisk store?
Is it the idea that each instance of the redundant consuming apps gets its unique set of 3 "store generated topics", or should they be configured to share the same set of changelog topics? so, should they share a same applicationId or rather not since they need to consume all the event of all the partitions?
In short, I am concerned by the storage scalability as we grow the number of consuming apps that would spawn more change log topics...
here is the code that creates the store
public class ProgramMappingEventStoreFactory {
private static final Logger logger = Logger.getLogger(ProgramMappingEventStoreFactory.class.getName());
private final static String STORE_NAME = "program-mapping-store";
private final static String APPLICATION_NAME = "epg-mapping-catalog_program-mapping";
public static ReadOnlyKeyValueStore<ProgramMappingEventKey, ProgramMappingEvent> newInstance(String kafkaBootstrapServerUrl,
String avroRegistryUrl,
String topic,
String storeDirectory)
{
Properties kafkaConfig = new KafkaConfigBuilder().withBootstrapServers(kafkaBootstrapServerUrl)
.withSchemaRegistryUrl(avroRegistryUrl)
.withApplicationId(createApplicationId(APPLICATION_NAME))
.withGroupId(UUID.randomUUID().toString())
.withClientId(UUID.randomUUID().toString())
.withDefaultKeySerdeClass(SpecificAvroSerde.class)
.withDefaultValueSerdeClass(SpecificAvroSerde.class)
.withStoreDirectory(storeDirectory)
.build();
StreamsBuilder streamBuilder = new StreamsBuilder();
bootstrapStore(streamBuilder, topic);
KafkaStreams streams = new KafkaStreams(streamBuilder.build(), kafkaConfig);
streams.start();
try {
return getStoreAndBlockUntilQueryable(STORE_NAME,
QueryableStoreTypes.keyValueStore(),
streams);
} catch (InterruptedException e) {
throw new IllegalStateException("Failed to create the LiveMediaPolicyIdStore", e);
}
}
private static <T> T getStoreAndBlockUntilQueryable(String storeName,
QueryableStoreType<T> queryableStoreType,
KafkaStreams streams)
throws InterruptedException
{
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
Thread.sleep(100);
}
}
}
private static void bootstrapStore(StreamsBuilder builder, String topic) {
KTable<ProgramMappingEventKey, ProgramMappingEvent> table = builder.table(topic);
table.groupBy((k, v) -> KeyValue.pair(k, v)).reduce((newValue, aggValue) -> newValue,
(newValue, aggValue) -> null,
Materialized.as(STORE_NAME));
}
private static String createApplicationId(String applicationName) {
try {
return String.format("%s-%s", applicationName, InetAddress.getLocalHost().getHostName());
} catch (UnknownHostException e) {
logger.warning(() -> "Failed to find the hostname, generating a uique applicationId");
return String.format("%s-%s", applicationName, UUID.randomUUID());
}
}
}
If you want to load the same state into multiple instances, you should use GlobalKTable and a unique application.id over all instances (builder.globalTable()).
If you use KTable data is partitioned forcing you to use different application.id for each instance. This can be considered an anti-pattern.
I am also not sure, why you do groupBy((k, v) -> KeyValue.pair(k, v)).reduce() -- this results in an unnecessary repartition topic.
For the generated changelog topics for table() operator, there is a know bug in 1.0 and 1.1 release if StreamsBuilder is used (KStreamBuilder is not affected). Its fixed in 2.0 release (https://issues.apache.org/jira/browse/KAFKA-6729)

Kafka Streams persistent store error: the state store, may have migrated to another instance

I am using Kafka Streams with Spring Boot. In my use case when I receive customer event from other microservice I need to store in customer materialized view and when I receive order event, I need to join customer and order then store in customer-order materialized view. To achieve this I created persistent key-value store customer-store and updating this when a new event comes.
StoreBuilder customerStateStore = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("customer"),Serdes.String(), customerSerde).withLoggingEnabled(new HashMap<>());
streamsBuilder.addStateStore(customerStateStore);
KTable<String,Customer> customerKTable=streamsBuilder.table("customer",Consumed.with(Serdes.String(),customerSerde));
customerKTable.foreach(((key, value) -> System.out.println("Customer from Topic: "+value)));
Configured Topology, Streams and started streams object. When I try to access store using ReadOnlyKeyValueStore, I got the following exception, even though I stored some objects few moments ago
streams.start();
ReadOnlyKeyValueStore<String, Customer> customerStore = streams.store("customer", QueryableStoreTypes.keyValueStore());
System.out.println("customerStore.approximateNumEntries()-> " + customerStore.approximateNumEntries());
Code uploaded to Github for reference. Appreciate your help.
Exception:
org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, customer, may have migrated to another instance.
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:60)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1043)
at com.kafkastream.service.EventsListener.main(EventsListener.java:94)
The state store needs some time to be prepared usually. The simplest approach is like below. (code from the official document)
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
You can find additional info in the document.
https://docs.confluent.io/current/streams/faq.html#interactive-queries

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.

Resources