How can I handle old data in the kafka topic?

I start using spark structured streaming.

I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.

My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?

I tried to run with `startOffset: earliest' at first. but the data in the kafka topic is too large, so spark streaming process is not started because of yarn timeout. (even though I increase timeout value)

1.
If I simply create a batch job and filter by specific data range.
the result is not reflected in the current state of spark streaming,
there seems to be a problem with the consistency and accuracy of the result.

I tried to reset the checkpoint directory but It did not work.

How can I handle the old and large data?
Help me.

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

add a comment |

I start using spark structured streaming.

I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.

My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?

I tried to reset the checkpoint directory but It did not work.

How can I handle the old and large data?
Help me.

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

add a comment |

I start using spark structured streaming.

I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.

My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?

I tried to reset the checkpoint directory but It did not work.

How can I handle the old and large data?
Help me.

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

I start using spark structured streaming.

I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.

My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?

I tried to reset the checkpoint directory but It did not work.

How can I handle the old and large data?
Help me.

apache-spark spark-structured-streaming

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

edited Dec 3 '18 at 20:27

Jacek Laskowski

44.7k18132268

asked Nov 26 '18 at 14:40

soy

434

asked Nov 26 '18 at 14:40

soy

434

asked Nov 26 '18 at 14:40

soy

434

add a comment |

1 Answer
1

active

oldest

votes

you can try the parmeter maxOffsetsPerTrigger for Kafka + Structured Streaming for receiving old data from Kafka. Set the value for this parameter to the number of records you want to receive from Kafka at one time.

Use:

sparkSession.readStream

      .format("kafka")

      .option("kafka.bootstrap.servers", "localhost:9092")

      .option("subscribe", "test-name")

      .option("startingOffsets", "earliest")

      .option("maxOffsetsPerTrigger", 1)

      .option("group.id", "2")

      .option("auto.offset.reset", "earliest")

      .load()

answered Nov 27 '18 at 4:31

anuj saxena

24917

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53483483%2fhow-can-i-handle-old-data-in-the-kafka-topic%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use:

sparkSession.readStream

      .format("kafka")

      .option("kafka.bootstrap.servers", "localhost:9092")

      .option("subscribe", "test-name")

      .option("startingOffsets", "earliest")

      .option("maxOffsetsPerTrigger", 1)

      .option("group.id", "2")

      .option("auto.offset.reset", "earliest")

      .load()

answered Nov 27 '18 at 4:31

anuj saxena

24917

add a comment |

Use:

sparkSession.readStream

      .format("kafka")

      .option("kafka.bootstrap.servers", "localhost:9092")

      .option("subscribe", "test-name")

      .option("startingOffsets", "earliest")

      .option("maxOffsetsPerTrigger", 1)

      .option("group.id", "2")

      .option("auto.offset.reset", "earliest")

      .load()

answered Nov 27 '18 at 4:31

anuj saxena

24917

add a comment |

Use:

sparkSession.readStream

      .format("kafka")

      .option("kafka.bootstrap.servers", "localhost:9092")

      .option("subscribe", "test-name")

      .option("startingOffsets", "earliest")

      .option("maxOffsetsPerTrigger", 1)

      .option("group.id", "2")

      .option("auto.offset.reset", "earliest")

      .load()

answered Nov 27 '18 at 4:31

anuj saxena

24917

Use:

sparkSession.readStream

      .format("kafka")

      .option("kafka.bootstrap.servers", "localhost:9092")

      .option("subscribe", "test-name")

      .option("startingOffsets", "earliest")

      .option("maxOffsetsPerTrigger", 1)

      .option("group.id", "2")

      .option("auto.offset.reset", "earliest")

      .load()

answered Nov 27 '18 at 4:31

anuj saxena

24917

answered Nov 27 '18 at 4:31

anuj saxena

24917

answered Nov 27 '18 at 4:31

anuj saxena

24917

answered Nov 27 '18 at 4:31

anuj saxena

24917

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl