Using JDBC: transfer data from tar.gz to MySQL db is becoming slow after ~200k entries
I have been experimenting quite some time during the last days to transform structured (xml) data from a tar.gz (~1.45M distinct files of rather small size) into a more friendly format into a database.
I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.
I have a table with the following columns-types (MySQL; InnoDB):
int(11) PK NN UQ
varchar(150) NN
varchar(400) NN
text
text NN
varchar(45) NN UQ
varchar(80) NN
date NN
text
varchar(300)
varchar(300)
varchar(500)
varchar(260)
varchar(200)
varchar(45)
Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:
try (TarArchiveInputStream tarArchiveInputStream =
new TarArchiveInputStream(
new BufferedInputStream(
new GzipCompressorInputStream(
new FileInputStream(tarLocation))))){
...
while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {
...PARSING + SOME STATISTICS....
}
}
I hope the following piece of code is sufficient insight into my iteration-process; if not I will try to provide more (totalCount is used in this example to generate the artificial id). The prepared statement is a "normal" INSERT INTO statement.
setPreparedStatementValues(preparedStatement, record, totalCount[0]++);
preparedStatement.addBatch();
counter[0]++;
if (counter[0] == BATCH_SIZE){
counter[0] = 0;
preparedStatement.executeBatch();
connection.commit();
watch.stop();
System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());
watch.reset();
watch.start();
}
The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):
Elapsed time for batch 29: 3430
Elapsed time for batch 30: 3400
Elapsed time for batch 31: 3553
Elapsed time for batch 32: 3405
Elapsed time for batch 33: 3509
Elapsed time for batch 34: 3544
Elapsed time for batch 35: 6124
Elapsed time for batch 36: 5273
Elapsed time for batch 37: 9171
Elapsed time for batch 38: 8922
Elapsed time for batch 39: 24878
Elapsed time for batch 40: 68124
Elapsed time for batch 41: 70886
Elapsed time for batch 42: 78856
Elapsed time for batch 43: 80879
Elapsed time for batch 44: 85223
Elapsed time for batch 45: 92639
Elapsed time for batch 46: 80106
Time seems to be linear until somewhere short before the 40th batch and explodes after that. This output is from an experiment with 300k entries max but where I tried to split it into two seperate runs of 150k entries each. The output is very similar to trying to do all 300k in one go.
I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!
java mysql jdbc
add a comment |
I have been experimenting quite some time during the last days to transform structured (xml) data from a tar.gz (~1.45M distinct files of rather small size) into a more friendly format into a database.
I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.
I have a table with the following columns-types (MySQL; InnoDB):
int(11) PK NN UQ
varchar(150) NN
varchar(400) NN
text
text NN
varchar(45) NN UQ
varchar(80) NN
date NN
text
varchar(300)
varchar(300)
varchar(500)
varchar(260)
varchar(200)
varchar(45)
Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:
try (TarArchiveInputStream tarArchiveInputStream =
new TarArchiveInputStream(
new BufferedInputStream(
new GzipCompressorInputStream(
new FileInputStream(tarLocation))))){
...
while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {
...PARSING + SOME STATISTICS....
}
}
I hope the following piece of code is sufficient insight into my iteration-process; if not I will try to provide more (totalCount is used in this example to generate the artificial id). The prepared statement is a "normal" INSERT INTO statement.
setPreparedStatementValues(preparedStatement, record, totalCount[0]++);
preparedStatement.addBatch();
counter[0]++;
if (counter[0] == BATCH_SIZE){
counter[0] = 0;
preparedStatement.executeBatch();
connection.commit();
watch.stop();
System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());
watch.reset();
watch.start();
}
The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):
Elapsed time for batch 29: 3430
Elapsed time for batch 30: 3400
Elapsed time for batch 31: 3553
Elapsed time for batch 32: 3405
Elapsed time for batch 33: 3509
Elapsed time for batch 34: 3544
Elapsed time for batch 35: 6124
Elapsed time for batch 36: 5273
Elapsed time for batch 37: 9171
Elapsed time for batch 38: 8922
Elapsed time for batch 39: 24878
Elapsed time for batch 40: 68124
Elapsed time for batch 41: 70886
Elapsed time for batch 42: 78856
Elapsed time for batch 43: 80879
Elapsed time for batch 44: 85223
Elapsed time for batch 45: 92639
Elapsed time for batch 46: 80106
Time seems to be linear until somewhere short before the 40th batch and explodes after that. This output is from an experiment with 300k entries max but where I tried to split it into two seperate runs of 150k entries each. The output is very similar to trying to do all 300k in one go.
I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!
java mysql jdbc
add a comment |
I have been experimenting quite some time during the last days to transform structured (xml) data from a tar.gz (~1.45M distinct files of rather small size) into a more friendly format into a database.
I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.
I have a table with the following columns-types (MySQL; InnoDB):
int(11) PK NN UQ
varchar(150) NN
varchar(400) NN
text
text NN
varchar(45) NN UQ
varchar(80) NN
date NN
text
varchar(300)
varchar(300)
varchar(500)
varchar(260)
varchar(200)
varchar(45)
Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:
try (TarArchiveInputStream tarArchiveInputStream =
new TarArchiveInputStream(
new BufferedInputStream(
new GzipCompressorInputStream(
new FileInputStream(tarLocation))))){
...
while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {
...PARSING + SOME STATISTICS....
}
}
I hope the following piece of code is sufficient insight into my iteration-process; if not I will try to provide more (totalCount is used in this example to generate the artificial id). The prepared statement is a "normal" INSERT INTO statement.
setPreparedStatementValues(preparedStatement, record, totalCount[0]++);
preparedStatement.addBatch();
counter[0]++;
if (counter[0] == BATCH_SIZE){
counter[0] = 0;
preparedStatement.executeBatch();
connection.commit();
watch.stop();
System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());
watch.reset();
watch.start();
}
The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):
Elapsed time for batch 29: 3430
Elapsed time for batch 30: 3400
Elapsed time for batch 31: 3553
Elapsed time for batch 32: 3405
Elapsed time for batch 33: 3509
Elapsed time for batch 34: 3544
Elapsed time for batch 35: 6124
Elapsed time for batch 36: 5273
Elapsed time for batch 37: 9171
Elapsed time for batch 38: 8922
Elapsed time for batch 39: 24878
Elapsed time for batch 40: 68124
Elapsed time for batch 41: 70886
Elapsed time for batch 42: 78856
Elapsed time for batch 43: 80879
Elapsed time for batch 44: 85223
Elapsed time for batch 45: 92639
Elapsed time for batch 46: 80106
Time seems to be linear until somewhere short before the 40th batch and explodes after that. This output is from an experiment with 300k entries max but where I tried to split it into two seperate runs of 150k entries each. The output is very similar to trying to do all 300k in one go.
I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!
java mysql jdbc
I have been experimenting quite some time during the last days to transform structured (xml) data from a tar.gz (~1.45M distinct files of rather small size) into a more friendly format into a database.
I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.
I have a table with the following columns-types (MySQL; InnoDB):
int(11) PK NN UQ
varchar(150) NN
varchar(400) NN
text
text NN
varchar(45) NN UQ
varchar(80) NN
date NN
text
varchar(300)
varchar(300)
varchar(500)
varchar(260)
varchar(200)
varchar(45)
Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:
try (TarArchiveInputStream tarArchiveInputStream =
new TarArchiveInputStream(
new BufferedInputStream(
new GzipCompressorInputStream(
new FileInputStream(tarLocation))))){
...
while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {
...PARSING + SOME STATISTICS....
}
}
I hope the following piece of code is sufficient insight into my iteration-process; if not I will try to provide more (totalCount is used in this example to generate the artificial id). The prepared statement is a "normal" INSERT INTO statement.
setPreparedStatementValues(preparedStatement, record, totalCount[0]++);
preparedStatement.addBatch();
counter[0]++;
if (counter[0] == BATCH_SIZE){
counter[0] = 0;
preparedStatement.executeBatch();
connection.commit();
watch.stop();
System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());
watch.reset();
watch.start();
}
The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):
Elapsed time for batch 29: 3430
Elapsed time for batch 30: 3400
Elapsed time for batch 31: 3553
Elapsed time for batch 32: 3405
Elapsed time for batch 33: 3509
Elapsed time for batch 34: 3544
Elapsed time for batch 35: 6124
Elapsed time for batch 36: 5273
Elapsed time for batch 37: 9171
Elapsed time for batch 38: 8922
Elapsed time for batch 39: 24878
Elapsed time for batch 40: 68124
Elapsed time for batch 41: 70886
Elapsed time for batch 42: 78856
Elapsed time for batch 43: 80879
Elapsed time for batch 44: 85223
Elapsed time for batch 45: 92639
Elapsed time for batch 46: 80106
Time seems to be linear until somewhere short before the 40th batch and explodes after that. This output is from an experiment with 300k entries max but where I tried to split it into two seperate runs of 150k entries each. The output is very similar to trying to do all 300k in one go.
I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!
java mysql jdbc
java mysql jdbc
asked Nov 25 '18 at 11:39
WolfoneWolfone
404412
404412
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467059%2fusing-jdbc-transfer-data-from-tar-gz-to-mysql-db-is-becoming-slow-after-200k-e%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467059%2fusing-jdbc-transfer-data-from-tar-gz-to-mysql-db-is-becoming-slow-after-200k-e%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown