Using JDBC: transfer data from tar.gz to MySQL db is becoming slow after ~200k entries

I have been experimenting quite some time during the last days to transform structured (xml) data from a tar.gz (~1.45M distinct files of rather small size) into a more friendly format into a database.

I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.

I have a table with the following columns-types (MySQL; InnoDB):

int(11) PK NN UQ

varchar(150) NN

varchar(400) NN

text 

text NN

varchar(45) NN UQ

varchar(80) NN

date NN

text 

varchar(300) 

varchar(300) 

varchar(500) 

varchar(260) 

varchar(200) 

varchar(45)

Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:

try (TarArchiveInputStream tarArchiveInputStream =

                     new TarArchiveInputStream(

                             new BufferedInputStream(

                                     new GzipCompressorInputStream(

                                             new FileInputStream(tarLocation))))){



...

    while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {

    ...PARSING + SOME STATISTICS....

    }

}

I hope the following piece of code is sufficient insight into my iteration-process; if not I will try to provide more (totalCount is used in this example to generate the artificial id). The prepared statement is a "normal" INSERT INTO statement.

setPreparedStatementValues(preparedStatement, record, totalCount[0]++);



preparedStatement.addBatch();



counter[0]++;



if (counter[0] == BATCH_SIZE){

    counter[0] = 0;

    preparedStatement.executeBatch();

    connection.commit();

    watch.stop();

    System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());

    watch.reset();

    watch.start();

}

The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):

Elapsed time for batch 29: 3430

Elapsed time for batch 30: 3400

Elapsed time for batch 31: 3553

Elapsed time for batch 32: 3405

Elapsed time for batch 33: 3509

Elapsed time for batch 34: 3544

Elapsed time for batch 35: 6124

Elapsed time for batch 36: 5273

Elapsed time for batch 37: 9171

Elapsed time for batch 38: 8922

Elapsed time for batch 39: 24878

Elapsed time for batch 40: 68124

Elapsed time for batch 41: 70886

Elapsed time for batch 42: 78856

Elapsed time for batch 43: 80879

Elapsed time for batch 44: 85223

Elapsed time for batch 45: 92639

Elapsed time for batch 46: 80106

Time seems to be linear until somewhere short before the 40th batch and explodes after that. This output is from an experiment with 300k entries max but where I tried to split it into two seperate runs of 150k entries each. The output is very similar to trying to do all 300k in one go.

I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!

asked Nov 25 '18 at 11:39

Wolfone

404412

add a comment |

I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.

I have a table with the following columns-types (MySQL; InnoDB):

int(11) PK NN UQ

varchar(150) NN

varchar(400) NN

text 

text NN

varchar(45) NN UQ

varchar(80) NN

date NN

text 

varchar(300) 

varchar(300) 

varchar(500) 

varchar(260) 

varchar(200) 

varchar(45)

Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:

try (TarArchiveInputStream tarArchiveInputStream =

                     new TarArchiveInputStream(

                             new BufferedInputStream(

                                     new GzipCompressorInputStream(

                                             new FileInputStream(tarLocation))))){



...

    while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {

    ...PARSING + SOME STATISTICS....

    }

}

setPreparedStatementValues(preparedStatement, record, totalCount[0]++);



preparedStatement.addBatch();



counter[0]++;



if (counter[0] == BATCH_SIZE){

    counter[0] = 0;

    preparedStatement.executeBatch();

    connection.commit();

    watch.stop();

    System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());

    watch.reset();

    watch.start();

}

The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):

Elapsed time for batch 29: 3430

Elapsed time for batch 30: 3400

Elapsed time for batch 31: 3553

Elapsed time for batch 32: 3405

Elapsed time for batch 33: 3509

Elapsed time for batch 34: 3544

Elapsed time for batch 35: 6124

Elapsed time for batch 36: 5273

Elapsed time for batch 37: 9171

Elapsed time for batch 38: 8922

Elapsed time for batch 39: 24878

Elapsed time for batch 40: 68124

Elapsed time for batch 41: 70886

Elapsed time for batch 42: 78856

Elapsed time for batch 43: 80879

Elapsed time for batch 44: 85223

Elapsed time for batch 45: 92639

Elapsed time for batch 46: 80106

I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!

asked Nov 25 '18 at 11:39

Wolfone

404412

add a comment |

I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.

I have a table with the following columns-types (MySQL; InnoDB):

int(11) PK NN UQ

varchar(150) NN

varchar(400) NN

text 

text NN

varchar(45) NN UQ

varchar(80) NN

date NN

text 

varchar(300) 

varchar(300) 

varchar(500) 

varchar(260) 

varchar(200) 

varchar(45)

Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:

try (TarArchiveInputStream tarArchiveInputStream =

                     new TarArchiveInputStream(

                             new BufferedInputStream(

                                     new GzipCompressorInputStream(

                                             new FileInputStream(tarLocation))))){



...

    while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {

    ...PARSING + SOME STATISTICS....

    }

}

setPreparedStatementValues(preparedStatement, record, totalCount[0]++);



preparedStatement.addBatch();



counter[0]++;



if (counter[0] == BATCH_SIZE){

    counter[0] = 0;

    preparedStatement.executeBatch();

    connection.commit();

    watch.stop();

    System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());

    watch.reset();

    watch.start();

}

The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):

Elapsed time for batch 29: 3430

Elapsed time for batch 30: 3400

Elapsed time for batch 31: 3553

Elapsed time for batch 32: 3405

Elapsed time for batch 33: 3509

Elapsed time for batch 34: 3544

Elapsed time for batch 35: 6124

Elapsed time for batch 36: 5273

Elapsed time for batch 37: 9171

Elapsed time for batch 38: 8922

Elapsed time for batch 39: 24878

Elapsed time for batch 40: 68124

Elapsed time for batch 41: 70886

Elapsed time for batch 42: 78856

Elapsed time for batch 43: 80879

Elapsed time for batch 44: 85223

Elapsed time for batch 45: 92639

Elapsed time for batch 46: 80106

I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!

asked Nov 25 '18 at 11:39

Wolfone

404412

I am not sure what I may disclose of the data or the usecase but I will try my best to explain my efforts.

I have a table with the following columns-types (MySQL; InnoDB):

int(11) PK NN UQ

varchar(150) NN

varchar(400) NN

text 

text NN

varchar(45) NN UQ

varchar(80) NN

date NN

text 

varchar(300) 

varchar(300) 

varchar(500) 

varchar(260) 

varchar(200) 

varchar(45)

Iterating through the entire tar just looking at the data + parsing takes roughly 90 seconds +/-:

try (TarArchiveInputStream tarArchiveInputStream =

                     new TarArchiveInputStream(

                             new BufferedInputStream(

                                     new GzipCompressorInputStream(

                                             new FileInputStream(tarLocation))))){



...

    while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && processedTarEntries < maxNumber) {

    ...PARSING + SOME STATISTICS....

    }

}

setPreparedStatementValues(preparedStatement, record, totalCount[0]++);



preparedStatement.addBatch();



counter[0]++;



if (counter[0] == BATCH_SIZE){

    counter[0] = 0;

    preparedStatement.executeBatch();

    connection.commit();

    watch.stop();

    System.out.println("Elapsed time for batch " + (totalCount[0] / BATCH_SIZE) + ": " + watch.getTime());

    watch.reset();

    watch.start();

}

The relevant part of the sout-output is the following (batchsize 5k/10k didn't really make a difference):

Elapsed time for batch 29: 3430

Elapsed time for batch 30: 3400

Elapsed time for batch 31: 3553

Elapsed time for batch 32: 3405

Elapsed time for batch 33: 3509

Elapsed time for batch 34: 3544

Elapsed time for batch 35: 6124

Elapsed time for batch 36: 5273

Elapsed time for batch 37: 9171

Elapsed time for batch 38: 8922

Elapsed time for batch 39: 24878

Elapsed time for batch 40: 68124

Elapsed time for batch 41: 70886

Elapsed time for batch 42: 78856

Elapsed time for batch 43: 80879

Elapsed time for batch 44: 85223

Elapsed time for batch 45: 92639

Elapsed time for batch 46: 80106

I would be very grateful for suggestions what I may be doing wrong if I am or suggestions how to speed this up!

java mysql jdbc

asked Nov 25 '18 at 11:39

Wolfone

404412

asked Nov 25 '18 at 11:39

Wolfone

404412

asked Nov 25 '18 at 11:39

Wolfone

404412

asked Nov 25 '18 at 11:39

Wolfone

404412

asked Nov 25 '18 at 11:39

Wolfone

404412

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53467059%2fusing-jdbc-transfer-data-from-tar-gz-to-mysql-db-is-becoming-slow-after-200k-e%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

UGj7tn2C8aussfN1 PJB08w fvo H9MmLMieSZNEtXpCC

搜尋此網誌

Btukfyl