Combine rows while importing txt file if they don't meet a condition












0


















ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns





I have a large pipe delimitated file, Approx 2 million rows (2gs). The file is supposed to have 418 columns (|) per row, however, many of the rows have been unnecessarily split causing problems when importing the data.



When importing I would like to combine rows until the count of pipes in the subsequent row == 418.



Most of the problems occur at column 90 and are followed by a row with 328. Others are split at 90, followed by a few rows of 0 and then 328. Ideally, all of these rows would be combined into one.



For example, the highlighted rows should be combined into one (rows with 0 still contain information



I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.



I also tried to combine the lines before appending, but am afraid I will run into the same efficiency problems.



%%time

correct =
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for index, line in enumerate(fh):
if index<20:
if line.count('|')!=418:
incorrect.append(line)









share|improve this question

























  • I suggest that you use the csv package. You can specify what delimiter to use.

    – Code-Apprentice
    Nov 27 '18 at 19:26











  • Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

    – schlafdata
    Nov 27 '18 at 19:31











  • Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

    – Code-Apprentice
    Nov 27 '18 at 21:13











  • Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

    – schlafdata
    Nov 27 '18 at 21:50











  • That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

    – schlafdata
    Nov 27 '18 at 21:51
















0


















ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns





I have a large pipe delimitated file, Approx 2 million rows (2gs). The file is supposed to have 418 columns (|) per row, however, many of the rows have been unnecessarily split causing problems when importing the data.



When importing I would like to combine rows until the count of pipes in the subsequent row == 418.



Most of the problems occur at column 90 and are followed by a row with 328. Others are split at 90, followed by a few rows of 0 and then 328. Ideally, all of these rows would be combined into one.



For example, the highlighted rows should be combined into one (rows with 0 still contain information



I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.



I also tried to combine the lines before appending, but am afraid I will run into the same efficiency problems.



%%time

correct =
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for index, line in enumerate(fh):
if index<20:
if line.count('|')!=418:
incorrect.append(line)









share|improve this question

























  • I suggest that you use the csv package. You can specify what delimiter to use.

    – Code-Apprentice
    Nov 27 '18 at 19:26











  • Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

    – schlafdata
    Nov 27 '18 at 19:31











  • Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

    – Code-Apprentice
    Nov 27 '18 at 21:13











  • Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

    – schlafdata
    Nov 27 '18 at 21:50











  • That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

    – schlafdata
    Nov 27 '18 at 21:51














0












0








0











ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns





I have a large pipe delimitated file, Approx 2 million rows (2gs). The file is supposed to have 418 columns (|) per row, however, many of the rows have been unnecessarily split causing problems when importing the data.



When importing I would like to combine rows until the count of pipes in the subsequent row == 418.



Most of the problems occur at column 90 and are followed by a row with 328. Others are split at 90, followed by a few rows of 0 and then 328. Ideally, all of these rows would be combined into one.



For example, the highlighted rows should be combined into one (rows with 0 still contain information



I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.



I also tried to combine the lines before appending, but am afraid I will run into the same efficiency problems.



%%time

correct =
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for index, line in enumerate(fh):
if index<20:
if line.count('|')!=418:
incorrect.append(line)









share|improve this question



















ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns





I have a large pipe delimitated file, Approx 2 million rows (2gs). The file is supposed to have 418 columns (|) per row, however, many of the rows have been unnecessarily split causing problems when importing the data.



When importing I would like to combine rows until the count of pipes in the subsequent row == 418.



Most of the problems occur at column 90 and are followed by a row with 328. Others are split at 90, followed by a few rows of 0 and then 328. Ideally, all of these rows would be combined into one.



For example, the highlighted rows should be combined into one (rows with 0 still contain information



I thought about appending the incorrect rows to a list, and then combining them, but at 1 second per row, this would take approximately 26 days to complete.



I also tried to combine the lines before appending, but am afraid I will run into the same efficiency problems.



%%time

correct =
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for index, line in enumerate(fh):
if index<20:
if line.count('|')!=418:
incorrect.append(line)





ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns





ROW_ID|Quote Number|Status|Status Reason ADT|Name|Account|Alias ADT|......etc 418 Columns






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 19:22









dustinos3

4591717




4591717










asked Nov 27 '18 at 18:51









schlafdataschlafdata

112




112













  • I suggest that you use the csv package. You can specify what delimiter to use.

    – Code-Apprentice
    Nov 27 '18 at 19:26











  • Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

    – schlafdata
    Nov 27 '18 at 19:31











  • Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

    – Code-Apprentice
    Nov 27 '18 at 21:13











  • Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

    – schlafdata
    Nov 27 '18 at 21:50











  • That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

    – schlafdata
    Nov 27 '18 at 21:51



















  • I suggest that you use the csv package. You can specify what delimiter to use.

    – Code-Apprentice
    Nov 27 '18 at 19:26











  • Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

    – schlafdata
    Nov 27 '18 at 19:31











  • Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

    – Code-Apprentice
    Nov 27 '18 at 21:13











  • Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

    – schlafdata
    Nov 27 '18 at 21:50











  • That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

    – schlafdata
    Nov 27 '18 at 21:51

















I suggest that you use the csv package. You can specify what delimiter to use.

– Code-Apprentice
Nov 27 '18 at 19:26





I suggest that you use the csv package. You can specify what delimiter to use.

– Code-Apprentice
Nov 27 '18 at 19:26













Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

– schlafdata
Nov 27 '18 at 19:31





Yes, I am using import csv package, and I can specify the Pipe delimiter. The problem is not all of the rows contain the correct number of delimiters.

– schlafdata
Nov 27 '18 at 19:31













Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

– Code-Apprentice
Nov 27 '18 at 21:13





Can you give an example of a few rows of input? Are there any text fields that contain new line characters?

– Code-Apprentice
Nov 27 '18 at 21:13













Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

– schlafdata
Nov 27 '18 at 21:50





Install 2 EBKs, KP, install HUBs at teller and Access bar , use HUBs removed from Tellers. TCR connection . OT Labor

– schlafdata
Nov 27 '18 at 21:50













That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

– schlafdata
Nov 27 '18 at 21:51





That is the end of a line that was cut off. Wish I could post more examples but the information is sensitive, there are no n

– schlafdata
Nov 27 '18 at 21:51












1 Answer
1






active

oldest

votes


















1














It turns on when you use Enumerate it takes drastically longer. I removed that part of my code, appended all of the problem rows to a list. It took 34 seconds Vs the 26 days I anticipated. Then I joined the entire list, and split the list every 418 pipes






correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]








share|improve this answer


























  • Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

    – Tomothy32
    Nov 28 '18 at 1:28











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506319%2fcombine-rows-while-importing-txt-file-if-they-dont-meet-a-condition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














It turns on when you use Enumerate it takes drastically longer. I removed that part of my code, appended all of the problem rows to a list. It took 34 seconds Vs the 26 days I anticipated. Then I joined the entire list, and split the list every 418 pipes






correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]








share|improve this answer


























  • Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

    – Tomothy32
    Nov 28 '18 at 1:28
















1














It turns on when you use Enumerate it takes drastically longer. I removed that part of my code, appended all of the problem rows to a list. It took 34 seconds Vs the 26 days I anticipated. Then I joined the entire list, and split the list every 418 pipes






correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]








share|improve this answer


























  • Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

    – Tomothy32
    Nov 28 '18 at 1:28














1












1








1







It turns on when you use Enumerate it takes drastically longer. I removed that part of my code, appended all of the problem rows to a list. It took 34 seconds Vs the 26 days I anticipated. Then I joined the entire list, and split the list every 418 pipes






correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]








share|improve this answer















It turns on when you use Enumerate it takes drastically longer. I removed that part of my code, appended all of the problem rows to a list. It took 34 seconds Vs the 26 days I anticipated. Then I joined the entire list, and split the list every 418 pipes






correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]








correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]





correct = 
incorrect =

with open('C:/Users/jschlajo/Desktop/export_all_quotes_compass.txt', 'r') as fh:
for line in fh:
if line.count('|')==418:
correct.append(line)
if line.count('|')!=418:
incorrect.append(line)


test_1 = ' '.join(incorrect)
i = iter(test_1.split('|'))


span = 418
words = test_1.split("|")
combined = ["|".join(words[i:i+span]) for i in range(0, len(words), span)]






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 28 '18 at 15:28

























answered Nov 27 '18 at 23:32









schlafdataschlafdata

112




112













  • Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

    – Tomothy32
    Nov 28 '18 at 1:28



















  • Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

    – Tomothy32
    Nov 28 '18 at 1:28

















Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

– Tomothy32
Nov 28 '18 at 1:28





Just a note for the future, please do not post images of code. Instead, paste it into your question/answer and format it as a code block.

– Tomothy32
Nov 28 '18 at 1:28




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506319%2fcombine-rows-while-importing-txt-file-if-they-dont-meet-a-condition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)