Extracting all reads from bam file which match read IDs in another file
up vote
2
down vote
favorite
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
add a comment |
up vote
2
down vote
favorite
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
1
Just as a general rule never dofor var in $(cat file). Also known as bash pitfall number 1.
– terdon♦
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
bam samtools reads
asked 6 hours ago
d_kennetz
1458
1458
1
Just as a general rule never dofor var in $(cat file). Also known as bash pitfall number 1.
– terdon♦
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago
add a comment |
1
Just as a general rule never dofor var in $(cat file). Also known as bash pitfall number 1.
– terdon♦
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago
1
1
Just as a general rule never do
for var in $(cat file). Also known as bash pitfall number 1.– terdon♦
4 hours ago
Just as a general rule never do
for var in $(cat file). Also known as bash pitfall number 1.– terdon♦
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago
add a comment |
2 Answers
2
active
oldest
votes
up vote
4
down vote
accepted
It is still slow but grep has a -f option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
add a comment |
up vote
2
down vote
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-wforgot about that
– Bioathlete
3 hours ago
@Bioathlete it's the-mthat really speeds it up though. But yeah, the-wis needed for safety.
– terdon♦
3 hours ago
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
It is still slow but grep has a -f option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
add a comment |
up vote
4
down vote
accepted
It is still slow but grep has a -f option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
add a comment |
up vote
4
down vote
accepted
up vote
4
down vote
accepted
It is still slow but grep has a -f option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
It is still slow but grep has a -f option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
answered 5 hours ago
Bioathlete
1,695215
1,695215
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
add a comment |
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
just as a warning this can take much longer if the read name file is big.
– Bioathlete
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
5 hours ago
add a comment |
up vote
2
down vote
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-wforgot about that
– Bioathlete
3 hours ago
@Bioathlete it's the-mthat really speeds it up though. But yeah, the-wis needed for safety.
– terdon♦
3 hours ago
add a comment |
up vote
2
down vote
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-wforgot about that
– Bioathlete
3 hours ago
@Bioathlete it's the-mthat really speeds it up though. But yeah, the-wis needed for safety.
– terdon♦
3 hours ago
add a comment |
up vote
2
down vote
up vote
2
down vote
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
edited 3 hours ago
answered 3 hours ago
terdon♦
3,8411726
3,8411726
oh good call on the-wforgot about that
– Bioathlete
3 hours ago
@Bioathlete it's the-mthat really speeds it up though. But yeah, the-wis needed for safety.
– terdon♦
3 hours ago
add a comment |
oh good call on the-wforgot about that
– Bioathlete
3 hours ago
@Bioathlete it's the-mthat really speeds it up though. But yeah, the-wis needed for safety.
– terdon♦
3 hours ago
oh good call on the
-w forgot about that– Bioathlete
3 hours ago
oh good call on the
-w forgot about that– Bioathlete
3 hours ago
@Bioathlete it's the
-m that really speeds it up though. But yeah, the -w is needed for safety.– terdon♦
3 hours ago
@Bioathlete it's the
-m that really speeds it up though. But yeah, the -w is needed for safety.– terdon♦
3 hours ago
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5600%2fextracting-all-reads-from-bam-file-which-match-read-ids-in-another-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Just as a general rule never do
for var in $(cat file). Also known as bash pitfall number 1.– terdon♦
4 hours ago
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
4 hours ago
Yeah, most of us use it very often until it bites us! :)
– terdon♦
4 hours ago