Using Python to create a very large binary frequency matrix to run collaborative filtering
I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.
The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?
list samples:
subset_patients =
[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]
sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]
my code:
bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)
count = -1
for row in subset_patients: #subset_patients is a small list of the patients
for col in row:
if col in sorted_codes: #sorted_codes is the unique codes list
count = count+1
bin_freq_matrix.at[count, col]=1
print(bin_freq_matrix.head())
NEWEST VERSION:
subset_patients = patients[0:1]
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
print(bin_freq_matrix)
for x in bin_freq_matrix[1]:
if x==1:
print("yes")
python
add a comment |
I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.
The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?
list samples:
subset_patients =
[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]
sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]
my code:
bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)
count = -1
for row in subset_patients: #subset_patients is a small list of the patients
for col in row:
if col in sorted_codes: #sorted_codes is the unique codes list
count = count+1
bin_freq_matrix.at[count, col]=1
print(bin_freq_matrix.head())
NEWEST VERSION:
subset_patients = patients[0:1]
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
print(bin_freq_matrix)
for x in bin_freq_matrix[1]:
if x==1:
print("yes")
python
add a comment |
I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.
The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?
list samples:
subset_patients =
[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]
sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]
my code:
bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)
count = -1
for row in subset_patients: #subset_patients is a small list of the patients
for col in row:
if col in sorted_codes: #sorted_codes is the unique codes list
count = count+1
bin_freq_matrix.at[count, col]=1
print(bin_freq_matrix.head())
NEWEST VERSION:
subset_patients = patients[0:1]
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
print(bin_freq_matrix)
for x in bin_freq_matrix[1]:
if x==1:
print("yes")
python
I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.
The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?
list samples:
subset_patients =
[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]
sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]
my code:
bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)
count = -1
for row in subset_patients: #subset_patients is a small list of the patients
for col in row:
if col in sorted_codes: #sorted_codes is the unique codes list
count = count+1
bin_freq_matrix.at[count, col]=1
print(bin_freq_matrix.head())
NEWEST VERSION:
subset_patients = patients[0:1]
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
print(bin_freq_matrix)
for x in bin_freq_matrix[1]:
if x==1:
print("yes")
python
python
edited Nov 29 '18 at 18:00
datascienceman1
asked Nov 26 '18 at 20:50
datascienceman1datascienceman1
62
62
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.
Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:
if col in sorted_codes: #sorted_codes is the unique codes list
takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:
sorted_codes_set = set(sorted_codes)
Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.
- Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.
The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:
bin_freq_matrix.at[count, col]=1
- Use
applyand a function instead of theforloop. This is likely to bring the largest gain.
The final piece of code:
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandasDataFrame.
– sophros
Nov 29 '18 at 7:40
|
show 4 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488849%2fusing-python-to-create-a-very-large-binary-frequency-matrix-to-run-collaborative%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.
Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:
if col in sorted_codes: #sorted_codes is the unique codes list
takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:
sorted_codes_set = set(sorted_codes)
Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.
- Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.
The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:
bin_freq_matrix.at[count, col]=1
- Use
applyand a function instead of theforloop. This is likely to bring the largest gain.
The final piece of code:
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandasDataFrame.
– sophros
Nov 29 '18 at 7:40
|
show 4 more comments
Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.
Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:
if col in sorted_codes: #sorted_codes is the unique codes list
takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:
sorted_codes_set = set(sorted_codes)
Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.
- Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.
The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:
bin_freq_matrix.at[count, col]=1
- Use
applyand a function instead of theforloop. This is likely to bring the largest gain.
The final piece of code:
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandasDataFrame.
– sophros
Nov 29 '18 at 7:40
|
show 4 more comments
Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.
Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:
if col in sorted_codes: #sorted_codes is the unique codes list
takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:
sorted_codes_set = set(sorted_codes)
Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.
- Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.
The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:
bin_freq_matrix.at[count, col]=1
- Use
applyand a function instead of theforloop. This is likely to bring the largest gain.
The final piece of code:
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.
Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:
if col in sorted_codes: #sorted_codes is the unique codes list
takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:
sorted_codes_set = set(sorted_codes)
Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.
- Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.
The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:
bin_freq_matrix.at[count, col]=1
- Use
applyand a function instead of theforloop. This is likely to bring the largest gain.
The final piece of code:
def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]
bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)
edited Nov 29 '18 at 7:39
answered Nov 27 '18 at 9:10
sophrossophros
2,6421830
2,6421830
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandasDataFrame.
– sophros
Nov 29 '18 at 7:40
|
show 4 more comments
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandasDataFrame.
– sophros
Nov 29 '18 at 7:40
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
@datascienceman1 - if you found it helpful please mark as an answer and upvote.
– sophros
Nov 28 '18 at 10:34
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..
– datascienceman1
Nov 28 '18 at 16:45
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.
– sophros
Nov 28 '18 at 17:26
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error
– datascienceman1
Nov 28 '18 at 17:29
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas
DataFrame.– sophros
Nov 29 '18 at 7:40
I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas
DataFrame.– sophros
Nov 29 '18 at 7:40
|
show 4 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488849%2fusing-python-to-create-a-very-large-binary-frequency-matrix-to-run-collaborative%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown