Using Python to create a very large binary frequency matrix to run collaborative filtering

I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.

The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?

list samples:

subset_patients =

[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]



sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]

my code:

bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)



count = -1

for row in subset_patients:  #subset_patients is a small list of the patients

    for col in row:

        if col in sorted_codes:  #sorted_codes is the unique codes list

            count = count+1

            bin_freq_matrix.at[count, col]=1



print(bin_freq_matrix.head())

NEWEST VERSION:

subset_patients = patients[0:1]



def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1 if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)



print(bin_freq_matrix)

for x in bin_freq_matrix[1]:

    if x==1:

        print("yes")

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

add a comment |

list samples:

subset_patients =

[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]



sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]

my code:

bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)



count = -1

for row in subset_patients:  #subset_patients is a small list of the patients

    for col in row:

        if col in sorted_codes:  #sorted_codes is the unique codes list

            count = count+1

            bin_freq_matrix.at[count, col]=1



print(bin_freq_matrix.head())

NEWEST VERSION:

subset_patients = patients[0:1]



def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1 if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)



print(bin_freq_matrix)

for x in bin_freq_matrix[1]:

    if x==1:

        print("yes")

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

add a comment |

list samples:

subset_patients =

[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]



sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]

my code:

bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)



count = -1

for row in subset_patients:  #subset_patients is a small list of the patients

    for col in row:

        if col in sorted_codes:  #sorted_codes is the unique codes list

            count = count+1

            bin_freq_matrix.at[count, col]=1



print(bin_freq_matrix.head())

NEWEST VERSION:

subset_patients = patients[0:1]



def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1 if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)



print(bin_freq_matrix)

for x in bin_freq_matrix[1]:

    if x==1:

        print("yes")

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

list samples:

subset_patients =

[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]



sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]

my code:

bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)



count = -1

for row in subset_patients:  #subset_patients is a small list of the patients

    for col in row:

        if col in sorted_codes:  #sorted_codes is the unique codes list

            count = count+1

            bin_freq_matrix.at[count, col]=1



print(bin_freq_matrix.head())

NEWEST VERSION:

subset_patients = patients[0:1]



def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1 if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)



print(bin_freq_matrix)

for x in bin_freq_matrix[1]:

    if x==1:

        print("yes")

python

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

edited Nov 29 '18 at 18:00

asked Nov 26 '18 at 20:50

datascienceman1

asked Nov 26 '18 at 20:50

datascienceman1

asked Nov 26 '18 at 20:50

datascienceman1

add a comment |

1 Answer
1

active

oldest

votes

Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.

Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:

if col in sorted_codes: #sorted_codes is the unique codes list

takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:

sorted_codes_set = set(sorted_codes)

Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.

Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.

The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:

bin_freq_matrix.at[count, col]=1

Use apply and a function instead of the for loop. This is likely to bring the largest gain.

The final piece of code:

def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1  if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

|
show 4 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488849%2fusing-python-to-create-a-very-large-binary-frequency-matrix-to-run-collaborative%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:

if col in sorted_codes: #sorted_codes is the unique codes list

sorted_codes_set = set(sorted_codes)

Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.

Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.

The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:

bin_freq_matrix.at[count, col]=1

Use apply and a function instead of the for loop. This is likely to bring the largest gain.

The final piece of code:

def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1  if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

|
show 4 more comments

Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:

if col in sorted_codes: #sorted_codes is the unique codes list

sorted_codes_set = set(sorted_codes)

Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.

Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.

The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:

bin_freq_matrix.at[count, col]=1

Use apply and a function instead of the for loop. This is likely to bring the largest gain.

The final piece of code:

def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1  if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

|
show 4 more comments

Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:

if col in sorted_codes: #sorted_codes is the unique codes list

sorted_codes_set = set(sorted_codes)

Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.

Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.

The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:

bin_freq_matrix.at[count, col]=1

Use apply and a function instead of the for loop. This is likely to bring the largest gain.

The final piece of code:

def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1  if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

Optimize the body of the loop
Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:

if col in sorted_codes: #sorted_codes is the unique codes list

sorted_codes_set = set(sorted_codes)

Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.

Removal of unnecessary operations from the loop.
The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.

The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:

bin_freq_matrix.at[count, col]=1

Use apply and a function instead of the for loop. This is likely to bring the largest gain.

The final piece of code:

def marking(row):

    # here the traverse is in the natural order of columns

    hots = {col for col in row if col in sorted_codes_set}

    # here as well there are no jumps around the memory

    return [1  if col in hots else 0 for col in sorted_codes]



bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

edited Nov 29 '18 at 7:39

answered Nov 27 '18 at 9:10

sophros

2,6421830

answered Nov 27 '18 at 9:10

sophros

2,6421830

answered Nov 27 '18 at 9:10

sophros

2,6421830

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

|
show 4 more comments

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

@datascienceman1 - if you found it helpful please mark as an answer and upvote.

– sophros
Nov 28 '18 at 10:34

First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

– datascienceman1
Nov 28 '18 at 16:45

Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

– sophros
Nov 28 '18 at 17:26

sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

– datascienceman1
Nov 28 '18 at 17:29

I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

– sophros
Nov 29 '18 at 7:40

|
show 4 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl