Using Python to create a very large binary frequency matrix to run collaborative filtering












1















I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.



The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?



list samples:



subset_patients =
[['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]

sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]


my code:



bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)

count = -1
for row in subset_patients: #subset_patients is a small list of the patients
for col in row:
if col in sorted_codes: #sorted_codes is the unique codes list
count = count+1
bin_freq_matrix.at[count, col]=1

print(bin_freq_matrix.head())


NEWEST VERSION:



subset_patients = patients[0:1]

def marking(row):
# here the traverse is in the natural order of columns
hots = {col for col in row if col in sorted_codes_set}
# here as well there are no jumps around the memory
return [1 if col in hots else 0 for col in sorted_codes]

bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

print(bin_freq_matrix)
for x in bin_freq_matrix[1]:
if x==1:
print("yes")









share|improve this question





























    1















    I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.



    The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?



    list samples:



    subset_patients =
    [['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]

    sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]


    my code:



    bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)

    count = -1
    for row in subset_patients: #subset_patients is a small list of the patients
    for col in row:
    if col in sorted_codes: #sorted_codes is the unique codes list
    count = count+1
    bin_freq_matrix.at[count, col]=1

    print(bin_freq_matrix.head())


    NEWEST VERSION:



    subset_patients = patients[0:1]

    def marking(row):
    # here the traverse is in the natural order of columns
    hots = {col for col in row if col in sorted_codes_set}
    # here as well there are no jumps around the memory
    return [1 if col in hots else 0 for col in sorted_codes]

    bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

    print(bin_freq_matrix)
    for x in bin_freq_matrix[1]:
    if x==1:
    print("yes")









    share|improve this question



























      1












      1








      1








      I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.



      The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?



      list samples:



      subset_patients =
      [['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]

      sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]


      my code:



      bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)

      count = -1
      for row in subset_patients: #subset_patients is a small list of the patients
      for col in row:
      if col in sorted_codes: #sorted_codes is the unique codes list
      count = count+1
      bin_freq_matrix.at[count, col]=1

      print(bin_freq_matrix.head())


      NEWEST VERSION:



      subset_patients = patients[0:1]

      def marking(row):
      # here the traverse is in the natural order of columns
      hots = {col for col in row if col in sorted_codes_set}
      # here as well there are no jumps around the memory
      return [1 if col in hots else 0 for col in sorted_codes]

      bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

      print(bin_freq_matrix)
      for x in bin_freq_matrix[1]:
      if x==1:
      print("yes")









      share|improve this question
















      I'm trying to run collaborative filtering on a large data set of med codes where each patient has 2 or more diagnoses. There are ~291K patients, and there are ~8K unique codes. In order to run CF on this data, I need to create a binary frequency matrix where each unique code is a column and there is a 0 or 1 in each patient's row and column if the disease is present or not.



      The problem is this data set has ~2.3 billion cells and my laptop with 16gb of RAM can't process it. I tried it in R using the reshape package and it crashes. I wrote code in Python (below) .If I subset the data to 500 patients, it takes around 24 hours to process. Does anyone have a better way to do this? I'm wondering if the loop within a loop structure is too inefficient? Or should I apply sparseMatrix in R somehow to this data?



      list samples:



      subset_patients =
      [['1510395', 'R31', 'N359', 'I639', 'C440', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ['1275226', 'T810', 'N813', 'N393', 'M8417', 'M679', 'M1997', 'L600', 'K529', 'R634', 'R15', 'N811', 'K573', 'K571', 'K222', 'D120', 'A099', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'], ...... ]

      sorted_codes = ['A009', 'A010', 'A011', 'A014', 'A020', 'A021', 'A022', 'A028', 'A029', ... ]


      my code:



      bin_freq_matrix = pd.DataFrame(0, index = np.arange(len(subset_patients)), columns = sorted_codes)

      count = -1
      for row in subset_patients: #subset_patients is a small list of the patients
      for col in row:
      if col in sorted_codes: #sorted_codes is the unique codes list
      count = count+1
      bin_freq_matrix.at[count, col]=1

      print(bin_freq_matrix.head())


      NEWEST VERSION:



      subset_patients = patients[0:1]

      def marking(row):
      # here the traverse is in the natural order of columns
      hots = {col for col in row if col in sorted_codes_set}
      # here as well there are no jumps around the memory
      return [1 if col in hots else 0 for col in sorted_codes]

      bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)

      print(bin_freq_matrix)
      for x in bin_freq_matrix[1]:
      if x==1:
      print("yes")






      python






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 29 '18 at 18:00







      datascienceman1

















      asked Nov 26 '18 at 20:50









      datascienceman1datascienceman1

      62




      62
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.





          1. Optimize the body of the loop
            Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:



            if col in sorted_codes: #sorted_codes is the unique codes list




          takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:



          sorted_codes_set = set(sorted_codes)


          Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.






          1. Removal of unnecessary operations from the loop.
            The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.


          The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:



          bin_freq_matrix.at[count, col]=1





          1. Use apply and a function instead of the for loop. This is likely to bring the largest gain.


          The final piece of code:



          def marking(row):
          # here the traverse is in the natural order of columns
          hots = {col for col in row if col in sorted_codes_set}
          # here as well there are no jumps around the memory
          return [1 if col in hots else 0 for col in sorted_codes]

          bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)





          share|improve this answer


























          • @datascienceman1 - if you found it helpful please mark as an answer and upvote.

            – sophros
            Nov 28 '18 at 10:34











          • First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

            – datascienceman1
            Nov 28 '18 at 16:45











          • Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

            – sophros
            Nov 28 '18 at 17:26











          • sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

            – datascienceman1
            Nov 28 '18 at 17:29













          • I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

            – sophros
            Nov 29 '18 at 7:40











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488849%2fusing-python-to-create-a-very-large-binary-frequency-matrix-to-run-collaborative%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.





          1. Optimize the body of the loop
            Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:



            if col in sorted_codes: #sorted_codes is the unique codes list




          takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:



          sorted_codes_set = set(sorted_codes)


          Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.






          1. Removal of unnecessary operations from the loop.
            The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.


          The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:



          bin_freq_matrix.at[count, col]=1





          1. Use apply and a function instead of the for loop. This is likely to bring the largest gain.


          The final piece of code:



          def marking(row):
          # here the traverse is in the natural order of columns
          hots = {col for col in row if col in sorted_codes_set}
          # here as well there are no jumps around the memory
          return [1 if col in hots else 0 for col in sorted_codes]

          bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)





          share|improve this answer


























          • @datascienceman1 - if you found it helpful please mark as an answer and upvote.

            – sophros
            Nov 28 '18 at 10:34











          • First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

            – datascienceman1
            Nov 28 '18 at 16:45











          • Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

            – sophros
            Nov 28 '18 at 17:26











          • sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

            – datascienceman1
            Nov 28 '18 at 17:29













          • I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

            – sophros
            Nov 29 '18 at 7:40
















          1














          Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.





          1. Optimize the body of the loop
            Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:



            if col in sorted_codes: #sorted_codes is the unique codes list




          takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:



          sorted_codes_set = set(sorted_codes)


          Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.






          1. Removal of unnecessary operations from the loop.
            The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.


          The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:



          bin_freq_matrix.at[count, col]=1





          1. Use apply and a function instead of the for loop. This is likely to bring the largest gain.


          The final piece of code:



          def marking(row):
          # here the traverse is in the natural order of columns
          hots = {col for col in row if col in sorted_codes_set}
          # here as well there are no jumps around the memory
          return [1 if col in hots else 0 for col in sorted_codes]

          bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)





          share|improve this answer


























          • @datascienceman1 - if you found it helpful please mark as an answer and upvote.

            – sophros
            Nov 28 '18 at 10:34











          • First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

            – datascienceman1
            Nov 28 '18 at 16:45











          • Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

            – sophros
            Nov 28 '18 at 17:26











          • sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

            – datascienceman1
            Nov 28 '18 at 17:29













          • I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

            – sophros
            Nov 29 '18 at 7:40














          1












          1








          1







          Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.





          1. Optimize the body of the loop
            Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:



            if col in sorted_codes: #sorted_codes is the unique codes list




          takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:



          sorted_codes_set = set(sorted_codes)


          Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.






          1. Removal of unnecessary operations from the loop.
            The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.


          The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:



          bin_freq_matrix.at[count, col]=1





          1. Use apply and a function instead of the for loop. This is likely to bring the largest gain.


          The final piece of code:



          def marking(row):
          # here the traverse is in the natural order of columns
          hots = {col for col in row if col in sorted_codes_set}
          # here as well there are no jumps around the memory
          return [1 if col in hots else 0 for col in sorted_codes]

          bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)





          share|improve this answer















          Welcome to SO! Indeed you could use a slightly more optimal solution here. There are at least few things you can optimize. Let's look at them step by step moving towards a more comprehensive use of pandas funtionality.





          1. Optimize the body of the loop
            Interestingly, you do not need to change the actual code constructing the matrix much. It is enough to change the definition of the data structure to make your code so much more efficient! The following line of code:



            if col in sorted_codes: #sorted_codes is the unique codes list




          takes a significant performance toll on the operation due to linear (big-O notation) characteristics of presence test for lists compared to sets (logarithmic characteristics) which you can easily use by changing the definition of sorted_codes copy you use to check if value exists to:



          sorted_codes_set = set(sorted_codes)


          Sorting the list does not help unless you use binary search. It has the same characteristics as sets but you would have to implement the search yourself. The choice is easy: sets.






          1. Removal of unnecessary operations from the loop.
            The code in the loop is going to be repeated billions of times (in your case) so it should be maximally optimized.


          The following line changes the dataframe in the random order which is a bad idea because pandas is optimized for sequential access and can be orders of magnitude slower otherwise:



          bin_freq_matrix.at[count, col]=1





          1. Use apply and a function instead of the for loop. This is likely to bring the largest gain.


          The final piece of code:



          def marking(row):
          # here the traverse is in the natural order of columns
          hots = {col for col in row if col in sorted_codes_set}
          # here as well there are no jumps around the memory
          return [1 if col in hots else 0 for col in sorted_codes]

          bin_freq_matrix = pd.DataFrame(subset_patients).apply(marking)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 29 '18 at 7:39

























          answered Nov 27 '18 at 9:10









          sophrossophros

          2,6421830




          2,6421830













          • @datascienceman1 - if you found it helpful please mark as an answer and upvote.

            – sophros
            Nov 28 '18 at 10:34











          • First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

            – datascienceman1
            Nov 28 '18 at 16:45











          • Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

            – sophros
            Nov 28 '18 at 17:26











          • sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

            – datascienceman1
            Nov 28 '18 at 17:29













          • I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

            – sophros
            Nov 29 '18 at 7:40



















          • @datascienceman1 - if you found it helpful please mark as an answer and upvote.

            – sophros
            Nov 28 '18 at 10:34











          • First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

            – datascienceman1
            Nov 28 '18 at 16:45











          • Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

            – sophros
            Nov 28 '18 at 17:26











          • sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

            – datascienceman1
            Nov 28 '18 at 17:29













          • I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

            – sophros
            Nov 29 '18 at 7:40

















          @datascienceman1 - if you found it helpful please mark as an answer and upvote.

          – sophros
          Nov 28 '18 at 10:34





          @datascienceman1 - if you found it helpful please mark as an answer and upvote.

          – sophros
          Nov 28 '18 at 10:34













          First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

          – datascienceman1
          Nov 28 '18 at 16:45





          First of all, thank you so much for the assistance. I figured that someone with an advanced CS background could solve this issue. I apologize for the response taking a day.. I really want to understand what's going on in your code and I'm thinking about it. I don't have a lot of experience with functions yet, but it looks like the function followed by apply is doing the work of the entire loop, creating the matrix a row at a time? The problem is I'm getting an invalid syntax error on your return statement for some reason. Trying to figure out why. Is it because I'm running 2.7? Thanks again..

          – datascienceman1
          Nov 28 '18 at 16:45













          Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

          – sophros
          Nov 28 '18 at 17:26





          Yes, it is likely due to my use of Python 3 syntax. It is also a generally faster choice. I suggest you try it too.

          – sophros
          Nov 28 '18 at 17:26













          sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

          – datascienceman1
          Nov 28 '18 at 17:29







          sorry previous comment was a mess due to loss of indentation.. I posted an edit above containing the list error

          – datascienceman1
          Nov 28 '18 at 17:29















          I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

          – sophros
          Nov 29 '18 at 7:40





          I have revised my answer. There was a minor mistake where I assumed subset_patients to be a pandas DataFrame.

          – sophros
          Nov 29 '18 at 7:40




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488849%2fusing-python-to-create-a-very-large-binary-frequency-matrix-to-run-collaborative%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Futebolista

          Lallio

          Jornalista