Memory allocation error in sklearn random forest classification python












0















I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?



The data a is



x,y, day, week, Accuracy



x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer



code:



import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier


with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])

trainclass = result[:251900]
testclass = result[251901:279953]


with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]

print "Done splitting data into test and train data"

clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)

print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score


Error:



line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes









share|improve this question





























    0















    I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?



    The data a is



    x,y, day, week, Accuracy



    x and y are the coordinates
    day is which day of the month (1-30)
    the week is which day of the week (1-7)
    and accuracy is an integer



    code:



    import csv
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier


    with open("time_data.csv", "rb") as infile:
    re1 = csv.reader(infile)
    result=
    ##next(reader, None)
    ##for row in reader:
    for row in re1:
    result.append(row[8])

    trainclass = result[:251900]
    testclass = result[251901:279953]


    with open("time_data.csv", "rb") as infile:
    re = csv.reader(infile)
    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
    train = coords[:251900]
    test = coords[251901:279953]

    print "Done splitting data into test and train data"

    clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
    clf.fit(train,trainclass)

    print "Done training"
    score = clf.score(test,testclass)
    print "Done Testing"
    print score


    Error:



    line 366, in fit
    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
    File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
    File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
    File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
    File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
    File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
    MemoryError: could not allocate 10206838784 bytes









    share|improve this question



























      0












      0








      0








      I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?



      The data a is



      x,y, day, week, Accuracy



      x and y are the coordinates
      day is which day of the month (1-30)
      the week is which day of the week (1-7)
      and accuracy is an integer



      code:



      import csv
      import numpy as np
      from sklearn.ensemble import RandomForestClassifier


      with open("time_data.csv", "rb") as infile:
      re1 = csv.reader(infile)
      result=
      ##next(reader, None)
      ##for row in reader:
      for row in re1:
      result.append(row[8])

      trainclass = result[:251900]
      testclass = result[251901:279953]


      with open("time_data.csv", "rb") as infile:
      re = csv.reader(infile)
      coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
      train = coords[:251900]
      test = coords[251901:279953]

      print "Done splitting data into test and train data"

      clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
      clf.fit(train,trainclass)

      print "Done training"
      score = clf.score(test,testclass)
      print "Done Testing"
      print score


      Error:



      line 366, in fit
      builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
      File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
      File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
      File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
      MemoryError: could not allocate 10206838784 bytes









      share|improve this question
















      I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?



      The data a is



      x,y, day, week, Accuracy



      x and y are the coordinates
      day is which day of the month (1-30)
      the week is which day of the week (1-7)
      and accuracy is an integer



      code:



      import csv
      import numpy as np
      from sklearn.ensemble import RandomForestClassifier


      with open("time_data.csv", "rb") as infile:
      re1 = csv.reader(infile)
      result=
      ##next(reader, None)
      ##for row in reader:
      for row in re1:
      result.append(row[8])

      trainclass = result[:251900]
      testclass = result[251901:279953]


      with open("time_data.csv", "rb") as infile:
      re = csv.reader(infile)
      coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
      train = coords[:251900]
      test = coords[251901:279953]

      print "Done splitting data into test and train data"

      clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
      clf.fit(train,trainclass)

      print "Done training"
      score = clf.score(test,testclass)
      print "Done Testing"
      print score


      Error:



      line 366, in fit
      builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
      File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
      File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
      File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
      MemoryError: could not allocate 10206838784 bytes






      python scikit-learn random-forest






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 28 '18 at 19:36







      Labeo

















      asked Nov 28 '18 at 19:02









      LabeoLabeo

      2,06872956




      2,06872956
























          1 Answer
          1






          active

          oldest

          votes


















          0














          From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."



          I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.






          share|improve this answer
























          • I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

            – Labeo
            Nov 28 '18 at 19:32













          • And depending on the number of features I have the depth cant be large enough right?

            – Labeo
            Nov 28 '18 at 19:35











          • I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

            – jbuchel
            Nov 28 '18 at 20:19











          • Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

            – Labeo
            Nov 28 '18 at 22:31













          • I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

            – jbuchel
            Nov 28 '18 at 22:37












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53526382%2fmemory-allocation-error-in-sklearn-random-forest-classification-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."



          I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.






          share|improve this answer
























          • I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

            – Labeo
            Nov 28 '18 at 19:32













          • And depending on the number of features I have the depth cant be large enough right?

            – Labeo
            Nov 28 '18 at 19:35











          • I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

            – jbuchel
            Nov 28 '18 at 20:19











          • Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

            – Labeo
            Nov 28 '18 at 22:31













          • I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

            – jbuchel
            Nov 28 '18 at 22:37
















          0














          From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."



          I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.






          share|improve this answer
























          • I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

            – Labeo
            Nov 28 '18 at 19:32













          • And depending on the number of features I have the depth cant be large enough right?

            – Labeo
            Nov 28 '18 at 19:35











          • I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

            – jbuchel
            Nov 28 '18 at 20:19











          • Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

            – Labeo
            Nov 28 '18 at 22:31













          • I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

            – jbuchel
            Nov 28 '18 at 22:37














          0












          0








          0







          From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."



          I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.






          share|improve this answer













          From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."



          I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 28 '18 at 19:17









          jbucheljbuchel

          553420




          553420













          • I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

            – Labeo
            Nov 28 '18 at 19:32













          • And depending on the number of features I have the depth cant be large enough right?

            – Labeo
            Nov 28 '18 at 19:35











          • I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

            – jbuchel
            Nov 28 '18 at 20:19











          • Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

            – Labeo
            Nov 28 '18 at 22:31













          • I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

            – jbuchel
            Nov 28 '18 at 22:37



















          • I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

            – Labeo
            Nov 28 '18 at 19:32













          • And depending on the number of features I have the depth cant be large enough right?

            – Labeo
            Nov 28 '18 at 19:35











          • I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

            – jbuchel
            Nov 28 '18 at 20:19











          • Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

            – Labeo
            Nov 28 '18 at 22:31













          • I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

            – jbuchel
            Nov 28 '18 at 22:37

















          I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

          – Labeo
          Nov 28 '18 at 19:32







          I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

          – Labeo
          Nov 28 '18 at 19:32















          And depending on the number of features I have the depth cant be large enough right?

          – Labeo
          Nov 28 '18 at 19:35





          And depending on the number of features I have the depth cant be large enough right?

          – Labeo
          Nov 28 '18 at 19:35













          I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

          – jbuchel
          Nov 28 '18 at 20:19





          I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

          – jbuchel
          Nov 28 '18 at 20:19













          Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

          – Labeo
          Nov 28 '18 at 22:31







          Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

          – Labeo
          Nov 28 '18 at 22:31















          I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

          – jbuchel
          Nov 28 '18 at 22:37





          I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

          – jbuchel
          Nov 28 '18 at 22:37




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53526382%2fmemory-allocation-error-in-sklearn-random-forest-classification-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)