Memory allocation error in sklearn random forest classification python

I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?

The data a is

x,y, day, week, Accuracy

x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer

code:

import csv

import numpy as np

from sklearn.ensemble import RandomForestClassifier





with open("time_data.csv", "rb") as infile:

    re1 = csv.reader(infile)

    result=

    ##next(reader, None)

    ##for row in reader:

    for row in re1:

        result.append(row[8])



    trainclass = result[:251900]

    testclass = result[251901:279953]





with open("time_data.csv", "rb") as infile:

    re = csv.reader(infile)

    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]

    train = coords[:251900]

    test = coords[251901:279953]



print "Done splitting data into test and train data"



clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)

clf.fit(train,trainclass)



print "Done training"

score = clf.score(test,testclass)

print "Done Testing"

print score

Error:

line 366, in fit

    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)

  File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node

  File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c

  File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc

MemoryError: could not allocate 10206838784 bytes

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

add a comment |

The data a is

x,y, day, week, Accuracy

x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer

code:

import csv

import numpy as np

from sklearn.ensemble import RandomForestClassifier





with open("time_data.csv", "rb") as infile:

    re1 = csv.reader(infile)

    result=

    ##next(reader, None)

    ##for row in reader:

    for row in re1:

        result.append(row[8])



    trainclass = result[:251900]

    testclass = result[251901:279953]





with open("time_data.csv", "rb") as infile:

    re = csv.reader(infile)

    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]

    train = coords[:251900]

    test = coords[251901:279953]



print "Done splitting data into test and train data"



clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)

clf.fit(train,trainclass)



print "Done training"

score = clf.score(test,testclass)

print "Done Testing"

print score

Error:

line 366, in fit

    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)

  File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node

  File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c

  File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc

MemoryError: could not allocate 10206838784 bytes

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

add a comment |

The data a is

x,y, day, week, Accuracy

x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer

code:

import csv

import numpy as np

from sklearn.ensemble import RandomForestClassifier





with open("time_data.csv", "rb") as infile:

    re1 = csv.reader(infile)

    result=

    ##next(reader, None)

    ##for row in reader:

    for row in re1:

        result.append(row[8])



    trainclass = result[:251900]

    testclass = result[251901:279953]





with open("time_data.csv", "rb") as infile:

    re = csv.reader(infile)

    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]

    train = coords[:251900]

    test = coords[251901:279953]



print "Done splitting data into test and train data"



clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)

clf.fit(train,trainclass)



print "Done training"

score = clf.score(test,testclass)

print "Done Testing"

print score

Error:

line 366, in fit

    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)

  File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node

  File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c

  File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc

MemoryError: could not allocate 10206838784 bytes

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

The data a is

x,y, day, week, Accuracy

x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer

code:

import csv

import numpy as np

from sklearn.ensemble import RandomForestClassifier





with open("time_data.csv", "rb") as infile:

    re1 = csv.reader(infile)

    result=

    ##next(reader, None)

    ##for row in reader:

    for row in re1:

        result.append(row[8])



    trainclass = result[:251900]

    testclass = result[251901:279953]





with open("time_data.csv", "rb") as infile:

    re = csv.reader(infile)

    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]

    train = coords[:251900]

    test = coords[251901:279953]



print "Done splitting data into test and train data"



clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)

clf.fit(train,trainclass)



print "Done training"

score = clf.score(test,testclass)

print "Done Testing"

print score

Error:

line 366, in fit

    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)

  File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build

  File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node

  File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c

  File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc

MemoryError: could not allocate 10206838784 bytes

python scikit-learn random-forest

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

edited Nov 28 '18 at 19:36

asked Nov 28 '18 at 19:02

Labeo

2,06872956

asked Nov 28 '18 at 19:02

Labeo

2,06872956

asked Nov 28 '18 at 19:02

Labeo

2,06872956

add a comment |

1 Answer
1

active

oldest

votes

From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."

I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.

answered Nov 28 '18 at 19:17

jbuchel

553420

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53526382%2fmemory-allocation-error-in-sklearn-random-forest-classification-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.

answered Nov 28 '18 at 19:17

jbuchel

553420

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

add a comment |

I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.

answered Nov 28 '18 at 19:17

jbuchel

553420

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

add a comment |

I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.

answered Nov 28 '18 at 19:17

jbuchel

553420

I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.

answered Nov 28 '18 at 19:17

jbuchel

553420

answered Nov 28 '18 at 19:17

jbuchel

553420

answered Nov 28 '18 at 19:17

jbuchel

553420

answered Nov 28 '18 at 19:17

jbuchel

553420

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

add a comment |

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram

– Labeo
Nov 28 '18 at 19:32

And depending on the number of features I have the depth cant be large enough right?

– Labeo
Nov 28 '18 at 19:35

I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.

– jbuchel
Nov 28 '18 at 20:19

Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same

– Labeo
Nov 28 '18 at 22:31

I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?

– jbuchel
Nov 28 '18 at 22:37

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

e7uVb5eEEga87UYE4lVP6Jd7xyi5kNnvXSV,n1WQ1F6jV1NEJppNp1eg L

搜尋此網誌

Btukfyl