Memory allocation error in sklearn random forest classification python
I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
python scikit-learn random-forest
add a comment |
I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
python scikit-learn random-forest
add a comment |
I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
python scikit-learn random-forest
I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates
day is which day of the month (1-30)
the week is which day of the week (1-7)
and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
python scikit-learn random-forest
python scikit-learn random-forest
edited Nov 28 '18 at 19:36
Labeo
asked Nov 28 '18 at 19:02
LabeoLabeo
2,06872956
2,06872956
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53526382%2fmemory-allocation-error-in-sklearn-random-forest-classification-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
add a comment |
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
add a comment |
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
answered Nov 28 '18 at 19:17
jbucheljbuchel
553420
553420
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
add a comment |
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
I tried with max_features="log2", min_samples_split=3, min_samples_leaf=2 as my parameter but still I am facing same issue I might try max depth. I am having a 16GB ram
– Labeo
Nov 28 '18 at 19:32
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
And depending on the number of features I have the depth cant be large enough right?
– Labeo
Nov 28 '18 at 19:35
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
I would definitely set a max_depth. Decision trees overfit drastically at a high depth. Usually a depth of 6 is sufficient, but this of course depends on your model.
– jbuchel
Nov 28 '18 at 20:19
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
Is it possible to run in chunks of data as i tried to run on 25000 points it ran. I assume it wont as at the end the data is same
– Labeo
Nov 28 '18 at 22:31
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
I think you can do that. But you can not train two models on different chunks of data since the results will be different. Have you tried max_depth?
– jbuchel
Nov 28 '18 at 22:37
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53526382%2fmemory-allocation-error-in-sklearn-random-forest-classification-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown