Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy
I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
r h2o
add a comment |
I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
r h2o
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50
add a comment |
I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
r h2o
I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
r h2o
r h2o
asked Nov 23 '18 at 19:07
NicklovnNicklovn
1135
1135
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50
add a comment |
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50
add a comment |
1 Answer
1
active
oldest
votes
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451856%2fissue-with-h2o-package-in-r-using-subsetted-dataframes-leading-to-near-perfect-p%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
add a comment |
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
add a comment |
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.
answered Nov 25 '18 at 10:38
Darren CookDarren Cook
16.6k765157
16.6k765157
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
add a comment |
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451856%2fissue-with-h2o-package-in-r-using-subsetted-dataframes-leading-to-near-perfect-p%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33
This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50