Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy












1














I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:



dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test])


The problem is that if I evaluate this against my test subset I get almost 0% error:



[1] 0.0007531255


Has anyone encountered this issue? Have an idea of how to alleviate this problem?










share|improve this question






















  • I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
    – Nicklovn
    Nov 23 '18 at 19:33










  • This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
    – Nicklovn
    Nov 23 '18 at 19:50
















1














I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:



dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test])


The problem is that if I evaluate this against my test subset I get almost 0% error:



[1] 0.0007531255


Has anyone encountered this issue? Have an idea of how to alleviate this problem?










share|improve this question






















  • I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
    – Nicklovn
    Nov 23 '18 at 19:33










  • This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
    – Nicklovn
    Nov 23 '18 at 19:50














1












1








1







I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:



dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test])


The problem is that if I evaluate this against my test subset I get almost 0% error:



[1] 0.0007531255


Has anyone encountered this issue? Have an idea of how to alleviate this problem?










share|improve this question













I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:



dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test])


The problem is that if I evaluate this against my test subset I get almost 0% error:



[1] 0.0007531255


Has anyone encountered this issue? Have an idea of how to alleviate this problem?







r h2o






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 23 '18 at 19:07









NicklovnNicklovn

1135




1135












  • I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
    – Nicklovn
    Nov 23 '18 at 19:33










  • This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
    – Nicklovn
    Nov 23 '18 at 19:50


















  • I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
    – Nicklovn
    Nov 23 '18 at 19:33










  • This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
    – Nicklovn
    Nov 23 '18 at 19:50
















I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33




I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33












This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50




This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50












1 Answer
1






active

oldest

votes


















1














It will be more efficient to use the H2O functions to load the data and split it.



data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)


It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.






share|improve this answer





















  • In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
    – vaclav
    Nov 26 '18 at 10:59












  • I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
    – Nicklovn
    Nov 26 '18 at 20:55










  • Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
    – Nicklovn
    Nov 26 '18 at 20:58











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451856%2fissue-with-h2o-package-in-r-using-subsetted-dataframes-leading-to-near-perfect-p%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














It will be more efficient to use the H2O functions to load the data and split it.



data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)


It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.






share|improve this answer





















  • In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
    – vaclav
    Nov 26 '18 at 10:59












  • I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
    – Nicklovn
    Nov 26 '18 at 20:55










  • Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
    – Nicklovn
    Nov 26 '18 at 20:58
















1














It will be more efficient to use the H2O functions to load the data and split it.



data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)


It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.






share|improve this answer





















  • In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
    – vaclav
    Nov 26 '18 at 10:59












  • I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
    – Nicklovn
    Nov 26 '18 at 20:55










  • Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
    – Nicklovn
    Nov 26 '18 at 20:58














1












1








1






It will be more efficient to use the H2O functions to load the data and split it.



data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)


It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.






share|improve this answer












It will be more efficient to use the H2O functions to load the data and split it.



data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)


It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 25 '18 at 10:38









Darren CookDarren Cook

16.6k765157




16.6k765157












  • In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
    – vaclav
    Nov 26 '18 at 10:59












  • I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
    – Nicklovn
    Nov 26 '18 at 20:55










  • Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
    – Nicklovn
    Nov 26 '18 at 20:58


















  • In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
    – vaclav
    Nov 26 '18 at 10:59












  • I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
    – Nicklovn
    Nov 26 '18 at 20:55










  • Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
    – Nicklovn
    Nov 26 '18 at 20:58
















In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59






In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59














I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55




I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55












Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58




Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451856%2fissue-with-h2o-package-in-r-using-subsetted-dataframes-leading-to-near-perfect-p%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)