Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)

y = dataset[,1] # Create response

X = dataset[,-1] # Create regressors



X = model.matrix(y~.,data=dataset) # Automatically create dummy variables

y=as.factor(y) # Ensure y has factor data type

dataset = data.frame(y,X) # Create final data.frame dataset



train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean

test = (-train) # Create testing indices



h2o.init(nthreads=2) # Initiate h2o



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)





predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel

predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me

predictions = predictions[,1] # Extract predictions



mean(predictions!=y[test])

The problem is that if I evaluate this against my test subset I get almost 0% error:

[1] 0.0007531255

Has anyone encountered this issue? Have an idea of how to alleviate this problem?

asked Nov 23 '18 at 19:07

Nicklovn

1135

I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33

This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50

add a comment |

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)

y = dataset[,1] # Create response

X = dataset[,-1] # Create regressors



X = model.matrix(y~.,data=dataset) # Automatically create dummy variables

y=as.factor(y) # Ensure y has factor data type

dataset = data.frame(y,X) # Create final data.frame dataset



train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean

test = (-train) # Create testing indices



h2o.init(nthreads=2) # Initiate h2o



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)





predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel

predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me

predictions = predictions[,1] # Extract predictions



mean(predictions!=y[test])

The problem is that if I evaluate this against my test subset I get almost 0% error:

[1] 0.0007531255

Has anyone encountered this issue? Have an idea of how to alleviate this problem?

asked Nov 23 '18 at 19:07

Nicklovn

1135

I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33

This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50

add a comment |

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)

y = dataset[,1] # Create response

X = dataset[,-1] # Create regressors



X = model.matrix(y~.,data=dataset) # Automatically create dummy variables

y=as.factor(y) # Ensure y has factor data type

dataset = data.frame(y,X) # Create final data.frame dataset



train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean

test = (-train) # Create testing indices



h2o.init(nthreads=2) # Initiate h2o



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)





predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel

predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me

predictions = predictions[,1] # Extract predictions



mean(predictions!=y[test])

The problem is that if I evaluate this against my test subset I get almost 0% error:

[1] 0.0007531255

Has anyone encountered this issue? Have an idea of how to alleviate this problem?

asked Nov 23 '18 at 19:07

Nicklovn

1135

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)

y = dataset[,1] # Create response

X = dataset[,-1] # Create regressors



X = model.matrix(y~.,data=dataset) # Automatically create dummy variables

y=as.factor(y) # Ensure y has factor data type

dataset = data.frame(y,X) # Create final data.frame dataset



train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean

test = (-train) # Create testing indices



h2o.init(nthreads=2) # Initiate h2o



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)





predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel

predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me

predictions = predictions[,1] # Extract predictions



mean(predictions!=y[test])

The problem is that if I evaluate this against my test subset I get almost 0% error:

[1] 0.0007531255

Has anyone encountered this issue? Have an idea of how to alleviate this problem?

r h2o

asked Nov 23 '18 at 19:07

Nicklovn

1135

asked Nov 23 '18 at 19:07

Nicklovn

1135

asked Nov 23 '18 at 19:07

Nicklovn

1135

asked Nov 23 '18 at 19:07

Nicklovn

1135

asked Nov 23 '18 at 19:07

Nicklovn

1135

I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33

This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50

add a comment |

I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33

This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50

I discovered something new. The problem arises when the response is coerced to type factor. Not sure why this is causing issues but if I allow the categorical response to be numerical, I can get relatively reasonable error. This is not ideal though because I cannot use the Cross Entropy loss function and it's not proper statistics.
– Nicklovn
Nov 23 '18 at 19:33

This issue arises even if you don't explicitly coerce the response. If you read the dataframe in with levels "Yes" and "No" you get the same problem... I don't understand why this is happening.
– Nicklovn
Nov 23 '18 at 19:50

add a comment |

1 Answer
1

active

oldest

votes

It will be more efficient to use the H2O functions to load the data and split it.

data = h2o.importFile("dataset.csv")

y = 2 #Response is 2nd column, first is an index

x = 3:(ncol(data))  #Learn from all the other columns

data[,y] = as.factor(data[,y])



parts = h2o.splitFrame(data, 0.8)  #Split 80/20

train = parts[[1]]

test = parts[[2]]



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)



h2o.performance(mlModel, test)

It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451856%2fissue-with-h2o-package-in-r-using-subsetted-dataframes-leading-to-near-perfect-p%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It will be more efficient to use the H2O functions to load the data and split it.

data = h2o.importFile("dataset.csv")

y = 2 #Response is 2nd column, first is an index

x = 3:(ncol(data))  #Learn from all the other columns

data[,y] = as.factor(data[,y])



parts = h2o.splitFrame(data, 0.8)  #Split 80/20

train = parts[[1]]

test = parts[[2]]



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)



h2o.performance(mlModel, test)

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

add a comment |

It will be more efficient to use the H2O functions to load the data and split it.

data = h2o.importFile("dataset.csv")

y = 2 #Response is 2nd column, first is an index

x = 3:(ncol(data))  #Learn from all the other columns

data[,y] = as.factor(data[,y])



parts = h2o.splitFrame(data, 0.8)  #Split 80/20

train = parts[[1]]

test = parts[[2]]



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)



h2o.performance(mlModel, test)

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

add a comment |

It will be more efficient to use the H2O functions to load the data and split it.

data = h2o.importFile("dataset.csv")

y = 2 #Response is 2nd column, first is an index

x = 3:(ncol(data))  #Learn from all the other columns

data[,y] = as.factor(data[,y])



parts = h2o.splitFrame(data, 0.8)  #Split 80/20

train = parts[[1]]

test = parts[[2]]



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)



h2o.performance(mlModel, test)

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

It will be more efficient to use the H2O functions to load the data and split it.

data = h2o.importFile("dataset.csv")

y = 2 #Response is 2nd column, first is an index

x = 3:(ncol(data))  #Learn from all the other columns

data[,y] = as.factor(data[,y])



parts = h2o.splitFrame(data, 0.8)  #Split 80/20

train = parts[[1]]

test = parts[[2]]



# BELOW: Create h2o.deeplearning model with subset of dataset.

mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",

                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)



h2o.performance(mlModel, test)

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

answered Nov 25 '18 at 10:38

Darren Cook

16.6k765157

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

add a comment |

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

In addition, the problem could be in the dataset itself. Are you sure the data is actually IID?
– vaclav
Nov 26 '18 at 10:59

I had had this on my mind for some time now. I had found the h2o.importFile() function and utilized it successfully. You are 100% correct. This was my first time working with something like h2o. Utilizing the h2o functions led to successful splitting and evaluation. Thank you for your response! I hope if someone else has this issue that they find this :) Have a good day sir!
– Nicklovn
Nov 26 '18 at 20:55

Regarding IID data, generally speaking I am. By that I mean that I don't have issues of singularity. But I don't expect the entire dataset to be IID. Unless I'm misunderstanding, I don't believe identical distribution of regressors is feasible with large datasets.
– Nicklovn
Nov 26 '18 at 20:58

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RqHP D05d3KvNNoCHnEu93TF,zaZBon9UxYSOBXw3e4KFP,k3b jyl,JTfT6VzzUy 0,zLbCLdtvUAvIeQ4Fg5013gc0QSov A

搜尋此網誌

Btukfyl