ensemble model in case of outliers











up vote
0
down vote

favorite












I am working on predictive(linear regression) modeling technique but my target variable has huge amount of outliers let say 30-40% data is outliers so I want to know whether is it a good idea to go for ensemble model, I mean:-
1. build one model for non-outlier data
2. build another model for outlier data
3. And then predict using average prediction from both the model(as we do in ensemble modeling)



Note: - After transformation also outlier exists - so this is also not a feasible option as per my research activities.



cannot share data for security reasons.



I did try to find solution(suggestions) on many discuss group but could not reach to any fruitful conclusion.










share|improve this question
























  • voted negative - reason please so that I can improve on that.
    – Abhishek
    Nov 21 at 12:50






  • 1




    I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
    – Matias Valdenegro
    Nov 21 at 15:58










  • Thanks @Matias, will keep in mind
    – Abhishek
    Nov 21 at 16:07






  • 1




    I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
    – Robert Dodier
    Nov 21 at 17:48










  • Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
    – Abhishek
    Nov 22 at 4:25















up vote
0
down vote

favorite












I am working on predictive(linear regression) modeling technique but my target variable has huge amount of outliers let say 30-40% data is outliers so I want to know whether is it a good idea to go for ensemble model, I mean:-
1. build one model for non-outlier data
2. build another model for outlier data
3. And then predict using average prediction from both the model(as we do in ensemble modeling)



Note: - After transformation also outlier exists - so this is also not a feasible option as per my research activities.



cannot share data for security reasons.



I did try to find solution(suggestions) on many discuss group but could not reach to any fruitful conclusion.










share|improve this question
























  • voted negative - reason please so that I can improve on that.
    – Abhishek
    Nov 21 at 12:50






  • 1




    I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
    – Matias Valdenegro
    Nov 21 at 15:58










  • Thanks @Matias, will keep in mind
    – Abhishek
    Nov 21 at 16:07






  • 1




    I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
    – Robert Dodier
    Nov 21 at 17:48










  • Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
    – Abhishek
    Nov 22 at 4:25













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am working on predictive(linear regression) modeling technique but my target variable has huge amount of outliers let say 30-40% data is outliers so I want to know whether is it a good idea to go for ensemble model, I mean:-
1. build one model for non-outlier data
2. build another model for outlier data
3. And then predict using average prediction from both the model(as we do in ensemble modeling)



Note: - After transformation also outlier exists - so this is also not a feasible option as per my research activities.



cannot share data for security reasons.



I did try to find solution(suggestions) on many discuss group but could not reach to any fruitful conclusion.










share|improve this question















I am working on predictive(linear regression) modeling technique but my target variable has huge amount of outliers let say 30-40% data is outliers so I want to know whether is it a good idea to go for ensemble model, I mean:-
1. build one model for non-outlier data
2. build another model for outlier data
3. And then predict using average prediction from both the model(as we do in ensemble modeling)



Note: - After transformation also outlier exists - so this is also not a feasible option as per my research activities.



cannot share data for security reasons.



I did try to find solution(suggestions) on many discuss group but could not reach to any fruitful conclusion.







machine-learning statistics linear-regression data-science outliers






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 at 4:20

























asked Nov 21 at 12:25









Abhishek

63




63












  • voted negative - reason please so that I can improve on that.
    – Abhishek
    Nov 21 at 12:50






  • 1




    I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
    – Matias Valdenegro
    Nov 21 at 15:58










  • Thanks @Matias, will keep in mind
    – Abhishek
    Nov 21 at 16:07






  • 1




    I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
    – Robert Dodier
    Nov 21 at 17:48










  • Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
    – Abhishek
    Nov 22 at 4:25


















  • voted negative - reason please so that I can improve on that.
    – Abhishek
    Nov 21 at 12:50






  • 1




    I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
    – Matias Valdenegro
    Nov 21 at 15:58










  • Thanks @Matias, will keep in mind
    – Abhishek
    Nov 21 at 16:07






  • 1




    I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
    – Robert Dodier
    Nov 21 at 17:48










  • Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
    – Abhishek
    Nov 22 at 4:25
















voted negative - reason please so that I can improve on that.
– Abhishek
Nov 21 at 12:50




voted negative - reason please so that I can improve on that.
– Abhishek
Nov 21 at 12:50




1




1




I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
– Matias Valdenegro
Nov 21 at 15:58




I didn't downvote, but it seems that you don't have a programming question here. Maybe ask in the Stats or Data Science SE.
– Matias Valdenegro
Nov 21 at 15:58












Thanks @Matias, will keep in mind
– Abhishek
Nov 21 at 16:07




Thanks @Matias, will keep in mind
– Abhishek
Nov 21 at 16:07




1




1




I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
– Robert Dodier
Nov 21 at 17:48




I think making separate models for different effects is a good idea, but my advice is to not think about "normal" effects versus "outliers". My advice is to think about the different ways that observable data may be generated; there may be any number of ways. Build a model that expresses what you know about each data generating mechanism, and then train the whole collection at the same time via EM or whatever, i.e. my advice is, don't filter out "outliers" and then train the "normal" model on the leftovers. Good luck, this is a good problem. Also stats.stackexchange.com will have more to say.
– Robert Dodier
Nov 21 at 17:48












Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
– Abhishek
Nov 22 at 4:25




Thanks @RobertDodier I did build a single model(Linear regression specifically) for the whole data but I have not reached to any conclusion as of now so thought to give it a try to build separate models. I will be going forward with the separate model approach and if I will get something fruitful out of it, I will share with all.
– Abhishek
Nov 22 at 4:25

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411995%2fensemble-model-in-case-of-outliers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411995%2fensemble-model-in-case-of-outliers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)