Are over-dispersion tests in GLMs actually *useful*?












3












$begingroup$


The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.



This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?



My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?










share|cite|improve this question









$endgroup$












  • $begingroup$
    my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
    $endgroup$
    – mlofton
    2 hours ago








  • 1




    $begingroup$
    In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
    $endgroup$
    – Gordon Smyth
    1 hour ago












  • $begingroup$
    @GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
    $endgroup$
    – Cliff AB
    1 hour ago










  • $begingroup$
    @CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
    $endgroup$
    – Gordon Smyth
    27 mins ago
















3












$begingroup$


The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.



This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?



My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?










share|cite|improve this question









$endgroup$












  • $begingroup$
    my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
    $endgroup$
    – mlofton
    2 hours ago








  • 1




    $begingroup$
    In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
    $endgroup$
    – Gordon Smyth
    1 hour ago












  • $begingroup$
    @GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
    $endgroup$
    – Cliff AB
    1 hour ago










  • $begingroup$
    @CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
    $endgroup$
    – Gordon Smyth
    27 mins ago














3












3








3


1



$begingroup$


The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.



This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?



My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?










share|cite|improve this question









$endgroup$




The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.



This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?



My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?







overdispersion






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 2 hours ago









BenBen

24.9k226117




24.9k226117












  • $begingroup$
    my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
    $endgroup$
    – mlofton
    2 hours ago








  • 1




    $begingroup$
    In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
    $endgroup$
    – Gordon Smyth
    1 hour ago












  • $begingroup$
    @GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
    $endgroup$
    – Cliff AB
    1 hour ago










  • $begingroup$
    @CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
    $endgroup$
    – Gordon Smyth
    27 mins ago


















  • $begingroup$
    my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
    $endgroup$
    – mlofton
    2 hours ago








  • 1




    $begingroup$
    In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
    $endgroup$
    – Gordon Smyth
    1 hour ago












  • $begingroup$
    @GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
    $endgroup$
    – Cliff AB
    1 hour ago










  • $begingroup$
    @CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
    $endgroup$
    – Gordon Smyth
    27 mins ago
















$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago






$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago






1




1




$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago






$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago














$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago




$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago












$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
27 mins ago




$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
27 mins ago










1 Answer
1






active

oldest

votes


















4












$begingroup$

In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.



(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.



(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.



(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
    $endgroup$
    – Björn
    8 mins ago










  • $begingroup$
    @Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
    $endgroup$
    – Cliff AB
    4 mins ago













Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f392591%2fare-over-dispersion-tests-in-glms-actually-useful%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4












$begingroup$

In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.



(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.



(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.



(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
    $endgroup$
    – Björn
    8 mins ago










  • $begingroup$
    @Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
    $endgroup$
    – Cliff AB
    4 mins ago


















4












$begingroup$

In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.



(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.



(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.



(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
    $endgroup$
    – Björn
    8 mins ago










  • $begingroup$
    @Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
    $endgroup$
    – Cliff AB
    4 mins ago
















4












4








4





$begingroup$

In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.



(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.



(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.



(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.






share|cite|improve this answer











$endgroup$



In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.



(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.



(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.



(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited 1 min ago

























answered 1 hour ago









Cliff ABCliff AB

12.8k12363




12.8k12363












  • $begingroup$
    On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
    $endgroup$
    – Björn
    8 mins ago










  • $begingroup$
    @Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
    $endgroup$
    – Cliff AB
    4 mins ago




















  • $begingroup$
    On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
    $endgroup$
    – Björn
    8 mins ago










  • $begingroup$
    @Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
    $endgroup$
    – Cliff AB
    4 mins ago


















$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
8 mins ago




$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
8 mins ago












$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
4 mins ago






$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
4 mins ago




















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f392591%2fare-over-dispersion-tests-in-glms-actually-useful%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)