What does “Scientists rise up against statistical significance” mean? (Comment in Nature)

The title of the Comment in Nature Scientists rise up against statistical significance begins with:

Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

and later contains statements like:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.

Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?

When talking about compatibility intervals, bear in mind four things.

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...

Second, not all values inside are equally compatible with the data, given the assumptions...

Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...

Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...

Nature: Scientists rise up against statistical significance

asked 3 hours ago

uhoh

1264

add a comment |

The title of the Comment in Nature Scientists rise up against statistical significance begins with:

Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

and later contains statements like:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?

When talking about compatibility intervals, bear in mind four things.

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...

Second, not all values inside are equally compatible with the data, given the assumptions...

Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...

Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...

Nature: Scientists rise up against statistical significance

asked 3 hours ago

uhoh

1264

add a comment |

The title of the Comment in Nature Scientists rise up against statistical significance begins with:

Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

and later contains statements like:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?

When talking about compatibility intervals, bear in mind four things.

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...

Second, not all values inside are equally compatible with the data, given the assumptions...

Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...

Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...

Nature: Scientists rise up against statistical significance

asked 3 hours ago

uhoh

1264

The title of the Comment in Nature Scientists rise up against statistical significance begins with:

Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

and later contains statements like:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?

When talking about compatibility intervals, bear in mind four things.

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...

Second, not all values inside are equally compatible with the data, given the assumptions...

Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...

Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...

Nature: Scientists rise up against statistical significance

statistical-significance p-value bias

asked 3 hours ago

uhoh

1264

asked 3 hours ago

uhoh

1264

asked 3 hours ago

uhoh

1264

asked 3 hours ago

uhoh

1264

asked 3 hours ago

uhoh

1264

add a comment |

3 Answers
3

active

oldest

votes

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

answered 3 hours ago

Harvey Motulsky

10.9k44485

add a comment |

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.

So to break down the points:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.

Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

1

$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
49 mins ago

$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
16 mins ago

add a comment |

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)

answered 26 mins ago

usul

1672

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398646%2fwhat-does-scientists-rise-up-against-statistical-significance-mean-comment-i%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

answered 3 hours ago

Harvey Motulsky

10.9k44485

add a comment |

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

answered 3 hours ago

Harvey Motulsky

10.9k44485

add a comment |

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

answered 3 hours ago

Harvey Motulsky

10.9k44485

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

answered 3 hours ago

Harvey Motulsky

10.9k44485

answered 3 hours ago

Harvey Motulsky

10.9k44485

answered 3 hours ago

Harvey Motulsky

10.9k44485

answered 3 hours ago

Harvey Motulsky

10.9k44485

add a comment |

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

So to break down the points:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

1

$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
49 mins ago

$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
16 mins ago

add a comment |

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

So to break down the points:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

1

$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
49 mins ago

$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
16 mins ago

add a comment |

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

So to break down the points:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

So to break down the points:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

edited 52 mins ago

answered 1 hour ago

Ingolifs

543219

answered 1 hour ago

Ingolifs

543219

answered 1 hour ago

Ingolifs

543219

1

$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
49 mins ago

$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
16 mins ago

add a comment |

1

$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
49 mins ago

$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
16 mins ago

Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?

– Ingolifs
49 mins ago

I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.

– uhoh
16 mins ago

add a comment |

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

answered 26 mins ago

usul

1672

add a comment |

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

answered 26 mins ago

usul

1672

add a comment |

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

answered 26 mins ago

usul

1672

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

answered 26 mins ago

usul

1672

answered 26 mins ago

usul

1672

answered 26 mins ago

usul

1672

answered 26 mins ago

usul

1672

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

JUxbcPN VtsCqaRK9NWVsz UYyDcFk2 VTXhmpFvSnribAPE

搜尋此網誌

Btukfyl