What does “Scientists rise up against statistical significance” mean? (Comment in Nature)












3












$begingroup$


The title of the Comment in Nature Scientists rise up against statistical significance begins with:




Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




and later contains statements like:




Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.



Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




When talking about compatibility intervals, bear in mind four things.




  • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


  • Second, not all values inside are equally compatible with the data, given the assumptions...


  • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


  • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...







Nature: Scientists rise up against statistical significance










share|cite|improve this question









$endgroup$

















    3












    $begingroup$


    The title of the Comment in Nature Scientists rise up against statistical significance begins with:




    Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




    and later contains statements like:




    Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




    I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.



    Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




    When talking about compatibility intervals, bear in mind four things.




    • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


    • Second, not all values inside are equally compatible with the data, given the assumptions...


    • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


    • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...







    Nature: Scientists rise up against statistical significance










    share|cite|improve this question









    $endgroup$















      3












      3








      3





      $begingroup$


      The title of the Comment in Nature Scientists rise up against statistical significance begins with:




      Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




      and later contains statements like:




      Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




      I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.



      Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




      When talking about compatibility intervals, bear in mind four things.




      • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


      • Second, not all values inside are equally compatible with the data, given the assumptions...


      • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


      • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...







      Nature: Scientists rise up against statistical significance










      share|cite|improve this question









      $endgroup$




      The title of the Comment in Nature Scientists rise up against statistical significance begins with:




      Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




      and later contains statements like:




      Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




      I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.



      Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




      When talking about compatibility intervals, bear in mind four things.




      • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


      • Second, not all values inside are equally compatible with the data, given the assumptions...


      • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


      • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...







      Nature: Scientists rise up against statistical significance







      statistical-significance p-value bias






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked 3 hours ago









      uhohuhoh

      1264




      1264






















          3 Answers
          3






          active

          oldest

          votes


















          5












          $begingroup$

          I'll try.




          1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

          2. Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

          3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

          4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.






          share|cite|improve this answer









          $endgroup$





















            4












            $begingroup$

            The first three points, as far as I can tell, are a variation on a single argument.



            Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



            uniform probability distribution



            When actually, they are much more likely to look like this:
            enter image description here



            As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



            So to break down the points:





            1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.




              1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

              2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


              3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.





            Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



            Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






            share|cite











            $endgroup$









            • 1




              $begingroup$
              Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
              $endgroup$
              – Ingolifs
              49 mins ago










            • $begingroup$
              I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
              $endgroup$
              – uhoh
              16 mins ago



















            1












            $begingroup$

            Much of the article and the figure you include make a very simple point:




            Lack of evidence for an effect is not evidence that it does not exist.




            For example,




            "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




            Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



            But this is the mistake the authors claim scientists routinely make.



            For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



            The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






            share|cite|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "65"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398646%2fwhat-does-scientists-rise-up-against-statistical-significance-mean-comment-i%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              5












              $begingroup$

              I'll try.




              1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

              2. Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

              3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

              4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.






              share|cite|improve this answer









              $endgroup$


















                5












                $begingroup$

                I'll try.




                1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

                2. Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

                3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

                4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.






                share|cite|improve this answer









                $endgroup$
















                  5












                  5








                  5





                  $begingroup$

                  I'll try.




                  1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

                  2. Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

                  3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

                  4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.






                  share|cite|improve this answer









                  $endgroup$



                  I'll try.




                  1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

                  2. Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.

                  3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

                  4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered 3 hours ago









                  Harvey MotulskyHarvey Motulsky

                  10.9k44485




                  10.9k44485

























                      4












                      $begingroup$

                      The first three points, as far as I can tell, are a variation on a single argument.



                      Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



                      uniform probability distribution



                      When actually, they are much more likely to look like this:
                      enter image description here



                      As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



                      So to break down the points:





                      1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.




                        1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

                        2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


                        3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.





                      Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



                      Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






                      share|cite











                      $endgroup$









                      • 1




                        $begingroup$
                        Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                        $endgroup$
                        – Ingolifs
                        49 mins ago










                      • $begingroup$
                        I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                        $endgroup$
                        – uhoh
                        16 mins ago
















                      4












                      $begingroup$

                      The first three points, as far as I can tell, are a variation on a single argument.



                      Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



                      uniform probability distribution



                      When actually, they are much more likely to look like this:
                      enter image description here



                      As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



                      So to break down the points:





                      1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.




                        1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

                        2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


                        3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.





                      Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



                      Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






                      share|cite











                      $endgroup$









                      • 1




                        $begingroup$
                        Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                        $endgroup$
                        – Ingolifs
                        49 mins ago










                      • $begingroup$
                        I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                        $endgroup$
                        – uhoh
                        16 mins ago














                      4












                      4








                      4





                      $begingroup$

                      The first three points, as far as I can tell, are a variation on a single argument.



                      Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



                      uniform probability distribution



                      When actually, they are much more likely to look like this:
                      enter image description here



                      As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



                      So to break down the points:





                      1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.




                        1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

                        2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


                        3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.





                      Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



                      Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






                      share|cite











                      $endgroup$



                      The first three points, as far as I can tell, are a variation on a single argument.



                      Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



                      uniform probability distribution



                      When actually, they are much more likely to look like this:
                      enter image description here



                      As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



                      So to break down the points:





                      1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.




                        1. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

                        2. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


                        3. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.





                      Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



                      Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.







                      share|cite














                      share|cite



                      share|cite








                      edited 52 mins ago

























                      answered 1 hour ago









                      IngolifsIngolifs

                      543219




                      543219








                      • 1




                        $begingroup$
                        Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                        $endgroup$
                        – Ingolifs
                        49 mins ago










                      • $begingroup$
                        I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                        $endgroup$
                        – uhoh
                        16 mins ago














                      • 1




                        $begingroup$
                        Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                        $endgroup$
                        – Ingolifs
                        49 mins ago










                      • $begingroup$
                        I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                        $endgroup$
                        – uhoh
                        16 mins ago








                      1




                      1




                      $begingroup$
                      Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                      $endgroup$
                      – Ingolifs
                      49 mins ago




                      $begingroup$
                      Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
                      $endgroup$
                      – Ingolifs
                      49 mins ago












                      $begingroup$
                      I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                      $endgroup$
                      – uhoh
                      16 mins ago




                      $begingroup$
                      I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
                      $endgroup$
                      – uhoh
                      16 mins ago











                      1












                      $begingroup$

                      Much of the article and the figure you include make a very simple point:




                      Lack of evidence for an effect is not evidence that it does not exist.




                      For example,




                      "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                      Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                      But this is the mistake the authors claim scientists routinely make.



                      For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                      The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                      share|cite|improve this answer









                      $endgroup$


















                        1












                        $begingroup$

                        Much of the article and the figure you include make a very simple point:




                        Lack of evidence for an effect is not evidence that it does not exist.




                        For example,




                        "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                        Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                        But this is the mistake the authors claim scientists routinely make.



                        For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                        The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                        share|cite|improve this answer









                        $endgroup$
















                          1












                          1








                          1





                          $begingroup$

                          Much of the article and the figure you include make a very simple point:




                          Lack of evidence for an effect is not evidence that it does not exist.




                          For example,




                          "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                          Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                          But this is the mistake the authors claim scientists routinely make.



                          For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                          The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                          share|cite|improve this answer









                          $endgroup$



                          Much of the article and the figure you include make a very simple point:




                          Lack of evidence for an effect is not evidence that it does not exist.




                          For example,




                          "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                          Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                          But this is the mistake the authors claim scientists routinely make.



                          For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                          The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered 26 mins ago









                          usulusul

                          1672




                          1672






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Cross Validated!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398646%2fwhat-does-scientists-rise-up-against-statistical-significance-mean-comment-i%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                              Calculate evaluation metrics using cross_val_predict sklearn

                              Insert data from modal to MySQL (multiple modal on website)