INTEL SIMD: why is inplace multiplication so slow?












4















I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.



The simplest can be boiled down to something like these:



void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);

for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);

dst += 4;
src += 4;
}
}


testing code:



for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}


When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)



Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..



--



This has been answered in the comments. It's denormals during artificial testing.










share|improve this question




















  • 3





    A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

    – Peter Cordes
    Nov 27 '18 at 20:51








  • 2





    We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

    – Alan Birtles
    Nov 27 '18 at 21:19






  • 1





    @AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

    – Peter Cordes
    Nov 27 '18 at 21:23






  • 1





    Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

    – Peter Cordes
    Nov 27 '18 at 21:25






  • 1





    @PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

    – Eike
    Nov 28 '18 at 7:46
















4















I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.



The simplest can be boiled down to something like these:



void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);

for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);

dst += 4;
src += 4;
}
}


testing code:



for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}


When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)



Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..



--



This has been answered in the comments. It's denormals during artificial testing.










share|improve this question




















  • 3





    A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

    – Peter Cordes
    Nov 27 '18 at 20:51








  • 2





    We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

    – Alan Birtles
    Nov 27 '18 at 21:19






  • 1





    @AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

    – Peter Cordes
    Nov 27 '18 at 21:23






  • 1





    Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

    – Peter Cordes
    Nov 27 '18 at 21:25






  • 1





    @PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

    – Eike
    Nov 28 '18 at 7:46














4












4








4








I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.



The simplest can be boiled down to something like these:



void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);

for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);

dst += 4;
src += 4;
}
}


testing code:



for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}


When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)



Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..



--



This has been answered in the comments. It's denormals during artificial testing.










share|improve this question
















I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.



The simplest can be boiled down to something like these:



void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);

for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);

dst += 4;
src += 4;
}
}


testing code:



for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}


When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)



Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..



--



This has been answered in the comments. It's denormals during artificial testing.







c++ sse simd multiplication in-place






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 10 '18 at 10:19







Eike

















asked Nov 27 '18 at 20:46









EikeEike

10318




10318








  • 3





    A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

    – Peter Cordes
    Nov 27 '18 at 20:51








  • 2





    We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

    – Alan Birtles
    Nov 27 '18 at 21:19






  • 1





    @AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

    – Peter Cordes
    Nov 27 '18 at 21:23






  • 1





    Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

    – Peter Cordes
    Nov 27 '18 at 21:25






  • 1





    @PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

    – Eike
    Nov 28 '18 at 7:46














  • 3





    A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

    – Peter Cordes
    Nov 27 '18 at 20:51








  • 2





    We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

    – Alan Birtles
    Nov 27 '18 at 21:19






  • 1





    @AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

    – Peter Cordes
    Nov 27 '18 at 21:23






  • 1





    Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

    – Peter Cordes
    Nov 27 '18 at 21:25






  • 1





    @PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

    – Eike
    Nov 28 '18 at 7:46








3




3





A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51







A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51






2




2





We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19





We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19




1




1





@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23





@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23




1




1





Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25





Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25




1




1





@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46





@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46












1 Answer
1






active

oldest

votes


















6














Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.



(Turns out, yes that was the problem for the OP).



Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.



It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.





I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.



There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)





Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?



Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507874%2fintel-simd-why-is-inplace-multiplication-so-slow%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6














    Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.



    (Turns out, yes that was the problem for the OP).



    Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.



    It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.





    I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.



    There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)





    Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?



    Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)






    share|improve this answer




























      6














      Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.



      (Turns out, yes that was the problem for the OP).



      Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.



      It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.





      I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.



      There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)





      Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?



      Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)






      share|improve this answer


























        6












        6








        6







        Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.



        (Turns out, yes that was the problem for the OP).



        Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.



        It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.





        I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.



        There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)





        Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?



        Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)






        share|improve this answer













        Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.



        (Turns out, yes that was the problem for the OP).



        Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.



        It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.





        I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.



        There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)





        Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?



        Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 28 '18 at 8:37









        Peter CordesPeter Cordes

        130k18196334




        130k18196334
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507874%2fintel-simd-why-is-inplace-multiplication-so-slow%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Lallio

            Unable to find Lightning Node

            Futebolista