INTEL SIMD: why is inplace multiplication so slow?

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)

{

    __m128 factorV = _mm_set1_ps(factorV);



    for(int i = 0; i < count; i+= 4)

    {

        __m128 in = _mm_load_ps(src);

        in = _mm_mul_ps(in, factorV);

        _mm_store_ps(dst, in);



        dst += 4;

        src += 4;

    }

}

testing code:

for(int i = 0; i < 1000000; i++)

{

    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);

}

When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

This has been answered in the comments. It's denormals during artificial testing.

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

3

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51

2

We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19

1

@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23

1

Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25

1

@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46

|
show 2 more comments

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)

{

    __m128 factorV = _mm_set1_ps(factorV);



    for(int i = 0; i < count; i+= 4)

    {

        __m128 in = _mm_load_ps(src);

        in = _mm_mul_ps(in, factorV);

        _mm_store_ps(dst, in);



        dst += 4;

        src += 4;

    }

}

testing code:

for(int i = 0; i < 1000000; i++)

{

    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);

}

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

This has been answered in the comments. It's denormals during artificial testing.

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

3

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51

2

We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19

1

@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23

1

Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25

1

@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46

|
show 2 more comments

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)

{

    __m128 factorV = _mm_set1_ps(factorV);



    for(int i = 0; i < count; i+= 4)

    {

        __m128 in = _mm_load_ps(src);

        in = _mm_mul_ps(in, factorV);

        _mm_store_ps(dst, in);



        dst += 4;

        src += 4;

    }

}

testing code:

for(int i = 0; i < 1000000; i++)

{

    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);

}

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

This has been answered in the comments. It's denormals during artificial testing.

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)

{

    __m128 factorV = _mm_set1_ps(factorV);



    for(int i = 0; i < count; i+= 4)

    {

        __m128 in = _mm_load_ps(src);

        in = _mm_mul_ps(in, factorV);

        _mm_store_ps(dst, in);



        dst += 4;

        src += 4;

    }

}

testing code:

for(int i = 0; i < 1000000; i++)

{

    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);

}

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

This has been answered in the comments. It's denormals during artificial testing.

c++ sse simd multiplication in-place

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

edited Dec 10 '18 at 10:19

asked Nov 27 '18 at 20:46

Eike

10318

asked Nov 27 '18 at 20:46

Eike

10318

asked Nov 27 '18 at 20:46

Eike

10318

3

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51

2

We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19

1

@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23

1

Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25

1

@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46

|
show 2 more comments

3

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51

2

We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19

1

@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23

1

Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25

1

@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?

– Peter Cordes
Nov 27 '18 at 20:51

We need a Minimal, Complete, and Verifiable example, what is count are your arrays aligned?

– Alan Birtles
Nov 27 '18 at 21:19

@AlanBirtles: The code is using _mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.

– Peter Cordes
Nov 27 '18 at 21:23

Does your factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)

– Peter Cordes
Nov 27 '18 at 21:25

@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.

– Eike
Nov 28 '18 at 7:46

|
show 2 more comments

1 Answer
1

active

oldest

votes

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507874%2fintel-simd-why-is-inplace-multiplication-so-slow%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

add a comment |

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

add a comment |

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

answered Nov 28 '18 at 8:37

Peter Cordes

130k18196334

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl