INTEL SIMD: why is inplace multiplication so slow?
I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.
c++ sse simd multiplication in-place
|
show 2 more comments
I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.
c++ sse simd multiplication in-place
3
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.addpsandmulpsare not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?
– Peter Cordes
Nov 27 '18 at 20:51
2
We need a Minimal, Complete, and Verifiable example, what iscountare your arrays aligned?
– Alan Birtles
Nov 27 '18 at 21:19
1
@AlanBirtles: The code is using_mm_store_ps, notstoreu, so it would fault on unaligned unless the compiler usesmovupsanyway. That is a possibility for ICC and MSVC though.
– Peter Cordes
Nov 27 '18 at 21:23
1
Does yourfactorproduce a subnormal result? non-zero but smaller thanFLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason-ffast-mathsets DAZ/FTZ - flush-to-zero on underflow.)
– Peter Cordes
Nov 27 '18 at 21:25
1
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46
|
show 2 more comments
I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.
c++ sse simd multiplication in-place
I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.
c++ sse simd multiplication in-place
c++ sse simd multiplication in-place
edited Dec 10 '18 at 10:19
Eike
asked Nov 27 '18 at 20:46
EikeEike
10318
10318
3
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.addpsandmulpsare not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?
– Peter Cordes
Nov 27 '18 at 20:51
2
We need a Minimal, Complete, and Verifiable example, what iscountare your arrays aligned?
– Alan Birtles
Nov 27 '18 at 21:19
1
@AlanBirtles: The code is using_mm_store_ps, notstoreu, so it would fault on unaligned unless the compiler usesmovupsanyway. That is a possibility for ICC and MSVC though.
– Peter Cordes
Nov 27 '18 at 21:23
1
Does yourfactorproduce a subnormal result? non-zero but smaller thanFLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason-ffast-mathsets DAZ/FTZ - flush-to-zero on underflow.)
– Peter Cordes
Nov 27 '18 at 21:25
1
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46
|
show 2 more comments
3
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.addpsandmulpsare not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?
– Peter Cordes
Nov 27 '18 at 20:51
2
We need a Minimal, Complete, and Verifiable example, what iscountare your arrays aligned?
– Alan Birtles
Nov 27 '18 at 21:19
1
@AlanBirtles: The code is using_mm_store_ps, notstoreu, so it would fault on unaligned unless the compiler usesmovupsanyway. That is a possibility for ICC and MSVC though.
– Peter Cordes
Nov 27 '18 at 21:23
1
Does yourfactorproduce a subnormal result? non-zero but smaller thanFLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason-ffast-mathsets DAZ/FTZ - flush-to-zero on underflow.)
– Peter Cordes
Nov 27 '18 at 21:25
1
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46
3
3
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.
addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?– Peter Cordes
Nov 27 '18 at 20:51
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.
addps and mulps are not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?– Peter Cordes
Nov 27 '18 at 20:51
2
2
We need a Minimal, Complete, and Verifiable example, what is
count are your arrays aligned?– Alan Birtles
Nov 27 '18 at 21:19
We need a Minimal, Complete, and Verifiable example, what is
count are your arrays aligned?– Alan Birtles
Nov 27 '18 at 21:19
1
1
@AlanBirtles: The code is using
_mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.– Peter Cordes
Nov 27 '18 at 21:23
@AlanBirtles: The code is using
_mm_store_ps, not storeu, so it would fault on unaligned unless the compiler uses movups anyway. That is a possibility for ICC and MSVC though.– Peter Cordes
Nov 27 '18 at 21:23
1
1
Does your
factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)– Peter Cordes
Nov 27 '18 at 21:25
Does your
factor produce a subnormal result? non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.)– Peter Cordes
Nov 27 '18 at 21:25
1
1
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46
|
show 2 more comments
1 Answer
1
active
oldest
votes
Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507874%2fintel-simd-why-is-inplace-multiplication-so-slow%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)
add a comment |
Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)
add a comment |
Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)
Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)
answered Nov 28 '18 at 8:37
Peter CordesPeter Cordes
130k18196334
130k18196334
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507874%2fintel-simd-why-is-inplace-multiplication-so-slow%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70.
addpsandmulpsare not very different in performance on Intel hardware (agner.org/optimize), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like?– Peter Cordes
Nov 27 '18 at 20:51
2
We need a Minimal, Complete, and Verifiable example, what is
countare your arrays aligned?– Alan Birtles
Nov 27 '18 at 21:19
1
@AlanBirtles: The code is using
_mm_store_ps, notstoreu, so it would fault on unaligned unless the compiler usesmovupsanyway. That is a possibility for ICC and MSVC though.– Peter Cordes
Nov 27 '18 at 21:23
1
Does your
factorproduce a subnormal result? non-zero but smaller thanFLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason-ffast-mathsets DAZ/FTZ - flush-to-zero on underflow.)– Peter Cordes
Nov 27 '18 at 21:25
1
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it.
– Eike
Nov 28 '18 at 7:46