Convert array of eight bytes to eight integers
I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd
or _mm512_i64gather_pd
. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd
.
I have created two functions get_index
and get_index2
to convert eight chars to a __m256i
. The assembly for get_index
is simpler than for get_index2
see https://godbolt.org/z/lhg9fX. However, in my code get_index2
is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?
#include <x86intrin.h>
#include <inttypes.h>
__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
x86 avx2 xeon-phi avx512 knights-landing
add a comment |
I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd
or _mm512_i64gather_pd
. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd
.
I have created two functions get_index
and get_index2
to convert eight chars to a __m256i
. The assembly for get_index
is simpler than for get_index2
see https://godbolt.org/z/lhg9fX. However, in my code get_index2
is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?
#include <x86intrin.h>
#include <inttypes.h>
__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
x86 avx2 xeon-phi avx512 knights-landing
2
KNL has very slow 256-bitvpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you usevpmovzxbd
orbq
like a normal person?__m512i _mm512_cvtepu8_epi32(__m128i a)
or_mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.
– Peter Cordes
Nov 24 '18 at 18:28
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
1
@PeterCordes, thank you for pointing out_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster thanget_index2
though in my code. Maybe ICC convertsget_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with#pragma ivdep
). I'm vectorizing stencil code.
– Z boson
Nov 26 '18 at 12:10
add a comment |
I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd
or _mm512_i64gather_pd
. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd
.
I have created two functions get_index
and get_index2
to convert eight chars to a __m256i
. The assembly for get_index
is simpler than for get_index2
see https://godbolt.org/z/lhg9fX. However, in my code get_index2
is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?
#include <x86intrin.h>
#include <inttypes.h>
__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
x86 avx2 xeon-phi avx512 knights-landing
I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd
or _mm512_i64gather_pd
. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd
.
I have created two functions get_index
and get_index2
to convert eight chars to a __m256i
. The assembly for get_index
is simpler than for get_index2
see https://godbolt.org/z/lhg9fX. However, in my code get_index2
is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?
#include <x86intrin.h>
#include <inttypes.h>
__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}
x86 avx2 xeon-phi avx512 knights-landing
x86 avx2 xeon-phi avx512 knights-landing
asked Nov 24 '18 at 14:25
Z bosonZ boson
20.7k779149
20.7k779149
2
KNL has very slow 256-bitvpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you usevpmovzxbd
orbq
like a normal person?__m512i _mm512_cvtepu8_epi32(__m128i a)
or_mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.
– Peter Cordes
Nov 24 '18 at 18:28
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
1
@PeterCordes, thank you for pointing out_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster thanget_index2
though in my code. Maybe ICC convertsget_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with#pragma ivdep
). I'm vectorizing stencil code.
– Z boson
Nov 26 '18 at 12:10
add a comment |
2
KNL has very slow 256-bitvpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you usevpmovzxbd
orbq
like a normal person?__m512i _mm512_cvtepu8_epi32(__m128i a)
or_mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.
– Peter Cordes
Nov 24 '18 at 18:28
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
1
@PeterCordes, thank you for pointing out_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster thanget_index2
though in my code. Maybe ICC convertsget_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with#pragma ivdep
). I'm vectorizing stencil code.
– Z boson
Nov 26 '18 at 12:10
2
2
KNL has very slow 256-bit
vpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd
or bq
like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a)
or _mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.– Peter Cordes
Nov 24 '18 at 18:28
KNL has very slow 256-bit
vpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd
or bq
like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a)
or _mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.– Peter Cordes
Nov 24 '18 at 18:28
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
1
1
@PeterCordes, thank you for pointing out
_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster than get_index2
though in my code. Maybe ICC converts get_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep
). I'm vectorizing stencil code.– Z boson
Nov 26 '18 at 12:10
@PeterCordes, thank you for pointing out
_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster than get_index2
though in my code. Maybe ICC converts get_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep
). I'm vectorizing stencil code.– Z boson
Nov 26 '18 at 12:10
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
KNL has very slow 256-bit
vpshufb ymm
(12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you usevpmovzxbd
orbq
like a normal person?__m512i _mm512_cvtepu8_epi32(__m128i a)
or_mm256_cvtepu8_epi32
. Those are all single-uop with 2c throughput.– Peter Cordes
Nov 24 '18 at 18:28
That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.
– Peter Cordes
Nov 24 '18 at 18:34
1
@PeterCordes, thank you for pointing out
_mm256_cvtepu8_epi32
, that's exactly what I want, the result is no faster thanget_index2
though in my code. Maybe ICC convertsget_index2
to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with#pragma ivdep
). I'm vectorizing stencil code.– Z boson
Nov 26 '18 at 12:10