Choose assembly implementation to use based on supported instructions
I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
- The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single
if
statement could be significant. - The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
- Check whether the CPU supports BMI2 instructions using the
cpuid
instruction. - Set a global variable
true
orfalse
depending on the result. - Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
- I'm not sure how I can automatically run
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a.a
file and don't have control over themain
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
- This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
c++ c assembly static-libraries static-linking
add a comment |
I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
- The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single
if
statement could be significant. - The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
- Check whether the CPU supports BMI2 instructions using the
cpuid
instruction. - Set a global variable
true
orfalse
depending on the result. - Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
- I'm not sure how I can automatically run
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a.a
file and don't have control over themain
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
- This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
c++ c assembly static-libraries static-linking
Set the global variable, not upon entry tomain
, but on the first call to one of your functions, usingpthread_once
or equivalent to ensure thread safety.
– zwol
Nov 28 '18 at 2:51
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48
add a comment |
I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
- The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single
if
statement could be significant. - The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
- Check whether the CPU supports BMI2 instructions using the
cpuid
instruction. - Set a global variable
true
orfalse
depending on the result. - Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
- I'm not sure how I can automatically run
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a.a
file and don't have control over themain
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
- This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
c++ c assembly static-libraries static-linking
I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
- The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single
if
statement could be significant. - The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
- Check whether the CPU supports BMI2 instructions using the
cpuid
instruction. - Set a global variable
true
orfalse
depending on the result. - Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
- I'm not sure how I can automatically run
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a.a
file and don't have control over themain
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
- This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
c++ c assembly static-libraries static-linking
c++ c assembly static-libraries static-linking
asked Nov 28 '18 at 2:19
Sam KumarSam Kumar
485
485
Set the global variable, not upon entry tomain
, but on the first call to one of your functions, usingpthread_once
or equivalent to ensure thread safety.
– zwol
Nov 28 '18 at 2:51
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48
add a comment |
Set the global variable, not upon entry tomain
, but on the first call to one of your functions, usingpthread_once
or equivalent to ensure thread safety.
– zwol
Nov 28 '18 at 2:51
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48
Set the global variable, not upon entry to
main
, but on the first call to one of your functions, using pthread_once
or equivalent to ensure thread safety.– zwol
Nov 28 '18 at 2:51
Set the global variable, not upon entry to
main
, but on the first call to one of your functions, using pthread_once
or equivalent to ensure thread safety.– zwol
Nov 28 '18 at 2:51
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48
add a comment |
2 Answers
2
active
oldest
votes
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually usingpdep
orpext
; I need BMI2 for themulx
instruction. Rather than having a special init function, I ended up usinginit_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before themain
function executes.
– Sam Kumar
Nov 30 '18 at 0:00
add a comment |
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53511152%2fchoose-assembly-implementation-to-use-based-on-supported-instructions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually usingpdep
orpext
; I need BMI2 for themulx
instruction. Rather than having a special init function, I ended up usinginit_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before themain
function executes.
– Sam Kumar
Nov 30 '18 at 0:00
add a comment |
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually usingpdep
orpext
; I need BMI2 for themulx
instruction. Rather than having a special init function, I ended up usinginit_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before themain
function executes.
– Sam Kumar
Nov 30 '18 at 0:00
add a comment |
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
answered Nov 28 '18 at 3:51
Peter CordesPeter Cordes
131k18199336
131k18199336
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually usingpdep
orpext
; I need BMI2 for themulx
instruction. Rather than having a special init function, I ended up usinginit_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before themain
function executes.
– Sam Kumar
Nov 30 '18 at 0:00
add a comment |
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually usingpdep
orpext
; I need BMI2 for themulx
instruction. Rather than having a special init function, I ended up usinginit_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before themain
function executes.
– Sam Kumar
Nov 30 '18 at 0:00
1
1
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually using
pdep
or pext
; I need BMI2 for the mulx
instruction. Rather than having a special init function, I ended up using init_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before the main
function executes.– Sam Kumar
Nov 30 '18 at 0:00
Thanks for the extra information about STT_GNU_IFUNC, and for the idea to use function pointers. Since you asked, I'm not actually using
pdep
or pext
; I need BMI2 for the mulx
instruction. Rather than having a special init function, I ended up using init_array
(more information here: stackoverflow.com/questions/31137260/…) to set up the function pointer before the main
function executes.– Sam Kumar
Nov 30 '18 at 0:00
add a comment |
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
add a comment |
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
add a comment |
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
answered Dec 27 '18 at 14:19
DavidDavid
762
762
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53511152%2fchoose-assembly-implementation-to-use-based-on-supported-instructions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Set the global variable, not upon entry to
main
, but on the first call to one of your functions, usingpthread_once
or equivalent to ensure thread safety.– zwol
Nov 28 '18 at 2:51
Don't worry about the cost of the branch; it will be predicted reliably after the first call.
– zwol
Nov 28 '18 at 2:52
Well, in the C++ itself (as language) you simply distribute the source and let the user to recompile it, that also allows for more specific (driven by user machine) optimizations of the C++ code to be applied. (also code without source is a zombie, in a decade or two it will be very likely dead, so unless you intentionally want to waste part of your life by creating something what will be lost in few years...)
– Ped7g
Nov 28 '18 at 7:48