Question about OpenMP sections and critical
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
parallel-processing openmp
add a comment |
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
parallel-processing openmp
1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variableOMP_NESTEDto be set to'true', or with the functionomp_set_nested()inside the code
– Gilles
Nov 29 '18 at 6:50
add a comment |
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
parallel-processing openmp
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
parallel-processing openmp
parallel-processing openmp
edited Nov 29 '18 at 1:38
vibe
asked Nov 28 '18 at 19:49
vibevibe
235
235
1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variableOMP_NESTEDto be set to'true', or with the functionomp_set_nested()inside the code
– Gilles
Nov 29 '18 at 6:50
add a comment |
1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variableOMP_NESTEDto be set to'true', or with the functionomp_set_nested()inside the code
– Gilles
Nov 29 '18 at 6:50
1
1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable
OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code– Gilles
Nov 29 '18 at 6:50
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable
OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code– Gilles
Nov 29 '18 at 6:50
add a comment |
1 Answer
1
active
oldest
votes
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the
parallel sectionsclause in the subroutine.Try enable nesting with
omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in
criticalsection ONE_TO_J and one oncriticalsection J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %dn", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53527052%2fquestion-about-openmp-sections-and-critical%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the
parallel sectionsclause in the subroutine.Try enable nesting with
omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in
criticalsection ONE_TO_J and one oncriticalsection J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %dn", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
add a comment |
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the
parallel sectionsclause in the subroutine.Try enable nesting with
omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in
criticalsection ONE_TO_J and one oncriticalsection J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %dn", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
add a comment |
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the
parallel sectionsclause in the subroutine.Try enable nesting with
omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in
criticalsection ONE_TO_J and one oncriticalsection J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %dn", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the
parallel sectionsclause in the subroutine.Try enable nesting with
omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in
criticalsection ONE_TO_J and one oncriticalsection J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %dn", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
answered Nov 29 '18 at 9:12
BriceBrice
1,415110
1,415110
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53527052%2fquestion-about-openmp-sections-and-critical%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable
OMP_NESTEDto be set to'true', or with the functionomp_set_nested()inside the code– Gilles
Nov 29 '18 at 6:50