Question about OpenMP sections and critical












0















I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.



Here is what I currently am doing:



#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}

...

subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}

#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}


The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:



id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...


So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?










share|improve this question




















  • 1





    I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

    – Gilles
    Nov 29 '18 at 6:50
















0















I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.



Here is what I currently am doing:



#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}

...

subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}

#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}


The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:



id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...


So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?










share|improve this question




















  • 1





    I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

    – Gilles
    Nov 29 '18 at 6:50














0












0








0








I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.



Here is what I currently am doing:



#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}

...

subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}

#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}


The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:



id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...


So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?










share|improve this question
















I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.



Here is what I currently am doing:



#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}

...

subroutine update_matrix(A, b)
{
printf("id0 = %dn", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %dn", omp_get_thread_num());
modify columns 1 to j of A using b
}

#pragma omp section
{
printf("id2 = %dn", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}


The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:



id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...


So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?







parallel-processing openmp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 29 '18 at 1:38







vibe

















asked Nov 28 '18 at 19:49









vibevibe

235




235








  • 1





    I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

    – Gilles
    Nov 29 '18 at 6:50














  • 1





    I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

    – Gilles
    Nov 29 '18 at 6:50








1




1





I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

– Gilles
Nov 29 '18 at 6:50





I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the code

– Gilles
Nov 29 '18 at 6:50












1 Answer
1






active

oldest

votes


















1














#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.



Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.



You have several options:




  • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.


  • Try enable nesting with omp_set_nested(1);



  • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.



    #pragma omp parallel for
    for (i = 0; i < n; ++i)
    {
    ... build array bi ...
    update_matrix(A, bi); // not critical
    }

    ...

    subroutine update_matrix(A, b)
    {
    printf("id0 = %dn", omp_get_thread_num());
    #pragma omp critical(ONE_TO_J)
    {
    printf("id1 = %dn", omp_get_thread_num());
    modify columns 1 to j of A using b
    }

    #pragma omp critical(J_TO_K)
    {
    printf("id2 = %dn", omp_get_thread_num());
    modify columns j+1 to k of A using b
    }
    }



  • Or use atomic operations to edit the matrix, if this is suitable.



    #pragma omp parallel for
    for (i = 0; i < n; ++i)
    {
    ... build array bi ...
    update_matrix(A, bi); // not critical
    }

    ...

    subroutine update_matrix(A, b)
    {
    float tmp;
    printf("id0 = %dn", omp_get_thread_num());
    for (int row=0; row<max_row;row++)
    for (int column=0;column<k;column++){
    float(tmp)=some_function(b,row,column);
    #pragma omp atomic
    A[column][row]+=tmp;
    }

    }


    By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.








share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53527052%2fquestion-about-openmp-sections-and-critical%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    #pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.



    Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.



    You have several options:




    • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.


    • Try enable nesting with omp_set_nested(1);



    • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.



      #pragma omp parallel for
      for (i = 0; i < n; ++i)
      {
      ... build array bi ...
      update_matrix(A, bi); // not critical
      }

      ...

      subroutine update_matrix(A, b)
      {
      printf("id0 = %dn", omp_get_thread_num());
      #pragma omp critical(ONE_TO_J)
      {
      printf("id1 = %dn", omp_get_thread_num());
      modify columns 1 to j of A using b
      }

      #pragma omp critical(J_TO_K)
      {
      printf("id2 = %dn", omp_get_thread_num());
      modify columns j+1 to k of A using b
      }
      }



    • Or use atomic operations to edit the matrix, if this is suitable.



      #pragma omp parallel for
      for (i = 0; i < n; ++i)
      {
      ... build array bi ...
      update_matrix(A, bi); // not critical
      }

      ...

      subroutine update_matrix(A, b)
      {
      float tmp;
      printf("id0 = %dn", omp_get_thread_num());
      for (int row=0; row<max_row;row++)
      for (int column=0;column<k;column++){
      float(tmp)=some_function(b,row,column);
      #pragma omp atomic
      A[column][row]+=tmp;
      }

      }


      By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.








    share|improve this answer




























      1














      #pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.



      Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.



      You have several options:




      • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.


      • Try enable nesting with omp_set_nested(1);



      • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.



        #pragma omp parallel for
        for (i = 0; i < n; ++i)
        {
        ... build array bi ...
        update_matrix(A, bi); // not critical
        }

        ...

        subroutine update_matrix(A, b)
        {
        printf("id0 = %dn", omp_get_thread_num());
        #pragma omp critical(ONE_TO_J)
        {
        printf("id1 = %dn", omp_get_thread_num());
        modify columns 1 to j of A using b
        }

        #pragma omp critical(J_TO_K)
        {
        printf("id2 = %dn", omp_get_thread_num());
        modify columns j+1 to k of A using b
        }
        }



      • Or use atomic operations to edit the matrix, if this is suitable.



        #pragma omp parallel for
        for (i = 0; i < n; ++i)
        {
        ... build array bi ...
        update_matrix(A, bi); // not critical
        }

        ...

        subroutine update_matrix(A, b)
        {
        float tmp;
        printf("id0 = %dn", omp_get_thread_num());
        for (int row=0; row<max_row;row++)
        for (int column=0;column<k;column++){
        float(tmp)=some_function(b,row,column);
        #pragma omp atomic
        A[column][row]+=tmp;
        }

        }


        By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.








      share|improve this answer


























        1












        1








        1







        #pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.



        Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.



        You have several options:




        • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.


        • Try enable nesting with omp_set_nested(1);



        • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.



          #pragma omp parallel for
          for (i = 0; i < n; ++i)
          {
          ... build array bi ...
          update_matrix(A, bi); // not critical
          }

          ...

          subroutine update_matrix(A, b)
          {
          printf("id0 = %dn", omp_get_thread_num());
          #pragma omp critical(ONE_TO_J)
          {
          printf("id1 = %dn", omp_get_thread_num());
          modify columns 1 to j of A using b
          }

          #pragma omp critical(J_TO_K)
          {
          printf("id2 = %dn", omp_get_thread_num());
          modify columns j+1 to k of A using b
          }
          }



        • Or use atomic operations to edit the matrix, if this is suitable.



          #pragma omp parallel for
          for (i = 0; i < n; ++i)
          {
          ... build array bi ...
          update_matrix(A, bi); // not critical
          }

          ...

          subroutine update_matrix(A, b)
          {
          float tmp;
          printf("id0 = %dn", omp_get_thread_num());
          for (int row=0; row<max_row;row++)
          for (int column=0;column<k;column++){
          float(tmp)=some_function(b,row,column);
          #pragma omp atomic
          A[column][row]+=tmp;
          }

          }


          By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.








        share|improve this answer













        #pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.



        Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.



        You have several options:




        • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.


        • Try enable nesting with omp_set_nested(1);



        • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.



          #pragma omp parallel for
          for (i = 0; i < n; ++i)
          {
          ... build array bi ...
          update_matrix(A, bi); // not critical
          }

          ...

          subroutine update_matrix(A, b)
          {
          printf("id0 = %dn", omp_get_thread_num());
          #pragma omp critical(ONE_TO_J)
          {
          printf("id1 = %dn", omp_get_thread_num());
          modify columns 1 to j of A using b
          }

          #pragma omp critical(J_TO_K)
          {
          printf("id2 = %dn", omp_get_thread_num());
          modify columns j+1 to k of A using b
          }
          }



        • Or use atomic operations to edit the matrix, if this is suitable.



          #pragma omp parallel for
          for (i = 0; i < n; ++i)
          {
          ... build array bi ...
          update_matrix(A, bi); // not critical
          }

          ...

          subroutine update_matrix(A, b)
          {
          float tmp;
          printf("id0 = %dn", omp_get_thread_num());
          for (int row=0; row<max_row;row++)
          for (int column=0;column<k;column++){
          float(tmp)=some_function(b,row,column);
          #pragma omp atomic
          A[column][row]+=tmp;
          }

          }


          By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.









        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 29 '18 at 9:12









        BriceBrice

        1,415110




        1,415110
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53527052%2fquestion-about-openmp-sections-and-critical%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Lallio

            Futebolista

            Jornalista