OpenCL Buffer Creation












1















I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.



I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:




If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)




To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.



So what is actually happening?



What I would like to know what do the flags:




  • CL_MEM_USE_HOST_PTR

  • CL_MEM_COPY_HOST_PTR

  • CL_MEM_ALLOC_HOST_PTR


imply about the location of buffer.



Thank you










share|improve this question



























    1















    I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.



    I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:




    If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)




    To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.



    So what is actually happening?



    What I would like to know what do the flags:




    • CL_MEM_USE_HOST_PTR

    • CL_MEM_COPY_HOST_PTR

    • CL_MEM_ALLOC_HOST_PTR


    imply about the location of buffer.



    Thank you










    share|improve this question

























      1












      1








      1








      I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.



      I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:




      If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)




      To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.



      So what is actually happening?



      What I would like to know what do the flags:




      • CL_MEM_USE_HOST_PTR

      • CL_MEM_COPY_HOST_PTR

      • CL_MEM_ALLOC_HOST_PTR


      imply about the location of buffer.



      Thank you










      share|improve this question














      I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.



      I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:




      If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)




      To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.



      So what is actually happening?



      What I would like to know what do the flags:




      • CL_MEM_USE_HOST_PTR

      • CL_MEM_COPY_HOST_PTR

      • CL_MEM_ALLOC_HOST_PTR


      imply about the location of buffer.



      Thank you







      opencl buffer-objects






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 25 '18 at 21:35









      JasonPhJasonPh

      13316




      13316
























          2 Answers
          2






          active

          oldest

          votes


















          0














          The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.



          First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.



          Regarding allocation modes:





          • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.


          • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).

          • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.


          Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.



          Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.






          share|improve this answer


























          • are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

            – noma
            Nov 26 '18 at 13:36













          • @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

            – pmdj
            Nov 26 '18 at 14:56











          • Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

            – noma
            Nov 26 '18 at 15:43











          • @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

            – pmdj
            Nov 26 '18 at 19:59













          • @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

            – pmdj
            Nov 26 '18 at 20:04



















          2














          Let's first have a look at the signature of clCreateBuffer:



          cl_mem clCreateBuffer(
          cl_context context,
          cl_mem_flags flags,
          size_t size,
          void *host_ptr,
          cl_int *errcode_ret)


          There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.



          Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.



          Why do we want pinned memory?
          Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.



          As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.




          CL_MEM_USE_HOST_PTR




          Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.




          CL_MEM_ALLOC_HOST_PTR




          We tell the runtime to allocate host memory for the object. It could be pinned.




          CL_MEM_COPY_HOST_PTR




          We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.



          Hope that helps.






          share|improve this answer


























          • Ok, I got it now. Thank you!

            – JasonPh
            Nov 26 '18 at 13:21











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472245%2fopencl-buffer-creation%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.



          First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.



          Regarding allocation modes:





          • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.


          • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).

          • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.


          Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.



          Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.






          share|improve this answer


























          • are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

            – noma
            Nov 26 '18 at 13:36













          • @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

            – pmdj
            Nov 26 '18 at 14:56











          • Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

            – noma
            Nov 26 '18 at 15:43











          • @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

            – pmdj
            Nov 26 '18 at 19:59













          • @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

            – pmdj
            Nov 26 '18 at 20:04
















          0














          The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.



          First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.



          Regarding allocation modes:





          • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.


          • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).

          • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.


          Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.



          Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.






          share|improve this answer


























          • are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

            – noma
            Nov 26 '18 at 13:36













          • @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

            – pmdj
            Nov 26 '18 at 14:56











          • Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

            – noma
            Nov 26 '18 at 15:43











          • @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

            – pmdj
            Nov 26 '18 at 19:59













          • @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

            – pmdj
            Nov 26 '18 at 20:04














          0












          0








          0







          The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.



          First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.



          Regarding allocation modes:





          • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.


          • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).

          • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.


          Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.



          Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.






          share|improve this answer















          The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.



          First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.



          Regarding allocation modes:





          • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.


          • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).

          • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.


          Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.



          Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 26 '18 at 13:12

























          answered Nov 26 '18 at 12:56









          pmdjpmdj

          12.8k13384




          12.8k13384













          • are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

            – noma
            Nov 26 '18 at 13:36













          • @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

            – pmdj
            Nov 26 '18 at 14:56











          • Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

            – noma
            Nov 26 '18 at 15:43











          • @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

            – pmdj
            Nov 26 '18 at 19:59













          • @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

            – pmdj
            Nov 26 '18 at 20:04



















          • are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

            – noma
            Nov 26 '18 at 13:36













          • @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

            – pmdj
            Nov 26 '18 at 14:56











          • Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

            – noma
            Nov 26 '18 at 15:43











          • @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

            – pmdj
            Nov 26 '18 at 19:59













          • @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

            – pmdj
            Nov 26 '18 at 20:04

















          are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

          – noma
          Nov 26 '18 at 13:36







          are you sure the alignment is sufficient to have the runtime pin user-provided memory, as the specifications only states buffer-type requirements here. Wouldn't pinning require page-size alignment and the buffer size to be a multiple of the page size?

          – noma
          Nov 26 '18 at 13:36















          @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

          – pmdj
          Nov 26 '18 at 14:56





          @noma the system can still pin the whole page, even if only a part of it is technically part of the buffer object.

          – pmdj
          Nov 26 '18 at 14:56













          Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

          – noma
          Nov 26 '18 at 15:43





          Sure, but wouldn't a DMA copy, which to my best knowledge happens on page granularity, then copy around some random piece of application-data of the same page, and - that's the bad part - overwrite the 'accidentally shared' host data when copying data back data from the device?

          – noma
          Nov 26 '18 at 15:43













          @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

          – pmdj
          Nov 26 '18 at 19:59







          @noma Typical PCI(e) DMA has a granularity of 4 bytes. You can of course go right down to the byte level with certain tricks. (for example, the device could copy the partial dword to a driver-allocated dword, then the driver copies the bytes into the actual buffer, but there might be other ways) For the purposes of OpenCL, 4 byte alignment is fine. Other systems than PCIe might have different granularities, but 4K+ would be rather a lot - probably too awkward for any real devices to be limited like that.

          – pmdj
          Nov 26 '18 at 19:59















          @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

          – pmdj
          Nov 26 '18 at 20:04





          @noma For the other way around, mapping BARs into CPU address space, the base address is 16-byte-aligned, which is still perfectly good enough for OpenCL.

          – pmdj
          Nov 26 '18 at 20:04













          2














          Let's first have a look at the signature of clCreateBuffer:



          cl_mem clCreateBuffer(
          cl_context context,
          cl_mem_flags flags,
          size_t size,
          void *host_ptr,
          cl_int *errcode_ret)


          There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.



          Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.



          Why do we want pinned memory?
          Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.



          As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.




          CL_MEM_USE_HOST_PTR




          Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.




          CL_MEM_ALLOC_HOST_PTR




          We tell the runtime to allocate host memory for the object. It could be pinned.




          CL_MEM_COPY_HOST_PTR




          We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.



          Hope that helps.






          share|improve this answer


























          • Ok, I got it now. Thank you!

            – JasonPh
            Nov 26 '18 at 13:21
















          2














          Let's first have a look at the signature of clCreateBuffer:



          cl_mem clCreateBuffer(
          cl_context context,
          cl_mem_flags flags,
          size_t size,
          void *host_ptr,
          cl_int *errcode_ret)


          There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.



          Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.



          Why do we want pinned memory?
          Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.



          As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.




          CL_MEM_USE_HOST_PTR




          Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.




          CL_MEM_ALLOC_HOST_PTR




          We tell the runtime to allocate host memory for the object. It could be pinned.




          CL_MEM_COPY_HOST_PTR




          We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.



          Hope that helps.






          share|improve this answer


























          • Ok, I got it now. Thank you!

            – JasonPh
            Nov 26 '18 at 13:21














          2












          2








          2







          Let's first have a look at the signature of clCreateBuffer:



          cl_mem clCreateBuffer(
          cl_context context,
          cl_mem_flags flags,
          size_t size,
          void *host_ptr,
          cl_int *errcode_ret)


          There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.



          Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.



          Why do we want pinned memory?
          Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.



          As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.




          CL_MEM_USE_HOST_PTR




          Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.




          CL_MEM_ALLOC_HOST_PTR




          We tell the runtime to allocate host memory for the object. It could be pinned.




          CL_MEM_COPY_HOST_PTR




          We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.



          Hope that helps.






          share|improve this answer















          Let's first have a look at the signature of clCreateBuffer:



          cl_mem clCreateBuffer(
          cl_context context,
          cl_mem_flags flags,
          size_t size,
          void *host_ptr,
          cl_int *errcode_ret)


          There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.



          Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.



          Why do we want pinned memory?
          Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.



          As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.




          CL_MEM_USE_HOST_PTR




          Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.




          CL_MEM_ALLOC_HOST_PTR




          We tell the runtime to allocate host memory for the object. It could be pinned.




          CL_MEM_COPY_HOST_PTR




          We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.



          Hope that helps.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 26 '18 at 13:30

























          answered Nov 26 '18 at 12:45









          nomanoma

          653211




          653211













          • Ok, I got it now. Thank you!

            – JasonPh
            Nov 26 '18 at 13:21



















          • Ok, I got it now. Thank you!

            – JasonPh
            Nov 26 '18 at 13:21

















          Ok, I got it now. Thank you!

          – JasonPh
          Nov 26 '18 at 13:21





          Ok, I got it now. Thank you!

          – JasonPh
          Nov 26 '18 at 13:21


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472245%2fopencl-buffer-creation%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)