Dataset from generator that yields multiple elements at a time












1















I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.



One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.



In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).



Is there any workaround to achieve this in the new Dataset API?










share|improve this question























  • Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

    – Rohan Saxena
    Nov 27 '18 at 13:27











  • Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

    – isarandi
    Nov 27 '18 at 13:30













  • Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

    – Rohan Saxena
    Nov 27 '18 at 13:34











  • @isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

    – kvish
    Nov 27 '18 at 14:10











  • Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

    – Rohan Saxena
    Nov 27 '18 at 14:16
















1















I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.



One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.



In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).



Is there any workaround to achieve this in the new Dataset API?










share|improve this question























  • Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

    – Rohan Saxena
    Nov 27 '18 at 13:27











  • Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

    – isarandi
    Nov 27 '18 at 13:30













  • Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

    – Rohan Saxena
    Nov 27 '18 at 13:34











  • @isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

    – kvish
    Nov 27 '18 at 14:10











  • Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

    – Rohan Saxena
    Nov 27 '18 at 14:16














1












1








1








I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.



One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.



In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).



Is there any workaround to achieve this in the new Dataset API?










share|improve this question














I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.



One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.



In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).



Is there any workaround to achieve this in the new Dataset API?







python tensorflow tensorflow-datasets






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 27 '18 at 13:20









isarandiisarandi

1,5161526




1,5161526













  • Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

    – Rohan Saxena
    Nov 27 '18 at 13:27











  • Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

    – isarandi
    Nov 27 '18 at 13:30













  • Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

    – Rohan Saxena
    Nov 27 '18 at 13:34











  • @isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

    – kvish
    Nov 27 '18 at 14:10











  • Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

    – Rohan Saxena
    Nov 27 '18 at 14:16



















  • Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

    – Rohan Saxena
    Nov 27 '18 at 13:27











  • Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

    – isarandi
    Nov 27 '18 at 13:30













  • Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

    – Rohan Saxena
    Nov 27 '18 at 13:34











  • @isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

    – kvish
    Nov 27 '18 at 14:10











  • Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

    – Rohan Saxena
    Nov 27 '18 at 14:16

















Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27





Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27













Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30







Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30















Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34





Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34













@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10





@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10













Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16





Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16












1 Answer
1






active

oldest

votes


















0














Try using flatmap



import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2

def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()

X= next_item[0]
Y = next_item[1]

with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500684%2fdataset-from-generator-that-yields-multiple-elements-at-a-time%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Try using flatmap



    import tensorflow as tf
    from tensorflow.examples.tutorials.mnist import input_data
    n_reads=10
    read_batch_size=20
    training_batch_size = 2

    def mnist_gen():
    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
    for i in range(n_reads):
    batch_x, batch_y = mnist.train.next_batch(read_batch_size)
    # Yielding a batch instead of single record
    yield batch_x,batch_y
    data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
    data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
    # if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
    iter = data.make_one_shot_iterator()
    next_item = iter.get_next()

    X= next_item[0]
    Y = next_item[1]

    with tf.Session() as sess:
    for i in range(n_reads*read_batch_size // training_batch_size):
    print(i, sess.run(X))





    share|improve this answer




























      0














      Try using flatmap



      import tensorflow as tf
      from tensorflow.examples.tutorials.mnist import input_data
      n_reads=10
      read_batch_size=20
      training_batch_size = 2

      def mnist_gen():
      mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
      for i in range(n_reads):
      batch_x, batch_y = mnist.train.next_batch(read_batch_size)
      # Yielding a batch instead of single record
      yield batch_x,batch_y
      data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
      data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
      # if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
      iter = data.make_one_shot_iterator()
      next_item = iter.get_next()

      X= next_item[0]
      Y = next_item[1]

      with tf.Session() as sess:
      for i in range(n_reads*read_batch_size // training_batch_size):
      print(i, sess.run(X))





      share|improve this answer


























        0












        0








        0







        Try using flatmap



        import tensorflow as tf
        from tensorflow.examples.tutorials.mnist import input_data
        n_reads=10
        read_batch_size=20
        training_batch_size = 2

        def mnist_gen():
        mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
        for i in range(n_reads):
        batch_x, batch_y = mnist.train.next_batch(read_batch_size)
        # Yielding a batch instead of single record
        yield batch_x,batch_y
        data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
        data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
        # if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
        iter = data.make_one_shot_iterator()
        next_item = iter.get_next()

        X= next_item[0]
        Y = next_item[1]

        with tf.Session() as sess:
        for i in range(n_reads*read_batch_size // training_batch_size):
        print(i, sess.run(X))





        share|improve this answer













        Try using flatmap



        import tensorflow as tf
        from tensorflow.examples.tutorials.mnist import input_data
        n_reads=10
        read_batch_size=20
        training_batch_size = 2

        def mnist_gen():
        mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
        for i in range(n_reads):
        batch_x, batch_y = mnist.train.next_batch(read_batch_size)
        # Yielding a batch instead of single record
        yield batch_x,batch_y
        data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
        data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
        # if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
        iter = data.make_one_shot_iterator()
        next_item = iter.get_next()

        X= next_item[0]
        Y = next_item[1]

        with tf.Session() as sess:
        for i in range(n_reads*read_batch_size // training_batch_size):
        print(i, sess.run(X))






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 3 at 8:55









        HimaprasoonHimaprasoon

        1,30711130




        1,30711130
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500684%2fdataset-from-generator-that-yields-multiple-elements-at-a-time%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)