Dataset from generator that yields multiple elements at a time

I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.

One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.

In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).

Is there any workaround to achieve this in the new Dataset API?

asked Nov 27 '18 at 13:20

isarandi

1,5161526

Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27

Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30

Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34

@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10

Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16

|
show 1 more comment

I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.

One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.

Is there any workaround to achieve this in the new Dataset API?

asked Nov 27 '18 at 13:20

isarandi

1,5161526

Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27

Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30

Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34

@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10

Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16

|
show 1 more comment

I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.

One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.

Is there any workaround to achieve this in the new Dataset API?

asked Nov 27 '18 at 13:20

isarandi

1,5161526

I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.

One use case I can't seem to find the equivalent for is the enqueue_many parameter of tf.train.batch.

Is there any workaround to achieve this in the new Dataset API?

python tensorflow tensorflow-datasets

asked Nov 27 '18 at 13:20

isarandi

1,5161526

asked Nov 27 '18 at 13:20

isarandi

1,5161526

asked Nov 27 '18 at 13:20

isarandi

1,5161526

asked Nov 27 '18 at 13:20

isarandi

1,5161526

asked Nov 27 '18 at 13:20

isarandi

1,5161526

Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27

Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30

Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34

@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10

Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16

|
show 1 more comment

Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27

Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30

Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34

@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10

Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16

Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the batch method on the corresponding dataset object.

– Rohan Saxena
Nov 27 '18 at 13:27

Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple yields. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many.

– isarandi
Nov 27 '18 at 13:30

Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).

– Rohan Saxena
Nov 27 '18 at 13:34

@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?

– kvish
Nov 27 '18 at 14:10

Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still tf.data API as it does all of this for you. In fact, tf.data uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data will be a safer, cleaner, easier and more efficient choice.

– Rohan Saxena
Nov 27 '18 at 14:16

|
show 1 more comment

1 Answer
1

active

oldest

votes

Try using flatmap

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

n_reads=10

read_batch_size=20

training_batch_size = 2



def mnist_gen():

    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

    for i in range(n_reads):

        batch_x, batch_y = mnist.train.next_batch(read_batch_size)

        # Yielding a batch instead of single record

        yield batch_x,batch_y

data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))

data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)

# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))

iter = data.make_one_shot_iterator()

next_item = iter.get_next()



X= next_item[0]

Y = next_item[1]



with tf.Session() as sess:

    for i in range(n_reads*read_batch_size // training_batch_size):

        print(i, sess.run(X))

answered Jan 3 at 8:55

Himaprasoon

1,30711130

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500684%2fdataset-from-generator-that-yields-multiple-elements-at-a-time%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Try using flatmap

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

n_reads=10

read_batch_size=20

training_batch_size = 2



def mnist_gen():

    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

    for i in range(n_reads):

        batch_x, batch_y = mnist.train.next_batch(read_batch_size)

        # Yielding a batch instead of single record

        yield batch_x,batch_y

data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))

data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)

# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))

iter = data.make_one_shot_iterator()

next_item = iter.get_next()



X= next_item[0]

Y = next_item[1]



with tf.Session() as sess:

    for i in range(n_reads*read_batch_size // training_batch_size):

        print(i, sess.run(X))

answered Jan 3 at 8:55

Himaprasoon

1,30711130

add a comment |

Try using flatmap

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

n_reads=10

read_batch_size=20

training_batch_size = 2



def mnist_gen():

    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

    for i in range(n_reads):

        batch_x, batch_y = mnist.train.next_batch(read_batch_size)

        # Yielding a batch instead of single record

        yield batch_x,batch_y

data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))

data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)

# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))

iter = data.make_one_shot_iterator()

next_item = iter.get_next()



X= next_item[0]

Y = next_item[1]



with tf.Session() as sess:

    for i in range(n_reads*read_batch_size // training_batch_size):

        print(i, sess.run(X))

answered Jan 3 at 8:55

Himaprasoon

1,30711130

add a comment |

Try using flatmap

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

n_reads=10

read_batch_size=20

training_batch_size = 2



def mnist_gen():

    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

    for i in range(n_reads):

        batch_x, batch_y = mnist.train.next_batch(read_batch_size)

        # Yielding a batch instead of single record

        yield batch_x,batch_y

data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))

data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)

# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))

iter = data.make_one_shot_iterator()

next_item = iter.get_next()



X= next_item[0]

Y = next_item[1]



with tf.Session() as sess:

    for i in range(n_reads*read_batch_size // training_batch_size):

        print(i, sess.run(X))

answered Jan 3 at 8:55

Himaprasoon

1,30711130

Try using flatmap

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

n_reads=10

read_batch_size=20

training_batch_size = 2



def mnist_gen():

    mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

    for i in range(n_reads):

        batch_x, batch_y = mnist.train.next_batch(read_batch_size)

        # Yielding a batch instead of single record

        yield batch_x,batch_y

data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))

data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)

# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))

iter = data.make_one_shot_iterator()

next_item = iter.get_next()



X= next_item[0]

Y = next_item[1]



with tf.Session() as sess:

    for i in range(n_reads*read_batch_size // training_batch_size):

        print(i, sess.run(X))

answered Jan 3 at 8:55

Himaprasoon

1,30711130

answered Jan 3 at 8:55

Himaprasoon

1,30711130

answered Jan 3 at 8:55

Himaprasoon

1,30711130

answered Jan 3 at 8:55

Himaprasoon

1,30711130

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

A2 2rJWH93maU V4Be3SO4EajriL97Et,trJKx 4UYe3FDHL9 hV9yt3JP IB9p,x7H3BVPTYmtSFDAYJV4Su,AhAoyn Cm4D,T

搜尋此網誌

Btukfyl