Dataset from generator that yields multiple elements at a time
I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.
One use case I can't seem to find the equivalent for is the enqueue_many
parameter of tf.train.batch
.
In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).
Is there any workaround to achieve this in the new Dataset API?
python tensorflow tensorflow-datasets
|
show 1 more comment
I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.
One use case I can't seem to find the equivalent for is the enqueue_many
parameter of tf.train.batch
.
In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).
Is there any workaround to achieve this in the new Dataset API?
python tensorflow tensorflow-datasets
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling thebatch
method on the corresponding dataset object.
– Rohan Saxena
Nov 27 '18 at 13:27
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multipleyield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to useenqueue_many
.
– isarandi
Nov 27 '18 at 13:30
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is stilltf.data
API as it does all of this for you. In fact,tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most casestf.data
will be a safer, cleaner, easier and more efficient choice.
– Rohan Saxena
Nov 27 '18 at 14:16
|
show 1 more comment
I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.
One use case I can't seem to find the equivalent for is the enqueue_many
parameter of tf.train.batch
.
In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).
Is there any workaround to achieve this in the new Dataset API?
python tensorflow tensorflow-datasets
I'm testing the waters if it's time to migrate to the Dataset API in TensorFlow from the deprecated queue-based API.
One use case I can't seem to find the equivalent for is the enqueue_many
parameter of tf.train.batch
.
In particular I would like to create a Python generator that can yield "batched" arrays, where the "batch size" is not necessarily the same as the one used for SGD training updates and then apply batching on that stream of data (i.e. as it is with enqueue_many in tf.train.batch).
Is there any workaround to achieve this in the new Dataset API?
python tensorflow tensorflow-datasets
python tensorflow tensorflow-datasets
asked Nov 27 '18 at 13:20
isarandiisarandi
1,5161526
1,5161526
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling thebatch
method on the corresponding dataset object.
– Rohan Saxena
Nov 27 '18 at 13:27
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multipleyield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to useenqueue_many
.
– isarandi
Nov 27 '18 at 13:30
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is stilltf.data
API as it does all of this for you. In fact,tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most casestf.data
will be a safer, cleaner, easier and more efficient choice.
– Rohan Saxena
Nov 27 '18 at 14:16
|
show 1 more comment
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling thebatch
method on the corresponding dataset object.
– Rohan Saxena
Nov 27 '18 at 13:27
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multipleyield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to useenqueue_many
.
– isarandi
Nov 27 '18 at 13:30
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is stilltf.data
API as it does all of this for you. In fact,tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most casestf.data
will be a safer, cleaner, easier and more efficient choice.
– Rohan Saxena
Nov 27 '18 at 14:16
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the
batch
method on the corresponding dataset object.– Rohan Saxena
Nov 27 '18 at 13:27
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the
batch
method on the corresponding dataset object.– Rohan Saxena
Nov 27 '18 at 13:27
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple
yield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many
.– isarandi
Nov 27 '18 at 13:30
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple
yield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to use enqueue_many
.– isarandi
Nov 27 '18 at 13:30
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still
tf.data
API as it does all of this for you. In fact, tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data
will be a safer, cleaner, easier and more efficient choice.– Rohan Saxena
Nov 27 '18 at 14:16
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still
tf.data
API as it does all of this for you. In fact, tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most cases tf.data
will be a safer, cleaner, easier and more efficient choice.– Rohan Saxena
Nov 27 '18 at 14:16
|
show 1 more comment
1 Answer
1
active
oldest
votes
Try using flatmap
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2
def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()
X= next_item[0]
Y = next_item[1]
with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500684%2fdataset-from-generator-that-yields-multiple-elements-at-a-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try using flatmap
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2
def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()
X= next_item[0]
Y = next_item[1]
with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))
add a comment |
Try using flatmap
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2
def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()
X= next_item[0]
Y = next_item[1]
with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))
add a comment |
Try using flatmap
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2
def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()
X= next_item[0]
Y = next_item[1]
with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))
Try using flatmap
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
n_reads=10
read_batch_size=20
training_batch_size = 2
def mnist_gen():
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
for i in range(n_reads):
batch_x, batch_y = mnist.train.next_batch(read_batch_size)
# Yielding a batch instead of single record
yield batch_x,batch_y
data = tf.data.Dataset.from_generator(mnist_gen,output_types=(tf.float32,tf.float32))
data = data.flat_map(lambda *x: tf.data.Dataset.zip(tuple(map(tf.data.Dataset.from_tensor_slices,x)))).batch(training_batch_size)
# if u yield only batch_x change lambda function to data.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)))
iter = data.make_one_shot_iterator()
next_item = iter.get_next()
X= next_item[0]
Y = next_item[1]
with tf.Session() as sess:
for i in range(n_reads*read_batch_size // training_batch_size):
print(i, sess.run(X))
answered Jan 3 at 8:55
HimaprasoonHimaprasoon
1,30711130
1,30711130
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500684%2fdataset-from-generator-that-yields-multiple-elements-at-a-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Of course, you can simply yield multiple (as many) arrays at one time in your generator function - this is independent of the batch size used in SGD. You set the SGD batch size by calling the
batch
method on the corresponding dataset object.– Rohan Saxena
Nov 27 '18 at 13:27
Maybe I was not clear. I don't mean yielding different "fields", like an image a label. I mean yielding already batched collections, i.e. yielding an array of shape [N, H, W, C], containing several images. I am referring specifically to something equivalent to the enqueue_many parameter of tf.train.batch. More specifically: I want to produce several images in one go, not in multiple
yield
s. The reason for this is a bit long to explain, but it's mainly efficiency and existing infrastructure I've built using multiprocessing etc. It was really convenient to useenqueue_many
.– isarandi
Nov 27 '18 at 13:30
Yes, that is also what I was referring to. I'm not familiar with the queue API of TF, but what you are describing is possible in the dataset API (in fact, I use it extensively).
– Rohan Saxena
Nov 27 '18 at 13:34
@isarandi can you share more details on how your data stream is going to be? Off the top of my head, the interleave method of tf.data seems to be something similar to what you are looking for?
– kvish
Nov 27 '18 at 14:10
Btw, @isarandi you mention "efficiency and existing infrastructure... using multiprocessing...". As of now, the recommended method for this is still
tf.data
API as it does all of this for you. In fact,tf.data
uses a slightly aggressive data loading strategy (can be observed by monitoring memory usage). So while I don't know your exact use case, in most casestf.data
will be a safer, cleaner, easier and more efficient choice.– Rohan Saxena
Nov 27 '18 at 14:16