Getting reproducible results using tensorflow-gpu
Working on a project using Tensorflow. However, I can't seem to reproduce my results.
I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.
On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?
Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?
machine-learning tensorflow deep-learning
|
show 4 more comments
Working on a project using Tensorflow. However, I can't seem to reproduce my results.
I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.
On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?
Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?
machine-learning tensorflow deep-learning
4
I think it's because of sum operation for floats is not really accociative, i.e.a+(b+c)
not always equals to(a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.
– stop-cran
Aug 24 '17 at 15:36
2
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20
|
show 4 more comments
Working on a project using Tensorflow. However, I can't seem to reproduce my results.
I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.
On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?
Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?
machine-learning tensorflow deep-learning
Working on a project using Tensorflow. However, I can't seem to reproduce my results.
I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.
On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?
Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?
machine-learning tensorflow deep-learning
machine-learning tensorflow deep-learning
edited Nov 24 '18 at 7:31
Martin Thoma
41.1k54295515
41.1k54295515
asked Aug 24 '17 at 15:32
uchman21uchman21
388415
388415
4
I think it's because of sum operation for floats is not really accociative, i.e.a+(b+c)
not always equals to(a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.
– stop-cran
Aug 24 '17 at 15:36
2
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20
|
show 4 more comments
4
I think it's because of sum operation for floats is not really accociative, i.e.a+(b+c)
not always equals to(a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.
– stop-cran
Aug 24 '17 at 15:36
2
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20
4
4
I think it's because of sum operation for floats is not really accociative, i.e.
a+(b+c)
not always equals to (a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.– stop-cran
Aug 24 '17 at 15:36
I think it's because of sum operation for floats is not really accociative, i.e.
a+(b+c)
not always equals to (a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.– stop-cran
Aug 24 '17 at 15:36
2
2
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20
|
show 4 more comments
1 Answer
1
active
oldest
votes
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same
numbers as found in the paper by executing exactly the same
steps. This means if one had access to the same environment,
the same software, hardware and data, one would be able to
get the same results. In contrast, a paper is called replicatable
if one can achieve the same results if one only follows the
textual description in the paper. Hence replicability is harder to
achieve, but also a more powerful indicator of the quality of the
paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
- Datasets are often not available: JFT-300M
- Massive usage of computational power: For one of the AutoML/Architecture Search papers by Google I asked the author how many GPU-hours they spent on one of the experiments. At the time, if I wanted that many GPU-hours it would have costed me around 250,000 USD.
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
- Make sure the quality metrics you mention don't have too many digits
- As the training likely depends on random initialization, you might also want to give rather an interval than a single number
- Try minor variations
- Re-implement things from scratch (maybe with another library?)
- Ask colleagues to read your paper and then explain back to you what they think you did.
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
- Setting all seeds:
numpy
,tensorflow
,random
, ... - Making sure the Training-Test split is consistent
- Making sure the training data is loaded in the same order
Please note that there might be factors out of your control:
Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the
wild: a large-scale field study”
Inherent Hardware/Software reproducibility problems: Floating
point multiplication is not associative and
different cores on a GPU might finish computations at
different times. Thus each single run could lead to different
results. (I'd be happy if somebody could give an authorative reference here)
Detailed Answer.
– uchman21
Jan 9 at 12:21
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45865665%2fgetting-reproducible-results-using-tensorflow-gpu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same
numbers as found in the paper by executing exactly the same
steps. This means if one had access to the same environment,
the same software, hardware and data, one would be able to
get the same results. In contrast, a paper is called replicatable
if one can achieve the same results if one only follows the
textual description in the paper. Hence replicability is harder to
achieve, but also a more powerful indicator of the quality of the
paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
- Datasets are often not available: JFT-300M
- Massive usage of computational power: For one of the AutoML/Architecture Search papers by Google I asked the author how many GPU-hours they spent on one of the experiments. At the time, if I wanted that many GPU-hours it would have costed me around 250,000 USD.
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
- Make sure the quality metrics you mention don't have too many digits
- As the training likely depends on random initialization, you might also want to give rather an interval than a single number
- Try minor variations
- Re-implement things from scratch (maybe with another library?)
- Ask colleagues to read your paper and then explain back to you what they think you did.
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
- Setting all seeds:
numpy
,tensorflow
,random
, ... - Making sure the Training-Test split is consistent
- Making sure the training data is loaded in the same order
Please note that there might be factors out of your control:
Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the
wild: a large-scale field study”
Inherent Hardware/Software reproducibility problems: Floating
point multiplication is not associative and
different cores on a GPU might finish computations at
different times. Thus each single run could lead to different
results. (I'd be happy if somebody could give an authorative reference here)
Detailed Answer.
– uchman21
Jan 9 at 12:21
add a comment |
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same
numbers as found in the paper by executing exactly the same
steps. This means if one had access to the same environment,
the same software, hardware and data, one would be able to
get the same results. In contrast, a paper is called replicatable
if one can achieve the same results if one only follows the
textual description in the paper. Hence replicability is harder to
achieve, but also a more powerful indicator of the quality of the
paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
- Datasets are often not available: JFT-300M
- Massive usage of computational power: For one of the AutoML/Architecture Search papers by Google I asked the author how many GPU-hours they spent on one of the experiments. At the time, if I wanted that many GPU-hours it would have costed me around 250,000 USD.
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
- Make sure the quality metrics you mention don't have too many digits
- As the training likely depends on random initialization, you might also want to give rather an interval than a single number
- Try minor variations
- Re-implement things from scratch (maybe with another library?)
- Ask colleagues to read your paper and then explain back to you what they think you did.
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
- Setting all seeds:
numpy
,tensorflow
,random
, ... - Making sure the Training-Test split is consistent
- Making sure the training data is loaded in the same order
Please note that there might be factors out of your control:
Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the
wild: a large-scale field study”
Inherent Hardware/Software reproducibility problems: Floating
point multiplication is not associative and
different cores on a GPU might finish computations at
different times. Thus each single run could lead to different
results. (I'd be happy if somebody could give an authorative reference here)
Detailed Answer.
– uchman21
Jan 9 at 12:21
add a comment |
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same
numbers as found in the paper by executing exactly the same
steps. This means if one had access to the same environment,
the same software, hardware and data, one would be able to
get the same results. In contrast, a paper is called replicatable
if one can achieve the same results if one only follows the
textual description in the paper. Hence replicability is harder to
achieve, but also a more powerful indicator of the quality of the
paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
- Datasets are often not available: JFT-300M
- Massive usage of computational power: For one of the AutoML/Architecture Search papers by Google I asked the author how many GPU-hours they spent on one of the experiments. At the time, if I wanted that many GPU-hours it would have costed me around 250,000 USD.
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
- Make sure the quality metrics you mention don't have too many digits
- As the training likely depends on random initialization, you might also want to give rather an interval than a single number
- Try minor variations
- Re-implement things from scratch (maybe with another library?)
- Ask colleagues to read your paper and then explain back to you what they think you did.
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
- Setting all seeds:
numpy
,tensorflow
,random
, ... - Making sure the Training-Test split is consistent
- Making sure the training data is loaded in the same order
Please note that there might be factors out of your control:
Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the
wild: a large-scale field study”
Inherent Hardware/Software reproducibility problems: Floating
point multiplication is not associative and
different cores on a GPU might finish computations at
different times. Thus each single run could lead to different
results. (I'd be happy if somebody could give an authorative reference here)
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same
numbers as found in the paper by executing exactly the same
steps. This means if one had access to the same environment,
the same software, hardware and data, one would be able to
get the same results. In contrast, a paper is called replicatable
if one can achieve the same results if one only follows the
textual description in the paper. Hence replicability is harder to
achieve, but also a more powerful indicator of the quality of the
paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
- Datasets are often not available: JFT-300M
- Massive usage of computational power: For one of the AutoML/Architecture Search papers by Google I asked the author how many GPU-hours they spent on one of the experiments. At the time, if I wanted that many GPU-hours it would have costed me around 250,000 USD.
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
- Make sure the quality metrics you mention don't have too many digits
- As the training likely depends on random initialization, you might also want to give rather an interval than a single number
- Try minor variations
- Re-implement things from scratch (maybe with another library?)
- Ask colleagues to read your paper and then explain back to you what they think you did.
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
- Setting all seeds:
numpy
,tensorflow
,random
, ... - Making sure the Training-Test split is consistent
- Making sure the training data is loaded in the same order
Please note that there might be factors out of your control:
Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the
wild: a large-scale field study”
Inherent Hardware/Software reproducibility problems: Floating
point multiplication is not associative and
different cores on a GPU might finish computations at
different times. Thus each single run could lead to different
results. (I'd be happy if somebody could give an authorative reference here)
answered Nov 24 '18 at 7:54
Martin ThomaMartin Thoma
41.1k54295515
41.1k54295515
Detailed Answer.
– uchman21
Jan 9 at 12:21
add a comment |
Detailed Answer.
– uchman21
Jan 9 at 12:21
Detailed Answer.
– uchman21
Jan 9 at 12:21
Detailed Answer.
– uchman21
Jan 9 at 12:21
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45865665%2fgetting-reproducible-results-using-tensorflow-gpu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
I think it's because of sum operation for floats is not really accociative, i.e.
a+(b+c)
not always equals to(a+b)+c
. So in case of any parallel computing when an order of operations is not guaranteed, you can get different results each time. It can begin with small differences, but over time they grow.– stop-cran
Aug 24 '17 at 15:36
2
If you are not sure about what may or may not be acceptable for a reviewer you can ask more senior members of your department, or your supervisor if you're a student; here is a related question in Academia SE. What I can tell is that TensorFlow has been used in many accepted academic publications. Probably it depends on the variability of the results, the size of the data and your particular field/problem, among other things.
– jdehesa
Aug 24 '17 at 15:47
@stop-cran I noticed that... I noticed the more I increase the dataset size, the wider the gap between the results.
– uchman21
Aug 24 '17 at 16:18
Can you try in nightly version? A recent commit (d93a55b8) is supposed to make reduce_sum deterministic on GPU
– Yaroslav Bulatov
Aug 24 '17 at 17:37
@YaroslavBulatov, Never heard of that. Thanks will look at it to see how it works.
– uchman21
Aug 24 '17 at 20:20