Dataflow unable to GET a reference to BigQuery tables when too many cores or more than one machine
My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.
The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.
It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).
How can I avoid this error, and have a high number of machines write to BQ ?
Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:
...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:
p = beam.Pipeline(options=options)
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')
lines = msgs | beam.Map(lambda x: {'content': x})
(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))
python google-bigquery google-cloud-dataflow apache-beam
add a comment |
My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.
The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.
It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).
How can I avoid this error, and have a high number of machines write to BQ ?
Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:
...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:
p = beam.Pipeline(options=options)
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')
lines = msgs | beam.Map(lambda x: {'content': x})
(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))
python google-bigquery google-cloud-dataflow apache-beam
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10
add a comment |
My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.
The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.
It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).
How can I avoid this error, and have a high number of machines write to BQ ?
Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:
...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:
p = beam.Pipeline(options=options)
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')
lines = msgs | beam.Map(lambda x: {'content': x})
(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))
python google-bigquery google-cloud-dataflow apache-beam
My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.
The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.
It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).
How can I avoid this error, and have a high number of machines write to BQ ?
Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:
...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:
p = beam.Pipeline(options=options)
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')
lines = msgs | beam.Map(lambda x: {'content': x})
(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))
python google-bigquery google-cloud-dataflow apache-beam
python google-bigquery google-cloud-dataflow apache-beam
edited Dec 4 '18 at 9:59
T. Pilewicz
asked Nov 25 '18 at 15:15
T. PilewiczT. Pilewicz
169214
169214
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10
add a comment |
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468893%2fdataflow-unable-to-get-a-reference-to-bigquery-tables-when-too-many-cores-or-mor%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468893%2fdataflow-unable-to-get-a-reference-to-bigquery-tables-when-too-many-cores-or-mor%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you share your Dataflow code?
– Graham Polley
Nov 26 '18 at 1:26
Sure, just added this
– T. Pilewicz
Nov 26 '18 at 9:10