Dataflow unable to GET a reference to BigQuery tables when too many cores or more than one machine












0















My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.



The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.



It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).



How can I avoid this error, and have a high number of machines write to BQ ?



Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:



...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)

HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"


Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:



p = beam.Pipeline(options=options)

msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')

lines = msgs | beam.Map(lambda x: {'content': x})

(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))









share|improve this question

























  • Can you share your Dataflow code?

    – Graham Polley
    Nov 26 '18 at 1:26











  • Sure, just added this

    – T. Pilewicz
    Nov 26 '18 at 9:10
















0















My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.



The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.



It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).



How can I avoid this error, and have a high number of machines write to BQ ?



Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:



...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)

HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"


Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:



p = beam.Pipeline(options=options)

msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')

lines = msgs | beam.Map(lambda x: {'content': x})

(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))









share|improve this question

























  • Can you share your Dataflow code?

    – Graham Polley
    Nov 26 '18 at 1:26











  • Sure, just added this

    – T. Pilewicz
    Nov 26 '18 at 9:10














0












0








0








My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.



The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.



It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).



How can I avoid this error, and have a high number of machines write to BQ ?



Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:



...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)

HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"


Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:



p = beam.Pipeline(options=options)

msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')

lines = msgs | beam.Map(lambda x: {'content': x})

(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))









share|improve this question
















My streaming Dataflow pipeline is supposed to read analytics hits from Pub/Sub and write them to BigQuery. If I use too many machines, or they're too big, it throws rate limit errors when getting a reference to the tables, more precisely when executing _get_or_create_table.



The rate limit that is reached seems to be one of these: 100 API requests per second per user, 300 concurrent API requests per user.



It is not blocking the pipeline (rows get written after some point), but I have the feeling it blocks some of the threads and prevents me from fully taking advantage of the parallelization. Switching from one machine with 4 CPUs to 5 with each 8 CPUs brought no improvement to the latency (in fact it got worse).



How can I avoid this error, and have a high number of machines write to BQ ?



Here's the log from the Dataflow monitoring interface. It is appearing regularly when I launch the pipeline:



...
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1087, in get_or_create_table
found_table = self._get_table(project_id, dataset_id, table_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 925, in _get_table
response = self.client.tables.Get(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py", line 611, in Get
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)

HttpForbiddenError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/<project_id>/datasets/<dataset_id>/tables/<table_id>?alt=json>: response: <{'status': '403', 'content-length': '577', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sun, 25 Nov 2018 14:36:24 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sun, 25 Nov 2018 14:36:24 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "rateLimitExceeded",
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"locationType": "other",
"location": "helix_api.method_request"
}
],
"code": 403,
"message": "Exceeded rate limits: Your user_method exceeded quota for api requests per user per method. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"


Here's the code of the pipeline. I cut pretty much everything in it to see if this still happens:



p = beam.Pipeline(options=options)

msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{project}/topics/{topic}'.format(
project=args.project, topic=args.hits_topic),
id_label='hit_id',
timestamp_attribute='time')

lines = msgs | beam.Map(lambda x: {'content': x})

(lines
| 'WriteToBQ' >> beam.io.gcp.bigquery.WriteToBigQuery(args.table,
dataset=args.dataset,
project=args.project))






python google-bigquery google-cloud-dataflow apache-beam






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 4 '18 at 9:59







T. Pilewicz

















asked Nov 25 '18 at 15:15









T. PilewiczT. Pilewicz

169214




169214













  • Can you share your Dataflow code?

    – Graham Polley
    Nov 26 '18 at 1:26











  • Sure, just added this

    – T. Pilewicz
    Nov 26 '18 at 9:10



















  • Can you share your Dataflow code?

    – Graham Polley
    Nov 26 '18 at 1:26











  • Sure, just added this

    – T. Pilewicz
    Nov 26 '18 at 9:10

















Can you share your Dataflow code?

– Graham Polley
Nov 26 '18 at 1:26





Can you share your Dataflow code?

– Graham Polley
Nov 26 '18 at 1:26













Sure, just added this

– T. Pilewicz
Nov 26 '18 at 9:10





Sure, just added this

– T. Pilewicz
Nov 26 '18 at 9:10












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468893%2fdataflow-unable-to-get-a-reference-to-bigquery-tables-when-too-many-cores-or-mor%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468893%2fdataflow-unable-to-get-a-reference-to-bigquery-tables-when-too-many-cores-or-mor%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)