Can't convert beam python pcollection into list
TypeError: 'PCollection' object does not support indexing
Above error results from trying to convert Pcollection into list:
filesList = (files | beam.combiners.ToList())
lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
| 'map' >> beam.Map(_to_dictionary))
And:
def ReadSHP(self, filesList):
"""
"""
sf = shp.Reader(shp=filesList[1], dbf=filesList[2])
How to fix this problem? Any help is appreciated.
python google-cloud-dataflow apache-beam
add a comment |
TypeError: 'PCollection' object does not support indexing
Above error results from trying to convert Pcollection into list:
filesList = (files | beam.combiners.ToList())
lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
| 'map' >> beam.Map(_to_dictionary))
And:
def ReadSHP(self, filesList):
"""
"""
sf = shp.Reader(shp=filesList[1], dbf=filesList[2])
How to fix this problem? Any help is appreciated.
python google-cloud-dataflow apache-beam
add a comment |
TypeError: 'PCollection' object does not support indexing
Above error results from trying to convert Pcollection into list:
filesList = (files | beam.combiners.ToList())
lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
| 'map' >> beam.Map(_to_dictionary))
And:
def ReadSHP(self, filesList):
"""
"""
sf = shp.Reader(shp=filesList[1], dbf=filesList[2])
How to fix this problem? Any help is appreciated.
python google-cloud-dataflow apache-beam
TypeError: 'PCollection' object does not support indexing
Above error results from trying to convert Pcollection into list:
filesList = (files | beam.combiners.ToList())
lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
| 'map' >> beam.Map(_to_dictionary))
And:
def ReadSHP(self, filesList):
"""
"""
sf = shp.Reader(shp=filesList[1], dbf=filesList[2])
How to fix this problem? Any help is appreciated.
python google-cloud-dataflow apache-beam
python google-cloud-dataflow apache-beam
asked Nov 27 '18 at 12:24
samuqsamuq
142110
142110
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
In general you cannot convert a PCollection
to a list.
PCollection
is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection
. Applying a PTransform
to a PCollection
yields another PCollection
. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.
Combiners is just a special class of PTransforms
. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection
of elements into a PCollection
of sums of those elements.
beam.combiners.ToList
is just another transformation that is applied to a PCollection
, potentially over a fleet of worker machines, and yields another PCollection
. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.
What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).
One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.
See the documentation for details:
- Beam Execution model: https://beam.apache.org/documentation/execution-model/
- Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
- Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
add a comment |
An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.
Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53499640%2fcant-convert-beam-python-pcollection-into-list%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
In general you cannot convert a PCollection
to a list.
PCollection
is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection
. Applying a PTransform
to a PCollection
yields another PCollection
. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.
Combiners is just a special class of PTransforms
. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection
of elements into a PCollection
of sums of those elements.
beam.combiners.ToList
is just another transformation that is applied to a PCollection
, potentially over a fleet of worker machines, and yields another PCollection
. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.
What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).
One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.
See the documentation for details:
- Beam Execution model: https://beam.apache.org/documentation/execution-model/
- Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
- Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
add a comment |
In general you cannot convert a PCollection
to a list.
PCollection
is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection
. Applying a PTransform
to a PCollection
yields another PCollection
. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.
Combiners is just a special class of PTransforms
. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection
of elements into a PCollection
of sums of those elements.
beam.combiners.ToList
is just another transformation that is applied to a PCollection
, potentially over a fleet of worker machines, and yields another PCollection
. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.
What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).
One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.
See the documentation for details:
- Beam Execution model: https://beam.apache.org/documentation/execution-model/
- Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
- Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
add a comment |
In general you cannot convert a PCollection
to a list.
PCollection
is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection
. Applying a PTransform
to a PCollection
yields another PCollection
. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.
Combiners is just a special class of PTransforms
. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection
of elements into a PCollection
of sums of those elements.
beam.combiners.ToList
is just another transformation that is applied to a PCollection
, potentially over a fleet of worker machines, and yields another PCollection
. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.
What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).
One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.
See the documentation for details:
- Beam Execution model: https://beam.apache.org/documentation/execution-model/
- Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
- Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/
In general you cannot convert a PCollection
to a list.
PCollection
is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection
. Applying a PTransform
to a PCollection
yields another PCollection
. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.
Combiners is just a special class of PTransforms
. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection
of elements into a PCollection
of sums of those elements.
beam.combiners.ToList
is just another transformation that is applied to a PCollection
, potentially over a fleet of worker machines, and yields another PCollection
. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.
What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).
One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.
See the documentation for details:
- Beam Execution model: https://beam.apache.org/documentation/execution-model/
- Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
- Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/
answered Dec 5 '18 at 17:30
AntonAnton
1,167216
1,167216
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
add a comment |
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…
– samuq
Dec 5 '18 at 18:15
add a comment |
An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.
Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
add a comment |
An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.
Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
add a comment |
An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.
Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.
Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
answered Dec 18 '18 at 8:28
Kannappan SirchabesanKannappan Sirchabesan
630615
630615
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53499640%2fcant-convert-beam-python-pcollection-into-list%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown