Can't convert beam python pcollection into list

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list:

filesList = (files | beam.combiners.ToList())



lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))

            | 'map' >> beam.Map(_to_dictionary))

And:

def ReadSHP(self, filesList):

    """

    """

    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])

How to fix this problem? Any help is appreciated.

asked Nov 27 '18 at 12:24

samuq

142110

add a comment |

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list:

filesList = (files | beam.combiners.ToList())



lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))

            | 'map' >> beam.Map(_to_dictionary))

And:

def ReadSHP(self, filesList):

    """

    """

    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])

How to fix this problem? Any help is appreciated.

asked Nov 27 '18 at 12:24

samuq

142110

add a comment |

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list:

filesList = (files | beam.combiners.ToList())



lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))

            | 'map' >> beam.Map(_to_dictionary))

And:

def ReadSHP(self, filesList):

    """

    """

    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])

How to fix this problem? Any help is appreciated.

asked Nov 27 '18 at 12:24

samuq

142110

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list:

filesList = (files | beam.combiners.ToList())



lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))

            | 'map' >> beam.Map(_to_dictionary))

And:

def ReadSHP(self, filesList):

    """

    """

    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])

How to fix this problem? Any help is appreciated.

python google-cloud-dataflow apache-beam

asked Nov 27 '18 at 12:24

samuq

142110

asked Nov 27 '18 at 12:24

samuq

142110

asked Nov 27 '18 at 12:24

samuq

142110

asked Nov 27 '18 at 12:24

samuq

142110

asked Nov 27 '18 at 12:24

samuq

142110

add a comment |

2 Answers
2

active

oldest

votes

In general you cannot convert a PCollection to a list.

PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.

Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.

beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.

What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).

One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/

Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

answered Dec 5 '18 at 17:30

Anton

1,167216

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

add a comment |

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53499640%2fcant-convert-beam-python-pcollection-into-list%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

In general you cannot convert a PCollection to a list.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/

Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

answered Dec 5 '18 at 17:30

Anton

1,167216

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

add a comment |

In general you cannot convert a PCollection to a list.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/

Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

answered Dec 5 '18 at 17:30

Anton

1,167216

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

add a comment |

In general you cannot convert a PCollection to a list.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/

Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

answered Dec 5 '18 at 17:30

Anton

1,167216

In general you cannot convert a PCollection to a list.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/

Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

answered Dec 5 '18 at 17:30

Anton

1,167216

answered Dec 5 '18 at 17:30

Anton

1,167216

answered Dec 5 '18 at 17:30

Anton

1,167216

answered Dec 5 '18 at 17:30

Anton

1,167216

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

add a comment |

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

– samuq
Dec 5 '18 at 18:15

When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

– samuq
Dec 5 '18 at 18:15

add a comment |

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

add a comment |

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

add a comment |

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

answered Dec 18 '18 at 8:28

Kannappan Sirchabesan

630615

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl