Can't convert beam python pcollection into list












1















TypeError: 'PCollection' object does not support indexing


Above error results from trying to convert Pcollection into list:



filesList = (files | beam.combiners.ToList())

lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
| 'map' >> beam.Map(_to_dictionary))


And:



def ReadSHP(self, filesList):
"""
"""
sf = shp.Reader(shp=filesList[1], dbf=filesList[2])


How to fix this problem? Any help is appreciated.










share|improve this question



























    1















    TypeError: 'PCollection' object does not support indexing


    Above error results from trying to convert Pcollection into list:



    filesList = (files | beam.combiners.ToList())

    lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
    | 'map' >> beam.Map(_to_dictionary))


    And:



    def ReadSHP(self, filesList):
    """
    """
    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])


    How to fix this problem? Any help is appreciated.










    share|improve this question

























      1












      1








      1








      TypeError: 'PCollection' object does not support indexing


      Above error results from trying to convert Pcollection into list:



      filesList = (files | beam.combiners.ToList())

      lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
      | 'map' >> beam.Map(_to_dictionary))


      And:



      def ReadSHP(self, filesList):
      """
      """
      sf = shp.Reader(shp=filesList[1], dbf=filesList[2])


      How to fix this problem? Any help is appreciated.










      share|improve this question














      TypeError: 'PCollection' object does not support indexing


      Above error results from trying to convert Pcollection into list:



      filesList = (files | beam.combiners.ToList())

      lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
      | 'map' >> beam.Map(_to_dictionary))


      And:



      def ReadSHP(self, filesList):
      """
      """
      sf = shp.Reader(shp=filesList[1], dbf=filesList[2])


      How to fix this problem? Any help is appreciated.







      python google-cloud-dataflow apache-beam






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 27 '18 at 12:24









      samuqsamuq

      142110




      142110
























          2 Answers
          2






          active

          oldest

          votes


















          1














          In general you cannot convert a PCollection to a list.



          PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.



          Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.



          beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.



          What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).



          One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.



          See the documentation for details:




          • Beam Execution model: https://beam.apache.org/documentation/execution-model/

          • Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

          • Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/






          share|improve this answer
























          • Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

            – samuq
            Dec 5 '18 at 18:15











          • When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

            – samuq
            Dec 5 '18 at 18:15



















          0














          An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.



          Here is a blogpost with more details
          https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53499640%2fcant-convert-beam-python-pcollection-into-list%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            In general you cannot convert a PCollection to a list.



            PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.



            Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.



            beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.



            What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).



            One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.



            See the documentation for details:




            • Beam Execution model: https://beam.apache.org/documentation/execution-model/

            • Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

            • Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/






            share|improve this answer
























            • Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

              – samuq
              Dec 5 '18 at 18:15











            • When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

              – samuq
              Dec 5 '18 at 18:15
















            1














            In general you cannot convert a PCollection to a list.



            PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.



            Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.



            beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.



            What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).



            One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.



            See the documentation for details:




            • Beam Execution model: https://beam.apache.org/documentation/execution-model/

            • Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

            • Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/






            share|improve this answer
























            • Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

              – samuq
              Dec 5 '18 at 18:15











            • When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

              – samuq
              Dec 5 '18 at 18:15














            1












            1








            1







            In general you cannot convert a PCollection to a list.



            PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.



            Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.



            beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.



            What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).



            One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.



            See the documentation for details:




            • Beam Execution model: https://beam.apache.org/documentation/execution-model/

            • Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

            • Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/






            share|improve this answer













            In general you cannot convert a PCollection to a list.



            PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection. Applying a PTransform to a PCollection yields another PCollection. And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.



            Combiners is just a special class of PTransforms. What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.



            beam.combiners.ToList is just another transformation that is applied to a PCollection, potentially over a fleet of worker machines, and yields another PCollection. But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.



            What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).



            One of the workarounds is to add another step to the pipeline that writes the combined outputs (e.g. the sums, or the lists) into a common storage, e.g. a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.



            See the documentation for details:




            • Beam Execution model: https://beam.apache.org/documentation/execution-model/

            • Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/

            • Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 5 '18 at 17:30









            AntonAnton

            1,167216




            1,167216













            • Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

              – samuq
              Dec 5 '18 at 18:15











            • When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

              – samuq
              Dec 5 '18 at 18:15



















            • Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

              – samuq
              Dec 5 '18 at 18:15











            • When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

              – samuq
              Dec 5 '18 at 18:15

















            Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

            – samuq
            Dec 5 '18 at 18:15





            Thanks for the answer. The workaround you mentioned probably wouldn't work in our case, since we need to open shapefile from cloud storage and process it with beam. Shapefile consists of multiple files (.shp, .prj, .dbf). In the start of the pipeline execution, template parameters (e.g. filename of shapefile) need to be accessed, which can be done (in Python) only from within the pipeline.

            – samuq
            Dec 5 '18 at 18:15













            When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

            – samuq
            Dec 5 '18 at 18:15





            When those files have been opened in pipeline, they need to be opened in the same pipeline with shapefile.reader, that takes file-like objects as parameters, but since pipeline returns always a PCollection, simultaneous access to PCollection items (e.g. .shp & .dbf files) is impossible. That's why we would have needed to convert the Pcollection into a python list. But we have ditched Dataflow ++ beam for now and have started to do ETL on Compute Engine VM, which seems to have its own problems: stackoverflow.com/questions/53631410/…

            – samuq
            Dec 5 '18 at 18:15













            0














            An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.



            Here is a blogpost with more details
            https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8






            share|improve this answer




























              0














              An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.



              Here is a blogpost with more details
              https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8






              share|improve this answer


























                0












                0








                0







                An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.



                Here is a blogpost with more details
                https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8






                share|improve this answer













                An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.



                Here is a blogpost with more details
                https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Dec 18 '18 at 8:28









                Kannappan SirchabesanKannappan Sirchabesan

                630615




                630615






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53499640%2fcant-convert-beam-python-pcollection-into-list%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                    Calculate evaluation metrics using cross_val_predict sklearn

                    Insert data from modal to MySQL (multiple modal on website)