What does df.repartition with no column arguments partition on?












2















In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.



My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.



def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.

:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.

.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")


For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?



df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')









share|improve this question

























  • Did my answer helped you to understand how repartition works without key?

    – Stefan Repcek
    Nov 29 '18 at 9:31
















2















In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.



My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.



def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.

:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.

.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")


For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?



df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')









share|improve this question

























  • Did my answer helped you to understand how repartition works without key?

    – Stefan Repcek
    Nov 29 '18 at 9:31














2












2








2








In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.



My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.



def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.

:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.

.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")


For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?



df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')









share|improve this question
















In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.



My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.



def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.

:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.

.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")


For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?



df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')






python apache-spark pyspark pyspark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 29 '18 at 0:06







veronik

















asked Nov 29 '18 at 0:04









veronikveronik

406




406













  • Did my answer helped you to understand how repartition works without key?

    – Stefan Repcek
    Nov 29 '18 at 9:31



















  • Did my answer helped you to understand how repartition works without key?

    – Stefan Repcek
    Nov 29 '18 at 9:31

















Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31





Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31












1 Answer
1






active

oldest

votes


















3














By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.



The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce



Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4






share|improve this answer


























  • So it just uses the row number? It would be great if I could get a reference to it in the source code.

    – veronik
    Nov 30 '18 at 1:08











  • What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

    – Stefan Repcek
    Nov 30 '18 at 1:34








  • 1





    @Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

    – Shyam
    Jan 9 at 9:56












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53529974%2fwhat-does-df-repartition-with-no-column-arguments-partition-on%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.



The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce



Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4






share|improve this answer


























  • So it just uses the row number? It would be great if I could get a reference to it in the source code.

    – veronik
    Nov 30 '18 at 1:08











  • What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

    – Stefan Repcek
    Nov 30 '18 at 1:34








  • 1





    @Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

    – Shyam
    Jan 9 at 9:56
















3














By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.



The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce



Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4






share|improve this answer


























  • So it just uses the row number? It would be great if I could get a reference to it in the source code.

    – veronik
    Nov 30 '18 at 1:08











  • What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

    – Stefan Repcek
    Nov 30 '18 at 1:34








  • 1





    @Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

    – Shyam
    Jan 9 at 9:56














3












3








3







By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.



The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce



Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4






share|improve this answer















By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.



The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce



Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 29 '18 at 0:44

























answered Nov 29 '18 at 0:31









Stefan RepcekStefan Repcek

1,40711223




1,40711223













  • So it just uses the row number? It would be great if I could get a reference to it in the source code.

    – veronik
    Nov 30 '18 at 1:08











  • What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

    – Stefan Repcek
    Nov 30 '18 at 1:34








  • 1





    @Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

    – Shyam
    Jan 9 at 9:56



















  • So it just uses the row number? It would be great if I could get a reference to it in the source code.

    – veronik
    Nov 30 '18 at 1:08











  • What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

    – Stefan Repcek
    Nov 30 '18 at 1:34








  • 1





    @Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

    – Shyam
    Jan 9 at 9:56

















So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08





So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08













What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34







What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34






1




1





@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56





@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53529974%2fwhat-does-df-repartition-with-no-column-arguments-partition-on%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)