What does df.repartition with no column arguments partition on?
In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.
My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.
def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.
:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")
For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?
df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')
python apache-spark pyspark pyspark-sql
add a comment |
In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.
My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.
def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.
:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")
For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?
df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')
python apache-spark pyspark pyspark-sql
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31
add a comment |
In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.
My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.
def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.
:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")
For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?
df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')
python apache-spark pyspark pyspark-sql
In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.
My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.
def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.
:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")
For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?
df_2 = df_1
.where(sf.col('some_column') == 1)
.repartition(32)
.alias('df_2')
python apache-spark pyspark pyspark-sql
python apache-spark pyspark pyspark-sql
edited Nov 29 '18 at 0:06
veronik
asked Nov 29 '18 at 0:04
veronikveronik
406
406
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31
add a comment |
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31
add a comment |
1 Answer
1
active
oldest
votes
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53529974%2fwhat-does-df-repartition-with-no-column-arguments-partition-on%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
add a comment |
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
add a comment |
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
edited Nov 29 '18 at 0:44
answered Nov 29 '18 at 0:31
Stefan RepcekStefan Repcek
1,40711223
1,40711223
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
add a comment |
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
So it just uses the row number? It would be great if I could get a reference to it in the source code.
– veronik
Nov 30 '18 at 1:08
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided
– Stefan Repcek
Nov 30 '18 at 1:34
1
1
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.
– Shyam
Jan 9 at 9:56
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53529974%2fwhat-does-df-repartition-with-no-column-arguments-partition-on%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Did my answer helped you to understand how repartition works without key?
– Stefan Repcek
Nov 29 '18 at 9:31