What does df.repartition with no column arguments partition on?

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):

    """

    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The

    resulting DataFrame is hash partitioned.



    :param numPartitions:

        can be an int to specify the target number of partitions or a Column.

        If it is a Column, it will be used as the first partitioning column. If not specified,

        the default number of partitions is used.



    .. versionchanged:: 1.6

       Added optional arguments to specify the partitioning columns. Also made numPartitions

       optional if partitioning columns are specified.



    >>> df.repartition(10).rdd.getNumPartitions()

    10

    >>> data = df.union(df).repartition("age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  5|  Bob|

    |  5|  Bob|

    |  2|Alice|

    |  2|Alice|

    +---+-----+

    >>> data = data.repartition(7, "age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  2|Alice|

    |  5|  Bob|

    |  2|Alice|

    |  5|  Bob|

    +---+-----+

    >>> data.rdd.getNumPartitions()

    7

    """

    if isinstance(numPartitions, int):

        if len(cols) == 0:

            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)

        else:

            return DataFrame(

                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)

    elif isinstance(numPartitions, (basestring, Column)):

        cols = (numPartitions, ) + cols

        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)

    else:

        raise TypeError("numPartitions should be an int or Column")

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1

       .where(sf.col('some_column') == 1)

       .repartition(32)

       .alias('df_2')

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31

add a comment |

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):

    """

    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The

    resulting DataFrame is hash partitioned.



    :param numPartitions:

        can be an int to specify the target number of partitions or a Column.

        If it is a Column, it will be used as the first partitioning column. If not specified,

        the default number of partitions is used.



    .. versionchanged:: 1.6

       Added optional arguments to specify the partitioning columns. Also made numPartitions

       optional if partitioning columns are specified.



    >>> df.repartition(10).rdd.getNumPartitions()

    10

    >>> data = df.union(df).repartition("age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  5|  Bob|

    |  5|  Bob|

    |  2|Alice|

    |  2|Alice|

    +---+-----+

    >>> data = data.repartition(7, "age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  2|Alice|

    |  5|  Bob|

    |  2|Alice|

    |  5|  Bob|

    +---+-----+

    >>> data.rdd.getNumPartitions()

    7

    """

    if isinstance(numPartitions, int):

        if len(cols) == 0:

            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)

        else:

            return DataFrame(

                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)

    elif isinstance(numPartitions, (basestring, Column)):

        cols = (numPartitions, ) + cols

        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)

    else:

        raise TypeError("numPartitions should be an int or Column")

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1

       .where(sf.col('some_column') == 1)

       .repartition(32)

       .alias('df_2')

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31

add a comment |

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):

    """

    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The

    resulting DataFrame is hash partitioned.



    :param numPartitions:

        can be an int to specify the target number of partitions or a Column.

        If it is a Column, it will be used as the first partitioning column. If not specified,

        the default number of partitions is used.



    .. versionchanged:: 1.6

       Added optional arguments to specify the partitioning columns. Also made numPartitions

       optional if partitioning columns are specified.



    >>> df.repartition(10).rdd.getNumPartitions()

    10

    >>> data = df.union(df).repartition("age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  5|  Bob|

    |  5|  Bob|

    |  2|Alice|

    |  2|Alice|

    +---+-----+

    >>> data = data.repartition(7, "age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  2|Alice|

    |  5|  Bob|

    |  2|Alice|

    |  5|  Bob|

    +---+-----+

    >>> data.rdd.getNumPartitions()

    7

    """

    if isinstance(numPartitions, int):

        if len(cols) == 0:

            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)

        else:

            return DataFrame(

                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)

    elif isinstance(numPartitions, (basestring, Column)):

        cols = (numPartitions, ) + cols

        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)

    else:

        raise TypeError("numPartitions should be an int or Column")

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1

       .where(sf.col('some_column') == 1)

       .repartition(32)

       .alias('df_2')

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):

    """

    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The

    resulting DataFrame is hash partitioned.



    :param numPartitions:

        can be an int to specify the target number of partitions or a Column.

        If it is a Column, it will be used as the first partitioning column. If not specified,

        the default number of partitions is used.



    .. versionchanged:: 1.6

       Added optional arguments to specify the partitioning columns. Also made numPartitions

       optional if partitioning columns are specified.



    >>> df.repartition(10).rdd.getNumPartitions()

    10

    >>> data = df.union(df).repartition("age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  5|  Bob|

    |  5|  Bob|

    |  2|Alice|

    |  2|Alice|

    +---+-----+

    >>> data = data.repartition(7, "age")

    >>> data.show()

    +---+-----+

    |age| name|

    +---+-----+

    |  2|Alice|

    |  5|  Bob|

    |  2|Alice|

    |  5|  Bob|

    +---+-----+

    >>> data.rdd.getNumPartitions()

    7

    """

    if isinstance(numPartitions, int):

        if len(cols) == 0:

            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)

        else:

            return DataFrame(

                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)

    elif isinstance(numPartitions, (basestring, Column)):

        cols = (numPartitions, ) + cols

        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)

    else:

        raise TypeError("numPartitions should be an int or Column")

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1

       .where(sf.col('some_column') == 1)

       .repartition(32)

       .alias('df_2')

python apache-spark pyspark pyspark-sql

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

edited Nov 29 '18 at 0:06

asked Nov 29 '18 at 0:04

veronik

406

asked Nov 29 '18 at 0:04

veronik

406

asked Nov 29 '18 at 0:04

veronik

406

Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31

add a comment |

Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31

Did my answer helped you to understand how repartition works without key?

– Stefan Repcek
Nov 29 '18 at 9:31

add a comment |

1 Answer
1

active

oldest

votes

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

1

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53529974%2fwhat-does-df-repartition-with-no-column-arguments-partition-on%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

1

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

add a comment |

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

1

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

add a comment |

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

Here is some good explanation how to repartition with DataFrame
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

edited Nov 29 '18 at 0:44

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

answered Nov 29 '18 at 0:31

Stefan Repcek

1,40711223

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

1

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

add a comment |

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

1

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

So it just uses the row number? It would be great if I could get a reference to it in the source code.

– veronik
Nov 30 '18 at 1:08

What I understand, it does not use any information from your dataset, no hask key, it just repartion data in a way that they are uniformely distributed (every partition having same size) It make sense, even other frameworks like apache kafka does not need key to partition data. Apache Kafka partition data using Round Robin by default if no key provided

– Stefan Repcek
Nov 30 '18 at 1:34

@Stefan Repcek , is there any way I can partition the data based on the total size of dataframe ? i.e. assume i need each partition 128m , if i have 1GB , i have to repartition 1024 , if i have 5GB , i have to repartition 5120 ....so repartition number should be calculated dynamically.

– Shyam
Jan 9 at 9:56

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl