How Logical and Physical plan works when read Hive Partitioned ORC table in pyspark dataframe

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have created a spark dataframe reading csv from hdfs location.

emp_df = spark.read.format("com.databricks.spark.csv") 

  .option("mode", "DROPMALFORMED") 

  .option("header", "true") 

  .option("inferschema", "true") 

  .option("delimiter", ",").load(PATH_TO_FILE)

and saving this dataframe as Hive paritioned orc table using partitionBy method

emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")

when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:

emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")

emp_df_1.explain(True)



***************************************************************************

== Parsed Logical Plan ==

'Project [*]

+- 'Filter ('emp_id = 6)

   +- 'UnresolvedRelation `UDB`.`temptable`



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int, emp_id: int

Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]

+- Filter (emp_id#7402 = cast(6 as int))

   +- SubqueryAlias temptable

      +- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Optimized Logical Plan ==

Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))

+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Physical Plan ==

*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb...., 

PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

***************************************************************************

whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:

emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")

emp_df_2.explain(True)



******************************************************************************

== Parsed Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Optimized Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Physical Plan ==

*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem..., 

PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

********************************************************************************

Could you please help me to understand the logical and physical plan in both the cases?

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

add a comment |

I have created a spark dataframe reading csv from hdfs location.

emp_df = spark.read.format("com.databricks.spark.csv") 

  .option("mode", "DROPMALFORMED") 

  .option("header", "true") 

  .option("inferschema", "true") 

  .option("delimiter", ",").load(PATH_TO_FILE)

and saving this dataframe as Hive paritioned orc table using partitionBy method

emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")

when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:

emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")

emp_df_1.explain(True)



***************************************************************************

== Parsed Logical Plan ==

'Project [*]

+- 'Filter ('emp_id = 6)

   +- 'UnresolvedRelation `UDB`.`temptable`



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int, emp_id: int

Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]

+- Filter (emp_id#7402 = cast(6 as int))

   +- SubqueryAlias temptable

      +- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Optimized Logical Plan ==

Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))

+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Physical Plan ==

*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb...., 

PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

***************************************************************************

whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:

emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")

emp_df_2.explain(True)



******************************************************************************

== Parsed Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Optimized Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Physical Plan ==

*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem..., 

PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

********************************************************************************

Could you please help me to understand the logical and physical plan in both the cases?

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

add a comment |

I have created a spark dataframe reading csv from hdfs location.

emp_df = spark.read.format("com.databricks.spark.csv") 

  .option("mode", "DROPMALFORMED") 

  .option("header", "true") 

  .option("inferschema", "true") 

  .option("delimiter", ",").load(PATH_TO_FILE)

and saving this dataframe as Hive paritioned orc table using partitionBy method

emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")

when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:

emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")

emp_df_1.explain(True)



***************************************************************************

== Parsed Logical Plan ==

'Project [*]

+- 'Filter ('emp_id = 6)

   +- 'UnresolvedRelation `UDB`.`temptable`



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int, emp_id: int

Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]

+- Filter (emp_id#7402 = cast(6 as int))

   +- SubqueryAlias temptable

      +- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Optimized Logical Plan ==

Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))

+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Physical Plan ==

*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb...., 

PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

***************************************************************************

whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:

emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")

emp_df_2.explain(True)



******************************************************************************

== Parsed Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Optimized Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Physical Plan ==

*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem..., 

PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

********************************************************************************

Could you please help me to understand the logical and physical plan in both the cases?

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

I have created a spark dataframe reading csv from hdfs location.

emp_df = spark.read.format("com.databricks.spark.csv") 

  .option("mode", "DROPMALFORMED") 

  .option("header", "true") 

  .option("inferschema", "true") 

  .option("delimiter", ",").load(PATH_TO_FILE)

and saving this dataframe as Hive paritioned orc table using partitionBy method

emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")

when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:

emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")

emp_df_1.explain(True)



***************************************************************************

== Parsed Logical Plan ==

'Project [*]

+- 'Filter ('emp_id = 6)

   +- 'UnresolvedRelation `UDB`.`temptable`



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int, emp_id: int

Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]

+- Filter (emp_id#7402 = cast(6 as int))

   +- SubqueryAlias temptable

      +- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Optimized Logical Plan ==

Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))

+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc



== Physical Plan ==

*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb...., 

PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

***************************************************************************

whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:

emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")

emp_df_2.explain(True)



******************************************************************************

== Parsed Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Analyzed Logical Plan ==

emp_name: string, emp_city: string, emp_salary: int

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Optimized Logical Plan ==

Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc



== Physical Plan ==

*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem..., 

PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>

********************************************************************************

Could you please help me to understand the logical and physical plan in both the cases?

dataframe hive pyspark orc

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

edited Nov 29 '18 at 9:26

mayank agrawal

1,303621

asked Nov 29 '18 at 5:30

vikrant rana

6521317

asked Nov 29 '18 at 5:30

vikrant rana

6521317

asked Nov 29 '18 at 5:30

vikrant rana

6521317

add a comment |

1 Answer
1

active

oldest

votes

In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:

full_dataset_df = spark.read.format("orc") 

    .load("hdfs://pathlocation/database/udb.db/temptable")

one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)

It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

1

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53532432%2fhow-logical-and-physical-plan-works-when-read-hive-partitioned-orc-table-in-pysp%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:

full_dataset_df = spark.read.format("orc") 

    .load("hdfs://pathlocation/database/udb.db/temptable")

one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)

It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

1

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

add a comment |

In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:

full_dataset_df = spark.read.format("orc") 

    .load("hdfs://pathlocation/database/udb.db/temptable")

one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)

It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

1

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

add a comment |

In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:

full_dataset_df = spark.read.format("orc") 

    .load("hdfs://pathlocation/database/udb.db/temptable")

one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)

It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:

full_dataset_df = spark.read.format("orc") 

    .load("hdfs://pathlocation/database/udb.db/temptable")

one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)

It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

edited Nov 29 '18 at 14:07

mayank agrawal

1,303621

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

answered Nov 29 '18 at 11:21

Mariusz

6,56421337

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

1

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

add a comment |

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

1

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks

– vikrant rana
Nov 29 '18 at 15:38

Was there any specific reason to downvote?

– vikrant rana
Nov 29 '18 at 20:39

I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)

– Mariusz
Nov 30 '18 at 7:14

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TMl8j89f,uhs6qaXLKxQoFlX sW6

搜尋此網誌

Btukfyl