How Logical and Physical plan works when read Hive Partitioned ORC table in pyspark dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have created a spark dataframe reading csv from hdfs location.
emp_df = spark.read.format("com.databricks.spark.csv")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferschema", "true")
.option("delimiter", ",").load(PATH_TO_FILE)
and saving this dataframe as Hive paritioned orc table using partitionBy method
emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")
when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:
emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")
emp_df_1.explain(True)
***************************************************************************
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('emp_id = 6)
+- 'UnresolvedRelation `UDB`.`temptable`
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int, emp_id: int
Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]
+- Filter (emp_id#7402 = cast(6 as int))
+- SubqueryAlias temptable
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Optimized Logical Plan ==
Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Physical Plan ==
*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb....,
PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
***************************************************************************
whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:
emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")
emp_df_2.explain(True)
******************************************************************************
== Parsed Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Optimized Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Physical Plan ==
*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem...,
PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
********************************************************************************
Could you please help me to understand the logical and physical plan in both the cases?
dataframe hive pyspark orc
add a comment |
I have created a spark dataframe reading csv from hdfs location.
emp_df = spark.read.format("com.databricks.spark.csv")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferschema", "true")
.option("delimiter", ",").load(PATH_TO_FILE)
and saving this dataframe as Hive paritioned orc table using partitionBy method
emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")
when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:
emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")
emp_df_1.explain(True)
***************************************************************************
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('emp_id = 6)
+- 'UnresolvedRelation `UDB`.`temptable`
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int, emp_id: int
Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]
+- Filter (emp_id#7402 = cast(6 as int))
+- SubqueryAlias temptable
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Optimized Logical Plan ==
Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Physical Plan ==
*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb....,
PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
***************************************************************************
whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:
emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")
emp_df_2.explain(True)
******************************************************************************
== Parsed Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Optimized Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Physical Plan ==
*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem...,
PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
********************************************************************************
Could you please help me to understand the logical and physical plan in both the cases?
dataframe hive pyspark orc
add a comment |
I have created a spark dataframe reading csv from hdfs location.
emp_df = spark.read.format("com.databricks.spark.csv")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferschema", "true")
.option("delimiter", ",").load(PATH_TO_FILE)
and saving this dataframe as Hive paritioned orc table using partitionBy method
emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")
when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:
emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")
emp_df_1.explain(True)
***************************************************************************
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('emp_id = 6)
+- 'UnresolvedRelation `UDB`.`temptable`
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int, emp_id: int
Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]
+- Filter (emp_id#7402 = cast(6 as int))
+- SubqueryAlias temptable
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Optimized Logical Plan ==
Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Physical Plan ==
*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb....,
PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
***************************************************************************
whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:
emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")
emp_df_2.explain(True)
******************************************************************************
== Parsed Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Optimized Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Physical Plan ==
*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem...,
PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
********************************************************************************
Could you please help me to understand the logical and physical plan in both the cases?
dataframe hive pyspark orc
I have created a spark dataframe reading csv from hdfs location.
emp_df = spark.read.format("com.databricks.spark.csv")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferschema", "true")
.option("delimiter", ",").load(PATH_TO_FILE)
and saving this dataframe as Hive paritioned orc table using partitionBy method
emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")
when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:
emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")
emp_df_1.explain(True)
***************************************************************************
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('emp_id = 6)
+- 'UnresolvedRelation `UDB`.`temptable`
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int, emp_id: int
Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]
+- Filter (emp_id#7402 = cast(6 as int))
+- SubqueryAlias temptable
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Optimized Logical Plan ==
Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Physical Plan ==
*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb....,
PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
***************************************************************************
whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:
emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")
emp_df_2.explain(True)
******************************************************************************
== Parsed Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Optimized Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Physical Plan ==
*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem...,
PartitionFilters: , PushedFilters: , ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
********************************************************************************
Could you please help me to understand the logical and physical plan in both the cases?
dataframe hive pyspark orc
dataframe hive pyspark orc
edited Nov 29 '18 at 9:26
mayank agrawal
1,303621
1,303621
asked Nov 29 '18 at 5:30
vikrant ranavikrant rana
6521317
6521317
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc")
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53532432%2fhow-logical-and-physical-plan-works-when-read-hive-partitioned-orc-table-in-pysp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc")
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
add a comment |
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc")
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
add a comment |
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc")
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc")
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.
edited Nov 29 '18 at 14:07
mayank agrawal
1,303621
1,303621
answered Nov 29 '18 at 11:21
MariuszMariusz
6,56421337
6,56421337
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
add a comment |
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
yeah that way we can filter using partition key. I thought by specifying the absoulte path location with partitioning sub directory structure would filter the data by partition. Thanks
– vikrant rana
Nov 29 '18 at 15:38
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
Was there any specific reason to downvote?
– vikrant rana
Nov 29 '18 at 20:39
1
1
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
I think not, there is nothing wrong with your question. I will upvote it, this is at least what I can do :-)
– Mariusz
Nov 30 '18 at 7:14
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53532432%2fhow-logical-and-physical-plan-works-when-read-hive-partitioned-orc-table-in-pysp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown