Why is count(distinct) slower than group by in Hive?











up vote
14
down vote

favorite
10












On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below.



Why?



Example query:



select count(distinct user)
from some_table


Version with group-by (proposed as faster):



select count(*) from
(select user
from some_table
group by user) q


Note: slide 26 of this presentation describes the problem.










share|improve this question
























  • I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
    – Hari Menon
    Oct 11 '13 at 6:21






  • 1




    just use EXPLAIN
    – Bohdan
    Jan 15 '14 at 5:02















up vote
14
down vote

favorite
10












On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below.



Why?



Example query:



select count(distinct user)
from some_table


Version with group-by (proposed as faster):



select count(*) from
(select user
from some_table
group by user) q


Note: slide 26 of this presentation describes the problem.










share|improve this question
























  • I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
    – Hari Menon
    Oct 11 '13 at 6:21






  • 1




    just use EXPLAIN
    – Bohdan
    Jan 15 '14 at 5:02













up vote
14
down vote

favorite
10









up vote
14
down vote

favorite
10






10





On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below.



Why?



Example query:



select count(distinct user)
from some_table


Version with group-by (proposed as faster):



select count(*) from
(select user
from some_table
group by user) q


Note: slide 26 of this presentation describes the problem.










share|improve this question















On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below.



Why?



Example query:



select count(distinct user)
from some_table


Version with group-by (proposed as faster):



select count(*) from
(select user
from some_table
group by user) q


Note: slide 26 of this presentation describes the problem.







performance hive aggregate-functions






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 at 13:50









mrsrinivas

14.9k76487




14.9k76487










asked Oct 11 '13 at 5:56









dfrankow

8,5093198149




8,5093198149












  • I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
    – Hari Menon
    Oct 11 '13 at 6:21






  • 1




    just use EXPLAIN
    – Bohdan
    Jan 15 '14 at 5:02


















  • I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
    – Hari Menon
    Oct 11 '13 at 6:21






  • 1




    just use EXPLAIN
    – Bohdan
    Jan 15 '14 at 5:02
















I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
– Hari Menon
Oct 11 '13 at 6:21




I don't understand your question. Are you asking why the group by version is faster? If yes, then why do you believe it is faster? You read it somewhere or you saw it behave in that way?
– Hari Menon
Oct 11 '13 at 6:21




1




1




just use EXPLAIN
– Bohdan
Jan 15 '14 at 5:02




just use EXPLAIN
– Bohdan
Jan 15 '14 at 5:02












1 Answer
1






active

oldest

votes

















up vote
20
down vote













select count(distinct user)
from some_table;


This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.



select count(*) from
(select user
from some_table
group by user) q;


This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.



So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.



This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.






share|improve this answer

















  • 1




    in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
    – Harper Koo
    Mar 28 '16 at 10:00










  • upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
    – Keith
    Apr 22 '17 at 4:00













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19311193%2fwhy-is-countdistinct-slower-than-group-by-in-hive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
20
down vote













select count(distinct user)
from some_table;


This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.



select count(*) from
(select user
from some_table
group by user) q;


This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.



So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.



This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.






share|improve this answer

















  • 1




    in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
    – Harper Koo
    Mar 28 '16 at 10:00










  • upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
    – Keith
    Apr 22 '17 at 4:00

















up vote
20
down vote













select count(distinct user)
from some_table;


This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.



select count(*) from
(select user
from some_table
group by user) q;


This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.



So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.



This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.






share|improve this answer

















  • 1




    in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
    – Harper Koo
    Mar 28 '16 at 10:00










  • upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
    – Keith
    Apr 22 '17 at 4:00















up vote
20
down vote










up vote
20
down vote









select count(distinct user)
from some_table;


This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.



select count(*) from
(select user
from some_table
group by user) q;


This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.



So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.



This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.






share|improve this answer












select count(distinct user)
from some_table;


This query does the count on the map side. Each mapper emits one value, the count. Then all values have to be aggregated to produce the total count, and that is the job of one single reducer.



select count(*) from
(select user
from some_table
group by user) q;


This query has two stages. On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user. The output has to be aggregated then on the reduce side, but it can use many reducers. On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer.



So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end.



This would normally not be an optimization. You would have to have a significant number of map splits for the query 1 reducer to become a problem. The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement.







share|improve this answer












share|improve this answer



share|improve this answer










answered Oct 11 '13 at 9:20









Remus Rusanu

244k31349480




244k31349480








  • 1




    in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
    – Harper Koo
    Mar 28 '16 at 10:00










  • upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
    – Keith
    Apr 22 '17 at 4:00
















  • 1




    in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
    – Harper Koo
    Mar 28 '16 at 10:00










  • upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
    – Keith
    Apr 22 '17 at 4:00










1




1




in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
– Harper Koo
Mar 28 '16 at 10:00




in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE..
– Harper Koo
Mar 28 '16 at 10:00












upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
– Keith
Apr 22 '17 at 4:00






upvote your answer. An example for large table is : if there is very few distinct values , then the first option may run faster than the second since most group by is done in map side
– Keith
Apr 22 '17 at 4:00




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19311193%2fwhy-is-countdistinct-slower-than-group-by-in-hive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)