Impala/Hive Filling in Missing Values Similar to LOCF (last observation carry forward)
I have a time series data in Impala that in this format.
One record get created when and only when there is a change, updated value represents the new data.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 20 | 80 |
|B | 2017 | 8 | 90 |
|... | ... | ... | ... |
---------------------------------------
Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.
This should be the ideal output.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | . | 5 |
|A | 2017 | 4 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 6 | 10 |
|A | 2017 | . | 10 |
|A | 2017 | 20 | 80 |
|A | 2017 | . | 80 |
|A | 2017 | 52 | 80 |
|B | 2017 | 1 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 8 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 52 | 90 |
|... | ... | ... | ... |
---------------------------------------
Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.
sql hive impala
add a comment |
I have a time series data in Impala that in this format.
One record get created when and only when there is a change, updated value represents the new data.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 20 | 80 |
|B | 2017 | 8 | 90 |
|... | ... | ... | ... |
---------------------------------------
Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.
This should be the ideal output.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | . | 5 |
|A | 2017 | 4 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 6 | 10 |
|A | 2017 | . | 10 |
|A | 2017 | 20 | 80 |
|A | 2017 | . | 80 |
|A | 2017 | 52 | 80 |
|B | 2017 | 1 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 8 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 52 | 90 |
|... | ... | ... | ... |
---------------------------------------
Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.
sql hive impala
add a comment |
I have a time series data in Impala that in this format.
One record get created when and only when there is a change, updated value represents the new data.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 20 | 80 |
|B | 2017 | 8 | 90 |
|... | ... | ... | ... |
---------------------------------------
Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.
This should be the ideal output.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | . | 5 |
|A | 2017 | 4 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 6 | 10 |
|A | 2017 | . | 10 |
|A | 2017 | 20 | 80 |
|A | 2017 | . | 80 |
|A | 2017 | 52 | 80 |
|B | 2017 | 1 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 8 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 52 | 90 |
|... | ... | ... | ... |
---------------------------------------
Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.
sql hive impala
I have a time series data in Impala that in this format.
One record get created when and only when there is a change, updated value represents the new data.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 20 | 80 |
|B | 2017 | 8 | 90 |
|... | ... | ... | ... |
---------------------------------------
Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.
This should be the ideal output.
---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | . | 5 |
|A | 2017 | 4 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 6 | 10 |
|A | 2017 | . | 10 |
|A | 2017 | 20 | 80 |
|A | 2017 | . | 80 |
|A | 2017 | 52 | 80 |
|B | 2017 | 1 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 8 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 52 | 90 |
|... | ... | ... | ... |
---------------------------------------
Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.
sql hive impala
sql hive impala
edited Nov 27 '18 at 17:33
Vamsi Prabhala
41.5k42039
41.5k42039
asked Nov 27 '18 at 17:33
B.Mr.W.B.Mr.W.
8,6041876130
8,6041876130
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Step 1:
Create a table with numbers.
create table if not exists tblNumbers
location 'hdfs_location' as
select 1 as num
union all
select 2 as num
....
Step 2:
Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max
to generate the values for missing rows from the last found value.
select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
,sum(case when t.val is not null then 1 else 0 end)
over(partition by py.product,py.year order by n.week) as grp
from tblNumbers n
cross join (select distinct product,year from tbl) py
left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
) t
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53505139%2fimpala-hive-filling-in-missing-values-similar-to-locf-last-observation-carry-fo%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Step 1:
Create a table with numbers.
create table if not exists tblNumbers
location 'hdfs_location' as
select 1 as num
union all
select 2 as num
....
Step 2:
Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max
to generate the values for missing rows from the last found value.
select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
,sum(case when t.val is not null then 1 else 0 end)
over(partition by py.product,py.year order by n.week) as grp
from tblNumbers n
cross join (select distinct product,year from tbl) py
left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
) t
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
add a comment |
Step 1:
Create a table with numbers.
create table if not exists tblNumbers
location 'hdfs_location' as
select 1 as num
union all
select 2 as num
....
Step 2:
Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max
to generate the values for missing rows from the last found value.
select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
,sum(case when t.val is not null then 1 else 0 end)
over(partition by py.product,py.year order by n.week) as grp
from tblNumbers n
cross join (select distinct product,year from tbl) py
left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
) t
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
add a comment |
Step 1:
Create a table with numbers.
create table if not exists tblNumbers
location 'hdfs_location' as
select 1 as num
union all
select 2 as num
....
Step 2:
Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max
to generate the values for missing rows from the last found value.
select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
,sum(case when t.val is not null then 1 else 0 end)
over(partition by py.product,py.year order by n.week) as grp
from tblNumbers n
cross join (select distinct product,year from tbl) py
left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
) t
Step 1:
Create a table with numbers.
create table if not exists tblNumbers
location 'hdfs_location' as
select 1 as num
union all
select 2 as num
....
Step 2:
Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max
to generate the values for missing rows from the last found value.
select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
,sum(case when t.val is not null then 1 else 0 end)
over(partition by py.product,py.year order by n.week) as grp
from tblNumbers n
cross join (select distinct product,year from tbl) py
left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
) t
edited Nov 27 '18 at 18:41
answered Nov 27 '18 at 17:43
Vamsi PrabhalaVamsi Prabhala
41.5k42039
41.5k42039
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
add a comment |
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
max(val) is referring to a column that doesn't exist, should the value also be included before the sum?
– B.Mr.W.
Nov 27 '18 at 18:40
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
yes please..edited the answer as well.
– Vamsi Prabhala
Nov 27 '18 at 18:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53505139%2fimpala-hive-filling-in-missing-values-similar-to-locf-last-observation-carry-fo%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown