Impala/Hive Filling in Missing Values Similar to LOCF (last observation carry forward)












0















I have a time series data in Impala that in this format.



One record get created when and only when there is a change, updated value represents the new data.



---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 20 | 80 |
|B | 2017 | 8 | 90 |
|... | ... | ... | ... |
---------------------------------------


Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.



This should be the ideal output.



---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A | 2017 | 1 | 5 |
|A | 2017 | . | 5 |
|A | 2017 | 4 | 5 |
|A | 2017 | 5 | 10 |
|A | 2017 | 6 | 10 |
|A | 2017 | . | 10 |
|A | 2017 | 20 | 80 |
|A | 2017 | . | 80 |
|A | 2017 | 52 | 80 |
|B | 2017 | 1 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 8 | 90 |
|B | 2017 | . | 90 |
|B | 2017 | 52 | 90 |
|... | ... | ... | ... |
---------------------------------------


Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.










share|improve this question





























    0















    I have a time series data in Impala that in this format.



    One record get created when and only when there is a change, updated value represents the new data.



    ---------------------------------------
    | Product | Year | Week | UpdatedValue |
    ---------------------------------------
    |A | 2017 | 1 | 5 |
    |A | 2017 | 5 | 10 |
    |A | 2017 | 20 | 80 |
    |B | 2017 | 8 | 90 |
    |... | ... | ... | ... |
    ---------------------------------------


    Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.



    This should be the ideal output.



    ---------------------------------------
    | Product | Year | Week | UpdatedValue |
    ---------------------------------------
    |A | 2017 | 1 | 5 |
    |A | 2017 | . | 5 |
    |A | 2017 | 4 | 5 |
    |A | 2017 | 5 | 10 |
    |A | 2017 | 6 | 10 |
    |A | 2017 | . | 10 |
    |A | 2017 | 20 | 80 |
    |A | 2017 | . | 80 |
    |A | 2017 | 52 | 80 |
    |B | 2017 | 1 | 90 |
    |B | 2017 | . | 90 |
    |B | 2017 | 8 | 90 |
    |B | 2017 | . | 90 |
    |B | 2017 | 52 | 90 |
    |... | ... | ... | ... |
    ---------------------------------------


    Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.










    share|improve this question



























      0












      0








      0








      I have a time series data in Impala that in this format.



      One record get created when and only when there is a change, updated value represents the new data.



      ---------------------------------------
      | Product | Year | Week | UpdatedValue |
      ---------------------------------------
      |A | 2017 | 1 | 5 |
      |A | 2017 | 5 | 10 |
      |A | 2017 | 20 | 80 |
      |B | 2017 | 8 | 90 |
      |... | ... | ... | ... |
      ---------------------------------------


      Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.



      This should be the ideal output.



      ---------------------------------------
      | Product | Year | Week | UpdatedValue |
      ---------------------------------------
      |A | 2017 | 1 | 5 |
      |A | 2017 | . | 5 |
      |A | 2017 | 4 | 5 |
      |A | 2017 | 5 | 10 |
      |A | 2017 | 6 | 10 |
      |A | 2017 | . | 10 |
      |A | 2017 | 20 | 80 |
      |A | 2017 | . | 80 |
      |A | 2017 | 52 | 80 |
      |B | 2017 | 1 | 90 |
      |B | 2017 | . | 90 |
      |B | 2017 | 8 | 90 |
      |B | 2017 | . | 90 |
      |B | 2017 | 52 | 90 |
      |... | ... | ... | ... |
      ---------------------------------------


      Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.










      share|improve this question
















      I have a time series data in Impala that in this format.



      One record get created when and only when there is a change, updated value represents the new data.



      ---------------------------------------
      | Product | Year | Week | UpdatedValue |
      ---------------------------------------
      |A | 2017 | 1 | 5 |
      |A | 2017 | 5 | 10 |
      |A | 2017 | 20 | 80 |
      |B | 2017 | 8 | 90 |
      |... | ... | ... | ... |
      ---------------------------------------


      Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.



      This should be the ideal output.



      ---------------------------------------
      | Product | Year | Week | UpdatedValue |
      ---------------------------------------
      |A | 2017 | 1 | 5 |
      |A | 2017 | . | 5 |
      |A | 2017 | 4 | 5 |
      |A | 2017 | 5 | 10 |
      |A | 2017 | 6 | 10 |
      |A | 2017 | . | 10 |
      |A | 2017 | 20 | 80 |
      |A | 2017 | . | 80 |
      |A | 2017 | 52 | 80 |
      |B | 2017 | 1 | 90 |
      |B | 2017 | . | 90 |
      |B | 2017 | 8 | 90 |
      |B | 2017 | . | 90 |
      |B | 2017 | 52 | 90 |
      |... | ... | ... | ... |
      ---------------------------------------


      Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.







      sql hive impala






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 27 '18 at 17:33









      Vamsi Prabhala

      41.5k42039




      41.5k42039










      asked Nov 27 '18 at 17:33









      B.Mr.W.B.Mr.W.

      8,6041876130




      8,6041876130
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Step 1:
          Create a table with numbers.



          create table if not exists tblNumbers
          location 'hdfs_location' as
          select 1 as num
          union all
          select 2 as num
          ....


          Step 2:
          Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.



          select product,year,week,max(val) over(partition by product,year,grp) as new_val
          from (select py.product,py.year,n.week,t.val
          ,sum(case when t.val is not null then 1 else 0 end)
          over(partition by py.product,py.year order by n.week) as grp
          from tblNumbers n
          cross join (select distinct product,year from tbl) py
          left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
          ) t





          share|improve this answer


























          • max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

            – B.Mr.W.
            Nov 27 '18 at 18:40











          • yes please..edited the answer as well.

            – Vamsi Prabhala
            Nov 27 '18 at 18:41











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53505139%2fimpala-hive-filling-in-missing-values-similar-to-locf-last-observation-carry-fo%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Step 1:
          Create a table with numbers.



          create table if not exists tblNumbers
          location 'hdfs_location' as
          select 1 as num
          union all
          select 2 as num
          ....


          Step 2:
          Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.



          select product,year,week,max(val) over(partition by product,year,grp) as new_val
          from (select py.product,py.year,n.week,t.val
          ,sum(case when t.val is not null then 1 else 0 end)
          over(partition by py.product,py.year order by n.week) as grp
          from tblNumbers n
          cross join (select distinct product,year from tbl) py
          left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
          ) t





          share|improve this answer


























          • max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

            – B.Mr.W.
            Nov 27 '18 at 18:40











          • yes please..edited the answer as well.

            – Vamsi Prabhala
            Nov 27 '18 at 18:41
















          1














          Step 1:
          Create a table with numbers.



          create table if not exists tblNumbers
          location 'hdfs_location' as
          select 1 as num
          union all
          select 2 as num
          ....


          Step 2:
          Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.



          select product,year,week,max(val) over(partition by product,year,grp) as new_val
          from (select py.product,py.year,n.week,t.val
          ,sum(case when t.val is not null then 1 else 0 end)
          over(partition by py.product,py.year order by n.week) as grp
          from tblNumbers n
          cross join (select distinct product,year from tbl) py
          left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
          ) t





          share|improve this answer


























          • max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

            – B.Mr.W.
            Nov 27 '18 at 18:40











          • yes please..edited the answer as well.

            – Vamsi Prabhala
            Nov 27 '18 at 18:41














          1












          1








          1







          Step 1:
          Create a table with numbers.



          create table if not exists tblNumbers
          location 'hdfs_location' as
          select 1 as num
          union all
          select 2 as num
          ....


          Step 2:
          Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.



          select product,year,week,max(val) over(partition by product,year,grp) as new_val
          from (select py.product,py.year,n.week,t.val
          ,sum(case when t.val is not null then 1 else 0 end)
          over(partition by py.product,py.year order by n.week) as grp
          from tblNumbers n
          cross join (select distinct product,year from tbl) py
          left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
          ) t





          share|improve this answer















          Step 1:
          Create a table with numbers.



          create table if not exists tblNumbers
          location 'hdfs_location' as
          select 1 as num
          union all
          select 2 as num
          ....


          Step 2:
          Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.



          select product,year,week,max(val) over(partition by product,year,grp) as new_val
          from (select py.product,py.year,n.week,t.val
          ,sum(case when t.val is not null then 1 else 0 end)
          over(partition by py.product,py.year order by n.week) as grp
          from tblNumbers n
          cross join (select distinct product,year from tbl) py
          left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
          ) t






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 27 '18 at 18:41

























          answered Nov 27 '18 at 17:43









          Vamsi PrabhalaVamsi Prabhala

          41.5k42039




          41.5k42039













          • max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

            – B.Mr.W.
            Nov 27 '18 at 18:40











          • yes please..edited the answer as well.

            – Vamsi Prabhala
            Nov 27 '18 at 18:41



















          • max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

            – B.Mr.W.
            Nov 27 '18 at 18:40











          • yes please..edited the answer as well.

            – Vamsi Prabhala
            Nov 27 '18 at 18:41

















          max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

          – B.Mr.W.
          Nov 27 '18 at 18:40





          max(val) is referring to a column that doesn't exist, should the value also be included before the sum?

          – B.Mr.W.
          Nov 27 '18 at 18:40













          yes please..edited the answer as well.

          – Vamsi Prabhala
          Nov 27 '18 at 18:41





          yes please..edited the answer as well.

          – Vamsi Prabhala
          Nov 27 '18 at 18:41




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53505139%2fimpala-hive-filling-in-missing-values-similar-to-locf-last-observation-carry-fo%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)