Python Pandas data aggregation by time interval












0















I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)



This is an example of my data:



            desc        date_1      date_2      date_3      values
54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87


In brief, I'm trying to recreate is this functionality:



SUM(CASE 
WHEN date_1 <= date_2 - interval '11' month
AND date_3 > date_2 - interval '11' month
THEN values
end)


But then also group by the desc column.



So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:




  • "date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

  • I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.


An example output should essentially look like this:



    desc        interval         values_sum
cc-cc at_date 55
cc-dd at_date 17
other-dd at_date 9
cc-cc date_minus_1 37
cc-dd date_minus_1 92
other-dd date_minus_1 40
cc-cc date_minus_2 37


Any help would be greatly appreciated.










share|improve this question



























    0















    I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)



    This is an example of my data:



                desc        date_1      date_2      date_3      values
    54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
    49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
    37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
    54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
    7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
    29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
    2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
    39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
    7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
    29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
    14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
    32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
    52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
    7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
    35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
    27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
    50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
    45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
    55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
    46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87


    In brief, I'm trying to recreate is this functionality:



    SUM(CASE 
    WHEN date_1 <= date_2 - interval '11' month
    AND date_3 > date_2 - interval '11' month
    THEN values
    end)


    But then also group by the desc column.



    So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:




    • "date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

    • I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.


    An example output should essentially look like this:



        desc        interval         values_sum
    cc-cc at_date 55
    cc-dd at_date 17
    other-dd at_date 9
    cc-cc date_minus_1 37
    cc-dd date_minus_1 92
    other-dd date_minus_1 40
    cc-cc date_minus_2 37


    Any help would be greatly appreciated.










    share|improve this question

























      0












      0








      0








      I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)



      This is an example of my data:



                  desc        date_1      date_2      date_3      values
      54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
      49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
      37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
      54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
      7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
      29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
      2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
      39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
      7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
      29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
      14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
      32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
      52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
      7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
      35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
      27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
      50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
      45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
      55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
      46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87


      In brief, I'm trying to recreate is this functionality:



      SUM(CASE 
      WHEN date_1 <= date_2 - interval '11' month
      AND date_3 > date_2 - interval '11' month
      THEN values
      end)


      But then also group by the desc column.



      So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:




      • "date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

      • I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.


      An example output should essentially look like this:



          desc        interval         values_sum
      cc-cc at_date 55
      cc-dd at_date 17
      other-dd at_date 9
      cc-cc date_minus_1 37
      cc-dd date_minus_1 92
      other-dd date_minus_1 40
      cc-cc date_minus_2 37


      Any help would be greatly appreciated.










      share|improve this question














      I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)



      This is an example of my data:



                  desc        date_1      date_2      date_3      values
      54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
      49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
      37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
      54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
      7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
      29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
      2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
      39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
      7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
      29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
      14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
      32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
      52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
      7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
      35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
      27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
      50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
      45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
      55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
      46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87


      In brief, I'm trying to recreate is this functionality:



      SUM(CASE 
      WHEN date_1 <= date_2 - interval '11' month
      AND date_3 > date_2 - interval '11' month
      THEN values
      end)


      But then also group by the desc column.



      So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:




      • "date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

      • I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.


      An example output should essentially look like this:



          desc        interval         values_sum
      cc-cc at_date 55
      cc-dd at_date 17
      other-dd at_date 9
      cc-cc date_minus_1 37
      cc-dd date_minus_1 92
      other-dd date_minus_1 40
      cc-cc date_minus_2 37


      Any help would be greatly appreciated.







      python pandas datetime intervals






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 25 '18 at 23:22









      roastbeeefroastbeeef

      1146




      1146
























          2 Answers
          2






          active

          oldest

          votes


















          1














          I hope I understand your question correctly.




          1. Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

          2. You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').


          A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.



          A script to provide an example:



          """ Create an example dataset """

          import numpy as np
          import pandas as pd
          df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
          df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
          df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
          df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
          df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
          df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
          df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
          df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
          df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
          df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
          df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
          df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
          df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
          df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
          df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
          df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
          df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
          df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
          df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
          df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]

          """Question 1: Yes, groupby() groups by properties for one or more columns"""
          df.groupby(["desc"]).sum()
          # values
          #desc
          #cc-cc 224
          #cc-dd 58
          #dd-cc 382
          #dd-dd 27
          #other-cc 153
          #other-dd 26

          """Question 2: You can define conditions, save them in the dataframe, then group by those too."""
          df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
          df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))

          df.groupby(["desc", "condition_1", "condition_2"]).sum()
          #
          #desc condition_1 condition_2
          #cc-cc False False 143
          # True False 81
          #cc-dd False False 58
          #dd-cc False False 127
          # True False 255
          #dd-dd False False 27
          #other-cc False False 62
          # True 54
          # True False 37
          #other-dd False False 9
          # True False 17





          share|improve this answer


























          • thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

            – roastbeeef
            Nov 26 '18 at 10:58











          • Yes, that will work; try it.

            – 0range
            Nov 26 '18 at 12:21



















          0














          Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.



          from dateutil.relativedelta import relativedelta
          for i in np.arange(-12,12,1):
          df['Month_' + str(i)] = df.apply(lambda x: x['values']
          if (x['date_2'] <= x['date_1'] + relativedelta(months=i))
          & (x['date_3'] > x['date_2'] + relativedelta(months=i))
          else 0, axis=1)


          Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53473017%2fpython-pandas-data-aggregation-by-time-interval%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            I hope I understand your question correctly.




            1. Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

            2. You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').


            A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.



            A script to provide an example:



            """ Create an example dataset """

            import numpy as np
            import pandas as pd
            df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
            df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
            df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
            df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
            df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
            df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
            df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
            df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
            df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
            df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
            df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
            df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
            df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
            df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
            df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
            df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
            df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
            df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
            df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
            df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]

            """Question 1: Yes, groupby() groups by properties for one or more columns"""
            df.groupby(["desc"]).sum()
            # values
            #desc
            #cc-cc 224
            #cc-dd 58
            #dd-cc 382
            #dd-dd 27
            #other-cc 153
            #other-dd 26

            """Question 2: You can define conditions, save them in the dataframe, then group by those too."""
            df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
            df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))

            df.groupby(["desc", "condition_1", "condition_2"]).sum()
            #
            #desc condition_1 condition_2
            #cc-cc False False 143
            # True False 81
            #cc-dd False False 58
            #dd-cc False False 127
            # True False 255
            #dd-dd False False 27
            #other-cc False False 62
            # True 54
            # True False 37
            #other-dd False False 9
            # True False 17





            share|improve this answer


























            • thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

              – roastbeeef
              Nov 26 '18 at 10:58











            • Yes, that will work; try it.

              – 0range
              Nov 26 '18 at 12:21
















            1














            I hope I understand your question correctly.




            1. Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

            2. You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').


            A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.



            A script to provide an example:



            """ Create an example dataset """

            import numpy as np
            import pandas as pd
            df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
            df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
            df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
            df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
            df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
            df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
            df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
            df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
            df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
            df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
            df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
            df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
            df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
            df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
            df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
            df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
            df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
            df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
            df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
            df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]

            """Question 1: Yes, groupby() groups by properties for one or more columns"""
            df.groupby(["desc"]).sum()
            # values
            #desc
            #cc-cc 224
            #cc-dd 58
            #dd-cc 382
            #dd-dd 27
            #other-cc 153
            #other-dd 26

            """Question 2: You can define conditions, save them in the dataframe, then group by those too."""
            df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
            df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))

            df.groupby(["desc", "condition_1", "condition_2"]).sum()
            #
            #desc condition_1 condition_2
            #cc-cc False False 143
            # True False 81
            #cc-dd False False 58
            #dd-cc False False 127
            # True False 255
            #dd-dd False False 27
            #other-cc False False 62
            # True 54
            # True False 37
            #other-dd False False 9
            # True False 17





            share|improve this answer


























            • thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

              – roastbeeef
              Nov 26 '18 at 10:58











            • Yes, that will work; try it.

              – 0range
              Nov 26 '18 at 12:21














            1












            1








            1







            I hope I understand your question correctly.




            1. Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

            2. You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').


            A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.



            A script to provide an example:



            """ Create an example dataset """

            import numpy as np
            import pandas as pd
            df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
            df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
            df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
            df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
            df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
            df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
            df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
            df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
            df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
            df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
            df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
            df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
            df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
            df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
            df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
            df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
            df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
            df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
            df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
            df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]

            """Question 1: Yes, groupby() groups by properties for one or more columns"""
            df.groupby(["desc"]).sum()
            # values
            #desc
            #cc-cc 224
            #cc-dd 58
            #dd-cc 382
            #dd-dd 27
            #other-cc 153
            #other-dd 26

            """Question 2: You can define conditions, save them in the dataframe, then group by those too."""
            df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
            df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))

            df.groupby(["desc", "condition_1", "condition_2"]).sum()
            #
            #desc condition_1 condition_2
            #cc-cc False False 143
            # True False 81
            #cc-dd False False 58
            #dd-cc False False 127
            # True False 255
            #dd-dd False False 27
            #other-cc False False 62
            # True 54
            # True False 37
            #other-dd False False 9
            # True False 17





            share|improve this answer















            I hope I understand your question correctly.




            1. Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

            2. You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').


            A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.



            A script to provide an example:



            """ Create an example dataset """

            import numpy as np
            import pandas as pd
            df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
            df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
            df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
            df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
            df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
            df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
            df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
            df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
            df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
            df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
            df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
            df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
            df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
            df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
            df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
            df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
            df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
            df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
            df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
            df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]

            """Question 1: Yes, groupby() groups by properties for one or more columns"""
            df.groupby(["desc"]).sum()
            # values
            #desc
            #cc-cc 224
            #cc-dd 58
            #dd-cc 382
            #dd-dd 27
            #other-cc 153
            #other-dd 26

            """Question 2: You can define conditions, save them in the dataframe, then group by those too."""
            df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
            df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))

            df.groupby(["desc", "condition_1", "condition_2"]).sum()
            #
            #desc condition_1 condition_2
            #cc-cc False False 143
            # True False 81
            #cc-dd False False 58
            #dd-cc False False 127
            # True False 255
            #dd-dd False False 27
            #other-cc False False 62
            # True 54
            # True False 37
            #other-dd False False 9
            # True False 17






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 26 '18 at 2:44

























            answered Nov 26 '18 at 2:37









            0range0range

            9711820




            9711820













            • thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

              – roastbeeef
              Nov 26 '18 at 10:58











            • Yes, that will work; try it.

              – 0range
              Nov 26 '18 at 12:21



















            • thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

              – roastbeeef
              Nov 26 '18 at 10:58











            • Yes, that will work; try it.

              – 0range
              Nov 26 '18 at 12:21

















            thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

            – roastbeeef
            Nov 26 '18 at 10:58





            thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

            – roastbeeef
            Nov 26 '18 at 10:58













            Yes, that will work; try it.

            – 0range
            Nov 26 '18 at 12:21





            Yes, that will work; try it.

            – 0range
            Nov 26 '18 at 12:21













            0














            Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.



            from dateutil.relativedelta import relativedelta
            for i in np.arange(-12,12,1):
            df['Month_' + str(i)] = df.apply(lambda x: x['values']
            if (x['date_2'] <= x['date_1'] + relativedelta(months=i))
            & (x['date_3'] > x['date_2'] + relativedelta(months=i))
            else 0, axis=1)


            Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()






            share|improve this answer




























              0














              Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.



              from dateutil.relativedelta import relativedelta
              for i in np.arange(-12,12,1):
              df['Month_' + str(i)] = df.apply(lambda x: x['values']
              if (x['date_2'] <= x['date_1'] + relativedelta(months=i))
              & (x['date_3'] > x['date_2'] + relativedelta(months=i))
              else 0, axis=1)


              Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()






              share|improve this answer


























                0












                0








                0







                Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.



                from dateutil.relativedelta import relativedelta
                for i in np.arange(-12,12,1):
                df['Month_' + str(i)] = df.apply(lambda x: x['values']
                if (x['date_2'] <= x['date_1'] + relativedelta(months=i))
                & (x['date_3'] > x['date_2'] + relativedelta(months=i))
                else 0, axis=1)


                Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()






                share|improve this answer













                Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.



                from dateutil.relativedelta import relativedelta
                for i in np.arange(-12,12,1):
                df['Month_' + str(i)] = df.apply(lambda x: x['values']
                if (x['date_2'] <= x['date_1'] + relativedelta(months=i))
                & (x['date_3'] > x['date_2'] + relativedelta(months=i))
                else 0, axis=1)


                Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 27 '18 at 7:47









                roastbeeefroastbeeef

                1146




                1146






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53473017%2fpython-pandas-data-aggregation-by-time-interval%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                    Calculate evaluation metrics using cross_val_predict sklearn

                    Insert data from modal to MySQL (multiple modal on website)