Python Pandas data aggregation by time interval

I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)

This is an example of my data:

            desc        date_1      date_2      date_3      values

54287171    cc-cc       2018-03-14  2017-07-03  2018-05-21  55

49410141    other-dd    2012-01-18  2017-01-26  2011-12-30  17

37694577    other-dd    2018-07-05  2017-07-25  2018-06-19  9

54051782    other-cc    2014-10-23  2017-11-24  2014-10-31  37

7378464     dd-cc       2016-08-05  2018-05-15  2016-07-22  92

29665541    dd-cc       2011-12-14  2017-08-01  2012-05-01  40

2999878     dd-cc       2018-10-03  2018-04-13  2018-09-17  37

39453869    cc-cc       2015-11-24  2017-09-09  2015-11-21  81

7181109     dd-dd       2018-01-18  2017-11-24  2018-01-15  27

29580865    dd-cc       2017-04-24  2017-09-07  2017-05-04  38

14778957    other-cc    2017-11-02  2017-06-20  2018-06-26  49

32500886    cc-dd       2017-01-12  2017-05-26  2017-01-12  50

52146154    other-cc    2018-08-01  2017-03-27  2018-07-16  5

7208584     cc-dd       2018-03-13  2018-07-04  2018-04-26  8

35894666    cc-cc       2017-12-04  2018-06-13  2018-08-14  88

27565108    other-other 2015-10-19  2017-03-14  2016-01-22  88

50705834    other-cc    2018-01-08  2017-12-09  2018-01-11  62

45420360    dd-cc       2017-10-23  2017-09-02  2018-01-29  52

55933497    dd-cc       2017-04-14  2018-06-07  2017-09-27  36

46160680    dd-cc       2014-06-05  2018-01-16  2016-01-27  87

In brief, I'm trying to recreate is this functionality:

SUM(CASE 

WHEN date_1 <= date_2 - interval '11' month

AND  date_3 > date_2 - interval '11' month

THEN values

end)

But then also group by the desc column.

So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:

"date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.

An example output should essentially look like this:

    desc        interval         values_sum

    cc-cc       at_date          55

    cc-dd       at_date          17

    other-dd    at_date          9

    cc-cc       date_minus_1     37

    cc-dd       date_minus_1     92

    other-dd    date_minus_1     40               

    cc-cc       date_minus_2     37

Any help would be greatly appreciated.

asked Nov 25 '18 at 23:22

roastbeeef

1146

add a comment |

I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)

This is an example of my data:

            desc        date_1      date_2      date_3      values

54287171    cc-cc       2018-03-14  2017-07-03  2018-05-21  55

49410141    other-dd    2012-01-18  2017-01-26  2011-12-30  17

37694577    other-dd    2018-07-05  2017-07-25  2018-06-19  9

54051782    other-cc    2014-10-23  2017-11-24  2014-10-31  37

7378464     dd-cc       2016-08-05  2018-05-15  2016-07-22  92

29665541    dd-cc       2011-12-14  2017-08-01  2012-05-01  40

2999878     dd-cc       2018-10-03  2018-04-13  2018-09-17  37

39453869    cc-cc       2015-11-24  2017-09-09  2015-11-21  81

7181109     dd-dd       2018-01-18  2017-11-24  2018-01-15  27

29580865    dd-cc       2017-04-24  2017-09-07  2017-05-04  38

14778957    other-cc    2017-11-02  2017-06-20  2018-06-26  49

32500886    cc-dd       2017-01-12  2017-05-26  2017-01-12  50

52146154    other-cc    2018-08-01  2017-03-27  2018-07-16  5

7208584     cc-dd       2018-03-13  2018-07-04  2018-04-26  8

35894666    cc-cc       2017-12-04  2018-06-13  2018-08-14  88

27565108    other-other 2015-10-19  2017-03-14  2016-01-22  88

50705834    other-cc    2018-01-08  2017-12-09  2018-01-11  62

45420360    dd-cc       2017-10-23  2017-09-02  2018-01-29  52

55933497    dd-cc       2017-04-14  2018-06-07  2017-09-27  36

46160680    dd-cc       2014-06-05  2018-01-16  2016-01-27  87

In brief, I'm trying to recreate is this functionality:

SUM(CASE 

WHEN date_1 <= date_2 - interval '11' month

AND  date_3 > date_2 - interval '11' month

THEN values

end)

But then also group by the desc column.

So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:

"date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.

An example output should essentially look like this:

    desc        interval         values_sum

    cc-cc       at_date          55

    cc-dd       at_date          17

    other-dd    at_date          9

    cc-cc       date_minus_1     37

    cc-dd       date_minus_1     92

    other-dd    date_minus_1     40               

    cc-cc       date_minus_2     37

Any help would be greatly appreciated.

asked Nov 25 '18 at 23:22

roastbeeef

1146

add a comment |

I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)

This is an example of my data:

            desc        date_1      date_2      date_3      values

54287171    cc-cc       2018-03-14  2017-07-03  2018-05-21  55

49410141    other-dd    2012-01-18  2017-01-26  2011-12-30  17

37694577    other-dd    2018-07-05  2017-07-25  2018-06-19  9

54051782    other-cc    2014-10-23  2017-11-24  2014-10-31  37

7378464     dd-cc       2016-08-05  2018-05-15  2016-07-22  92

29665541    dd-cc       2011-12-14  2017-08-01  2012-05-01  40

2999878     dd-cc       2018-10-03  2018-04-13  2018-09-17  37

39453869    cc-cc       2015-11-24  2017-09-09  2015-11-21  81

7181109     dd-dd       2018-01-18  2017-11-24  2018-01-15  27

29580865    dd-cc       2017-04-24  2017-09-07  2017-05-04  38

14778957    other-cc    2017-11-02  2017-06-20  2018-06-26  49

32500886    cc-dd       2017-01-12  2017-05-26  2017-01-12  50

52146154    other-cc    2018-08-01  2017-03-27  2018-07-16  5

7208584     cc-dd       2018-03-13  2018-07-04  2018-04-26  8

35894666    cc-cc       2017-12-04  2018-06-13  2018-08-14  88

27565108    other-other 2015-10-19  2017-03-14  2016-01-22  88

50705834    other-cc    2018-01-08  2017-12-09  2018-01-11  62

45420360    dd-cc       2017-10-23  2017-09-02  2018-01-29  52

55933497    dd-cc       2017-04-14  2018-06-07  2017-09-27  36

46160680    dd-cc       2014-06-05  2018-01-16  2016-01-27  87

In brief, I'm trying to recreate is this functionality:

SUM(CASE 

WHEN date_1 <= date_2 - interval '11' month

AND  date_3 > date_2 - interval '11' month

THEN values

end)

But then also group by the desc column.

So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:

"date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.

An example output should essentially look like this:

    desc        interval         values_sum

    cc-cc       at_date          55

    cc-dd       at_date          17

    other-dd    at_date          9

    cc-cc       date_minus_1     37

    cc-dd       date_minus_1     92

    other-dd    date_minus_1     40               

    cc-cc       date_minus_2     37

Any help would be greatly appreciated.

asked Nov 25 '18 at 23:22

roastbeeef

1146

I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)

This is an example of my data:

            desc        date_1      date_2      date_3      values

54287171    cc-cc       2018-03-14  2017-07-03  2018-05-21  55

49410141    other-dd    2012-01-18  2017-01-26  2011-12-30  17

37694577    other-dd    2018-07-05  2017-07-25  2018-06-19  9

54051782    other-cc    2014-10-23  2017-11-24  2014-10-31  37

7378464     dd-cc       2016-08-05  2018-05-15  2016-07-22  92

29665541    dd-cc       2011-12-14  2017-08-01  2012-05-01  40

2999878     dd-cc       2018-10-03  2018-04-13  2018-09-17  37

39453869    cc-cc       2015-11-24  2017-09-09  2015-11-21  81

7181109     dd-dd       2018-01-18  2017-11-24  2018-01-15  27

29580865    dd-cc       2017-04-24  2017-09-07  2017-05-04  38

14778957    other-cc    2017-11-02  2017-06-20  2018-06-26  49

32500886    cc-dd       2017-01-12  2017-05-26  2017-01-12  50

52146154    other-cc    2018-08-01  2017-03-27  2018-07-16  5

7208584     cc-dd       2018-03-13  2018-07-04  2018-04-26  8

35894666    cc-cc       2017-12-04  2018-06-13  2018-08-14  88

27565108    other-other 2015-10-19  2017-03-14  2016-01-22  88

50705834    other-cc    2018-01-08  2017-12-09  2018-01-11  62

45420360    dd-cc       2017-10-23  2017-09-02  2018-01-29  52

55933497    dd-cc       2017-04-14  2018-06-07  2017-09-27  36

46160680    dd-cc       2014-06-05  2018-01-16  2016-01-27  87

In brief, I'm trying to recreate is this functionality:

SUM(CASE 

WHEN date_1 <= date_2 - interval '11' month

AND  date_3 > date_2 - interval '11' month

THEN values

end)

But then also group by the desc column.

So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:

"date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?

I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.

An example output should essentially look like this:

    desc        interval         values_sum

    cc-cc       at_date          55

    cc-dd       at_date          17

    other-dd    at_date          9

    cc-cc       date_minus_1     37

    cc-dd       date_minus_1     92

    other-dd    date_minus_1     40               

    cc-cc       date_minus_2     37

Any help would be greatly appreciated.

python pandas datetime intervals

asked Nov 25 '18 at 23:22

roastbeeef

1146

asked Nov 25 '18 at 23:22

roastbeeef

1146

asked Nov 25 '18 at 23:22

roastbeeef

1146

asked Nov 25 '18 at 23:22

roastbeeef

1146

asked Nov 25 '18 at 23:22

roastbeeef

1146

add a comment |

2 Answers
2

active

oldest

votes

I hope I understand your question correctly.

Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').

A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.

A script to provide an example:

""" Create an example dataset """



import numpy as np

import pandas as pd

df = pd.DataFrame(columns=["desc",  "date_1", "date_2", "date_3", "values"])

df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]

df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]

df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]

df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]

df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]

df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]

df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]

df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]

df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]

df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]

df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]

df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]

df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]

df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]

df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]

df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]

df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]

df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]

df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]



"""Question 1: Yes, groupby() groups by properties for one or more columns"""

df.groupby(["desc"]).sum()

#          values

#desc            

#cc-cc        224

#cc-dd         58

#dd-cc        382

#dd-dd         27

#other-cc     153

#other-dd      26



"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""

df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))

df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))



df.groupby(["desc", "condition_1", "condition_2"]).sum()

#

#desc     condition_1 condition_2        

#cc-cc    False       False           143

#         True        False            81

#cc-dd    False       False            58

#dd-cc    False       False           127

#         True        False           255

#dd-dd    False       False            27

#other-cc False       False            62

#                     True             54

#         True        False            37

#other-dd False       False             9

#         True        False            17

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

add a comment |

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta

for i in np.arange(-12,12,1):

      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 

                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) 

                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 

                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

answered Nov 27 '18 at 7:47

roastbeeef

1146

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53473017%2fpython-pandas-data-aggregation-by-time-interval%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I hope I understand your question correctly.

Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').

A script to provide an example:

""" Create an example dataset """



import numpy as np

import pandas as pd

df = pd.DataFrame(columns=["desc",  "date_1", "date_2", "date_3", "values"])

df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]

df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]

df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]

df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]

df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]

df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]

df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]

df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]

df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]

df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]

df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]

df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]

df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]

df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]

df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]

df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]

df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]

df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]

df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]



"""Question 1: Yes, groupby() groups by properties for one or more columns"""

df.groupby(["desc"]).sum()

#          values

#desc            

#cc-cc        224

#cc-dd         58

#dd-cc        382

#dd-dd         27

#other-cc     153

#other-dd      26



"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""

df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))

df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))



df.groupby(["desc", "condition_1", "condition_2"]).sum()

#

#desc     condition_1 condition_2        

#cc-cc    False       False           143

#         True        False            81

#cc-dd    False       False            58

#dd-cc    False       False           127

#         True        False           255

#dd-dd    False       False            27

#other-cc False       False            62

#                     True             54

#         True        False            37

#other-dd False       False             9

#         True        False            17

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

add a comment |

I hope I understand your question correctly.

Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').

A script to provide an example:

""" Create an example dataset """



import numpy as np

import pandas as pd

df = pd.DataFrame(columns=["desc",  "date_1", "date_2", "date_3", "values"])

df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]

df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]

df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]

df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]

df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]

df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]

df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]

df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]

df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]

df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]

df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]

df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]

df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]

df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]

df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]

df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]

df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]

df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]

df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]



"""Question 1: Yes, groupby() groups by properties for one or more columns"""

df.groupby(["desc"]).sum()

#          values

#desc            

#cc-cc        224

#cc-dd         58

#dd-cc        382

#dd-dd         27

#other-cc     153

#other-dd      26



"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""

df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))

df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))



df.groupby(["desc", "condition_1", "condition_2"]).sum()

#

#desc     condition_1 condition_2        

#cc-cc    False       False           143

#         True        False            81

#cc-dd    False       False            58

#dd-cc    False       False           127

#         True        False           255

#dd-dd    False       False            27

#other-cc False       False            62

#                     True             54

#         True        False            37

#other-dd False       False             9

#         True        False            17

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

add a comment |

I hope I understand your question correctly.

Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').

A script to provide an example:

""" Create an example dataset """



import numpy as np

import pandas as pd

df = pd.DataFrame(columns=["desc",  "date_1", "date_2", "date_3", "values"])

df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]

df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]

df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]

df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]

df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]

df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]

df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]

df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]

df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]

df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]

df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]

df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]

df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]

df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]

df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]

df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]

df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]

df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]

df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]



"""Question 1: Yes, groupby() groups by properties for one or more columns"""

df.groupby(["desc"]).sum()

#          values

#desc            

#cc-cc        224

#cc-dd         58

#dd-cc        382

#dd-dd         27

#other-cc     153

#other-dd      26



"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""

df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))

df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))



df.groupby(["desc", "condition_1", "condition_2"]).sum()

#

#desc     condition_1 condition_2        

#cc-cc    False       False           143

#         True        False            81

#cc-dd    False       False            58

#dd-cc    False       False           127

#         True        False           255

#dd-dd    False       False            27

#other-cc False       False            62

#                     True             54

#         True        False            37

#other-dd False       False             9

#         True        False            17

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

I hope I understand your question correctly.

Yes, groupby() groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.

You can define conditions, save them in the dataframe, then group by those too. In your case, the conditions would query whether "date_1" predates "date_2" by at least 11 months. The most tricky part about this is the 11 months timedelta. An easy way to achieve this would be to use numpy.timedelta64(11, 'M').

A script to provide an example:

""" Create an example dataset """



import numpy as np

import pandas as pd

df = pd.DataFrame(columns=["desc",  "date_1", "date_2", "date_3", "values"])

df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]

df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]

df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]

df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]

df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]

df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]

df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]

df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]

df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]

df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]

df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]

df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]

df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]

df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]

df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]

df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]

df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]

df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]

df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]



"""Question 1: Yes, groupby() groups by properties for one or more columns"""

df.groupby(["desc"]).sum()

#          values

#desc            

#cc-cc        224

#cc-dd         58

#dd-cc        382

#dd-dd         27

#other-cc     153

#other-dd      26



"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""

df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))

df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))



df.groupby(["desc", "condition_1", "condition_2"]).sum()

#

#desc     condition_1 condition_2        

#cc-cc    False       False           143

#         True        False            81

#cc-dd    False       False            58

#dd-cc    False       False           127

#         True        False           255

#dd-dd    False       False            27

#other-cc False       False            62

#                     True             54

#         True        False            37

#other-dd False       False             9

#         True        False            17

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

edited Nov 26 '18 at 2:44

answered Nov 26 '18 at 2:37

0range

9711820

answered Nov 26 '18 at 2:37

0range

9711820

answered Nov 26 '18 at 2:37

0range

9711820

thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

add a comment |

thanks - trying it out now. would this work if i wanted to use multiple conditions?df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

thanks - trying it out now. would this work if i wanted to use multiple conditions?

df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

thanks - trying it out now. would this work if i wanted to use multiple conditions?

df["condition_1"] = (df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))) & (df["date_3"] > df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M')))

– roastbeeef
Nov 26 '18 at 10:58

Yes, that will work; try it.

– 0range
Nov 26 '18 at 12:21

add a comment |

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta

for i in np.arange(-12,12,1):

      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 

                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) 

                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 

                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

answered Nov 27 '18 at 7:47

roastbeeef

1146

add a comment |

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta

for i in np.arange(-12,12,1):

      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 

                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) 

                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 

                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

answered Nov 27 '18 at 7:47

roastbeeef

1146

add a comment |

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta

for i in np.arange(-12,12,1):

      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 

                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) 

                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 

                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

answered Nov 27 '18 at 7:47

roastbeeef

1146

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta

for i in np.arange(-12,12,1):

      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 

                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) 

                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 

                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

answered Nov 27 '18 at 7:47

roastbeeef

1146

answered Nov 27 '18 at 7:47

roastbeeef

1146

answered Nov 27 '18 at 7:47

roastbeeef

1146

answered Nov 27 '18 at 7:47

roastbeeef

1146

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

doy2DdOrV8 8OH3G5,n cwkOJVguZNeMnN BJx,2lrIp

搜尋此網誌

Btukfyl