Python dataframe group by following values
I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :
date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None
I hope to be clear,
thanks a lot for your help,
Rémy
python datetime dataframe pandas-groupby
add a comment |
I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :
date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None
I hope to be clear,
thanks a lot for your help,
Rémy
python datetime dataframe pandas-groupby
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49
add a comment |
I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :
date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None
I hope to be clear,
thanks a lot for your help,
Rémy
python datetime dataframe pandas-groupby
I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :
date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None
I hope to be clear,
thanks a lot for your help,
Rémy
python datetime dataframe pandas-groupby
python datetime dataframe pandas-groupby
asked Nov 26 '18 at 8:32
R. ClaverieR. Claverie
11
11
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49
add a comment |
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49
add a comment |
2 Answers
2
active
oldest
votes
If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods
df['rainfall'].rolling(2).sum()
Ofcourse, your DataFrame
must stay sorted as you show in your example.
adding it as a part of your DataFrame would be:
df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()
EDIT1:
If you mean you just want to accumulate the rainfall use:
df['rainfall_cumsum'] = df['rainfall'].cumsum()
as the function cumsum()
is accumulative sum
EDIT2:
rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
add a comment |
Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)
import pandas as pd
import numpy as np
# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])
df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain
# convert rain to boolean
df['is_rain'] = df['rain'] > 0
# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted
# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']
# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index
Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477230%2fpython-dataframe-group-by-following-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods
df['rainfall'].rolling(2).sum()
Ofcourse, your DataFrame
must stay sorted as you show in your example.
adding it as a part of your DataFrame would be:
df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()
EDIT1:
If you mean you just want to accumulate the rainfall use:
df['rainfall_cumsum'] = df['rainfall'].cumsum()
as the function cumsum()
is accumulative sum
EDIT2:
rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
add a comment |
If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods
df['rainfall'].rolling(2).sum()
Ofcourse, your DataFrame
must stay sorted as you show in your example.
adding it as a part of your DataFrame would be:
df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()
EDIT1:
If you mean you just want to accumulate the rainfall use:
df['rainfall_cumsum'] = df['rainfall'].cumsum()
as the function cumsum()
is accumulative sum
EDIT2:
rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
add a comment |
If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods
df['rainfall'].rolling(2).sum()
Ofcourse, your DataFrame
must stay sorted as you show in your example.
adding it as a part of your DataFrame would be:
df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()
EDIT1:
If you mean you just want to accumulate the rainfall use:
df['rainfall_cumsum'] = df['rainfall'].cumsum()
as the function cumsum()
is accumulative sum
EDIT2:
rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)
If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods
df['rainfall'].rolling(2).sum()
Ofcourse, your DataFrame
must stay sorted as you show in your example.
adding it as a part of your DataFrame would be:
df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()
EDIT1:
If you mean you just want to accumulate the rainfall use:
df['rainfall_cumsum'] = df['rainfall'].cumsum()
as the function cumsum()
is accumulative sum
EDIT2:
rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)
edited Nov 26 '18 at 10:51
answered Nov 26 '18 at 8:54
Eran MosheEran Moshe
1,371722
1,371722
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
add a comment |
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy
– R. Claverie
Nov 26 '18 at 9:11
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Do you mean just accumulate the rainfall ?
– Eran Moshe
Nov 26 '18 at 9:20
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy
– R. Claverie
Nov 26 '18 at 10:24
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
@R.Claverie Only took the whole day to understand :>
– Eran Moshe
Nov 26 '18 at 10:51
add a comment |
Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)
import pandas as pd
import numpy as np
# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])
df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain
# convert rain to boolean
df['is_rain'] = df['rain'] > 0
# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted
# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']
# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index
Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)
add a comment |
Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)
import pandas as pd
import numpy as np
# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])
df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain
# convert rain to boolean
df['is_rain'] = df['rain'] > 0
# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted
# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']
# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index
Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)
add a comment |
Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)
import pandas as pd
import numpy as np
# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])
df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain
# convert rain to boolean
df['is_rain'] = df['rain'] > 0
# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted
# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']
# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index
Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)
Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)
import pandas as pd
import numpy as np
# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])
df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain
# convert rain to boolean
df['is_rain'] = df['rain'] > 0
# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted
# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']
# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index
Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)
edited Nov 26 '18 at 11:38
answered Nov 26 '18 at 11:13
MisterMonkMisterMonk
1549
1549
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477230%2fpython-dataframe-group-by-following-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Think he's meaning a 'rolling_sum'
– Eran Moshe
Nov 26 '18 at 8:49