Python dataframe group by following values












0















I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :




date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None


I hope to be clear,



thanks a lot for your help,



Rémy










share|improve this question























  • Think he's meaning a 'rolling_sum'

    – Eran Moshe
    Nov 26 '18 at 8:49
















0















I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :




date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None


I hope to be clear,



thanks a lot for your help,



Rémy










share|improve this question























  • Think he's meaning a 'rolling_sum'

    – Eran Moshe
    Nov 26 '18 at 8:49














0












0








0








I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :




date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None


I hope to be clear,



thanks a lot for your help,



Rémy










share|improve this question














I am looking for the best pythonic way to group the sum of values that follow each other from hour to hour. Example: I have the following Dataframe with the columns 'date' and'rainfall' and I have added an example of the desired result :




date rainfall RE
31/12/17 23:00 0.88 None
01/01/18 00:00 0.38 1.26
01/01/18 01:00 0 None
01/01/18 02:00 0.22 0.22
01/01/18 03:00 0 None
01/01/18 04:00 0 None
01/01/18 13:00 0 None
01/01/18 14:00 0 None
01/01/18 15:00 0.55 0.55
01/01/18 16:00 0 None
01/01/18 17:00 1.31 1.31
01/01/18 18:00 0 None
01/01/18 19:00 0.49 0.49
01/01/18 20:00 0 None
01/01/18 21:00 0 None
01/01/18 22:00 0 None
01/01/18 23:00 0 None
02/01/18 00:00 0.7 None
02/01/18 01:00 0.22 None
02/01/18 02:00 0.61 None
02/01/18 03:00 0.42 1.95
02/01/18 04:00 0 None
02/01/18 05:00 1.69 1.69
02/01/18 06:00 0 None
02/01/18 07:00 0 None
02/01/18 08:00 0 None


I hope to be clear,



thanks a lot for your help,



Rémy







python datetime dataframe pandas-groupby






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 8:32









R. ClaverieR. Claverie

11




11













  • Think he's meaning a 'rolling_sum'

    – Eran Moshe
    Nov 26 '18 at 8:49



















  • Think he's meaning a 'rolling_sum'

    – Eran Moshe
    Nov 26 '18 at 8:49

















Think he's meaning a 'rolling_sum'

– Eran Moshe
Nov 26 '18 at 8:49





Think he's meaning a 'rolling_sum'

– Eran Moshe
Nov 26 '18 at 8:49












2 Answers
2






active

oldest

votes


















0














If I understand correctly, what you want is a rolling sum.
The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods



df['rainfall'].rolling(2).sum()


Ofcourse, your DataFrame must stay sorted as you show in your example.



adding it as a part of your DataFrame would be:



df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()


EDIT1:



If you mean you just want to accumulate the rainfall use:



df['rainfall_cumsum'] = df['rainfall'].cumsum()


as the function cumsum() is accumulative sum



EDIT2:



rf_not_zero = df['rainfall'] != 0
df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)





share|improve this answer


























  • Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

    – R. Claverie
    Nov 26 '18 at 9:11













  • Do you mean just accumulate the rainfall ?

    – Eran Moshe
    Nov 26 '18 at 9:20











  • Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

    – R. Claverie
    Nov 26 '18 at 10:24











  • @R.Claverie Only took the whole day to understand :>

    – Eran Moshe
    Nov 26 '18 at 10:51



















0














Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)



import pandas as pd
import numpy as np

# generate pseudo dataframe
rng = pd.date_range('1/1/2012', periods=20, freq='H')
rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])

df = pd.DataFrame()
df['data'] = rng
df['rain'] = rain

# convert rain to boolean
df['is_rain'] = df['rain'] > 0

# shift rain for one slot to recognice if the state has changed
data = list(df['is_rain'].values[0:-1])
shifted = [data[0]]
shifted.extend(data)
df['is_rain_next'] = shifted

# get start and endpoints of rain (the end has to be excluded!
# it is always the next period, when it isn't raining anymore)
df['rain_start'] = df['is_rain'] < df['is_rain_next']
df['rain_end'] = df['is_rain'] > df['is_rain_next']

# this are the starts and ends you can use them for getting the groups from dataframe
df[df['rain_start']].index
df[df['rain_end']].index


Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477230%2fpython-dataframe-group-by-following-values%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    If I understand correctly, what you want is a rolling sum.
    The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods



    df['rainfall'].rolling(2).sum()


    Ofcourse, your DataFrame must stay sorted as you show in your example.



    adding it as a part of your DataFrame would be:



    df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()


    EDIT1:



    If you mean you just want to accumulate the rainfall use:



    df['rainfall_cumsum'] = df['rainfall'].cumsum()


    as the function cumsum() is accumulative sum



    EDIT2:



    rf_not_zero = df['rainfall'] != 0
    df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)





    share|improve this answer


























    • Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

      – R. Claverie
      Nov 26 '18 at 9:11













    • Do you mean just accumulate the rainfall ?

      – Eran Moshe
      Nov 26 '18 at 9:20











    • Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

      – R. Claverie
      Nov 26 '18 at 10:24











    • @R.Claverie Only took the whole day to understand :>

      – Eran Moshe
      Nov 26 '18 at 10:51
















    0














    If I understand correctly, what you want is a rolling sum.
    The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods



    df['rainfall'].rolling(2).sum()


    Ofcourse, your DataFrame must stay sorted as you show in your example.



    adding it as a part of your DataFrame would be:



    df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()


    EDIT1:



    If you mean you just want to accumulate the rainfall use:



    df['rainfall_cumsum'] = df['rainfall'].cumsum()


    as the function cumsum() is accumulative sum



    EDIT2:



    rf_not_zero = df['rainfall'] != 0
    df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)





    share|improve this answer


























    • Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

      – R. Claverie
      Nov 26 '18 at 9:11













    • Do you mean just accumulate the rainfall ?

      – Eran Moshe
      Nov 26 '18 at 9:20











    • Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

      – R. Claverie
      Nov 26 '18 at 10:24











    • @R.Claverie Only took the whole day to understand :>

      – Eran Moshe
      Nov 26 '18 at 10:51














    0












    0








    0







    If I understand correctly, what you want is a rolling sum.
    The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods



    df['rainfall'].rolling(2).sum()


    Ofcourse, your DataFrame must stay sorted as you show in your example.



    adding it as a part of your DataFrame would be:



    df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()


    EDIT1:



    If you mean you just want to accumulate the rainfall use:



    df['rainfall_cumsum'] = df['rainfall'].cumsum()


    as the function cumsum() is accumulative sum



    EDIT2:



    rf_not_zero = df['rainfall'] != 0
    df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)





    share|improve this answer















    If I understand correctly, what you want is a rolling sum.
    The following will output a Pandas Series of the rolling sums of 'rainfall' over 2 periods



    df['rainfall'].rolling(2).sum()


    Ofcourse, your DataFrame must stay sorted as you show in your example.



    adding it as a part of your DataFrame would be:



    df['rainfall_rolling_sum'] = df['rainfall'].rolling(2).sum()


    EDIT1:



    If you mean you just want to accumulate the rainfall use:



    df['rainfall_cumsum'] = df['rainfall'].cumsum()


    as the function cumsum() is accumulative sum



    EDIT2:



    rf_not_zero = df['rainfall'] != 0
    df['rainfall_accum'] = df['rainfall'].cumsum()-df['rainfall'].cumsum().where(~df_not_zero).ffill().fillna(0).astype(int)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 26 '18 at 10:51

























    answered Nov 26 '18 at 8:54









    Eran MosheEran Moshe

    1,371722




    1,371722













    • Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

      – R. Claverie
      Nov 26 '18 at 9:11













    • Do you mean just accumulate the rainfall ?

      – Eran Moshe
      Nov 26 '18 at 9:20











    • Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

      – R. Claverie
      Nov 26 '18 at 10:24











    • @R.Claverie Only took the whole day to understand :>

      – Eran Moshe
      Nov 26 '18 at 10:51



















    • Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

      – R. Claverie
      Nov 26 '18 at 9:11













    • Do you mean just accumulate the rainfall ?

      – Eran Moshe
      Nov 26 '18 at 9:20











    • Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

      – R. Claverie
      Nov 26 '18 at 10:24











    • @R.Claverie Only took the whole day to understand :>

      – Eran Moshe
      Nov 26 '18 at 10:51

















    Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

    – R. Claverie
    Nov 26 '18 at 9:11







    Thanks a lot Eran for this quick answer, but it does not work as expected. If you look at from the date "02/01/18 00:00", there is 4 values to add and not only 2. Thanks again, Rémy

    – R. Claverie
    Nov 26 '18 at 9:11















    Do you mean just accumulate the rainfall ?

    – Eran Moshe
    Nov 26 '18 at 9:20





    Do you mean just accumulate the rainfall ?

    – Eran Moshe
    Nov 26 '18 at 9:20













    Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

    – R. Claverie
    Nov 26 '18 at 10:24





    Yes, but only when I have a rainfall event over several consecutive hours. Thanks. Rémy

    – R. Claverie
    Nov 26 '18 at 10:24













    @R.Claverie Only took the whole day to understand :>

    – Eran Moshe
    Nov 26 '18 at 10:51





    @R.Claverie Only took the whole day to understand :>

    – Eran Moshe
    Nov 26 '18 at 10:51













    0














    Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)



    import pandas as pd
    import numpy as np

    # generate pseudo dataframe
    rng = pd.date_range('1/1/2012', periods=20, freq='H')
    rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])

    df = pd.DataFrame()
    df['data'] = rng
    df['rain'] = rain

    # convert rain to boolean
    df['is_rain'] = df['rain'] > 0

    # shift rain for one slot to recognice if the state has changed
    data = list(df['is_rain'].values[0:-1])
    shifted = [data[0]]
    shifted.extend(data)
    df['is_rain_next'] = shifted

    # get start and endpoints of rain (the end has to be excluded!
    # it is always the next period, when it isn't raining anymore)
    df['rain_start'] = df['is_rain'] < df['is_rain_next']
    df['rain_end'] = df['is_rain'] > df['is_rain_next']

    # this are the starts and ends you can use them for getting the groups from dataframe
    df[df['rain_start']].index
    df[df['rain_end']].index


    Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)






    share|improve this answer






























      0














      Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)



      import pandas as pd
      import numpy as np

      # generate pseudo dataframe
      rng = pd.date_range('1/1/2012', periods=20, freq='H')
      rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])

      df = pd.DataFrame()
      df['data'] = rng
      df['rain'] = rain

      # convert rain to boolean
      df['is_rain'] = df['rain'] > 0

      # shift rain for one slot to recognice if the state has changed
      data = list(df['is_rain'].values[0:-1])
      shifted = [data[0]]
      shifted.extend(data)
      df['is_rain_next'] = shifted

      # get start and endpoints of rain (the end has to be excluded!
      # it is always the next period, when it isn't raining anymore)
      df['rain_start'] = df['is_rain'] < df['is_rain_next']
      df['rain_end'] = df['is_rain'] > df['is_rain_next']

      # this are the starts and ends you can use them for getting the groups from dataframe
      df[df['rain_start']].index
      df[df['rain_end']].index


      Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)






      share|improve this answer




























        0












        0








        0







        Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)



        import pandas as pd
        import numpy as np

        # generate pseudo dataframe
        rng = pd.date_range('1/1/2012', periods=20, freq='H')
        rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])

        df = pd.DataFrame()
        df['data'] = rng
        df['rain'] = rain

        # convert rain to boolean
        df['is_rain'] = df['rain'] > 0

        # shift rain for one slot to recognice if the state has changed
        data = list(df['is_rain'].values[0:-1])
        shifted = [data[0]]
        shifted.extend(data)
        df['is_rain_next'] = shifted

        # get start and endpoints of rain (the end has to be excluded!
        # it is always the next period, when it isn't raining anymore)
        df['rain_start'] = df['is_rain'] < df['is_rain_next']
        df['rain_end'] = df['is_rain'] > df['is_rain_next']

        # this are the starts and ends you can use them for getting the groups from dataframe
        df[df['rain_start']].index
        df[df['rain_end']].index


        Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)






        share|improve this answer















        Okey it is not the best, but when you have big dataframes this could be a simple approach. (If it rains at the beginning or at the end, this will be ignored in this example. See Warning)



        import pandas as pd
        import numpy as np

        # generate pseudo dataframe
        rng = pd.date_range('1/1/2012', periods=20, freq='H')
        rain = np.random.choice([0,0.5,1,2], size=20, p=[0.4,0.2,0.2,0.2])

        df = pd.DataFrame()
        df['data'] = rng
        df['rain'] = rain

        # convert rain to boolean
        df['is_rain'] = df['rain'] > 0

        # shift rain for one slot to recognice if the state has changed
        data = list(df['is_rain'].values[0:-1])
        shifted = [data[0]]
        shifted.extend(data)
        df['is_rain_next'] = shifted

        # get start and endpoints of rain (the end has to be excluded!
        # it is always the next period, when it isn't raining anymore)
        df['rain_start'] = df['is_rain'] < df['is_rain_next']
        df['rain_end'] = df['is_rain'] > df['is_rain_next']

        # this are the starts and ends you can use them for getting the groups from dataframe
        df[df['rain_start']].index
        df[df['rain_end']].index


        Warning: The arrays can have different sizes, thats caused by the unknown state at the end and at the beginning. So check if the first element in rain_end smaller than the first element in rain_start than you can delete it. Also if the last element of rain_start is greater than the last element of rain_end (sound logical i think)







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 26 '18 at 11:38

























        answered Nov 26 '18 at 11:13









        MisterMonkMisterMonk

        1549




        1549






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477230%2fpython-dataframe-group-by-following-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)