How to clean up datetime strings in dataframe after export from excel sheet?












2















I have an excel spreadsheet with some datetime data in a column. I exported the data into a dataframe using pandas. This column however has chunks of dates that have the month and day swapped, while there are other chunks of dates in the same column that are correct. Here's an example -



enter image description here



Figure 1: Day and month have been swapped incorrectly



The above picture shows the day and month swapped. The date shows 2016-01-10 but should instead be 2016-10-01. Compare this with another chunk of datetime values in the same column -



enter image description here



Figure 2: Day and month are correctly represented



In the above case in Figure 2, the month correctly represented as 12 and the day is 31.



I used the solution from this question - How to swap months and days in a datetime object?



I also tried using this solution -
Python Pandas - Day and Month mix up



I also tried writing my own function to map to the entries but this was to no avail either -



def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
temp = day
day = month
month = temp


t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)



However, both solutions change all the datetime values in the column. So, when the incorrect values get corrected, the correct values become incorrect.



I've also linked the excel file for your convenience. It's an open data set.



https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#343faeaa-c920-57d6-6a75-969181b6cbde



Please choose the last dataset Bikeshare Ridership (2016 Q4). The "start" and "end" columns have the above mentioned issues.



Is there a more efficient way to clean the datetime data?










share|improve this question

























  • Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

    – Andrew
    Nov 27 '18 at 13:39


















2















I have an excel spreadsheet with some datetime data in a column. I exported the data into a dataframe using pandas. This column however has chunks of dates that have the month and day swapped, while there are other chunks of dates in the same column that are correct. Here's an example -



enter image description here



Figure 1: Day and month have been swapped incorrectly



The above picture shows the day and month swapped. The date shows 2016-01-10 but should instead be 2016-10-01. Compare this with another chunk of datetime values in the same column -



enter image description here



Figure 2: Day and month are correctly represented



In the above case in Figure 2, the month correctly represented as 12 and the day is 31.



I used the solution from this question - How to swap months and days in a datetime object?



I also tried using this solution -
Python Pandas - Day and Month mix up



I also tried writing my own function to map to the entries but this was to no avail either -



def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
temp = day
day = month
month = temp


t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)



However, both solutions change all the datetime values in the column. So, when the incorrect values get corrected, the correct values become incorrect.



I've also linked the excel file for your convenience. It's an open data set.



https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#343faeaa-c920-57d6-6a75-969181b6cbde



Please choose the last dataset Bikeshare Ridership (2016 Q4). The "start" and "end" columns have the above mentioned issues.



Is there a more efficient way to clean the datetime data?










share|improve this question

























  • Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

    – Andrew
    Nov 27 '18 at 13:39
















2












2








2


1






I have an excel spreadsheet with some datetime data in a column. I exported the data into a dataframe using pandas. This column however has chunks of dates that have the month and day swapped, while there are other chunks of dates in the same column that are correct. Here's an example -



enter image description here



Figure 1: Day and month have been swapped incorrectly



The above picture shows the day and month swapped. The date shows 2016-01-10 but should instead be 2016-10-01. Compare this with another chunk of datetime values in the same column -



enter image description here



Figure 2: Day and month are correctly represented



In the above case in Figure 2, the month correctly represented as 12 and the day is 31.



I used the solution from this question - How to swap months and days in a datetime object?



I also tried using this solution -
Python Pandas - Day and Month mix up



I also tried writing my own function to map to the entries but this was to no avail either -



def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
temp = day
day = month
month = temp


t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)



However, both solutions change all the datetime values in the column. So, when the incorrect values get corrected, the correct values become incorrect.



I've also linked the excel file for your convenience. It's an open data set.



https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#343faeaa-c920-57d6-6a75-969181b6cbde



Please choose the last dataset Bikeshare Ridership (2016 Q4). The "start" and "end" columns have the above mentioned issues.



Is there a more efficient way to clean the datetime data?










share|improve this question
















I have an excel spreadsheet with some datetime data in a column. I exported the data into a dataframe using pandas. This column however has chunks of dates that have the month and day swapped, while there are other chunks of dates in the same column that are correct. Here's an example -



enter image description here



Figure 1: Day and month have been swapped incorrectly



The above picture shows the day and month swapped. The date shows 2016-01-10 but should instead be 2016-10-01. Compare this with another chunk of datetime values in the same column -



enter image description here



Figure 2: Day and month are correctly represented



In the above case in Figure 2, the month correctly represented as 12 and the day is 31.



I used the solution from this question - How to swap months and days in a datetime object?



I also tried using this solution -
Python Pandas - Day and Month mix up



I also tried writing my own function to map to the entries but this was to no avail either -



def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
temp = day
day = month
month = temp


t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)



However, both solutions change all the datetime values in the column. So, when the incorrect values get corrected, the correct values become incorrect.



I've also linked the excel file for your convenience. It's an open data set.



https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#343faeaa-c920-57d6-6a75-969181b6cbde



Please choose the last dataset Bikeshare Ridership (2016 Q4). The "start" and "end" columns have the above mentioned issues.



Is there a more efficient way to clean the datetime data?







python pandas datetime






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 16:00









Geof

315




315










asked Nov 27 '18 at 13:31









ragzputinragzputin

162312




162312













  • Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

    – Andrew
    Nov 27 '18 at 13:39





















  • Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

    – Andrew
    Nov 27 '18 at 13:39



















Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

– Andrew
Nov 27 '18 at 13:39







Ah, I see, of course, it's Q4 - carry on without me I would probably generate a datetime object for 01 october 2016 midnight and 31st december 2016 23:59:59 and compare whether the date objects are in this range using apply. From here, you ought to be able to make the change you need easily enough

– Andrew
Nov 27 '18 at 13:39














3 Answers
3






active

oldest

votes


















0














Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.



Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df:



import pandas as pd

df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

for col in ['trip_start_time', 'trip_stop_time']:
df[col] = pd.to_datetime(df[col])
swapped = pd.to_datetime({'year':df[col].dt.year,
'month':df[col].dt.day,
'day':df[col].dt.month,
'hour':df[col].dt.hour,
'minute':df[col].dt.minute,
'second':df[col].dt.second,}, errors='coerce')
swapped = swapped.dropna()
mask = swapped.index
df.loc[mask, col] = swapped

# check that now all dates are in 2016Q4
for col in ['trip_start_time', 'trip_stop_time']:
mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
assert mask.all()

# check that `trip_start_times` are in chronological order
assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

# check that `trip_stop_times` are always greater than `trip_start_times`
assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()


The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times are in chronological order, and that the trip_stop_times are always greater than their associated trip_start_times.






share|improve this answer


























  • After slight modification of this method, I was able to fit everything within Q4. Thank you!

    – ragzputin
    Nov 28 '18 at 13:18











  • Might this create an issue with 11/10 or 10/11 which are both in Q4?

    – Andrew
    Nov 28 '18 at 14:15











  • @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

    – unutbu
    Nov 28 '18 at 14:22











  • @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

    – unutbu
    Nov 28 '18 at 14:34











  • @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

    – unutbu
    Nov 28 '18 at 16:41



















1














OK.



EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:



import pandas as pd

f = "stringtofilehere.xlsx"
df = pd.read_excel(f)

def alter_date(timestamp):

try:
date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
time_stamp = pd.Timestamp(date_time)
return time_stamp
except:
return timestamp

new_starts = df["trip_start_time"].apply(alter_date)
df["trip_start_time"] = new_starts
new_ends = df["trip_stop_time"].apply(alter_date)
df["trip_stop_time"] = new_ends


EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:



import pandas as pd

f = "stringtofilehere.xlsx"
df = pd.read_excel(f)

for idx in df.index:
trip_start = df.loc[df.index[idx], "trip_start_time"]
trip_end = df.loc[df.index[idx], "trip_stop_time"]
start_dt = trip_start.to_datetime()
end_dt = trip_end.to_datetime()
try:
start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
start_ts = pd.Timestamp(start_dt_string)
end_ts = pd.Timestamp(end_dt_string)
df.loc[idx, "trip_start_time"] = start_ts
df.loc[idx, "trip_stop_time"] = end_ts
except ValueError:
pass


It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.



Old response:
So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.



If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has



What I would do, is to iterate over each column



for row in df.index:
try:
new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
#write back to the df here
except ValueError:
pass#ignore anything that cannot be converted





share|improve this answer


























  • I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

    – ragzputin
    Nov 28 '18 at 13:17













  • This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

    – Andrew
    Nov 28 '18 at 14:13











  • Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

    – unutbu
    Nov 28 '18 at 16:55











  • From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

    – Andrew
    Nov 28 '18 at 21:46






  • 1





    My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

    – unutbu
    Nov 29 '18 at 1:57



















0














You can use the parameter format in pd.to_datetime:



>>> date= pd.Series(['2016-01-10', '2016-02-10'])
>>> pd.to_datetime(date, format='%Y-%d-%m')
Out:
0 2016-10-01
1 2016-10-02





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500872%2fhow-to-clean-up-datetime-strings-in-dataframe-after-export-from-excel-sheet%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.



    Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df:



    import pandas as pd

    df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

    for col in ['trip_start_time', 'trip_stop_time']:
    df[col] = pd.to_datetime(df[col])
    swapped = pd.to_datetime({'year':df[col].dt.year,
    'month':df[col].dt.day,
    'day':df[col].dt.month,
    'hour':df[col].dt.hour,
    'minute':df[col].dt.minute,
    'second':df[col].dt.second,}, errors='coerce')
    swapped = swapped.dropna()
    mask = swapped.index
    df.loc[mask, col] = swapped

    # check that now all dates are in 2016Q4
    for col in ['trip_start_time', 'trip_stop_time']:
    mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
    assert mask.all()

    # check that `trip_start_times` are in chronological order
    assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

    # check that `trip_stop_times` are always greater than `trip_start_times`
    assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()


    The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times are in chronological order, and that the trip_stop_times are always greater than their associated trip_start_times.






    share|improve this answer


























    • After slight modification of this method, I was able to fit everything within Q4. Thank you!

      – ragzputin
      Nov 28 '18 at 13:18











    • Might this create an issue with 11/10 or 10/11 which are both in Q4?

      – Andrew
      Nov 28 '18 at 14:15











    • @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

      – unutbu
      Nov 28 '18 at 14:22











    • @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

      – unutbu
      Nov 28 '18 at 14:34











    • @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

      – unutbu
      Nov 28 '18 at 16:41
















    0














    Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.



    Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df:



    import pandas as pd

    df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

    for col in ['trip_start_time', 'trip_stop_time']:
    df[col] = pd.to_datetime(df[col])
    swapped = pd.to_datetime({'year':df[col].dt.year,
    'month':df[col].dt.day,
    'day':df[col].dt.month,
    'hour':df[col].dt.hour,
    'minute':df[col].dt.minute,
    'second':df[col].dt.second,}, errors='coerce')
    swapped = swapped.dropna()
    mask = swapped.index
    df.loc[mask, col] = swapped

    # check that now all dates are in 2016Q4
    for col in ['trip_start_time', 'trip_stop_time']:
    mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
    assert mask.all()

    # check that `trip_start_times` are in chronological order
    assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

    # check that `trip_stop_times` are always greater than `trip_start_times`
    assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()


    The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times are in chronological order, and that the trip_stop_times are always greater than their associated trip_start_times.






    share|improve this answer


























    • After slight modification of this method, I was able to fit everything within Q4. Thank you!

      – ragzputin
      Nov 28 '18 at 13:18











    • Might this create an issue with 11/10 or 10/11 which are both in Q4?

      – Andrew
      Nov 28 '18 at 14:15











    • @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

      – unutbu
      Nov 28 '18 at 14:22











    • @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

      – unutbu
      Nov 28 '18 at 14:34











    • @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

      – unutbu
      Nov 28 '18 at 16:41














    0












    0








    0







    Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.



    Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df:



    import pandas as pd

    df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

    for col in ['trip_start_time', 'trip_stop_time']:
    df[col] = pd.to_datetime(df[col])
    swapped = pd.to_datetime({'year':df[col].dt.year,
    'month':df[col].dt.day,
    'day':df[col].dt.month,
    'hour':df[col].dt.hour,
    'minute':df[col].dt.minute,
    'second':df[col].dt.second,}, errors='coerce')
    swapped = swapped.dropna()
    mask = swapped.index
    df.loc[mask, col] = swapped

    # check that now all dates are in 2016Q4
    for col in ['trip_start_time', 'trip_stop_time']:
    mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
    assert mask.all()

    # check that `trip_start_times` are in chronological order
    assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

    # check that `trip_stop_times` are always greater than `trip_start_times`
    assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()


    The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times are in chronological order, and that the trip_stop_times are always greater than their associated trip_start_times.






    share|improve this answer















    Andrew observed that the DataFrame can be fixed by flipping all months and days when doing so results in a valid date.



    Here's a fast way to "flip" all the dates. Invalid dates are coerced to NaT (Not-A-Timestamp) values and then dropped. The remaining flipped dates can then be reassigned to df:



    import pandas as pd

    df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

    for col in ['trip_start_time', 'trip_stop_time']:
    df[col] = pd.to_datetime(df[col])
    swapped = pd.to_datetime({'year':df[col].dt.year,
    'month':df[col].dt.day,
    'day':df[col].dt.month,
    'hour':df[col].dt.hour,
    'minute':df[col].dt.minute,
    'second':df[col].dt.second,}, errors='coerce')
    swapped = swapped.dropna()
    mask = swapped.index
    df.loc[mask, col] = swapped

    # check that now all dates are in 2016Q4
    for col in ['trip_start_time', 'trip_stop_time']:
    mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
    assert mask.all()

    # check that `trip_start_times` are in chronological order
    assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

    # check that `trip_stop_times` are always greater than `trip_start_times`
    assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()


    The assert statements above verify that the resultant dates are all in 2016Q4, that trip_start_times are in chronological order, and that the trip_stop_times are always greater than their associated trip_start_times.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 1 '18 at 2:39

























    answered Nov 27 '18 at 14:35









    unutbuunutbu

    555k10311931246




    555k10311931246













    • After slight modification of this method, I was able to fit everything within Q4. Thank you!

      – ragzputin
      Nov 28 '18 at 13:18











    • Might this create an issue with 11/10 or 10/11 which are both in Q4?

      – Andrew
      Nov 28 '18 at 14:15











    • @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

      – unutbu
      Nov 28 '18 at 14:22











    • @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

      – unutbu
      Nov 28 '18 at 14:34











    • @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

      – unutbu
      Nov 28 '18 at 16:41



















    • After slight modification of this method, I was able to fit everything within Q4. Thank you!

      – ragzputin
      Nov 28 '18 at 13:18











    • Might this create an issue with 11/10 or 10/11 which are both in Q4?

      – Andrew
      Nov 28 '18 at 14:15











    • @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

      – unutbu
      Nov 28 '18 at 14:22











    • @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

      – unutbu
      Nov 28 '18 at 14:34











    • @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

      – unutbu
      Nov 28 '18 at 16:41

















    After slight modification of this method, I was able to fit everything within Q4. Thank you!

    – ragzputin
    Nov 28 '18 at 13:18





    After slight modification of this method, I was able to fit everything within Q4. Thank you!

    – ragzputin
    Nov 28 '18 at 13:18













    Might this create an issue with 11/10 or 10/11 which are both in Q4?

    – Andrew
    Nov 28 '18 at 14:15





    Might this create an issue with 11/10 or 10/11 which are both in Q4?

    – Andrew
    Nov 28 '18 at 14:15













    @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

    – unutbu
    Nov 28 '18 at 14:22





    @Andrew: That's an excellent point (that I had totally overlooked). Perhaps a different approach is needed to really solve this problem....

    – unutbu
    Nov 28 '18 at 14:22













    @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

    – unutbu
    Nov 28 '18 at 14:34





    @ragzputin: The current code is definitely wrong. Look at, for example, the row with trip_id equal to 608907. The start date is 2016-11-09 but the end date is 2016-10-11.

    – unutbu
    Nov 28 '18 at 14:34













    @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

    – unutbu
    Nov 28 '18 at 16:41





    @ragzputin: I've completely rewritten the code to address the problem Andrew points out. I've also added some more assert statements to test the reasonableness of the result. Please beware that since we're dealing with ambiguous data there's no guarantee that the result above is absolutely correct.

    – unutbu
    Nov 28 '18 at 16:41













    1














    OK.



    EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    def alter_date(timestamp):

    try:
    date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
    time_stamp = pd.Timestamp(date_time)
    return time_stamp
    except:
    return timestamp

    new_starts = df["trip_start_time"].apply(alter_date)
    df["trip_start_time"] = new_starts
    new_ends = df["trip_stop_time"].apply(alter_date)
    df["trip_stop_time"] = new_ends


    EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    for idx in df.index:
    trip_start = df.loc[df.index[idx], "trip_start_time"]
    trip_end = df.loc[df.index[idx], "trip_stop_time"]
    start_dt = trip_start.to_datetime()
    end_dt = trip_end.to_datetime()
    try:
    start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
    end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
    start_ts = pd.Timestamp(start_dt_string)
    end_ts = pd.Timestamp(end_dt_string)
    df.loc[idx, "trip_start_time"] = start_ts
    df.loc[idx, "trip_stop_time"] = end_ts
    except ValueError:
    pass


    It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.



    Old response:
    So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.



    If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has



    What I would do, is to iterate over each column



    for row in df.index:
    try:
    new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
    #write back to the df here
    except ValueError:
    pass#ignore anything that cannot be converted





    share|improve this answer


























    • I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

      – ragzputin
      Nov 28 '18 at 13:17













    • This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

      – Andrew
      Nov 28 '18 at 14:13











    • Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

      – unutbu
      Nov 28 '18 at 16:55











    • From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

      – Andrew
      Nov 28 '18 at 21:46






    • 1





      My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

      – unutbu
      Nov 29 '18 at 1:57
















    1














    OK.



    EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    def alter_date(timestamp):

    try:
    date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
    time_stamp = pd.Timestamp(date_time)
    return time_stamp
    except:
    return timestamp

    new_starts = df["trip_start_time"].apply(alter_date)
    df["trip_start_time"] = new_starts
    new_ends = df["trip_stop_time"].apply(alter_date)
    df["trip_stop_time"] = new_ends


    EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    for idx in df.index:
    trip_start = df.loc[df.index[idx], "trip_start_time"]
    trip_end = df.loc[df.index[idx], "trip_stop_time"]
    start_dt = trip_start.to_datetime()
    end_dt = trip_end.to_datetime()
    try:
    start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
    end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
    start_ts = pd.Timestamp(start_dt_string)
    end_ts = pd.Timestamp(end_dt_string)
    df.loc[idx, "trip_start_time"] = start_ts
    df.loc[idx, "trip_stop_time"] = end_ts
    except ValueError:
    pass


    It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.



    Old response:
    So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.



    If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has



    What I would do, is to iterate over each column



    for row in df.index:
    try:
    new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
    #write back to the df here
    except ValueError:
    pass#ignore anything that cannot be converted





    share|improve this answer


























    • I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

      – ragzputin
      Nov 28 '18 at 13:17













    • This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

      – Andrew
      Nov 28 '18 at 14:13











    • Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

      – unutbu
      Nov 28 '18 at 16:55











    • From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

      – Andrew
      Nov 28 '18 at 21:46






    • 1





      My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

      – unutbu
      Nov 29 '18 at 1:57














    1












    1








    1







    OK.



    EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    def alter_date(timestamp):

    try:
    date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
    time_stamp = pd.Timestamp(date_time)
    return time_stamp
    except:
    return timestamp

    new_starts = df["trip_start_time"].apply(alter_date)
    df["trip_start_time"] = new_starts
    new_ends = df["trip_stop_time"].apply(alter_date)
    df["trip_stop_time"] = new_ends


    EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    for idx in df.index:
    trip_start = df.loc[df.index[idx], "trip_start_time"]
    trip_end = df.loc[df.index[idx], "trip_stop_time"]
    start_dt = trip_start.to_datetime()
    end_dt = trip_end.to_datetime()
    try:
    start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
    end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
    start_ts = pd.Timestamp(start_dt_string)
    end_ts = pd.Timestamp(end_dt_string)
    df.loc[idx, "trip_start_time"] = start_ts
    df.loc[idx, "trip_stop_time"] = end_ts
    except ValueError:
    pass


    It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.



    Old response:
    So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.



    If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has



    What I would do, is to iterate over each column



    for row in df.index:
    try:
    new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
    #write back to the df here
    except ValueError:
    pass#ignore anything that cannot be converted





    share|improve this answer















    OK.



    EDIT -again. I ran the code below, and it took forever! I aborted in the end, but this definitely works in sensible time as well - good luck!:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    def alter_date(timestamp):

    try:
    date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
    time_stamp = pd.Timestamp(date_time)
    return time_stamp
    except:
    return timestamp

    new_starts = df["trip_start_time"].apply(alter_date)
    df["trip_start_time"] = new_starts
    new_ends = df["trip_stop_time"].apply(alter_date)
    df["trip_stop_time"] = new_ends


    EDIT: I had a bit of a dig, and it looks possible based on what I've done previously, new code here:



    import pandas as pd

    f = "stringtofilehere.xlsx"
    df = pd.read_excel(f)

    for idx in df.index:
    trip_start = df.loc[df.index[idx], "trip_start_time"]
    trip_end = df.loc[df.index[idx], "trip_stop_time"]
    start_dt = trip_start.to_datetime()
    end_dt = trip_end.to_datetime()
    try:
    start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
    end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
    start_ts = pd.Timestamp(start_dt_string)
    end_ts = pd.Timestamp(end_dt_string)
    df.loc[idx, "trip_start_time"] = start_ts
    df.loc[idx, "trip_stop_time"] = end_ts
    except ValueError:
    pass


    It's a bit slow (there's a bunch of rows of data) but my computer seems to be working through it - will update again if it fails.



    Old response:
    So, what's happened is that every date/time where there is no possibility of ambiguity is in the original dataset in the format: DD/MM/YYYY HH:MM:SS.



    If it's possible to co-erce to MM/DD/YY HH:MM:SS then it has



    What I would do, is to iterate over each column



    for row in df.index:
    try:
    new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
    #write back to the df here
    except ValueError:
    pass#ignore anything that cannot be converted






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 28 '18 at 16:15

























    answered Nov 27 '18 at 14:27









    AndrewAndrew

    634312




    634312













    • I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

      – ragzputin
      Nov 28 '18 at 13:17













    • This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

      – Andrew
      Nov 28 '18 at 14:13











    • Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

      – unutbu
      Nov 28 '18 at 16:55











    • From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

      – Andrew
      Nov 28 '18 at 21:46






    • 1





      My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

      – unutbu
      Nov 29 '18 at 1:57



















    • I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

      – ragzputin
      Nov 28 '18 at 13:17













    • This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

      – Andrew
      Nov 28 '18 at 14:13











    • Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

      – unutbu
      Nov 28 '18 at 16:55











    • From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

      – Andrew
      Nov 28 '18 at 21:46






    • 1





      My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

      – unutbu
      Nov 29 '18 at 1:57

















    I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

    – ragzputin
    Nov 28 '18 at 13:17







    I tried this but it didn't work for me. The trip_start_time column remained the same with incorrect values. Thank you for your solution! I will experiment with it further.

    – ragzputin
    Nov 28 '18 at 13:17















    This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

    – Andrew
    Nov 28 '18 at 14:13





    This code doesn't actually write back to the original dataframe, which is why yours is unchanged (as I've indicated in the commented line). I would suggest that you either keep track of the index and iterate over that to allow you to overwrite in place or to construct a parallel Series and then add something to the except to preserve correct values then broadcast over the top

    – Andrew
    Nov 28 '18 at 14:13













    Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

    – unutbu
    Nov 28 '18 at 16:55





    Once you parse the data with df = pd.read_excel(f) some of the rows will have the date correctly parsed, and some will have the month and day incorrectly flipped. The problem can't be corrected by calling alter_date on every date. We need some criterion to decide which dates are incorrect and only call alter_date on those.

    – unutbu
    Nov 28 '18 at 16:55













    From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

    – Andrew
    Nov 28 '18 at 21:46





    From what I can make out, the dates that are flipped are the ones that are possible to flip, hence my approach. All the instances where there's no ambiguity (i.e. the day is > 12) look to have one format, and all dates where the day is 12 or less seem to have another. alter_date tries to switch the day and month, and if it can't makes no change. Happy to be told that this doesn't work for the whole dataset though, if that's the case

    – Andrew
    Nov 28 '18 at 21:46




    1




    1





    My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

    – unutbu
    Nov 29 '18 at 1:57





    My goodness. Somehow, this possibility never even occurred to me. Well, here's a faster way to swap all months and days. Unswappable dates are coerced to NaT (not-a-timestamp) values, then dropped. It produces the same result as my answer, but feels cleaner. Since its based on your observation, you're welcome to use the code in your answer if you like.

    – unutbu
    Nov 29 '18 at 1:57











    0














    You can use the parameter format in pd.to_datetime:



    >>> date= pd.Series(['2016-01-10', '2016-02-10'])
    >>> pd.to_datetime(date, format='%Y-%d-%m')
    Out:
    0 2016-10-01
    1 2016-10-02





    share|improve this answer




























      0














      You can use the parameter format in pd.to_datetime:



      >>> date= pd.Series(['2016-01-10', '2016-02-10'])
      >>> pd.to_datetime(date, format='%Y-%d-%m')
      Out:
      0 2016-10-01
      1 2016-10-02





      share|improve this answer


























        0












        0








        0







        You can use the parameter format in pd.to_datetime:



        >>> date= pd.Series(['2016-01-10', '2016-02-10'])
        >>> pd.to_datetime(date, format='%Y-%d-%m')
        Out:
        0 2016-10-01
        1 2016-10-02





        share|improve this answer













        You can use the parameter format in pd.to_datetime:



        >>> date= pd.Series(['2016-01-10', '2016-02-10'])
        >>> pd.to_datetime(date, format='%Y-%d-%m')
        Out:
        0 2016-10-01
        1 2016-10-02






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 28 '18 at 1:03









        TimeSeamTimeSeam

        32125




        32125






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53500872%2fhow-to-clean-up-datetime-strings-in-dataframe-after-export-from-excel-sheet%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Lallio

            Futebolista

            Jornalista