Python/Pandas - confusion around ARIMA forecasting to get simple predictions












1















Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:



date           bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100


Where anything greater than today (28/11/18) is forecasted.



What I've tried to do:



This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:



import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)


This is the 'modelling' bit:



X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()

yhat = output[0]
predictions.append(yhat)

obs = test[t]
history.append(obs)

# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()


Exporting results to a csv:



df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)


The trouble I'm having is:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?

  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


What I think I need to do:




  • Grab my bookings dataset of 2017 and 2018 data from my database

  • Split it by 2017 and 2018

  • Produce some forecasts on 2018

  • Append this 2018+forecast data to 2017 and export as csv


The how-to and why is the problem I'm having.
Any help would be much appreciated










share|improve this question

























  • Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

    – user32185
    Nov 28 '18 at 11:35











  • I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

    – AK91
    Nov 28 '18 at 11:45











  • See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

    – user32185
    Nov 28 '18 at 11:56
















1















Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:



date           bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100


Where anything greater than today (28/11/18) is forecasted.



What I've tried to do:



This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:



import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)


This is the 'modelling' bit:



X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()

yhat = output[0]
predictions.append(yhat)

obs = test[t]
history.append(obs)

# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()


Exporting results to a csv:



df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)


The trouble I'm having is:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?

  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


What I think I need to do:




  • Grab my bookings dataset of 2017 and 2018 data from my database

  • Split it by 2017 and 2018

  • Produce some forecasts on 2018

  • Append this 2018+forecast data to 2017 and export as csv


The how-to and why is the problem I'm having.
Any help would be much appreciated










share|improve this question

























  • Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

    – user32185
    Nov 28 '18 at 11:35











  • I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

    – AK91
    Nov 28 '18 at 11:45











  • See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

    – user32185
    Nov 28 '18 at 11:56














1












1








1








Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:



date           bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100


Where anything greater than today (28/11/18) is forecasted.



What I've tried to do:



This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:



import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)


This is the 'modelling' bit:



X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()

yhat = output[0]
predictions.append(yhat)

obs = test[t]
history.append(obs)

# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()


Exporting results to a csv:



df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)


The trouble I'm having is:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?

  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


What I think I need to do:




  • Grab my bookings dataset of 2017 and 2018 data from my database

  • Split it by 2017 and 2018

  • Produce some forecasts on 2018

  • Append this 2018+forecast data to 2017 and export as csv


The how-to and why is the problem I'm having.
Any help would be much appreciated










share|improve this question
















Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:



date           bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100


Where anything greater than today (28/11/18) is forecasted.



What I've tried to do:



This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:



import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)


This is the 'modelling' bit:



X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()

yhat = output[0]
predictions.append(yhat)

obs = test[t]
history.append(obs)

# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()


Exporting results to a csv:



df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)


The trouble I'm having is:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?

  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


What I think I need to do:




  • Grab my bookings dataset of 2017 and 2018 data from my database

  • Split it by 2017 and 2018

  • Produce some forecasts on 2018

  • Append this 2018+forecast data to 2017 and export as csv


The how-to and why is the problem I'm having.
Any help would be much appreciated







python pandas forecasting arima






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 28 '18 at 11:46







AK91

















asked Nov 28 '18 at 11:23









AK91AK91

757




757













  • Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

    – user32185
    Nov 28 '18 at 11:35











  • I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

    – AK91
    Nov 28 '18 at 11:45











  • See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

    – user32185
    Nov 28 '18 at 11:56



















  • Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

    – user32185
    Nov 28 '18 at 11:35











  • I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

    – AK91
    Nov 28 '18 at 11:45











  • See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

    – user32185
    Nov 28 '18 at 11:56

















Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

– user32185
Nov 28 '18 at 11:35





Hi AK91, what have you read and what is the problem? The title is somehow misleading. The following is not using ARIMA but there are few concepts you might want to read prophet

– user32185
Nov 28 '18 at 11:35













I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

– AK91
Nov 28 '18 at 11:45





I had a read of prophet but I had some issues with installation or something? I'll have another go though. In terms of what I've read, here's the link: machinelearningmastery.com/…. Problem is how to perform the forecast on 2018 data and what would be my train/test subsets? All a bit new/confusing to me...

– AK91
Nov 28 '18 at 11:45













See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

– user32185
Nov 28 '18 at 11:56





See nixon's answer. Them, but is a personal though, I don't think that blog is a good source of information.

– user32185
Nov 28 '18 at 11:56












1 Answer
1






active

oldest

votes


















2














Here are some thoughts:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?


Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.




  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?



What I think I need to do




  • Split it by 2017 and 2018


Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.






share|improve this answer
























  • Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

    – AK91
    Nov 28 '18 at 11:55













  • Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

    – yatu
    Nov 28 '18 at 12:03











  • Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

    – AK91
    Nov 28 '18 at 17:03











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53518331%2fpython-pandas-confusion-around-arima-forecasting-to-get-simple-predictions%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Here are some thoughts:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?


Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.




  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?



What I think I need to do




  • Split it by 2017 and 2018


Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.






share|improve this answer
























  • Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

    – AK91
    Nov 28 '18 at 11:55













  • Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

    – yatu
    Nov 28 '18 at 12:03











  • Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

    – AK91
    Nov 28 '18 at 17:03
















2














Here are some thoughts:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?


Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.




  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?



What I think I need to do




  • Split it by 2017 and 2018


Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.






share|improve this answer
























  • Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

    – AK91
    Nov 28 '18 at 11:55













  • Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

    – yatu
    Nov 28 '18 at 12:03











  • Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

    – AK91
    Nov 28 '18 at 17:03














2












2








2







Here are some thoughts:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?


Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.




  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?



What I think I need to do




  • Split it by 2017 and 2018


Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.






share|improve this answer













Here are some thoughts:




  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?


Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.




  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?


Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?



What I think I need to do




  • Split it by 2017 and 2018


Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 28 '18 at 11:41









yatuyatu

14.3k41541




14.3k41541













  • Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

    – AK91
    Nov 28 '18 at 11:55













  • Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

    – yatu
    Nov 28 '18 at 12:03











  • Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

    – AK91
    Nov 28 '18 at 17:03



















  • Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

    – AK91
    Nov 28 '18 at 11:55













  • Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

    – yatu
    Nov 28 '18 at 12:03











  • Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

    – AK91
    Nov 28 '18 at 17:03

















Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

– AK91
Nov 28 '18 at 11:55







Thanks for the answer and clarification. I guess the issue is within the last bit "Once you've validated the performance of the model, use it to predict from today onwards." - Does that mean, using the code I have, amend my for loop to something like for t in range(len(tomorrow up to end of the year))? So all the data I have will be my train set? And the test is basically the predictions?...Apologies for the lame questions...

– AK91
Nov 28 '18 at 11:55















Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

– yatu
Nov 28 '18 at 12:03





Yes that's right. For the forecast you will have to extend the iterations till the date you wish to forecast to. So the same as you do to validate the model, i.e check the mean squared error with the predictions, but from len(sequence):last_forecast_date

– yatu
Nov 28 '18 at 12:03













Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

– AK91
Nov 28 '18 at 17:03





Tried everything to get this to work - still not getting whatever the model is intended for, doesn't help when the tutorial I used stops at the training/testing phase and doesn't explain the actual application/forecasting phase...in any case, gonna go do some proper homework on this, rather than trying to get a quick fix...thanks for the help, much appreciated

– AK91
Nov 28 '18 at 17:03




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53518331%2fpython-pandas-confusion-around-arima-forecasting-to-get-simple-predictions%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)