📖 Background

You work for an energy company in Australia. Your company builds solar panel arrays and then sells the energy they produce to industrial customers. The company wants to expand to the city of Melbourne in the state of Victoria.

Prices and demand for electricity change every day. Customers pay for the energy received using a formula based on the local energy market’s daily price.

Your company’s pricing committee wants your team to estimate energy prices for the next 12-18 months to use those prices as the basis for contract negotiations.

In addition, the VP of strategy is researching investing in storage capacity (i.e., batteries) as a new source of revenue. The plan is to store some of the energy produced by the solar panels when pricing conditions are unfavorable and sell it by the next day on the open market if the prices are higher.

💾 The data

You have access to over five years of energy price and demand data (source):

Note: The price was negative during some intraday intervals, so energy producers were paying buyers rather than vice-versa.

💪 Competition challenge

Create a report that covers the following:

  1. How do energy prices change throughout the year? Are there any patterns by season or month of the year?
  2. Build a forecast of daily energy prices the company can use as the basis of its financial planning.
  3. Provide guidance on how much revenue the energy storage venture could generate per year using retail prices and a 70MWh storage system.

Exploratory Data Analysis (EDA)

import seaborn as sns
sns.set_style('darkgrid')
import pandas as pd
df = pd.read_csv('energy_demand.csv', 
                 parse_dates=['date'], 
                 index_col = 'date'
                )
df.tail()
demand price demand_pos_price price_positive demand_neg_price price_negative frac_neg_price min_temperature max_temperature solar_exposure rainfall school_day holiday
date
2020-10-02 99585.835 -6.076028 41988.240 26.980251 57597.595 -30.173823 0.625000 12.8 26.0 22.0 0.0 N N
2020-10-03 92277.025 -1.983471 44133.510 32.438156 48143.515 -33.538025 0.583333 17.4 29.4 19.8 0.0 N N
2020-10-04 94081.565 25.008614 88580.995 26.571687 5500.570 -0.163066 0.062500 13.5 29.5 8.4 0.0 N N
2020-10-05 113610.030 36.764701 106587.375 39.616015 7022.655 -6.511550 0.083333 9.1 12.7 7.3 12.8 N N
2020-10-06 122607.560 75.771059 122607.560 75.771059 0.000 0.000000 0.000000 8.9 12.6 5.8 1.0 N N

wrangling & cleaning

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2106 entries, 2015-01-01 to 2020-10-06
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   demand            2106 non-null   float64
 1   price             2106 non-null   float64
 2   demand_pos_price  2106 non-null   float64
 3   price_positive    2106 non-null   float64
 4   demand_neg_price  2106 non-null   float64
 5   price_negative    2106 non-null   float64
 6   frac_neg_price    2106 non-null   float64
 7   min_temperature   2106 non-null   float64
 8   max_temperature   2106 non-null   float64
 9   solar_exposure    2105 non-null   float64
 10  rainfall          2103 non-null   float64
 11  school_day        2106 non-null   object 
 12  holiday           2106 non-null   object 
dtypes: float64(11), object(2)
memory usage: 230.3+ KB
df.isna().sum()
demand              0
price               0
demand_pos_price    0
price_positive      0
demand_neg_price    0
price_negative      0
frac_neg_price      0
min_temperature     0
max_temperature     0
solar_exposure      1
rainfall            3
school_day          0
holiday             0
dtype: int64

Hardly any missing data. That is good, less imputing required.

df[(df.solar_exposure.isna() > 0) | (df.rainfall.isna() > 0) ]
demand price demand_pos_price price_positive demand_neg_price price_negative frac_neg_price min_temperature max_temperature solar_exposure rainfall school_day holiday
date
2015-06-11 143465.445 37.481829 143465.445 37.481829 0.0 0.0 0.0 5.7 14.0 8.3 NaN Y N
2017-11-26 108717.875 83.114514 108717.875 83.114514 0.0 0.0 0.0 19.4 28.3 NaN 3.4 Y N
2018-10-09 116449.310 99.000749 116449.310 99.000749 0.0 0.0 0.0 16.1 17.9 7.2 NaN Y N
2018-10-10 109551.080 73.539698 109551.080 73.539698 0.0 0.0 0.0 10.1 16.5 20.0 NaN Y N

Label encoding of both holiday and school_day columns:

b = {'Y': 1, 'N': 0, 1: 1, 0: 0}
df.school_day = df.school_day.map(b)
df.holiday = df.holiday.map(b)
df.head()
demand price demand_pos_price price_positive demand_neg_price price_negative frac_neg_price min_temperature max_temperature solar_exposure rainfall school_day holiday
date
2015-01-01 99635.030 25.633696 97319.240 26.415953 2315.790 -7.240000 0.020833 13.3 26.9 23.6 0.0 0 1
2015-01-02 129606.010 33.138988 121082.015 38.837661 8523.995 -47.809777 0.062500 15.4 38.8 26.8 0.0 0 0
2015-01-03 142300.540 34.564855 142300.540 34.564855 0.000 0.000000 0.000000 20.0 38.2 26.5 0.0 0 0
2015-01-04 104330.715 25.005560 104330.715 25.005560 0.000 0.000000 0.000000 16.3 21.4 25.2 4.2 0 0
2015-01-05 118132.200 26.724176 118132.200 26.724176 0.000 0.000000 0.000000 15.0 22.0 30.7 0.0 0 0
df.corr()
demand price demand_pos_price price_positive demand_neg_price price_negative frac_neg_price min_temperature max_temperature solar_exposure rainfall school_day holiday
demand 1.000000 0.217538 0.971377 0.215038 -0.180638 0.057854 -0.189839 -0.156118 -0.073216 -0.257406 -0.064609 0.123030 -0.247683
price 0.217538 1.000000 0.220856 0.999821 -0.078815 0.038931 -0.077955 0.070619 0.165484 0.061808 -0.028642 -0.005014 -0.030963
demand_pos_price 0.971377 0.220856 1.000000 0.214628 -0.409102 0.120054 -0.416573 -0.147020 -0.068146 -0.229749 -0.069696 0.136983 -0.234809
price_positive 0.215038 0.999821 0.214628 1.000000 -0.062631 0.029455 -0.061968 0.071052 0.165663 0.061311 -0.027860 -0.006134 -0.030697
demand_neg_price -0.180638 -0.078815 -0.409102 -0.062631 1.000000 -0.274847 0.995590 0.009030 0.000914 -0.037467 0.040387 -0.094577 0.020787
price_negative 0.057854 0.038931 0.120054 0.029455 -0.274847 1.000000 -0.258065 -0.077248 -0.033255 0.001654 -0.020794 0.007775 -0.004092
frac_neg_price -0.189839 -0.077955 -0.416573 -0.061968 0.995590 -0.258065 1.000000 0.008859 -0.001174 -0.036701 0.038527 -0.096948 0.025606
min_temperature -0.156118 0.070619 -0.147020 0.071052 0.009030 -0.077248 0.008859 1.000000 0.705433 0.376261 -0.003050 -0.082904 0.066620
max_temperature -0.073216 0.165484 -0.068146 0.165663 0.000914 -0.033255 -0.001174 0.705433 1.000000 0.598995 -0.155392 -0.092532 0.042487
solar_exposure -0.257406 0.061808 -0.229749 0.061311 -0.037467 0.001654 -0.036701 0.376261 0.598995 1.000000 -0.123568 -0.095447 0.045224
rainfall -0.064609 -0.028642 -0.069696 -0.027860 0.040387 -0.020794 0.038527 -0.003050 -0.155392 -0.123568 1.000000 -0.013987 -0.015024
school_day 0.123030 -0.005014 0.136983 -0.006134 -0.094577 0.007775 -0.096948 -0.082904 -0.092532 -0.095447 -0.013987 1.000000 -0.170251
holiday -0.247683 -0.030963 -0.234809 -0.030697 0.020787 -0.004092 0.025606 0.066620 0.042487 0.045224 -0.015024 -0.170251 1.000000

Most variables are only weakly correlated to each other, except the price features among each other of course.

import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1)
fig.suptitle('Demand and Price time series')
axes[0] = df.demand.plot(ax=axes[0])
axes[0].set_ylabel('demand')
axes[1] = df.price.plot(ax=axes[1])
axes[1].set_ylabel('price (log scaled)')
axes[1].set_yscale('log')
plt.show()

png

# create a few extra columns
import numpy as np
price_monthly = df.demand.asfreq('M')
df['month'] = df.index.month
df['quarter'] = df.index.quarter
df['week'] = df.index.isocalendar().week
df['weekdayname'] = df.index.day_name()
df['logprice_negative'] = df.price_negative.apply(lambda x: np.log(abs(x)) if x<0 else x)

1. How do energy prices change throughout the year? Are there any patterns by season or month of the year?

import numpy as np
sns.set(rc={"figure.figsize":(12, 20)})
fig, axes = plt.subplots(4, 1)
axes[0].set_title('Average Price distributions (log transformed) by quarter')
axes[0] = sns.boxplot(data=df, y=np.log(df.price), x=df.quarter, ax=axes[0])
axes[0].set_ylabel('Log Price')
axes[0].set_xlabel('quarter')
axes[1].set_title('Average Prices (log transformed) by month')
axes[1] = sns.boxplot(data=df, y=np.log(df.price), x=df.month, ax=axes[1])
axes[1].set_ylabel('Log Price')
axes[1].set_xlabel('month season')
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
axes[2].set_title('Average Prices (log transformed) per weekday')
axes[2] = sns.boxplot(data=df, y=np.log(df.price), x=df.weekdayname, order=weekdays,ax=axes[2])
axes[2].set_ylabel('Prices (log tranformed)')
axes[2].set_xlabel('weekday')
axes[3].set_title('Negative Prices only (log transformed & sign changed)')
axes[3] = sns.boxplot(data=df, y=df.logprice_negative[df.logprice_negative != 0 ], x=df.weekdayname, order=weekdays)
axes[3].set_ylabel('Negative Prices (log transformed & sign changed)')
axes[3].set_xlabel('weekday')
plt.show()
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)

png

df.logprice_negative[df.logprice_negative != 0 ].count()
181

There are only 181 days on which the price turned negative at all, out of 2106 days of data. As we can see from belows chart that this price phenomenon takes on average more than double as long to be regressed on a day on weekend.

# frac_neg_price

avg_frac_neg_price_weekday = df.frac_neg_price.groupby(df.weekdayname).mean()

sns.set(rc={"figure.figsize":(14, 8)})
fig = avg_frac_neg_price_weekday[weekdays].plot(kind='bar', x='weekdayname',
                                #order=weekdays
                               )
fig.set_ylabel('fraction of time with negative prices')
fig.set_xlabel('weekday')
fig.set_title('Fraction of time with negative prices per weekday')
#plt.yscale('log')
plt.show()

png

As we do not have the data for all the feature variables for the forecast horizon of 12/18 months, we will be not able to use any variables in the price forecasting models. While variables such as holiday and school_day may be easier to acquire and provide for these future days, others will be harder or impossible to acquire, as e.g. with weather data we would need to rely on forecasts as well, which are available on a few days maximum anyway.

sns.set(rc={"figure.figsize":(13, 7)})
fig, axes = plt.subplots(1, 2)

fig = sns.scatterplot(data=df, y=np.log(df.price), x='demand', hue='holiday',
                 ax=axes[0]
                 )

fig = sns.scatterplot(data=df, y=np.log(df.price), x='demand', hue='school_day',
                 ax=axes[1]
                 )
plt.show()
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)

png

Price and demand are by nature positively correlated and interact with each other. This can be confirmed in a quick scatterplot. We want to use the demand timeseries also in our price forecast and therefore will model demand at first, forecast it and use this a extra feature variable in our price model.

2. Build a forecast of daily energy prices the company can use as the basis of its financial planning.

Formal timeseries stationarity check

# Import augmented dicky-fuller test function
from statsmodels.tsa.stattools import adfuller

# Run test
adfullertest_price = adfuller(df.price)
adfullertest_demand = adfuller(df.demand)

print(f'Price TS Adfuller test statistic {adfullertest_price[0]}')
print(f'Price TS Adfuller p-value {adfullertest_price[1]}')

print('##################')

print(f'Demand TS Adfuller test statistic {adfullertest_demand[0]}')
print(f'Demand TS Adfuller p-value {adfullertest_demand[1]}')
Price TS Adfuller test statistic -11.039525891900833
Price TS Adfuller p-value 5.4099030605776795e-20
##################
Demand TS Adfuller test statistic -3.953447482126237
Demand TS Adfuller p-value 0.0016748806784637172

Augmented Dikey Fuller Test (ADF)

Null Hypothesis: the presence of unit root = the series is non-stationary.

We can clearly reject the null hypothesis for the demand time series, we infer that it is stationary.

Forecasting demand with a Hybrid model: Linear Regression + XGBoost

We want to build a demand forecst model because both demand and prices strongly interact with each other not only in electricty markets. Demand may be a good feature to predict the price in our tasked question.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from xgboost import XGBRegressor
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
y = df.demand

# Create trend features
dp = DeterministicProcess(index=y.index, constant=True, order=2, drop=True)
X = dp.in_sample()  # I waived deliberately to use awkward out_of_sample function

# split the data
idx_train, idx_test = train_test_split(
    y.index, test_size=500, shuffle=False,
)
X_train, X_test = X.loc[idx_train, :], X.loc[idx_test, :]
y_train, y_test = y.loc[idx_train], y.loc[idx_test]

X_test
const trend trend_squared
date
2019-05-26 1.0 1607.0 2582449.0
2019-05-27 1.0 1608.0 2585664.0
2019-05-28 1.0 1609.0 2588881.0
2019-05-29 1.0 1610.0 2592100.0
2019-05-30 1.0 1611.0 2595321.0
... ... ... ...
2020-10-02 1.0 2102.0 4418404.0
2020-10-03 1.0 2103.0 4422609.0
2020-10-04 1.0 2104.0 4426816.0
2020-10-05 1.0 2105.0 4431025.0
2020-10-06 1.0 2106.0 4435236.0

500 rows × 3 columns

# Fit trend model
model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)
LinearRegression(fit_intercept=False)
# Make predictions
y_fit = pd.DataFrame(model.predict(X_train),index=y_train.index, columns=['demand'])
y_pred = pd.DataFrame(model.predict(X_test), index=y_test.index, columns=['demand'])
# Plot
plt.figure(figsize=(10,6))
axs = y_train.plot(color='0.25', subplots=True, sharex=True)
axs = y_test.plot(color='0.25', subplots=True, ax=axs)
axs = y_fit.plot(color='C0', subplots=True, ax=axs)
axs = y_pred.plot(color='C3', subplots=True, ax=axs)
for ax in axs: ax.legend([])
_ = plt.suptitle("Demand on downwards trend")
plt.show()

png

import numpy as np

def feature_engineer_X(X):
    X['dayofweek'] = X.index.dayofweek
    X['quarter'] = X.index.quarter
    X['month'] = X.index.month
    X['year'] = X.index.year
    X['dayofyear'] = X.index.dayofyear
    X['dayofmonth'] = X.index.day
    X['weekofyear'] = X.index.isocalendar().week.astype(np.int64)
    return X

X = feature_engineer_X(X)

# remove cols for XGB
tbd = ['const' , 'trend' , 'trend_squared']
for col in tbd:
    if col in X.columns:
        X.drop(tbd, axis=1, inplace=True)
        break
        
# Create splits
X_train, X_test = X.loc[idx_train, :], X.loc[idx_test, :]
y_train, y_test = y.loc[idx_train], y.loc[idx_test]
# Create residuals (the collection of detrended series) from the training set
y_resid = y_train - y_fit.demand # y_fit is DF
y_resid.tail()
date
2019-05-21     6304.855728
2019-05-22     6560.784812
2019-05-23     3670.415060
2019-05-24     3484.181472
2019-05-25   -10483.520954
Name: demand, dtype: float64
# Train XGBoost on the residuals
xgb = XGBRegressor()
xgb.fit(X_train, y_resid)

# Add the predicted residuals onto the predicted trends
y_fit_boosted = xgb.predict(X_train) + y_fit.demand
y_pred_boosted = xgb.predict(X_test) + y_pred.demand

C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
axs = y_train.plot(
    color='0.25', figsize=(15, 8), subplots=True, 
)
axs = y_test.plot(
    color='0.25', subplots=True, sharex=True, ax=axs,
)
axs = y_fit_boosted.plot(
    color='C0', subplots=True, sharex=True, ax=axs,
)
axs = y_pred_boosted.plot(
    color='C3', subplots=True, sharex=True, ax=axs,
)
plt.legend(['observed demand (training)', 'observed demand (test)', 'demand fitted (training)', 'demand forecast (test)'])
plt.ylabel('Demand in MWh')
plt.show()
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\873584967.py:4: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = y_test.plot(
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\873584967.py:7: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = y_fit_boosted.plot(
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\873584967.py:10: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = y_pred_boosted.plot(

png


axs = y_train[-100:].plot(
    color='0.25', figsize=(15, 8), subplots=True, sharex=True,
    #title=['BuildingMaterials', 'FoodAndBeverage'],
)
axs = y_test.plot(
    color='0.25', subplots=True, ax=axs,
)
axs = y_fit_boosted[-100:].plot(
    color='C0', subplots=True,  ax=axs,
)
axs = y_pred_boosted.plot(
    color='C3', subplots=True,  ax=axs,
)
plt.legend(['observed demand (training)', 'observed demand (test)', 'demand fitted (training)', 'demand forecast (test)'])
plt.ylabel('Demand in MWh')
plt.show()

png

forecast full demand time series as new feature for price modelling

we train again, now with the full demand dataset and predict demand for our forecast horizon of 12 to 18 months

lr = LinearRegression(fit_intercept=False)
lr.fit(X, y)
y_lr = lr.predict(X)

y_resid = y - y_lr
y_resid

xgb = XGBRegressor()
xgb.fit(X, y_resid)
y_xgb = xgb.predict(X)

y_demand_fitted = pd.DataFrame(
    y_lr + y_xgb,
    index=y.index,
    columns=['demand_fitted']
)
C:\Users\SD\anaconda3\envs\timeseries\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
from xgboost import plot_importance

plot_importance(xgb, title='Feature importance for demand forecasting')
plt.show()

png

horizon_index_12M = pd.date_range('2020-10-07', periods=365, freq='D')
horizon_index_18M = pd.date_range('2020-10-07', periods=548, freq='D')

X_12M = pd.DataFrame(horizon_index_12M.dayofweek, index=horizon_index_12M, columns=['dayofweek'])
X_18M = pd.DataFrame(horizon_index_18M.dayofweek, index=horizon_index_18M, columns=['dayofweek'])

X_12M = feature_engineer_X(X_12M)
X_18M = feature_engineer_X(X_18M)
y_demand_forecast_12M = pd.DataFrame(lr.predict(X_12M) + xgb.predict(X_12M), index=horizon_index_12M, columns=['demand_forecast'])
y_demand_forecast_18M = pd.DataFrame(lr.predict(X_18M) + xgb.predict(X_18M), index=horizon_index_18M, columns=['demand_forecast'])

axs = y_demand_forecast_12M.plot(
    color='red', figsize=(15, 8), subplots=True, sharex=True, alpha=0.8

)
axs = y_demand_forecast_18M.plot(
    color='red', linestyle='--', subplots=True, sharex=True, alpha=0.6, ax=axs,

)
axs = y_demand_fitted[-200:].plot(
    color='blue', subplots=True, sharex=True, ax=axs,
)
axs = df.demand[-200:].plot(
    color='black', marker = 'o', subplots=True, sharex=True, ax=axs, alpha=0.5
)
#axs = y_pred_boosted.plot(
#    color='C3', subplots=True, sharex=True, ax=axs,
#)
#axs.legend([])
plt.legend(['demand forecast 12 months','demand forecast 18 months', 'demand fitted', 'actual demand'])
plt.title('Demand forecast for 12 & 18 months')
plt.ylabel('Demand in MWh')
plt.show()
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\1894236598.py:5: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = y_demand_forecast_18M.plot(
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\1894236598.py:9: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = y_demand_fitted[-200:].plot(
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\1894236598.py:12: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  axs = df.demand[-200:].plot(

png

We have got a solid forecast on the demand time series and created the feature demand forecast time series that will help in forecasting the price as well.

Forecasting price

from statsmodels.tsa.seasonal import seasonal_decompose

def plot_periodogram(ts, detrend='linear', ax=None):
    # https://www.kaggle.com/code/ryanholbrook/seasonality/tutorial
    from scipy.signal import periodogram
    fs = pd.Timedelta("1Y") / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots()
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels(
        [
            "Annual (1)",
            "Semiannual (2)",
            "Quarterly (4)",
            "Bimonthly (6)",
            "Monthly (12)",
            "Biweekly (26)",
            "Weekly (52)",
            "Semiweekly (104)",
        ],
        rotation=30,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Price Periodogram")
    return ax

fig, axes = plt.subplots(3, 1, figsize=(10,10), gridspec_kw={'height_ratios': [1, 1,2]})
axes[1] = plot_periodogram(df.price, ax=axes[1])
ts_decomposed = seasonal_decompose(df['price'],  model='additive')
axes[0] = ts_decomposed.trend.plot( ax=axes[0])
axes[0].set_title("Price trend component")
axes[0].set_yscale("log")
axes[2] = sns.boxplot(data=df, y='price', showfliers=False)
axes[2].set_title("Price distribution")
plt.show()
C:\Users\SD\AppData\Local\Temp\ipykernel_26628\204507337.py:38: FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version.
  axes[1] = plot_periodogram(df.price, ax=axes[1])

png

The price time series is hardly trending, it is rather periodically bound inside specific price ranges. This time series does not have particular strong seasonality in any period unlike the demand time series in the earlier modeling. The price distribution has been mostly confined between 38 and 95 AUD with considerable price average extremes above and below that, shown as long whiskers on the box.

Due to this characteristics we choose any classical algorithm for our price model, as trend and seasonality are not big factors. We choose an XGBRegressor model to forecast prices.

from warnings import simplefilter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

simplefilter("ignore")
def make_lags(ts, lags, lead_time=1):
    return pd.concat(
        {
            f'y_lag_{i}': ts.shift(i)
            for i in range(lead_time, lags + lead_time)
        },
        axis=1)


# Four weeks of lag features
y = np.log(df.price.copy())
y_orig = y.copy()
lags = int(1)
X = make_lags(y, lags=lags).fillna(0.0)

# remove cols for XGB
tbd = ['dayofweek',	'quarter',	'month',	'dayofyear',	'dayofmonth',	'weekofyear	']
for col in tbd:
    if col in X.columns:
        X.drop(tbd, axis=1, inplace=True)
        break
        

def make_multistep_target(ts, steps):
    return pd.concat(
        {f'y_step_{i + 1}': ts.shift(-i)
         for i in range(steps)},
        axis=1)


# Eight-week forecast
y = make_multistep_target(y, steps=1).dropna()

# Shifting has created indexes that don't match. Only keep times for
# which we have both targets and features.
y, X = y.align(X, join='inner', axis=0)
## Feature Engineering

X = feature_engineer_X(X)
#X.drop("year", axis=1, inplace=True)
X['demand_modeled'] = y_demand_fitted
#X.drop('y_lag_1', axis=1,  inplace=True)

X.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2103 entries, 2015-01-01 to 2020-10-06
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   y_lag_1         2103 non-null   float64
 1   dayofweek       2103 non-null   int64  
 2   quarter         2103 non-null   int64  
 3   month           2103 non-null   int64  
 4   year            2103 non-null   int64  
 5   dayofyear       2103 non-null   int64  
 6   dayofmonth      2103 non-null   int64  
 7   weekofyear      2103 non-null   int64  
 8   demand_modeled  2103 non-null   float64
dtypes: float64(2), int64(7)
memory usage: 164.3 KB
y.head(3)
y_step_1
date
2015-01-01 3.243908
2015-01-02 3.500710
2015-01-03 3.542837
X.head(4)
y_lag_1 dayofweek quarter month year dayofyear dayofmonth weekofyear demand_modeled
date
2015-01-01 0.000000 3 1 1 2015 1 1 1 99973.035996
2015-01-02 3.243908 4 1 1 2015 2 2 1 128306.386960
2015-01-03 3.500710 5 1 1 2015 3 3 1 140932.607065
2015-01-04 3.542837 6 1 1 2015 4 4 1 105635.768575
# Create splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)
#from sklearn.multioutput import MultiOutputRegressor

pp = {'colsample_bytree': 0.6, 'gamma': 0.075, 'learning_rate': 0.04, 'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 200, 'subsample': 0.7}
price_model = XGBRegressor(**pp)
#price_model2 = MultiOutputRegressor(XGBRegressor())
price_model.fit(X_train, y_train)

y_fit = pd.DataFrame(price_model.predict(X_train), index=X_train.index, columns=y.columns)
y_pred = pd.DataFrame(price_model.predict(X_test), index=X_test.index, columns=y.columns)
train_rmse = mean_squared_error(y_train, y_fit, squared=False)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print((f"Tuned model: Train RMSE: {train_rmse:.3f}\n" f"Test RMSE: {test_rmse:.3f}"))
Tuned model: Train RMSE: 0.215
Test RMSE: 0.374
# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 4))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
)
%config InlineBackend.figure_format = 'retina'


def plot_multistep(y, every=1, ax=None, palette_kwargs=None):
    palette_kwargs_ = dict(palette='husl', n_colors=16, desat=None)
    if palette_kwargs is not None:
        palette_kwargs_.update(palette_kwargs)
    palette = sns.color_palette(**palette_kwargs_)
    if ax is None:
        fig, ax = plt.subplots()
    ax.set_prop_cycle(plt.cycler('color', palette))
    #ax.set_yscale('log')
    for date, preds in y[::every].iterrows():
        preds.index = pd.period_range(start=date, periods=len(preds))
        preds.plot(ax=ax)
    return ax
train_rmse = mean_squared_error(y_train, y_fit, squared=False)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print((f"Train RMSE: {train_rmse:.3f}\n" f"Test RMSE: {test_rmse:.3f}"))
MAE = np.mean(np.absolute(y_test.y_step_1 - y_pred.y_step_1))
MAE_Per = 100* MAE/np.mean(y_test.y_step_1)
print(MAE)
print(MAE_Per)
Train RMSE: 0.215
Test RMSE: 0.374
0.2767522987811786
6.698853940866162
axs = y_train.plot(
    color='0.25', figsize=(13, 8), marker = 'o',
    sharex=True,
)
axs = y_test.plot(
    color='0.5', #subplots=True, 
    sharex=True, ax=axs, marker = 'o',
)
axs = y_fit.plot(
    color='C0', #subplots=True, 
    sharex=True, ax=axs,
)
axs = y_pred.plot(
    color='C3', #subplots=True, 
    sharex=True, ax=axs, label="modeled price"
)
axs.legend(["observed prices (train data)", "observed prices (test data)", "modeled price (train data)", "modeled price (test data)"])
plt.ylabel('prices (log transformed)')
plt.title('Observed and modeled prices')
plt.show()

png

axs = y_test[250:600].plot(
    color='0.25', figsize=(13, 9),
    marker = 'o',
    sharex=True,
)

axs = y_pred[250:600].plot(
    color='C3',
    linewidth=2, #linestyle = '--',
    sharex=True, ax=axs,
    label = 'Forecast'
)

"""axs = y_pred2[250:600].plot(
    color='green',
    linewidth=2, #linestyle = '--',
    sharex=True, ax=axs,
    label = 'Forecast'
)"""

axs.legend(['Actual', 'Forecast'])
plt.ylabel('prices (log transformed)')
plt.title('Zoomed in detail view of the indiviual stepwise forecast prices (test data)')
plt.show()

png

axs = y_test[350:600].plot(
    color='0.25', figsize=(13, 9),
    marker = 'o',
    sharex=True,
)

axs = y_pred[350:600].plot(
    color='C3',
    linewidth=2, #linestyle = '--',
    sharex=True, ax=axs
)

"""axs = y_pred2[350:600].plot(
    color='green',
    linewidth=2, linestyle = '--',
    sharex=True, ax=axs
)
"""
axs.legend(['Actual', 'Forecast', 'Forecast (unoptimized model)'])
plt.ylabel('prices (log transformed)')
plt.title('Zoomed in detail view of the indiviual stepwise forecast prices (test data)')
plt.show()

png

from xgboost import plot_importance
plot_importance(price_model, title='Feature importance in our price model')
plt.show()

png

Forecast Price for 12 & 18 months

X_12M = pd.DataFrame(horizon_index_12M.dayofweek, index=horizon_index_12M, columns=['dayofweek'])
X_18M = pd.DataFrame(horizon_index_18M.dayofweek, index=horizon_index_18M, columns=['dayofweek'])

X_12M = make_lags(X_12M, lags=lags).fillna(0.0)
X_18M = make_lags(X_18M, lags=lags).fillna(0.0)

X_12M = feature_engineer_X(X_12M)
#X_12M.drop("year", axis=1, inplace=True)
X_12M['demand'] = y_demand_forecast_12M
X_18M = feature_engineer_X(X_18M)
#X_18M.drop("year", axis=1, inplace=True)
X_18M['demand'] = y_demand_forecast_18M
#X_12M.drop('dayofweek', axis=1, inplace=True)
#X_18M.drop('dayofweek', axis=1, inplace=True)

For this forecast we introduce a manual pricetrend adjustment component as we believe the price reverts back to the range of the years 2015 to 2016 and also 2020. The latter could not be used for training the model, as it was designated testing data. There have been probably structural effects in the market that caused wholesale electricity prices in 2017 to 19 to rise into a logarithmic price band between 4 and 5, instead of below 4. Such reasons could have been shortened supply (decommission power plants), specific costs of transforming australia’s energy mix to become more sustainable etc. These can be confirmed with local domain experts. In its absence, I do not expect considerable price trend change expect falling back to the previous price range, that has been reestablished in 2020 year to date. Only a qualitative assessment will be able to give a guidance on the price level, only seasonal and cyclical spikes can be predicted on a daily level, which are rather of less interest here.

#y_pred = pd.DataFrame(model.predict(X_12M), index=horizon_index_12M, columns=['price_forecast'])
pricetrend_adj = -0.5

y_price_forecast_12M = pd.DataFrame(price_model.predict(X_12M)+pricetrend_adj, index=horizon_index_12M, columns=['price_forecast'])
y_price_forecast_18M = pd.DataFrame(price_model.predict(X_18M)+pricetrend_adj, index=horizon_index_18M, columns=['price_forecast'])

axs = y_price_forecast_12M.plot(
    color='red', figsize=(15, 8), subplots=True, sharex=True, alpha=0.8

)
axs = y_price_forecast_18M.plot(
    color='red', linestyle='--', subplots=True, sharex=True, alpha=0.6, ax=axs,

)
#axs = y_pred[-700:].plot(
#    color='blue', subplots=True, sharex=True, ax=axs,
#)
axs = np.log(df.price[-700:]).plot(
    color='black', marker = 'o', subplots=True, sharex=True, ax=axs, alpha=0.5
)
#axs = y_pred_boosted.plot(
#    color='C3', subplots=True, sharex=True, ax=axs,
#)
#axs.legend([])
plt.legend(['Price forecast 12 months','Price forecast 18 months', 
            #'price fitted', 
            'actual price'])
plt.title('Price forecast for 12 & 18 months')
plt.ylabel('prices (log transformed)')
plt.show()

png

fig, axes = plt.subplots(1, 10, sharey=True, figsize=(14,8))
fig.suptitle('Price distributions comparison between actual prices (blue) and modeled prices (orange) ex outliers')
pal = sns.color_palette("PuBu", 12); pal2=sns.color_palette("OrRd", 10)
sns.boxplot(data=df['2015'],  y='price', showfliers = False, ax=axes[0], color=pal[4])
axes[0].set_title('year 2015')
sns.boxplot(data=df['2016'],  y='price', showfliers = False, ax=axes[1], color=pal[5])
axes[1].set_title('year 2016')
sns.boxplot(data=df['2017'],  y='price', showfliers = False, ax=axes[2], color=pal[6])
axes[2].set_title('year 2017')
sns.boxplot(data=df['2018'],  y='price', showfliers = False, ax=axes[3], color=pal[7])
axes[3].set_title('year 2018')
sns.boxplot(data=df['2019'],  y='price', showfliers = False, ax=axes[4], color=pal[8])
axes[4].set_title('year 2019')
sns.boxplot(data=df['2020'],  y='price', showfliers = False, ax=axes[5], color=pal[9])
axes[5].set_title('year 2020 YTD')
sns.boxplot(data=df,  y='price', showfliers = False, ax=axes[6], color=pal[10])
axes[6].set_title("years 2015-2020")
sns.boxplot(y=np.exp(y_fit.y_step_1), showfliers = False, ax=axes[7], color=pal2[3])
axes[7].set_title("modeled prices\n training")
sns.boxplot(y=np.exp(y_price_forecast_12M['price_forecast']), showfliers = False, ax=axes[8], color=pal2[5])
axes[8].set_title("forecast prices\n 12 months")
sns.boxplot(y=np.exp(y_price_forecast_18M['price_forecast']), showfliers = False,  ax=axes[9], color=pal2[5])
axes[9].set_title("forecast prices\n 18 months")
plt.show()

png

The forecast prices lack external shocks and its induced price spikes and therefore are confined to relatively narrow price ranges in comparison to real observed prices. These price shocks cannot be modeled with the given data. This price spiking and volatility is expected to rise generally in this electricity market (see figure). It is a market, in which base loads from less desirable conventional power plants cannot be taken for granted any longer. And for our purpose, where our price comittee requires a guidance for contract price negotations, I assume that we want to plan financial numbers conservatively and do not really require daily resolution of a price development. So we could consider our SARIMA half monthly price model as well.

3. Provide guidance on how much revenue the energy storage venture could generate per year using retail prices and a 70MWh storage system.

modeling storage business

import random
# set up required cols
df['stored_energy'] = 0
df['revenue_negprices'] = 0
df['revenue_regular'] = 0
df['year'] = df.index.year

# function to determine daily if and how much energy is favourable to be stored and how much revenue could have been made (ex post)
def energy_storage(df, index, row, MWh):
        # sell stored energy as soon as price is somehow favourable at avg spot price
        loc = df.index.get_loc(index) # integer index
        if df.iloc[loc-1]['stored_energy'] > 0:
            df.loc[index, 'revenue_regular'] = df.iloc[loc-1]['stored_energy'] * row.price_positive
            df.loc[index, 'stored_energy'] = 0
        
        # unfavourable market prices for selling electricity
        if row.price_negative < 0:
            # A. we store our own produced PV energy free of cost or 
            # B. we sell our storage capacity for market price and earn revenue by 'buying' electricity at negative prices
            # the chance is simplified and assumed to 50%/50% suntime/nighttime
            suntime = random.choice([0, 1])
            if row.demand_neg_price > MWh * 2 & suntime == 1: # suntime situation is large and long enough to fill 70MWh storage
                df.loc[index, 'stored_energy'] = MWh
                df.loc[index, 'revenue_negprices'] = 0 # not yet sold, but we may have avoided buying spot energy storage
            elif row.demand_neg_price > MWh * 2 & suntime == 0: # night
                df.loc[index, 'stored_energy'] = MWh
                df.loc[index, 'revenue_negprices'] = MWh * row.price_negative * -1

for index, row in df.iterrows():        
    energy_storage(df, index, row, MWh=70)  
sns.set(rc={"figure.figsize":(13, 7)})

df['total_storage_revenue'] = df.revenue_negprices + df.revenue_regular
fig, axes = plt.subplots(3, 1, figsize=(13,15), gridspec_kw={'height_ratios': [1, 1,1]})

axes[0] = df.groupby(['year', 'quarter'])[['revenue_negprices','revenue_regular']].sum().plot(kind='bar', stacked=True, ax=axes[0])
axes[0].legend(['Selling storage capacity (Revenue by negative prices)','Re-(selling) stored electricty at favourable prices'])
axes[0].set_title('Revenues by Selling storage capacity & Re-(selling) stored electricity at favourable prices')
axes[0].set_ylabel('Revenue in AUD')

df['total_storage_revenue'] = df.revenue_negprices + df.revenue_regular
axes[1] = df.groupby(['year'])[['revenue_negprices','revenue_regular']].sum().plot(kind='bar', stacked=True, ax=axes[1])
axes[1].legend(['Selling storage capacity (Revenue by negative prices)','Re-(selling) stored electricty at favourable prices'])
axes[1].set_title('Revenues by Selling storage capacity & Re-(selling) stored electricity at favourable prices')
#ax = plt.gca() # get axis
axes[1].set_xticks([0,1,2,3,4,5])
axes[1].set_xticklabels(['2015', '2016', '2017', '2018', '2019', '2020 YTD'], rotation=45) 
axes[1].set_ylabel('Revenue in AUD')

rev_yearly = df.groupby(['year'])[['revenue_negprices','revenue_regular']].sum()
rev_yearly['total'] = rev_yearly.revenue_negprices + rev_yearly.revenue_regular
axes[2] = sns.boxplot(data=rev_yearly, y='total')
axes[2].set_title('Yearly Total Electricity Storage Revenue distribution 2015-2020 YTD')
axes[2].set_ylabel('Total revenue per year in AUD')
plt.show()

png

print((f"Yearly 25% Percentile: {rev_yearly['total'].describe()[4]:.0f}"))
print((f"Yearly 75% Percentile: {rev_yearly['total'].describe()[6]:.0f}"))
print((f"Yearly Average: {rev_yearly['total'].describe()[1]:.0f}"))

Yearly 25% Percentile: 108542
Yearly 75% Percentile: 240657
Yearly Average: 174024
rev_yearly['total'].describe()
count         6.000000
mean     174024.206780
std       74740.389903
min       92247.097306
25%      108542.000211
50%      178795.248394
75%      240656.883464
max      247905.967280
Name: total, dtype: float64
from statsmodels.tsa.seasonal import seasonal_decompose
fig2, axes = plt.subplots(2, 1, figsize=(13, 10), constrained_layout=False)
monthly = df.resample('M').mean()
sub1 = sns.barplot(y=monthly.frac_neg_price, x=monthly.index, palette="Blues_d", ax = axes[0])
ticklabels = [monthly.index[int(tick)].strftime('%Y-%m') for tick in sub1.get_xticks()]
sub1.set_xticklabels(ticklabels,rotation=90)
sub1.set_title('monthly mean')
decomp = seasonal_decompose(monthly.frac_neg_price,period=12)
sub2 = decomp.trend.plot(ax = axes[1])
sub2.set_title('isolated monthly trend component pointing upwards')
plt.suptitle('Monthly average of daily fractions, in which transactions traded at negative prices')
plt.subplots_adjust(hspace = 0.5)
plt.show()

png

As guidance on how much revenue the storage venture can produce per year we have seen the chart and depending on how conservative you may want to plan, you can refer to the first or third quartile or just the yearly average of 174K AUD, which can be used to calculate against costs involved. As we have seen the trend to negative prices is strong, which can be explained with more sustainable electricity being produced by solar and wind which has more volatile production outputs and hence volatile prices and price spikes both up and down as well.