Oil Series in R


“Charts are great for predicting the past.” -Peter Lynch

I have not dealt with time series in practice, but I definitely read about them (mostly at school) and had some idea about the way the analysis is carried out. But it is well known that what told in textbooks on statistics and machine learning does not always reflect the real situation.

I guess a lot of people follow the pirouettes made by the curve of oil prices. The chart looks either chaotic, or too regular, so making any predictions on it is quite a thankless job. Of course, we can unleash the full power of statistical, economic and mathematical, and expert methods on time series, but let’s try to deal with the technical analysis – of course, on the basis of R.

When working with regular time series, we can use a standard approach:

  1. Visual analysis
  2. Decompose the series and analyze its components: seasonality, cyclicity and trend.
  3. Build the mathematical model and make predictions.

There’s this handy source of data – Quandl; it represents the interface for Matlab, Python, R. For R, it is enough to install just one package: install.packages(«Quandl»). I am interested in the Europe Brent Crude Oil Spot Price — the spot price of Brent Oil.

(below are three different sets of data detailing)

oil.ts <- Quandl("DOE/RBRTE", trim_start="1987-11-10", trim_end="2015-01-01", type="zoo")
oil.tsw <-Quandl("DOE/RBRTE", trim_start="1987-11-10", trim_end="2015-01-01", type="zoo", collapse="weekly")
oil.tsm <-Quandl("DOE/RBRTE", trim_start="1987-11-10", trim_end="2015-01-01", type="ts", collapse="monthly")
plot(oil.tsm, xlab="Year", ylab="Price, $", type="l")
lines(lowess(oil.tsm), col="red", lty="dashed")

Oil Price

Considering the prices at a scale of decades, we can see several spikes and falls, as well as the direction of the trend. But it’s hard to make any significant estimates, so we’ll examine the series components.

plot(decompose(oil.tsm, type="multiplicative"))

As for the trend, the situation is quite clear: the 21st century introduced a steady, until recently, upward trend (except for some interesting years), the series is non-stationary, which is proved by the Augmented Dickey–Fuller test as well:

>adf.test(oil.tsm, alternative=c('stationary'))
    Augmented Dickey-Fuller Test
data:  oil.tsm
Dickey-Fuller = -2.7568, Lag order = 6, p-value = 0.2574
alternative hypothesis: stationary

On the other hand, we can say with a relatively high degree of confidence that the first-order differences of the series are stationary; it’s the integrated time series of the first order (Difference Stationary series). This fact will allow us to apply the autoregressive integrated moving average (ARIMA) model.

>adf.test(diff(oil.tsm), alternative=c('stationary'))
    Augmented Dickey-Fuller Test
data:  diff(oil.tsm)
Dickey-Fuller = -8.0377, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary
> ndiffs(oil.tsm)
[1] 1

Besides, it turns out that there is a seasonal component, which is hard to see in a general chart. Taking a closer look at it, in addition to quite a high volatility, we can see two price jumps during the year (which can be associated with the increased oil flow in winter and during the holiday season). At the same time, there is a random component, the weight of which increases especially in critical years (for example the recession of 2008).

Sometimes it is preferable to work with data after the one-parameter Box-Cox transformation that allows to stabilize the dispersion and transform the data to a more standard form:

L <- BoxCox.lambda(ts(oil.ts, frequency=260), method="loglik")
Lw <- BoxCox.lambda(ts(oil.tsw, frequency=52), method="loglik")
Lm <- BoxCox.lambda(oil.tsm, method="loglik")

As for the most delicate topic, namely, extrapolation, the authors of the article titled “Crude Oil Price Forecasting Techniques: a Comprehensive Review of Literature” note that, depending on the length of the time period, the applicability of models is as follows:

  1. Nonlinear models, the same neural networks and support vector machines, are the most suitable for the mid-term and long-term period.
  2. ARIMA often exceeds neural networks within the context of the short-term period.

After all the formalities, we will use the available in the forecast package the nnetar() function, which will help to build a neural network model of the series. We’ll do this for three series, from the more detailed one (daily) to the less detailed one (monthly). At the same time, we’ll see what will happen in the mid-term, for example, during 2 years (it is displayed in blue in charts)

# Fit NN for long-run
fit.nn <- nnetar(ts(oil.ts, frequency=260), lambda=L, size=3)
fcast.nn <- forecast(fit.nn, h=520, lambda=L)
fit.nnw <- nnetar(ts(oil.tsw, frequency=52), lambda=Lw, size=3)
fcast.nnw <- forecast(fit.nnw, h=104, lambda=Lw)
fit.nnm <- nnetar(oil.tsm, lambda=Lm, size=3)
fcast.nnm <- forecast(fit.nnm, h=24, lambda=Lm)
par(mfrow=c(3, 1))
plot(fcast.nn, include=1040)
plot(fcast.nnw, include=208)
plot(fcast.nnm, include=48)

Overfitting is what has turned out well at the upper chart: the neural network has caught the last pattern in the series and began to copy it. At the middle chart, the network not only copies the last patter, but also combines it well with the trend, which adds some reality to the prediction. The lower chart displays… some strange curve. The charts illustrate well the way predictions change, depending on the data smoothing. In any case, we cannot trust predictions for goods with high (due to various reasons) volatility for such a short time period. Therefore, let’s move on to the short-term period, and also compare several different models: ARIMA, tbats and the neural network. We will use the data of the last six months and especially single out December into the short.test series, for testing purposes.

# Fit ARIMA, NN and ETS for short-run
short <- ts(oil.ts[index(oil.ts) > "2014-06-30" & index(oil.ts) < "2014-12-01"], frequency=20)
short.test <- as.numeric(oil.ts[index(oil.ts) >= "2014-12-01",])
h <- length(short.test)
fit.arima <- auto.arima(short, lambda=L)
fcast.arima <- forecast(fit.arima, h, lambda=L)
fit.nn <- nnetar(short, size=7, lambda=L)
fcast.nn <- forecast(fit.nn, h, lambda=L)
fit.tbats <-tbats(short, lambda=L)
fcast.tbats <- forecast(fit.tbats, h, lambda=L)
par(mfrow=c(3, 1))
plot(fcast.arima, include=3*h)
plot(fcast.nn, include=3*h)
plot(fcast.tbats, include=3*h)

Oil Prices

After the overfitting, the neural network went a bit crazy, and ARIMA showed quite an interesting dependency. It is interesting in terms of proximity to the real picture. Below is the comparison of each model predictions with the real data in December and the mean absolute percentage error:

par(mfrow=c(1, 1))
plot(short.test, type="l", col="red", lwd=5, xlab="Day", ylab="Price, $", main="December prices",
     ylim=c(min(short.test, fcast.arima$mean, fcast.tbats$mean, fcast.nn$mean),
            max(short.test, fcast.arima$mean, fcast.tbats$mean, fcast.nn$mean)))
lines(as.numeric(fcast.nn$mean), col="green", lwd=3,lty=2)
lines(as.numeric(fcast.tbats$mean), col="magenta", lwd=3,lty=2)
lines(as.numeric(fcast.arima$mean), col="blue", lwd=3, lty=2)
legend("topright", legend=c("Real Data","NeuralNet","TBATS", "ARIMA"), 
       col=c("red","green", "magenta","blue"), lty=c(1,2,2,2), lwd=c(5,3,3,3))

mape <- function(r, f){
  len <- length(r)
  return(sum( abs(r - f$mean[1:len]) / r) / len * 100)
mape(short.test, fcast.arima)
mape(short.test, fcast.nn)
mape(short.test, fcast.tbats)
  • ARIMA: 1.99%
  • NNet: 18.26%
  • TBATS: 4.00%

Instead of the Summary

I am not going to comment on the long-term forecasts. It is obvious that they are wrong and inappropriate in this situation. As for ARIMA, it has showed quite nice results for the short-term period. We should also pay attention to the following facts. Oil prices dropped:

  1. by 5% in September;
  2. by 10% in October;
  3. by 15% in November%;
  4. December?

These figures a sort of hint us that the process of changes in oil prices is far from the process that is governed by random parameters.



    Ropes — Fast Strings

    Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.