15 Feb Electricity consumption Outlier detection
In today world the electric energy consumption is at the center of attention due to environmental pollution. More and more effort is conveyed in reducing the amount of energy used. In this regards, Italy arises as the 12-th country for electricity consumption as shown in the following link:
With 291 billion kW*h/year and an average of roughly 4.600 kW per person per year, we are one of the main country responsible for excessive energy production (weighted by population size).
In this scenario, it might be interesting to automatically detect an ‘outlier’ consumption of energy during a particular day. The ability to spot anomalies in a day-to-day consumption is useful for those firms responsible for the production and diffusion of the energy since it cannot be otherwise stored. On the other hand, the excessive demand for energy in a specific time of the day might become a problem as well.
In this article, we propose a combination of two unsupervised models to detect anomaly energy consumption during the year 2011. The point of the analysis, however, is not simply detecting the global anomaly energy consumption but identify ‘irregularities’ in the shape of the time series.
This article is structured as follow: firstly, we describe how the dataset is created and we offer a graphical analysis in order to understand the connection between different types of households characteristics and the electric consumption series. Secondly, we trained a deep learning autoencoder in order to encode the series and visualize it. At this point, a Local Outlier Factor (LOF) is applied to the reduced data to detect anomalies with respect to the global ordinary consumption trend (other possibilities might be the pylof model).
Finally, we sum all the results up and explain the possible connection with households characteristics as well as the time features describing the outlier group.
Electricity consumption Dataset creation
In order to apply the dimensionality reduction via deep learning autoencoder, we face the need to reshape the dataset of household electric consumption. We start off with a dataset of shape (35040; 377). In other words, we have 377 households and for each a time series one year long with interval every 15 minutes (35040 observation for each subject). In this article, however, the interest lays in detecting the anomaly day consumption trend (for the entire year) rather than the anomaly household consumption. For this reason, the data has been reshaped as (137605, 96) where each entry corresponds to one day of the year and each column reports the consumption every quarter hour. Each subject, thus, is repeated 365 times since 96 quarter hours sum up to a day (24 hours).
Apart from the consumption data, we added information regarding the daytime like holidays, weekend and seasons. Moreover, we have at our disposal information describing the house the observation pertains to like ‘Number of appliances’ (both white and brown), ‘Annual salary’ (in classes), ‘Family components’ and ‘Climate zone’. This knowledge will be primarily used to better understand the quality as well as the type of outlier we are able to detect rather than for detection itself.
Graphical descriptive statistical analysis
As always in my articles, I propose a descriptive statistical approach to better understand the structure underlining the data.
In order to build a multigraph representation of the electric households’ consumption, we summed all the data in the 24 hours during each day (obtaining the total consumption per day) and then we calculated the mean for each region in each season for the entire year. The following image reports four thematic maps regarding the total seasonal electricity consumption by regions:
From the graph above it is possible to notice an interesting linkage. Regions in the north of Italy appear to consume less energy with respect to households living in the south of the country. This correlation seems to hold for all seasons and might be due to less availability of new ecological green appliances (though I’m just speculating over).
In addition, we propose the histogram reporting the observation counting per region in order to better understand how well the thematic map above represents the regional data:
It is possible to notice that the most represented region is Lombardia with more than 60 observations while the least represented is Friuli-Venezia-Giulia with just 3 subjects.
Lastly, we show the boxplot describing the distribution of households consumption by region as a sum of the entire year for each subject:
In this plot, coherently with what seen in the thematic map above, Sardegna appears to be undoubtedly the region with the highest consumption with a moderate standard deviation. Calabria, on the other hand, reports a median in line with all the other countries, though its standard deviation is very pronounced.
Apart from the electricity consumption dataset, we have at our disposal data describing households characteristics. Those data, however, will not be used to detect outliers but they might become handy when we will try to make sense of them.
In the picture proposed below, we graphically illustrate four different categorical variables (Family components, Annual salary, Climate zone, Supply power [kW]):
The bar chart describing the Supply power (to the bottom right) is not very interesting since practically all the subjects have a 3 kW power supply. Moreover, our dataset is composed mainly by couples (107 out of 377) with a medium year salary (166 out of 377). Lastly, in terms of climate zone, Italy is essentially classified in the E class where each class is defined based on the degree day:
“Heating degree day (HDD) is a measurement designed to quantify the demand for energy needed to heat a building. HDD is derived from measurements of outside air temperature. The heating requirements for a given building at a specific location are considered to be directly proportional to the number of HDD at that location. A similar measurement, cooling degree day (CDD), reflects the amount of energy used to cool a home or business.”
Here is the map that represents the classes:
Last but not least, we present four joint plots for the variables: ‘house surface’, ‘Annual salary’ and ‘Total appliances’ by total energy consumption and a correlation of ‘Total appliances’ with ‘Household surface’:
The joint plots reported above depict the bivariate plot as well as the univariate marginal distribution for each variable. Moreover, in the upper right corner, a Person correlation coefficient and the associated p-value is reported. Apart from the bottom right graph, there is no correlation with respect to electricity consumption with a p-value well above the usual threshold. On the other hand, we reported (out of curiosity) the correlation between house surface and the number of appliances with a Pearson score of 0.32.
Deep outlier detection analysis
After graphically exploring the data, we propose a more technical approach in which we will encode the electric consumption time series data in a latent space with 3 dimensions. Firstly, we propose an easy autoencoder model with the following structure:
The representation provided above is obtained through the ‘Graphviz’ package which offers an easier way to represent the underline structure of our neural network.
As we might grasp from the structure of the network above, this is a regression problem in which the goal is to encode the data in input (Input_Layer) to a latent space with 3 dimensions (Encoded_Layer). The output of this layer is what we will use to detect anomaly behaviors in terms of energy consumption. However, in order to be sure that the model encodes the data properly, we should be able to reconstruct the input vector (with 96 dimensions) starting from the encoded vector. Henceforth, the performance of the model could be evaluated by comparing the original 96-dimensional vector with its reconstruction from the latent space.
The following graph depicts the performances of our model:
Apart from some annoying fluctuations (due to low batch size fixed to 200), the model proposed, exhibit a legitimate behavior and both the training and the validation loss consistently decrease at each epoch reaching an RMSE value of 0.74.
Notice that, larger minibatch size means more “accurate” gradients, in the sense of them being closer to the “true” gradient one would get from whole-batch (processing the entire training set at every step). There’s a performance tradeoff, though, inherent in batch size selection — a larger batch size is often more efficient computationally (up to a point) but while that might increase the number of samples you process (and thus increase the time needed for each epoch to be completed), it also can mean that the number of SGD iterations you take decreases. Studies, however, seem to suggest that with deep networks it’s preferable to take many small steps than to take fewer larger ones.
The following step regards the application of the LOF model in which we are going to detect outliers defined as the 1% most extreme values considering a neighbor of 570 points. It turns out that the number of neighbors points taken are quite important to appropriately detect those observations that reflect our definition of an outlier.
In the following image, we report a dynamic 3D graph representing a sample of 1000 observations extracted from the encoded dataset.
It is possible to notice that our model is quite good at detecting outliers (orange points). The purpose of using the combination of two models, an autoencoder and a LOF model, lays in the fact that the former allows us to appropriately encode the data in a lower dimensional space. This lower dimensional representation of the data enables us to graphically inspect the information in hand and thus appropriately tune the outlier model. The LOF model, in fact, for each point, needs to consider a local neighbor. Without the possibility of graphically depicting the data, it is quite difficult to choose the appropriate parameters. This is the typical problem when we use unsupervised models where we have no clue regarding the goodness of the fit.
If we perform this analysis for the entire dataset, we can retrieve what days are more keen on to anomalies and which are the household characteristics:
From the results shown above, there is any statistical difference between the mean characteristic for the outlier group compared with to total mean value. However, we can draw some interesting knowledge by just comparing the two groups. Though not really pronounced, the outlier group seems to be constituted by high salary households whereas for the climate zone a visible difference can be noticed for the B and E classes. Moreover, outlier families are mainly constituted of 4 components. Finally, in terms of white appliances, there seems to be a prevalence for possessing 3 to 4 devices while the control group has on average 2 or 6 machines.
Finally, we provide the same graphical analysis but applied to features inherently the time date:
From the chart above, we may notice that the outlier group is more present in weekends and in autumn, though the difference is really minimal.
In this article, we analyzed the households energy consumption of 377 families for the entire year 2011. The goal of the analysis is to detect anomalies in the consumption of energy in a specific day of the year. In order to accomplish the task, a combination of two unsupervised models is proposed.
In the first part of the article, we proposed a descriptive graphical analysis in which we explored the data. It is quite evident that the energy consumption is linked with the economic development of the region since, in the north of Italy, we see a lower energy demand. In the second part of the analysis, we applied a deep neural network model with the aim to encode the data in a lower dimensional space (we passed from a 96-dimensional vector to a vector with 3 dimensions). We then used the encoded dataset to detect anomalies by means of a Local Outlier Factor Model and we tuned it by graphical inspection.
Finally, we proposed a descriptive analysis in order to better understand the characteristics of those households identified in the outlier group. Apart minor, though interesting, differences, there seem to be no variations with respect to the control group.
Hope you enjoyed the analysis… see you in the next article…
As always have fun with ML…