Avoiding Overconfidence in Predictions of Residential Energy Demand Through Identification of the Persistence Forecast Effect

Forecasting domestic electricity consumption is important for a wide range of modern power system solutions and smart applications that support network operation, grid stability, and demand-side management, most of which depend on robust and accurate predictions. The methods producing these predictions infer future load from statistical regularity in historical data. If such regularity is lacking, predictions then regress towards the most recently observed consumption value used in the input set. Predictions then follow the actual load data one step behind in time, potentially affecting the robustness of predictions and functionality of applications. Current evaluation methods do not detect this behaviour which may result in overconfidence in prediction results. In this study, we I) define and systematically analyse this behaviour, which we label the Persistence Forecast Effect and illustrate its impacts, II) propose a novel method, called 1-Step-Shifting, to detect its presence, and III) analyse and establish the relationship between irregularity in data and the effect. Further, we provide a case study applying state-of-the-art forecasting techniques to a real-world dataset of electricity consumption data from 69 households in order to demonstrate the Persistence Forecast Effect, its implications, and its relationship to statistical regularity in historical data.

households bring into the power systems [4]. To this end, households are increasingly equipped with smart sensors, storage devices, and distributed and renewable energy generators to optimise consumption and minimise waste through intelligent and automatic energy management systems [5], [6]. Such systems, which enable household energy users to better manage their consumption, and system operators to maintain the grid reliability and supply/demand balance, inevitably rely on accurate and robust consumption forecasts [1], [7]. However, achieving accurate and robust consumption forecasts at the household level is considerably difficult owing to the high volatility and the lack of regularity in load demand [8], [9] that arise from a number of complex factors, such as dweller's lifestyle, income, cultural background, occupancy, location, weather conditions and more [10], [11]. This volatility can adversely affect the robustness and reliability of forecasts [1], [9], [12] as time-series prediction methods infer future electricity consumption from historical load patterns [13], [14]. Note -we use the terms prediction and forecast interchangeably in this text.
In case such regularity is lacking, we observe for stateof-the-art forecasting methods that they produce forecasts that approximate the most recently observed value which results in a time-series of predicted values that is nearly identical to the time-series of observed values, but systematically delayed one step in time, similar to a persistence model [15]. We have labeled this the "Persistence Forecast Effect" (PFE) and explain it in detail in the Problem Statement section. To the best of our knowledge, this phenomenon has so far not been described or analysed in the literature. The impact of the PFE on applications varies between applications depending on their tolerance to accommodate temporal displacement of predictions. The PFE can jeopardise the functionality of such smart applications that require precise timing of predictions since the unnoticed PFE can result in overconfidence in predictions. For instance, battery charging/discharging scheduling and load shifting are important strategies for peak shaving [7], [12], and such strategies require precise forecasts with no temporal flexibility. However, in the presence of the PFE, forecasts are naïvely reproducing the current load, causing batteries to be charged/discharged at a suboptimal time or tasks to be incorrectly scheduled, ultimate resulting in potentially exacerbated peaks, rather than shaving them; a behaviour also observed by [16] for other kinds of delays in predictions of peaks.
The PFE is a risk to all prediction contexts that exhibit volatile and uncertain load patterns, including small grids (also called "weak grids") [12]. Hence, it is quite likely to observe the PFE in the predictions for such grids, which can negatively affect energy suppliers' decision-making for strategies with limited temporal flexibility, such as dynamic pricing, tariffs adjustment, supply-demand balancing, and network level peak reduction. However, the state-of-the-art evaluation metrics used in demand forecasting literature are not able to detect the PFE. To avoid overconfidence in prediction results, the PFE should be taken into consideration regardless of the aggregation level or building type, so that steps to mitigate it can be taken; including reviewing the input feature set. Otherwise, it can undermine the robustness of methodological decisions made, and negatively affect the final applications and intelligent energy management systems, which may lead to bigger troubles in smart grid concepts adventitiously.
The main contributions of our work are: i) Definition and description of the PFE are provided. ii) Current evaluation metrics are reviewed relative to the PFE. iii) A novel method for detecting the PFE is proposed. iv) Underlying causes of the PFE are highlighted. v) The PFE is investigated for both single-step and multistep forecasts. The remainder of the paper is structured as follows. In the next section, we define and explain the PFE, followed by a review of related literature. We then propose a method for detecting the PFE and introduce the experimental setup. After that, we present experimental results that illustrate the PFE and its implications. This is followed by a discussion of how the PFE manifests itself in multi-step forecasts. Then, we evaluate the association between the irregular load pattern and the PFE along with the importance of having historical energy data in input set. Finally, we present our conclusions.

II. PROBLEM STATEMENT
In this section, we describe the PFE in single-step forecasts and motivate its identification and evaluation. Further, we explain the implications of the effect.
Electric load forecasting is a time-series forecasting problem. Time-series forecasting can be performed for singlestep or multi-step ahead. Single-step forecasts consists of prediction the value of the next time step (y t+1 ) only, while multi-step forecasts is the task of predicting a range of sequential future values (y t+i , i = 1, 2, . . . , H, where H is the absolute forecasting horizon). At present, in order to solve time-series electrical load forecasting problems, various prediction methods exploiting correlations and similarities [13] are applied. These methods utilise a variety of different "features" on which the values in the output domain are assumed to depend (contingent). Most of the time, these features include a certain number of historical data (past observations of electricity consumption), for example in [9], [11], [14]. More formally, we can describe historical electric data as follows; given a time t, E t denotes the most recent observed electricity consumption value and E t+1 refers to the electricity Fig. 1. A comparison of PFE-affected predictions produced by a machine learning method and predictions produced by a naïve persistence model throughout a day. The error of the predictions affected by the PFE is shaded in grey. The naïve persistence model performs better than the machine learning method: The absolute error of the machine learning method is 6.723 kWh, while the error of the persistence model is lower at 6.526 kWh. consumption value to be predicted. Hence, the historical load data can be expressed by E t−i , i = 0, 1, . . . , K, where K is the number of data points.
Even though historical data from the output domain can be a very powerful predictor of future values, they may lead to the PFE. In this study, we describe a phenomenon of PFE that can be observed when regularity in historical load data among input features is sufficiently weak. In such a case, predictions (E t+1 ) approximate the most recently observed electricity demand value (E t ). An example from the dataset in our case study is given in Fig. 1, showing observed values in blue and time-series predictions suffering from the PFE in orange. The main characteristic is that the shape of the predictions is almost identical to the observed values, except that the output domain values are displaced by one time step to the right (the future). In other words, the method returns almost the same value with observed value as its prediction (E t+1 ≈ E t ). Our succinct explanation for this is due to the volatility and pattern irregularity in historical data, methods cannot learn enough and instead, extrapolate from the most recent electricity demand value because of the high correlation between the consecutive data points in electricity load data. Also shown in Fig. 1 is the output of a persistence model. This is a well-known naïve method that is mostly used as a baseline method for testing the prediction ability of machine learning algorithms [15], [17]. It simply returns the value of the most recent observation (E t ) as a prediction outcome for the next time step (E t+1 ), resulting in (E t+1 = E t ). Based on the similarity between the predictions of the persistence model and the predictions suffering from the bias we investigate in this text, we have chosen the term "Persistence Forecast Effect" to describe the effect. In this randomly chosen example, the naïve persistence model has a lower absolute error than the LSTM RNN model.
Evaluation metrics assess the prediction error from a discrepancy between the predicted and observed values. However, the PFE might result in overconfidence in evaluation metric results because the most popular evaluation metrics, e.g., Mean Squared Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are not capable to identify this effect by themselves. Smart systems or applications built on top of forecasts affected by the PFE will not perform as effectively as they would with data that enables the developments of models that surpass the naïve persistence model. According to [18], a 1% rise in the prediction error translates to a roughly £10 million increase in annual operating costs for the United Kingdom in 1984. Given the increased level of automation in all parts of the economy, building these smart and automated systems on top of predictions with an unnoticed PFE will likely cost much higher today.
Evaluation metrics can also be used to provide an in-between comparison of alternative prediction methods. However, besides being a sign of poor time-series prediction generally, the PFE is also a direct threat to the validity of comparative studies that rank methods in-between. As we will demonstrate in the Results section, the PFE can result in inconsistent rankings or even rank reversal between prediction methods. This in turn threatens the transferability of findings from studies aiming to optimise forecasting methods or compare forecasting methods for a certain type of problem.

III. RELATED WORKS
Time-series forecasts are defined by both timings as well as amplitudes of series of events. Therefore, it is critically significant to predict the timing of an event correctly along with its amplitude. An error resulting from an event predicted to happen too early or too late is called phase error [19], [20]. In [21], [22], authors effectively illustrated that the phase error, together with bias and amplitude error, is one of the three main components forming RMSE and MSE. As a consequence, in order to achieve relatively better metric value, the phase error should also be handled properly. In one line of work, studies have focused on forecasting ramp events, referring to sudden, significant fluctuations in time-series data within a short period of time [20]. These aim to predict the time of ramp events correctly alongside the accurate amplitude. More recently, a growing interest has emerged in this topic in solar (see [23], [24], [25]) and wind (see [26], [27], [28]) power production domains to guard from the negative impacts of ramp events for greater safety, stability, and economics of power systems and energy storage devices. Further, [17] proposes a novel evaluation metric, Ramp Score, to evaluate the ability to predict significant ramp events.
Furthermore, considering the phase error, the point-wise metrics, such as MAPE, RMSE, and MSE, are claimed to be inappropriate in time-series prediction evaluation [5], [16], [17], [29]. Point-wise metrics simply compare the observed and predicted values at each time step, and hence, they lead to double penalty effect (DPE). The DPE refers to the case in which point-wise metrics penalise temporally displaced predictions twice: first, where the event actually should be, and second, where the event is predicted to happen -even if the size and the amplitude of an event are correctly predicted in principle. In order to avoid this effect, the general recommendation is to tolerate and not penalise small and discontinuous displacement of predictions in time. The authors of [16] introduce an adjusted p-norm error measure that allows for small and discontinuous displacement of predictions in time. Basically, the main idea behind this method is to partly drop the time dimension and provide temporal flexibility to predictions during evaluation. Based on the test they run, they find that the new metric they proposed is suitable and useful for volatile and irregular data while the standard point-wise metrics are adequate for smooth and regular data. Similarly, alignment-based metrics such as Dynamic Time Warping, Longest Common Sequence, Parameterised Forecast Error Metric, and Move Split Merge are also proposed as evaluation metrics for time-series predictions [5], [16]. Alignment-based metrics mainly align the predictions with the actual values in order to find the optimal match between them, hence they do prevent the DPE. However, on the other hand, they do not have consideration of time dimension. That is to say, they do not preserve the time order among data points and each data point is handled as an independent prediction. As a result, they are not suitable for the assessment of electricity consumption prediction and other applications where the time dimension and time order of the data points are inflexible [16].
Due to the uncertainty and volatility in household electricity load and resulting challenges for point forecasts, probabilistic forecasts are considered as a way of getting robust and reliable predictions [29]. A comprehensive review of probabilistic electric load forecasting for various types of buildings and aggregation levels is provided in [30], [31]. Besides, more recently, some studies including [32], [33], [34] provide new results for probabilistic load forecasting specifically in the residential domain.
Nevertheless, point forecasting methods and point-wise metrics are still predominantly used in household-level load forecasting. A number of these works are reviewed from two perspectives below. First, we list some recent studies using point forecasting methods, together with their applied evaluation metrics. We then present certain studies whose results display some level of PFE based on our visual investigation. Table I lists a number of recent studies, each of which has a different strategy and approach to different variants of household demand forecasting problem, as well as the point-wise evaluation metrics they utilise to evaluate their prediction accuracy. From Table I, it is evident that the most popular metrics in household load forecasting are RMSE and MAPE, each of which is applied by 13 different studies. Furthermore, as seen in Table I, it is a common practice to use multiple evaluation metrics. However, it is of note that justification for the choice of evaluation metrics is rarely given.
Regarding the presence of the PFE in recently published works, we carried out a visual inspection of plots of predictions and actual observations in several peer-reviewed studies and found substantial evidence for the PFE; for example in [10, Fig. 2], [9, Fig. 8], [41,Figs. 9 and 10], [42, Fig. 8], [44, Fig. 10], [45,Figs. 8 and 11], [49,Figs. 3 and 4], and [50,Figs. 4,5,6,and 7]. However, no comment is made on systematic one step delay in predictions in these studies. Furthermore, many studies, including [37], [43], [47], do not provide plots comparing predicted and actual load. Therefore, it is not possible to claim something definite about the existence or absence of the PFE in such studies.

IV. METHODOLOGY
In this section, we describe the novel method we propose for detecting the presence of the PFE in single-step forecasts. We then describe an empirical setup used to illustrate and evaluate the PFE in a large-scale real-world dataset.

A. 1-Step-Shifting Method for PFE Detection
As previously mentioned, the PFE occurs mainly as a result of a lack of regular pattern in data as well as a high correlation between consecutive time-series data points. When the effect arises, forecasting results approximate the value of the previous data point, and hence, they follow the actual energy demand one time step behind as illustrated in Fig. 1. However, visual inspection alone is not always sufficiently objective for practical and repeatable PFE detection. Given that, a computational approach to detect the PFE is required. Consequently, we propose the "1-Step-Shifting" (1-SS) method. The central idea for the 1-SS method is the recalculation of standard evaluation metrics after shifting the predictions "one step back" in time (shift the predictions to the past) or shifting the timeseries of actual observations "one step forward" in time (shift the actual data one time step right to the future). Note that in this text, all the visualisations and formulations utilise the former strategy which shifts the predictions one step back in time to the past. The main idea behind the 1-SS method is to show that shifting the predictions one step back in time yields better evaluation metric results, which proves the systematic one step delay in predictions.
The proposed 1-SS method contains four steps as follow: Step 1 Calculate evaluation metrics for predictions/actual data as usual.
Step 2 Apply shift of predictions one time step to the past (or shift actual load data one step to the future).
Step 3 Recalculate the evaluation metrics for the shifted predictions/actual load data.
Step 4 Compare the evaluation metric results of Step 1 and Step 3. If 1-SS results in considerable improvements in accuracy, then it can be claimed that the predictions exhibit the PFE. For instance, Fig. 2 illustrates the 1-SS method for predictions affected by the PFE (Fig. 2(a)) and predictions not affected by the PFE (Fig. 2(b)). In Fig. 2(a), the curve of the predictions matches the curve of the actual values much better than the curve of the shifted predictions do. The opposite is the case in Fig. 2(b), where the difference between the shifted predictions to the actual load data is smaller than that of the original predictions.
This four-step 1-SS method is independent of a specific evaluation metric. It is common practice (see Table I) and recommended to apply multiple evaluation metrics [37] as part of a prediction evaluation, as each metric has individual advantages and disadvantages. In this work, therefore, we apply the most popular evaluation metrics in the household load forecasting literature: MAPE, RMSE, and Correlation coefficient. We advocate MAPE and RMSE as they complement each other in many aspects and Correlation as it measures the linear correlation (similarity) between two sets of variables. As a result, these three metrics, whose properties are summarised in Table II, aid us to evaluate the PFE from as many as different angels.
These three metrics are defined as; where Act t is the actual value and Pred t is the forecast value at time t and where Act and Pred refer to the mean of the actual loads and the mean of the predictions respectively. On the other hand, 1-SS modifies these formulas as follow; Definition 1: Given formulas 1-6, we define that for a given household, predictions exhibit the PFE if MAPE * < MAPE, RMSE * < RMSE, and Corr * > Corr. Conversely, predictions do not suffer from the PFE, if MAPE * > MAPE, RMSE * > RMSE, and Corr * < Corr. Any other combinations other than these two, we consider as inconclusive and advise further investigation via manual, visual inspection or in-depth regularity analysis.

B. Experimental Setup
In order to illustrate the PFE and the use of 1-SS for the PFE detection, we carry out an experiment with a largescale dataset and state-of-the-art machine learning methods. We replicate relevant results from [8] who published prediction results on the publicly available Smart Grid Smart City project dataset (SGSC) [51] and used methods from the Keras library together with the Theano back-end. The dataset they utilise provides electricity consumption data for several houses and they deploy multiple machine learning methods, which are critically important for the purpose of this work. This experiment also allows us to demonstrate that the PFE already exists in literature.

1) Dataset:
The SGSC dataset provides energy consumption data for a large number of households in New South Wales, Australia, and captures the variability of residential schedules and activities. We replicate results from [8] and identical to them, utilise a subset of 69 households from the SGSC with 92 consecutive daily load profiles, from a 3-month time span (01.06.2013 -31.08.2013). This time period was initially chosen as it includes complete half-hourly electric load data for the 69 individual households. We apply the same trainvalidation-test split of 67 days of data for training, 16 days for validation, and the remaining 9 days for testing.
The input features are identical across all 69 households and include: • Historical electricity load data for the past 2 time steps (E t , E t−1 ). • Day of week indicator (ranges from 0 to 6). • Time of day indicator (ranges from 0 to 47). • Weekend indicator (ranges from 0 to 1). Identical to [8], we carry out data preparation as well: all the input features are transformed to a standard scale (0, 1) independently, which is crucial to reduce the impact of marginal outliers. To do this, one-hot encoding is applied to time-ofday, day-of-week inputs. Besides that, min-max normalisation is performed for historical electricity data columns.
However, it is of note that for one of the 69 houses (household ID: 8568209) from the dataset, electricity consumption is approximately zero for a substantial part of the dataset (potentially because the property was vacant), including the entire time span that defines the test set (see Fig. 3). The test set, thus, is utterly dissimilar to the training set. In order to evaluate the PFE for this household, a different train-test split would be needed, which would be a deviation from [8]. For this reason, we have excluded this house from the PFE evaluation.
2) Prediction Methods: In order to illustrate the PFE and its implications, we apply two state-of-the-art machine learning techniques, Long Short-Term Memory Recurrent Neural Network (LSTM RNN) and Back-Propagation Neural Network (BPNN), as applied in [8]. In this section, we introduce their implemented architecture as well as critical hyper-parameters.  However, a detailed explanation of their working principles is out of the scope of this study. Interested readers can find an introduction to LSTM RNNs in [4], [8], [39] and to BPNNs in [52], [53]. Both of these methods are built with the same architecture and hyper-parameters, replicating [8]. The common architecture and hyper-parameter settings of the methods are summarised in Table III. The specific values for batch size, drop-out rate, and loss function were not stated in [8]. Therefore, we apply the default batch size of 32 and no dropout. However, the Keras framework does not provide a default choice of loss function, and hence, we carried out a grid search and select the function that resulted in the best reproduction of the results from [8] -which is MAE.
3) Clustering Method: We carry out a clustering analysis of daily load profiles of households in order to explain the association between the presence of PFE and irregularity in load patterns. The clustering provides a visual and numeric comparison of the regularity level of households whose predictions are PFE-free and PFE-affected.
In an electricity demand prediction context, clustering algorithms are generally employed to improve the prediction accuracy [39], [40], [54]. In this text, however, we use clustering for data analysis rather than prediction. The clustering groups the 92 daily load profiles based on their resemblance to one another. The number of clusters provides a measure for the regularity level of data from a household on a day-by-day basis based on hierarchical clustering in line with [8], [9].
Hierarchical clustering (see [55], [56] for details) is chosen, as it I) requires only a few number of hyper-parameter selection, II) does not need a predetermined number of clusters, and III) identifies outliers explicitly.
Clustering is an unsupervised learning method that groups objects based on their relative similarity given some distance metric [54]. In the current context, since the objects are not a single data point but a sequential data array of 48 data points for each day, correlation was chosen as the distance metric. To calculate the distance between individual objects or clusters, the average method is performed. The threshold for assignment to the same cluster is set to corr = 0.75. In other words, two daily electricity demand profiles are assigned to the same cluster if their correlation coefficient is equal to or greater than 0.75.

V. RESULTS
We first deploy LSTM RNN on the data of 68 residences independently with identical architecture and hyper-parameters and then, apply the 1-SS method to the predictions produced by the LSTM RNN for 68 residences. The difference between the default evaluation metrics (formulas 1 to 3) and their shifted equivalents (formulas 4 to 6) are shown in Fig. 4.
Considering the difference between default and shifted evaluation metrics as given by the 1-SS method and Definition 1, the 68 households can be split into three groups: • PFE-free: All three metrics worsen after 1-SS. This is the case for household IDs: 8273230, 8342852, 8482121. • Inconclusive: One or two metric(s) improve while another worsens after 1-SS. This is the case only for household ID: 8478501. • PFE-affected: All three metrics improve after 1-SS. This is the case for the remaining 64 houses, which is by far the majority of households. For MAPE and RMSE, a larger value corresponds to a worse result, while the opposite is true for Correlation. For instance, the 1-SS method indicates that the predictions of household 8342852 are not affected by the PFE, as values for MAPE and RMSE worsen (error increases) from 38.704 and 0.227 Even though the 1-SS method is conceptually very simple, it can detect the PFE in all but one of the cases (inconclusive). The default evaluation metrics by themselves are not able to identify such behaviour at all, which can result in misplaced confidence in predictions. Fig. 5 juxtaposes MAPE and MAPE * values of 6 houses in three pairs of houses with PFE-free and PFE-affected predictions. In each pair, the MAPE values are very similar to one another (e.g., 25.357 and 24.815 for houses 8273230 and 8487285, respectively). According to metric results, the prediction method performs relatively similarly for the pairs 8273230 and 8487285, 8342852 and 8184653, and 8482121 and 8661542. However, when the 1-SS method is applied, the evaluation metric values improve for the households with PFE-affected predictions and degrade for those not subject to the PFE -as shown by MAPE * in Fig. 5. Even though the absolute overall prediction error in each pair of households during the test interval is very similar, for predictions exhibiting the PFE, this error is almost exclusively determined by the cumulative difference of the observed energy consumption between two time steps.
Another important risk resulting from PFE is for the interpretation of evaluation metrics during comparison of prediction methods. When using the standard evaluation metrics to compare alternative methods, the PFE can result in evaluation metrics that separate methods only poorly -or even reversing the rank order between them. We illustrate this with a comparison of LSTM RNN and BPNN methods for six residential buildings, three of which have PFE-free predictions and three have PFE-affected predictions. These methods are partly stochastic and result in slightly varying forecasts. We train both methods four times for each of these buildings. The MAPE results are listed in Table IV. For buildings with PFE-free predictions, LSTM RNN performs consistently and significantly better than BPNN. However, for households whose predictions are PFE-affected, the difference between MAPE results is not significant and ranks between LSTM RNN and BPNN are not stable.

A. The PFE in Multi-Step Forecasts
So far, we have only considered single-step forecasts. To investigate the PFE in the context of multi-step forecasts, we deploy LSTM RNN, with the same architecture and hyper-parameters as above, for the same dataset. We perform one-day-ahead prediction, yielding 48 predicted values for the electricity consumption of the following 24 hours. To deal with this task, we use a recursive mechanism that allows us to use our single-step method, LSTM RNN, iteratively. The main idea of the recursive mechanism is to deploy the same pre-trained single-step method for each time point to be predicted. Basically, the single-step model is used to predict the one time point ahead first, and then, the output is fed into the same single-step model to predict the subsequent time point. This procedure is iteratively applied until the last value of the desired multi-step forecasting sequence is predicted.
Day ahead multi-step forecasts for three households are presented by Fig. 6. The prediction method produces prediction curves that show no temporal displacement, and the predicted load pattern follows the actual load pattern closely for the building (8482121) whose predictions are PFE-free in a single-step case ( Fig. 6(a)). On the other hand, for the buildings (8487285 and 8661542) whose predictions are PFEaffected in single-step case, the method provides forecasts of very poor alignment. The multi-step predictions are either i) more or less flat lines with minimal fluctuations (Fig. 6(b)), or ii) prediction curves are full of arbitrary peaks/troughs (Fig. 6(c)). In both cases, the forecasts are uncorrelated to the actual load. The correlation coefficients between actual load and multi-step forecasts are −0.123 and 0.026 for forecasts, respectively. Therefore, considering the given definition of the PFE and the dissimilarity between the actual load and multi-step predictions, it does not seem possible to talk about the temporal accuracy or the existence/absence of the PFE for such predictions.
As a result, in any case, the 1-SS method as presented here is not sufficient to detect the PFE in multi-step forecasts, which is an area of further work. However, the 1-SS can be used to show the absence of the PFE in multi-step forecasts anyway, and can thus be used to give confidence in evaluation outcomes to stakeholders.

B. Clustering Results
Results from hierarchical clustering are shown in Table V. Here, houses are sorted by the number of clusters, with fewer clusters indicating a greater similarity between the daily energy consumption profiles over the 92 days of a specific household. It is important to note that outlier daily demand profiles, that do not have any similar profiles to be in the same cluster with according to the correlation distance metric, are each put in a separate cluster.
The clustering results show significant differences between the regularity level of demand profiles of households. The most regular household has only 10 clusters for the 92 daily load profiles, while, the household that has the most variable load profiles has 91 clusters; in other words only two daily load profiles that were similar.
Among the results in Table V, the three houses that have PFE-free predictions (8273230, 8342852, and 8482121) rank first, showing that these have the most self-similar and regular demand profiles throughout the 92 days among 68 households. The considerable difference between the regularity level of the daily load profiles of households 8273230 and 8487285, whose MAPE and MAPE * values are compared above in Fig. 5, can be seen in Fig. 7, which shows all 92 daily load profiles of those buildings in Fig. 7(a) and Fig. 7(b), respectively. These are complemented by the dendrograms in Fig. 8 that visualise the clustering results of the same houses.

VI. DISCUSSION
Given the strong within-household variability of the demand profiles, where as many as 91 different clusters are found for 92 days -it is not surprising that prediction methods fail to provide robust predictions given such variable energy consumption. If this is the case, it should be detected during the prediction evaluation. Nevertheless, the evaluation methods available to the community are not able to do this.
The 1-SS method provides a conceptually straightforward method to detect the PFE in predictions. We note that although the ranking in Table V explicitly reveals the relation between the irregular pattern in data and the PFE, clustering cannot be used as a tool to identify the PFE. This is because, first, hierarchical clustering requires a choice of hyper-parameters such as the threshold value (here: corr = 0.75); and second, clustering results rely on dataset features, such as dataset length, granularity and the like. While the ranking of houses is supposed to be independent of these hyper-parameters, the absolute number of clusters is not.
Our empirical regularity analysis focused on the overall variability of demand profiles. A detailed analysis of the underlying reasons why predictions "fall back" on "the most recent observation" is out of the scope of this text. However, the autocorrelation analyses offer to make a progress in this direction. The auto-correlation analysis evaluates the similarity between observations from the same time-series variable as a function of the delay between the observations. The distance (delay) between observations is called lag. For example, in a halfhourly dataset, lag1 refers to the observation 30 minutes prior to the most recent one, and lag48 to the observation 24 hours earlier. Auto-correlation, thus, offers a way to measure and understand similarity within a day and sequential days. For instance, Fig. 9(a) shows significant auto-correlation values at various lags throughout the day for a household whose predictions are PFE-free. In particular, the correlation values at lag1 (0.693) and lag48 (0.687) are almost equal to each other, illustrating the similarity between the consecutive days. In contrast, Fig. 9(b) shows the auto-correlation results for a household with PFE-affected predictions. Here, the correlation value at lag1 is by far higher than values at all the other lags. In other words, in such a volatile dataset, the current value is most correlated to the previous time step. Prediction methods, thus, are consistent to forecast that the future will be "similar to the current".
The objective of our analysis was not to demonstrate that the forecasting methods are "broken" in a previously unknown way. Instead, we aim to guard against overconfidence in prediction outcomes and to mitigate the risk of developing  forecast models on datasets that do not work as might be expected. With the proposed 1-SS method, model developers have a tool to detect the PFE in their predictions. When they do, we recommend a review of the dataset they use and their features. Developers should explore the availability of additional features that are the predicted output is contingent on. In the domain of residential energy demand, these might include the features that determine why, when, and how electrical energy is consumed in a building, including the lifestyle of occupants and daily routines, power consumption of appliances, weather data, and the like. Such augmented datasets hold the potential to include the regularity that forecasting methods rely on and contribute to the reduction of the prediction error. Nonetheless, historical load data is still an important predictor in the domain of residential load forecasting. In Table VI, we compare accuracy of LSTM RNN for five randomly selected houses with and without historical loads in the input set. Similar to [14], we find that the use of historical data brings substantial improvement in prediction accuracy. Given that historical data is of such high value to prediction accuracy, we propose the 1-SS evaluation method in order to identify when a historical dataset provides insufficient regularity for prediction methods to provide robust output.
We would like to close with a review of the PFE related to the phase error and DPE. The DPE arises from the use of point-wise metrics during the evaluation of discontinuously displaced predictions also known as phase error. The phase error that the DPE is concerned with can occur at any point in time and with varying time steps -backwards or forwards -for all sorts of reasons. The solutions proposed for the DPE work by dropping the time dimension during evaluation, and they are applicable to application areas that are tolerant to a displacement of events in time. However, not all applications provide this temporal flexibility (such as peak shaving, storage scheduling, and dynamic pricing). As a matter of fact, the PFE refers to predictions systematically trailing the actual loads one step behind in time since they are extrapolated from the most recent load value used in the input set due to irregularity in data. Dropping the time dimension cannot be a solution to PFE. This is because if it was applicable for systematic and continuous prediction displacement, the simplest method, the persistence model introduced in Problem Statement, would always be the most superior method.

VII. CONCLUSION
In this text, we start with the observation that evaluation methods applied to time-series predictions, commonly used for electric load forecasting, fail to detect model-degradation when trained with highly irregular data. We also introduce the PFE, referring to an effect of predictions systematically following the actual load one step behind, which may be detrimental to the final applications that have no tolerance to temporal displacement. We then illustrate the risks associated with this and to mitigate these risks, propose the 1-SS as a PFE detection method. We investigate the use of the 1-SS for single-and multi-step predictions. We finally investigate how irregularity in data causes predictions to be affected by the PFE.
We illustrate the PFE on a real-world dataset of 68 houses by deploying advanced machine learning techniques. We replicate the results of a recent, peer-reviewed study and show that standard evaluation metrics are insufficient to detect the PFE. According to the 1-SS results, only 3 of the houses have PFE-free predictions, whereas predictions of 64 of them are PFE-affected and the remaining household is inconclusive. Finally, through analysis of similarity between-day and withinday through hierarchical clustering and auto-correlation, we make steps towards a more formal description of the PFE.
As a consequence, the PFE has a strong potential to endanger network security and resilience, as well as the domestic economy. We recommend that model developers apply the 1-SS method to examine the presence of the PFE before deploying models in smart applications. Additionally, in order to overcome the PFE and increase the robustness of residential demand forecasts, we recommend augmenting the input feature set with features that the prediction outputs might be contingent on. As future work, first, we see the evaluation of the PFE in other aspects of power systems and electricity market studies, e.g., price forecasting and solar/wind power generation forecasting and second, we see the investigation of whether the PFE can be used for improving the predictions accuracy.

ACKNOWLEDGMENT
The data used in this study is openly available at "https:// data.gov.au/data/dataset/smart-grid-smart-city-customer-trialdata", which is also cited in the References section of this paper.