A new approach to time series clustering by combination of sub-series

Document Type : Article

Authors

1 Department Of Industrial Engineering. Faculty Of Engineering. Ferdowsi University Of Mashhad(FUM)

2 Department of Industrial Engineering, Faculty of Engineering, Ferdowsi University Of Mashhad (FUM)

Abstract

Time series-clustering, defined as deriving trends and archetypes from sequential data, divides time series into groups considering their characteristics. Previous work mainly focused on distance criterion and clustering algorithm to cluster the time series so few researchers have investigated the similarities between the segments of a time series. To address this research gap, we propose a new two-step approach based on sub-time series and combination clustering. In the first step, a time series data set is segmented using a fixed window size, and each segment is clustered by applying a hierarchical clustering algorithm and Euclidean distance. Also, we use a logarithmic relation based on the length of the time series data set to determine the number of components, selecting the best outcomes using various internal criteria including intergroup variance, Kalinsky-Harbaz, and Dunn index. In the second step, the results of the first stage are processed using ensemble clustering and the final clustering label is obtained. We develop two novel algorithms based on different internal criteria for selecting the best segmentations: the first one in which we consider only one internal criterion and the second one in which we consider three internal criteria simultaneously. Moreover, we run various settings on 82 datasets with 10 replications for the two presented algorithms, checking the final precision using an external RAND index. Then, in order to identify the best settings for the proposed algorithms we applied Wilkinson statistical test. Statistical comparison of the results of the two new algorithms on 82 data sets with some algorithms in the related literature indicates significant improvement In terms of error rate and execution time. Finally, the findings acquired based on the best settings of the proposed algorithms indicate that the suggested method has the best RAND index among the previous algorithms in the literature for 32% of the dataset tiers

Keywords

Main Subjects