Streaming Hierarchical Clustering Based on Point-Set Kernel


Hierarchical clustering produces a cluster tree with different granularities. As a result, hierarchical clustering provides richer information and insight into a dataset than partitioning clustering. However, hierarchical clustering algorithms often have two weaknesses: scalability and the capacity to handle clusters of varying densities. This is because they rely on pairwise point-based similarity calculations and the similarity measure is independent of data distribution. In this paper, we aim to overcome these weaknesses and propose a novel efficient hierarchical clustering called StreaKHC that enables massive streaming data to be mined. The enabling factor is the use of a scalable point-set kernel to measure the similarity between an existing cluster in the cluster tree and a new point in the data stream. It also has an efficient mechanism to update the hierarchical structure so that a high-quality cluster tree can be maintained in real-time. Our extensive empirical evaluation shows that StreaKHC is more accurate and more efficient than existing hierarchical clustering algorithms.

28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-22)