Data-dependent Similarity

Last updated on Dec 9, 2020

Data-dependent similarity measures have been proposed to overcome key weaknesses of existing data mining algorithms relying on distance measures. These measures have been motivated by psychologists, who advocate for a measure to have the following characteristic: two points in a dense region are less similar to each other than two points of equal inter-point distance in a sparse region. We have shown that these new measures can significantly improve the task-specific performance of existing data mining techniques, including clustering, classification and anomaly detection on a large number of real-world datasets.

The source code of the latest data-dependent similarity measure aNNE (AAAI-19) can be obtained from here. The first generic data-dependent similarity measures me (KDD-16) can be obtained from here.

Publications

Towards a Persistence Diagram that is Robust to Noise and Varied Densities

We propose a new filter function for Topological Data Analysis (TDA) based on a new data-dependent kernel.

Hang Zhang, Kaifeng Zhang, Kai Ming Ting, Ye Zhu

Towards a Persistence Diagram that is Robust to Noise and Varied Densities

Kernel-based clustering via Isolation Distributional Kernel

We propose the first clustering algorithm that employs an adaptive distributional kernel without any optimization, while achieving a similar optimization objective function.

Ye Zhu, Kai Ming Ting

Code Project DOI

Kernel-based clustering via Isolation Distributional Kernel

A new distributional treatment for time series and an anomaly detection investigation

We propose a distributional treatment for anomalous subsequence detection with a linear runtime.

Ye Zhu, Kai Ming Ting

Code Project DOI

A new distributional treatment for time series and an anomaly detection investigation

Streaming Hierarchical Clustering Based on Point-Set Kernel

We propose a novel efficient hierarchical clustering called StreaKHC that enables massive streaming data to be mined. .

Xin Han, Ye Zhu, Kai Ming Ting, De-Chuan Zhan, Gang Li

Code Project DOI

Streaming Hierarchical Clustering Based on Point-Set Kernel

Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel

We presents a new insight into improving the performance of Stochastic Neighbour Embedding (t-SNE) by using Isolation kernel instead of Gaussian kernel.

Ye Zhu, Kai Ming Ting

Code Project DOI

Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel

Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel

Replacing Gaussian kernel with Isolation kernel in t-SNE significantly improves the quality of the final visualisation output.

Ye Zhu, Kai Ming Ting

PDF Code Project DOI

Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel

Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Clustering

We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead.

Xiaoyu Qin, Kai Ming Ting, Ye Zhu, Vincent CS Lee

Code Project DOI

Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Clustering

Lowest probabilitymass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms

We propose to use mass-based dissimilarity, which employs estimates of the probability mass to measure dissimilarity, to replace the distance metric.

Kai Ming Ting, Ye Zhu, Mark Carman, Yue Zhu, Takashi Washio, Zhi-Hua Zhou

Code Project DOI

Lowest probabilitymass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms

Overcoming key weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure

A generic data dependent dissimilarity, named massbased dissimilarity, is proposed to allow for different implementations.

Kai Ming Ting, Ye Zhu, Mark Carman, Yue Zhu, Zhi-Hua Zhou

Code Project DOI

Overcoming key weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure