Use Cases of k-Means Clustering in the domain of Security

Anjali Singh
3 min readAug 12, 2021

Clustering

Clustering is exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.

K-means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

K-means Algorithm

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way k-means algorithm works is as follows:

· Specify number of clusters K.

· Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.

· Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

· Compute the sum of the squared distance between data points and all centroids.

· Assign each data point to the closest cluster (centroid).

· Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

Use-Cases in the Security Domain

1. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

2. Insurance fraud detection

Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

4. Call record detail analysis

A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

--

--