K-Means Clustering and its Use-Cases
Understanding K-means Clustering
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
“The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
- The centroids have stabilized — there is no change in their values because the clustering has been successful.
- The defined number of iterations has been achieved
Applications of K-Means Clustering :
- Academic Performance :
Based on the scores, students are categorized into grades like A, B, or C.
- Diagnostic systems :
The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
- Search engines :
Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.
- Wireless sensor networks :
The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.
Use-Cases in Security Domain
1. Identifying crime localities :
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Cyber-profiling criminals :
Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
Criminal profiling is both an art and a science, knowing what types of people mostly commit unambiguous types of criminal offenses. Law Enforcement and Information Technology professionals need to work together to control this growing problem: Cybercrime and the individuals behind it.
3 . Crime document classification :
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.