Blog on K-NN Algorithm Using Python

 


With the business world entirely revolving around Data Science, it has become one of the most sorts after fields. In this article on the KNN algorithm, you will understand how the KNN algorithm works and how it can be implemented by using Python.

What is KNN Algorithm?

“K nearest neighbors or KNN Algorithm is a simple algorithm that uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.”

KNN is often used in search applications where you are looking for similar items, like find items similar to this one.

The algorithm suggests that if you’re similar to your neighbors, then you are one of them. For example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat, or a rat (animals), then most likely apple is a fruit.

How does a KNN Algorithm work?

The k-nearest neighbor algorithm uses a very simple approach to perform classification. When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k-training examples) to the test example.

What does ‘k’ in the KNN Algorithm represent?


k in the KNN algorithm represents the number of nearest neighbor points that are voting for the new test data class.

If k=1, then test examples are given the same label as the closest example in the training set.

If k=3, the labels of the three closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for larger ks.

KNN Algorithm Manual Implementation

Let’s consider this example,

Suppose we have height and weight and its corresponding T-shirt size of several customers. Your task is to predict the T-shirt size of Anna, whose height is 161cm and her weight is 61kg.



big

Step1: Calculate the Euclidean distance between the new point and the existing points
For example, Euclidean distance between point P1(1,1) and P2(5,4) is:





Step 2: Choose the value of K and select K neighbor's closet to the new point.
In this case, select the top 5 parameters having least Euclidean distance


Step 3: Count the votes of all the K neighbors / Predicting Values
Since for K = 5, we have 4 T-shirts of size M, therefore according to the kNN Algorithm, Anna of height 161 cm and weight, 61kg will fit into a T-shirt of size M.



Mathematical explanation of the algorithm

Distance Calculation:

The Euclidean distance between two instances, A and B, in a feature space with d dimensions can be calculated using the following formula:
Euclidean distance = sqrt((A1 - B1)^2 + (A2 - B2)^2 + ... + (Ad - Bd)^2)
Similarly, other distance metrics like the Manhattan distance or cosine similarity can be used depending on the problem and data characteristics.
Selecting the Nearest Neighbors:
Once the distances are calculated, KNN selects the K nearest neighbors based on the shortest distances. The value of K is a parameter chosen by the user.
Classification:
For classification tasks, KNN assigns the majority class among the K nearest neighbors to the new instance. This is done through a voting mechanism. Let's assume there are C classes in total, and for the K neighbors, the counts of each class are stored. The class with the highest count is assigned to the new instance.
Regression:
For regression tasks, KNN takes the average value of the target variable from the K nearest neighbors and assigns it to the new instance as the predicted value. If the target variable is represented by a continuous variable, the predicted value would be:
Predicted value = (Sum of target values of K nearest neighbors) / K
These formulas provide a mathematical foundation for the KNN algorithm, enabling it to classify or predict values based on the distances and relationships between instances in the feature space.

Advantages and disadvantages of the algorithm

Simplicity: KNN is relatively easy to understand and implement, making it accessible even to those new to machine learning.
No Training Phase: KNN does not require an explicit training phase as it stores the labeled instances of the training data for predictions based on similarity.
Versatility: KNN can be applied to both classification and regression tasks, making it a versatile algorithm for a wide range of problem domains.
Non-parametric: KNN is a non-parametric algorithm, meaning it does not assume any specific underlying data distribution. This makes it effective in handling complex and non-linear decision boundaries.
Multi-class Classification: KNN naturally extends to handle multi-class classification problems without the need for additional modifications.
Disadvantages of the KNN algorithm:
Computational Complexity: Predicting the class or value for a new instance requires calculating the distances between that instance and all labeled instances in the training data. This can be computationally expensive, especially for large datasets.
Sensitivity to Feature Scaling: KNN is sensitive to the scale of features. If features have different scales, the ones with larger magnitudes can dominate the distance calculation, leading to biased results. Feature scaling or normalization is often necessary before applying KNN.
Impact of Choosing K: The choice of K, the number of nearest neighbors, significantly affects the performance of KNN. A small K value may lead to high variance and overfitting, while a large K value may lead to high bias and underfitting. Selecting the optimal K value requires careful consideration and experimentation.
Imbalanced Data: KNN can be biased towards the majority class in imbalanced datasets. If one class dominates the training data, the predictions may favor that class even if the minority class is equally important.
Storage of Training Data: KNN stores the entire training dataset, which can consume substantial memory, especially for large datasets with many features.
It's important to consider these advantages and disadvantages when applying KNN, ensuring it is suitable for the specific problem at hand.

Best case scenario where the algorithm should be used

 K-NN is particularly well-suited for certain scenarios, including:

  1. Recommendation Systems: K-NN can be employed to recommend items or products based on the preferences of similar users or customers.
  2. Anomaly Detection: By considering the proximity of instances, K-NN can help detect outliers or anomalies in a dataset.
  3. Image Recognition: K-NN can be utilized for image classification tasks by comparing the features of images and their nearest neighbors.
CONCLUSION

In conclusion, the K-Nearest Neighbors algorithm is a simple yet powerful tool in the machine learning arsenal. Its intuitive nature, flexibility, and ability to handle various problem domains make it a popular choice. By understanding its principles and weighing its advantages and disadvantages, we can effectively harness the potential of K-NN for diverse applications.

This was all about the k-NN Algorithm using python.





















































Comments

Popular posts from this blog

Blog on Naive Bayes