k-Nearest Neighbors (kNN) is one of the simplest yet most effective machine learning algorithms used for classification and regression tasks. As a non-parametric, instance-based learning algorithm, kNN makes predictions based on the similarity (or distance) between data points rather than constructing a generalized internal model.
Key Concepts of kNN
At its core, kNN operates on a fundamental principle: similar things exist in close proximity. When classifying a new data point, the algorithm:
Calculates the distance between the new point and all points in the training dataset
Identifies the k nearest points (neighbors)
For classification: assigns the most common class among these k neighbors
For regression: assigns the average value of the k neighbors
Distance Metrics
The choice of distance metric significantly impacts kNN’s performance. The most commonly used metrics include:
Euclidean Distance: The straight-line distance between two points in Euclidean space
1 2 3 4 5
defeuclidean_distance(point1, point2): sum_squared_distance = 0 for i inrange(len(point1)): sum_squared_distance += (point1[i] - point2[i]) ** 2 return math.sqrt(sum_squared_distance)
Manhattan Distance: The sum of absolute differences between points across all dimensions
1 2
defmanhattan_distance(point1, point2): returnsum(abs(a - b) for a, b inzip(point1, point2))
Minkowski Distance: A generalization that includes both Euclidean and Manhattan as special cases
1 2 3
defminkowski_distance(point1, point2, p): returnsum(abs(a - b) ** p for a, b inzip(point1, point2)) ** (1/p) # p=1: Manhattan, p=2: Euclidean
The value of k significantly affects the model’s performance:
Small k: High variance, sensitive to noise
Large k: Smoother decision boundaries, may miss important patterns
A common approach is to use cross-validation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from sklearn.model_selection import cross_val_score
# Test different k values k_values = range(1, 31) k_accuracies = {}
for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') k_accuracies[k] = scores.mean()
# Find the best k best_k = max(k_accuracies, key=k_accuracies.get) print(f"Best k: {best_k}, Accuracy: {k_accuracies[best_k]:.4f}")
2. Example: Improving matches from a dating site with kNN
Dating sites and recommendation systems often need to match users based on their preferences and attributes. kNN provides an intuitive approach to finding compatible matches by identifying users with similar preferences.
Problem Statement
Let’s consider a dating site with the following user attributes:
Age
Height
Income
Education level
Location
Interests/hobbies
Lifestyle preferences
Our goal is to recommend compatible matches for each user based on similarity.
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.neighbors import NearestNeighbors
# Sample dataset (in a real application, this would be much larger) data = { 'user_id': range(1, 8), 'age': [25, 30, 35, 28, 32, 24, 29], 'height': [165, 180, 175, 172, 168, 170, 178], 'income_level': [3, 5, 6, 4, 7, 2, 5], # On a scale of 1-10 'education_level': [4, 5, 5, 3, 6, 4, 5], # On a scale of 1-6 'outdoor_interest': [7, 3, 5, 8, 4, 6, 7], # On a scale of 1-10 'social_interest': [6, 8, 4, 5, 7, 9, 5] # On a scale of 1-10 }
# Create the kNN model n_neighbors = 3# Number of matches to recommend knn_model = NearestNeighbors(n_neighbors=n_neighbors+1, algorithm='auto', metric='euclidean') knn_model.fit(X_scaled)
defrecommend_matches(user_id, n_recommendations=3): """ Recommend matches for a given user based on similarity """ # Get the index position of the user user_index = df.index.get_loc(user_id)
# Find k nearest neighbors distances, indices = knn_model.kneighbors([X_scaled[user_index]])
# Ignore the first result (it's the user itself) similar_user_indices = indices[0][1:n_recommendations+1]
# Return the user IDs of the recommended matches recommended_user_ids = [df.index[i] for i in similar_user_indices]
return recommended_user_ids
# Example: Recommend matches for user_id=3 matches = recommend_matches(user_id=3) print(f"Recommended matches for user 3: {matches}")
Importance Weighting
Not all features are equally important in match-making. We can enhance our recommendation system by adding importance weights to different attributes:
# Calculate weighted distances to all other users distances = [] for i, features inenumerate(X_scaled): if i != user_index: # Skip the user itself dist = weighted_euclidean_distance(user_features, features) distances.append((df.index[i], dist))
# Sort by distance and take top n recommendations distances.sort(key=lambda x: x[1]) recommended_user_ids = [user_id for user_id, _ in distances[:n_recommendations]]
return recommended_user_ids
# Example: Recommend matches for user_id=3 with weighted features weighted_matches = recommend_weighted_matches(user_id=3) print(f"Weighted recommended matches for user 3: {weighted_matches}")
This approach can significantly improve the quality of recommendations by focusing on the attributes that matter most to users.
3. Example: A handwriting recognition system
Handwriting recognition is a classical problem in machine learning and computer vision. The k-Nearest Neighbors algorithm can be effectively applied to recognize handwritten characters, particularly digits.
The MNIST Dataset
The MNIST (Modified National Institute of Standards and Technology) dataset is the most widely used benchmark for handwritten digit recognition. It consists of 70,000 grayscale images of handwritten digits (0-9), with 60,000 for training and 10,000 for testing.
Each image is 28x28 pixels, which can be flattened into a 784-dimensional vector for processing.
Loading and Exploring the Dataset
Let’s load the MNIST dataset and explore its structure:
for i inrange(10): sample_idx = np.where(y == i)[0][0] # First instance of each digit axes[i].imshow(X[sample_idx].reshape(28, 28), cmap='gray') axes[i].set_title(f"Digit: {y[sample_idx]}") axes[i].axis('off')
plt.tight_layout() plt.show()
Data Preprocessing
Before applying kNN, we need to preprocess the data:
1 2 3 4 5 6 7 8 9 10 11 12 13
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
# Split the data into training and testing sets # Using a smaller subset for demonstration (faster computation) X_train, X_test, y_train, y_test = train_test_split( X[:10000], y[:10000], test_size=0.2, random_state=42 )
import matplotlib.pyplot as plt from sklearn.model_selection import cross_val_score
# Test different k values k_values = range(1, 20, 2) # Odd values to avoid ties accuracies = []
for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy') accuracies.append(scores.mean())
# Plot accuracies plt.figure(figsize=(10, 6)) plt.plot(k_values, accuracies, marker='o', linestyle='-') plt.title('kNN Accuracy vs. k Value') plt.xlabel('k Value') plt.ylabel('Accuracy') plt.grid(True) plt.show()
# Find the best k best_k = k_values[np.argmax(accuracies)] print(f"Best k value: {best_k}")
# Create model with best k best_knn = KNeighborsClassifier(n_neighbors=best_k) best_knn.fit(X_train_scaled, y_train) best_pred = best_knn.predict(X_test_scaled) best_accuracy = accuracy_score(y_test, best_pred) print(f"Best accuracy: {best_accuracy:.4f}")