Classifying with k-Nearest Neighbors

1. Classifying with distance measurements

k-Nearest Neighbors (kNN) is one of the simplest yet most effective machine learning algorithms used for classification and regression tasks. As a non-parametric, instance-based learning algorithm, kNN makes predictions based on the similarity (or distance) between data points rather than constructing a generalized internal model.

Key Concepts of kNN

At its core, kNN operates on a fundamental principle: similar things exist in close proximity. When classifying a new data point, the algorithm:

  1. Calculates the distance between the new point and all points in the training dataset
  2. Identifies the k nearest points (neighbors)
  3. For classification: assigns the most common class among these k neighbors
  4. For regression: assigns the average value of the k neighbors

Distance Metrics

The choice of distance metric significantly impacts kNN’s performance. The most commonly used metrics include:

  • Euclidean Distance: The straight-line distance between two points in Euclidean space

    1
    2
    3
    4
    5
    def euclidean_distance(point1, point2):
    sum_squared_distance = 0
    for i in range(len(point1)):
    sum_squared_distance += (point1[i] - point2[i]) ** 2
    return math.sqrt(sum_squared_distance)
  • Manhattan Distance: The sum of absolute differences between points across all dimensions

    1
    2
    def manhattan_distance(point1, point2):
    return sum(abs(a - b) for a, b in zip(point1, point2))
  • Minkowski Distance: A generalization that includes both Euclidean and Manhattan as special cases

    1
    2
    3
    def minkowski_distance(point1, point2, p):
    return sum(abs(a - b) ** p for a, b in zip(point1, point2)) ** (1/p)
    # p=1: Manhattan, p=2: Euclidean

Implementing kNN from Scratch

Let’s implement a basic kNN classifier in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from collections import Counter

class KNNClassifier:
def __init__(self, k=3):
self.k = k

def fit(self, X, y):
self.X_train = X
self.y_train = y

def predict(self, X):
predictions = [self._predict(x) for x in X]
return np.array(predictions)

def _predict(self, x):
# Compute distances
distances = [np.sqrt(np.sum((x - x_train) ** 2)) for x_train in self.X_train]

# Get the indices of k nearest neighbors
k_indices = np.argsort(distances)[:self.k]

# Get the labels of k nearest neighbors
k_nearest_labels = [self.y_train[i] for i in k_indices]

# Return the most common class label
most_common = Counter(k_nearest_labels).most_common(1)
return most_common[0][0]

Using scikit-learn for kNN

For practical applications, scikit-learn provides an optimized implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Choosing the Optimal k Value

The value of k significantly affects the model’s performance:

  • Small k: High variance, sensitive to noise
  • Large k: Smoother decision boundaries, may miss important patterns

A common approach is to use cross-validation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.model_selection import cross_val_score

# Test different k values
k_values = range(1, 31)
k_accuracies = {}

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
k_accuracies[k] = scores.mean()

# Find the best k
best_k = max(k_accuracies, key=k_accuracies.get)
print(f"Best k: {best_k}, Accuracy: {k_accuracies[best_k]:.4f}")

2. Example: Improving matches from a dating site with kNN

Dating sites and recommendation systems often need to match users based on their preferences and attributes. kNN provides an intuitive approach to finding compatible matches by identifying users with similar preferences.

Problem Statement

Let’s consider a dating site with the following user attributes:

  • Age
  • Height
  • Income
  • Education level
  • Location
  • Interests/hobbies
  • Lifestyle preferences

Our goal is to recommend compatible matches for each user based on similarity.

Data Preparation

First, we need to prepare our data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

# Sample dataset (in a real application, this would be much larger)
data = {
'user_id': range(1, 8),
'age': [25, 30, 35, 28, 32, 24, 29],
'height': [165, 180, 175, 172, 168, 170, 178],
'income_level': [3, 5, 6, 4, 7, 2, 5], # On a scale of 1-10
'education_level': [4, 5, 5, 3, 6, 4, 5], # On a scale of 1-6
'outdoor_interest': [7, 3, 5, 8, 4, 6, 7], # On a scale of 1-10
'social_interest': [6, 8, 4, 5, 7, 9, 5] # On a scale of 1-10
}

# Create DataFrame
df = pd.DataFrame(data)
df.set_index('user_id', inplace=True)

# Extract features
X = df.values

# Normalize data (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Implementing the Match Recommendation System

Now, let’s implement the kNN-based recommendation system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Create the kNN model
n_neighbors = 3 # Number of matches to recommend
knn_model = NearestNeighbors(n_neighbors=n_neighbors+1, algorithm='auto', metric='euclidean')
knn_model.fit(X_scaled)

def recommend_matches(user_id, n_recommendations=3):
"""
Recommend matches for a given user based on similarity
"""
# Get the index position of the user
user_index = df.index.get_loc(user_id)

# Find k nearest neighbors
distances, indices = knn_model.kneighbors([X_scaled[user_index]])

# Ignore the first result (it's the user itself)
similar_user_indices = indices[0][1:n_recommendations+1]

# Return the user IDs of the recommended matches
recommended_user_ids = [df.index[i] for i in similar_user_indices]

return recommended_user_ids

# Example: Recommend matches for user_id=3
matches = recommend_matches(user_id=3)
print(f"Recommended matches for user 3: {matches}")

Importance Weighting

Not all features are equally important in match-making. We can enhance our recommendation system by adding importance weights to different attributes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import pairwise_distances

# Weights for each feature (must sum to 1)
feature_weights = np.array([
0.15, # age
0.10, # height
0.10, # income_level
0.10, # education_level
0.25, # outdoor_interest
0.30 # social_interest
])

def weighted_euclidean_distance(x, y):
"""
Compute weighted Euclidean distance between two points
"""
return np.sqrt(np.sum(feature_weights * (x - y) ** 2))

def recommend_weighted_matches(user_id, n_recommendations=3):
user_index = df.index.get_loc(user_id)
user_features = X_scaled[user_index]

# Calculate weighted distances to all other users
distances = []
for i, features in enumerate(X_scaled):
if i != user_index: # Skip the user itself
dist = weighted_euclidean_distance(user_features, features)
distances.append((df.index[i], dist))

# Sort by distance and take top n recommendations
distances.sort(key=lambda x: x[1])
recommended_user_ids = [user_id for user_id, _ in distances[:n_recommendations]]

return recommended_user_ids

# Example: Recommend matches for user_id=3 with weighted features
weighted_matches = recommend_weighted_matches(user_id=3)
print(f"Weighted recommended matches for user 3: {weighted_matches}")

This approach can significantly improve the quality of recommendations by focusing on the attributes that matter most to users.

3. Example: A handwriting recognition system

Handwriting recognition is a classical problem in machine learning and computer vision. The k-Nearest Neighbors algorithm can be effectively applied to recognize handwritten characters, particularly digits.

The MNIST Dataset

The MNIST (Modified National Institute of Standards and Technology) dataset is the most widely used benchmark for handwritten digit recognition. It consists of 70,000 grayscale images of handwritten digits (0-9), with 60,000 for training and 10,000 for testing.

Each image is 28x28 pixels, which can be flattened into a 784-dimensional vector for processing.

Loading and Exploring the Dataset

Let’s load the MNIST dataset and explore its structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
import numpy as np

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']

# Convert to numpy arrays
X = X.values
y = y.astype(int)

# Inspect data shape
print(f"X shape: {X.shape}, y shape: {y.shape}")

# Visualize some examples
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
axes = axes.flatten()

for i in range(10):
sample_idx = np.where(y == i)[0][0] # First instance of each digit
axes[i].imshow(X[sample_idx].reshape(28, 28), cmap='gray')
axes[i].set_title(f"Digit: {y[sample_idx]}")
axes[i].axis('off')

plt.tight_layout()
plt.show()

Data Preprocessing

Before applying kNN, we need to preprocess the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into training and testing sets
# Using a smaller subset for demonstration (faster computation)
X_train, X_test, y_train, y_test = train_test_split(
X[:10000], y[:10000], test_size=0.2, random_state=42
)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Implementing kNN for Digit Recognition

Now we can apply the kNN algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Create the kNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Optimizing the Model

We can optimize our model by finding the best value of k:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

# Test different k values
k_values = range(1, 20, 2) # Odd values to avoid ties
accuracies = []

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
accuracies.append(scores.mean())

# Plot accuracies
plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o', linestyle='-')
plt.title('kNN Accuracy vs. k Value')
plt.xlabel('k Value')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

# Find the best k
best_k = k_values[np.argmax(accuracies)]
print(f"Best k value: {best_k}")

# Create model with best k
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_scaled, y_train)
best_pred = best_knn.predict(X_test_scaled)
best_accuracy = accuracy_score(y_test, best_pred)
print(f"Best accuracy: {best_accuracy:.4f}")

Visualizing the Predictions

Let’s visualize some of the predictions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def plot_predictions(X, true_labels, predicted_labels, n_samples=10):
fig, axes = plt.subplots(2, n_samples, figsize=(n_samples*1.5, 4))

for i in range(n_samples):
# Randomly select a sample
idx = np.random.randint(0, len(X))

# Plot the image
axes[0, i].imshow(X[idx].reshape(28, 28), cmap='gray')
axes[0, i].set_title(f"True: {true_labels[idx]}")
axes[0, i].axis('off')

# Highlight if prediction is wrong
color = 'green' if predicted_labels[idx] == true_labels[idx] else 'red'
axes[1, i].imshow(X[idx].reshape(28, 28), cmap='gray')
axes[1, i].set_title(f"Pred: {predicted_labels[idx]}", color=color)
axes[1, i].axis('off')

plt.tight_layout()
plt.show()

# Visualize some predictions
plot_predictions(X_test, y_test, best_pred)

Challenges and Optimizations

While kNN works well for digit recognition, it faces several challenges:

  1. Computational Complexity: kNN requires calculating distances between the test point and all training points, which can be slow for large datasets.

  2. Dimensionality Curse: High-dimensional data (like images) can reduce the effectiveness of distance-based methods.

To address these challenges, we can:

  • Use dimensionality reduction techniques (PCA, t-SNE)
  • Implement efficient nearest neighbor search algorithms (KD-trees, Ball-trees)
  • Apply feature engineering to extract relevant attributes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.decomposition import PCA

# Apply PCA to reduce dimensions
pca = PCA(n_components=50) # Reduce from 784 to 50 dimensions
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train kNN on reduced dimensions
knn_pca = KNeighborsClassifier(n_neighbors=best_k)
knn_pca.fit(X_train_pca, y_train)
pca_pred = knn_pca.predict(X_test_pca)
pca_accuracy = accuracy_score(y_test, pca_pred)

print(f"Accuracy with PCA: {pca_accuracy:.4f}")
print(f"Explained variance: {sum(pca.explained_variance_ratio_):.4f}")

Classifying with k-Nearest Neighbors
https://www.hardyhu.cn/2022/03/14/Classifying-with-k-Nearest-Neighbors/
Author
John Doe
Posted on
March 14, 2022
Licensed under