In this week's seminar we demonstrate applications of Dimensions Reduction as used in Unsupervised Learning. We will mostly follow along the demonstrations to embed 8x8 pixel images of the MNIST database in the 2-dimensional plane.
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn.decomposition import SparsePCA
from sklearn import manifold, datasets, decomposition, random_projection
digits = datasets.load_digits(n_class=9)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix : ix + 8, iy : iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor="w", edgecolor="k")
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title("A selection from the 64-dimensional digits dataset")
Text(0.5, 1.0, 'A selection from the 64-dimensional digits dataset')
We begin be the simplest embedding possible. We just randomly project the high-dimensional data onto the 2-dimensional plane. The embedding is already quite good. Why do random projections work already so well? This is due to the Johnson-Lindenstrauss-Lemma that we will cover in class.
print("Computing random projection")
rp = random_projection.GaussianRandomProjection(n_components=2, random_state=42)
X_projected = rp.fit_transform(X)
Computing random projection
Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor="w", edgecolor="k")
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(
X[i, 0],
X[i, 1],
str(y[i]),
color=plt.cm.Dark2(y[i]),
fontdict={"weight": "bold", "size": 9},
)
if hasattr(offsetbox, "AnnotationBbox"):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1.0, 1.0]]) # just something big
for i in range(X.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
# don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]
)
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
plot_embedding(X_projected, "Random Gaussian Projection of the digits")