Classifier
machine learning classifier -> training set, labels.
nearest neighbor classifier
let \(x^{n}\) be the \(n\)-th training example, \(y^{n}\) be the label of \(x^{n}\).
for example, \(x^{1}\) is the image of the number 1, \(y^{1} = 1\).
suppose the size of image \(x\) is \(N\times N\) pixels, then \(x\) has a data space as a vector in \(\mathbb{R}^{N^{2}}\).
for example, if \(N=28\), then \(x\) is a vector in \(\mathbb{R}^{784}\).
in summary, data space is \(\mathbb{X} = \mathbb{R}^{N^{2}}\), label space is \(\mathbb{Y} = \{0,1,2,3,4,5,6,7,8,9\}\).
distance metric
-
a good distance metric should have:
- non-negativity: \(d(x,y) \geq 0\)
- identity: \(d(x,y) = 0\) iff \(x = y\)
- symmetry: \(d(x,y) = d(y,x)\)
- triangle inequality: \(d(x,z) \leq d(x,y) + d(y,z)\)
-
common distance metrics:
- Euclidean distance
- edit distance (for strings)
- Hamming distance (for binary vectors)
- cosine distance (for text data)
- Jaccard distance (for sets)
-
non-metric distances:
-
relative entropy (KL divergence):
\[D_{KL}(P||Q) = \sum_{i} P(i) \log_2 \frac{P(i)}{Q(i)}\]
-
how to classify using euclidean distance to find the represented number of the iamge?
- try to find the Euclidean distance between the test image \(x\) and all training images \(x^{n}\), i.e. \(d(x,x^{n}) = ||x - x^{n}||_{2}\).
- find the nearest neighbor \(x^{m}\) such that \(d(x,x^{m}) = \min_{n} d(x,x^{n})\).
fraction of test points incorrectly classified. a random number classifier has a test error of 90%. for nearest neighbor classifier, test error is ~ 3.09%. for the training points, the training error is 0%.
- translations and rotations of images
- natural deformations (sharp context)
instead of finding the nearest neighbor, find the k-nearest neighbors, and do a majority vote.
discrete / continuous features
- discrete y
- binary classification
- multi-class classification
- continuous y