HW2

alt text

since we use 4 fold cross validation, need to divide the dataset into 4 parts, with each size of 2500. therefore, the size of training set is 2500 *3 = 7500

alt text

1-NN:

for the first +, the closest neighbor is +, so it is classified as +
for the second +, the closest neighbor is + , so it is classified as +
for the third - , the closest neighbor is +, so it is classified as +
for the fourth +, the closest neighbor is - , so it is classified as -
therefore, the test error is 2/4 = 0.5

3-NN:

for the first +, the closest 3 neighbors are +, +, - , so it is classified as +
for the second +, the closest 3 neighbors are +, +, - , so it is classified as +
for the third - , the closest 3 neighbors are +, +, + , so it is classified as +
for the fourth +, the closest 3 neighbors are +, -, + , so it is classified as +
therefore, the test error is 1/4 = 0.25

alt text

so there are n points, and each point is a vector of d components, and if we use the LOOCV method to evaluate, then we need to evaluate n times, and each time we need to calculate the distance between the test point and all other n-1 points, and each distance calculation needs to calculate d components, so the total time complexity is \(O(n*n*d) = O(n^2 d)\)

alt text

a. \(||x||_1 = |1| + |2| + |3| + |4| = 10\)

b. \(||x||_2 = \sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{30}\)

c. \(||x||_{\infty} = \max(|1|, |2|, |3|, |4|) = 4\)

alt text

a. if the \(||x||_\infty = 1\), that means the max component of x is 1. suppose \(x \in R^d\), then the largest value of l1 form is when all components are 1, so the largest value is \(l1 = |1| + |1| + ... + |1| = d\). the largest value of l2 form is when all components are 1, so the largest value is \(l2 = \sqrt{1^2 + 1^2 + ... + 1^2} = \sqrt{d}\)

b. if the \(||x||_2 = 1\) for all points, that means \(\sum x_i^2 = 1\). suppose \(x \in R^d\), and given that the formula for \(l1 = |x_1| + |x_2| + ... + |x_d|\), \(l2 = \sqrt{x_1^2 + x_2^2 + ... + x_d^2} = 1\), and \(l_\infty = \max(|x_1|, |x_2|, ..., |x_d|)\), then the largest value of l1 form is when all components are equal. suppose there are d components, and each component is c, we have \(\sum c^2 = dc^2 = 1\), so \(c = 1/\sqrt{d}\), therefore, the largest value of l1 form is \(l1 = d * (1/\sqrt{d}) = \sqrt{d}\). the largest value of \(l_\infty\) form is when all one component is 1 and other components are 0, so the largest value is \(l_\infty = 1\)

alt text

this is not metric:

non-negativity: all distance values are non-negative
identity: distance is 0 iff the two points are the same
symmetry: distance from A to B is the same as distance from B to A
triangle inequality: one exception here: the distance from A -> B = 2, C -> A = 1, but B -> C = 4, so 4 > 2 + 1, therefore, it is not a metric

alt text

a. the distance function \(d(x,y) = x-y\) is not metric, because it does not satisfy non-negativity (for example, if x=1, y=2, then d(1,2) = 1-2 = -1)

b. the distance function d(x,y) = # positions on which x and y are different based on hamming distance is metric. this is metric distance

c. the squared euclidean distance \(d(x,y) = \sum_{i=1}^m (x_i - y_i)^2\) is not metric. when m = 1, then, d(x,y) = (x-y)^2. suppose when x=1, y=2, z=4, then d(1,2) = 1, d(2,4) = 4, but d(1,4) = 9, so 9 > 1 + 4. therefore, it does not satisfy triangle inequality

alt text

\[ D_{KL}(P||Q) = \sum_i P(i) \log_2 \frac{P(i)}{Q(i)} \]

\[ = \frac{1}{2} \log_2 \frac{\frac{1}{2}}{\frac{1}{4}} + \frac{1}{4} \log_2 \frac{\frac{1}{4}}{\frac{1}{4}} + \frac{1}{8} \log_2 \frac{\frac{1}{8}}{\frac{1}{6}} + \frac{1}{16} \log_2 \frac{\frac{1}{16}}{\frac{1}{6}} + \frac{1}{16} \log_2 \frac{\frac{1}{16}}{\frac{1}{6}} \]

\[ = \frac{1}{2} \log_2 2 + 0 + \frac{1}{8} \log_2 \frac{3}{4} + \frac{1}{16} \log_2 \frac{3}{8} + \frac{1}{16} \log_2 \frac{3}{8} \]

\[ = \frac{1}{2} + \frac{1}{8} \log_2 2 + \frac{1}{8} \log_2 \frac{3}{8} + \frac{1}{16} \log_2 \frac{3}{8} + \frac{1}{16} \log_2 \frac{3}{8}\]

\[ = \frac{5}{8} + \frac{1}{4} \log_2 \frac{3}{8} \]

\[ \approx .2712 \]

alt text

a. it should be classification problem, because output is discrete states

b. it should be regression problem, because output is continuous speed

c. it should be a regression problem, because output is continuous GPA

d. it should be classification problem, because output is discrete states

alt text

a. \(\sigma = \sqrt{\100} = 10\)

b. with the random variable X = \({0, 10, 20}\), mean = 10, variance = 100 we know that:

\[p_0 + p_1 + p_2 = 1\]

\[0*p_0 + 10*p_1 + 20*p_2 = 10\]

\[0^2*p_0 + 10^2*p_1 + 20^2*p_2 = 100\]

solving the equations, we get:

\[p_0 = 0, p_1 = 1, p_2 = 0\]

alt text

a.
\(\(Cov(X, Y) = E[XY] - E[X]E[Y]\)\)

\[E[X] = \sum x_i p(x_i) = -1/3 + 0 + 1/3 = 0\]