Low-rank approximations cambridge introductiontoinformationretrieval2008 readingversion

Preliminary draft c 2008 Cambridge UP 410 18 Matrix decompositions and latent semantic indexing Exercise 18.5 Verify that the SVD of the matrix in Equation 18.12 is U =   − 0.816 0.000 − 0.408 − 0.707 − 0.408 0.707   , Σ = 1.732 0.000 0.000 1.000 and V T = − 0.707 − 0.707 0.707 − 0.707 , 18.13 by verifying all of the properties in the statement of Theorem 18.3 . Exercise 18.6 Suppose that C is a term-document incidence matrix. What do the entries of C T C represent? Exercise 18.7 Let C =   0 2 1 0 3 0 2 1 0   18.14 be a term-document matrix whose entries are term frequencies; thus term 1 occurs 2 times in document 2 and once in document 3. Compute CC T ; observe that its entries are largest where two terms have their most frequent occurrences together in the same document.

18.3 Low-rank approximations

We next state a matrix approximation problem that at first seems to have little to do with information retrieval. We describe a solution to this matrix problem using singular-value decompositions, then develop its application to information retrieval. Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix C k of rank at most k, so as to minimize the Frobenius norm of F ROBENIUS NORM the matrix difference X = C − C k , defined to be k X k F = v u u t M ∑ i= 1 N ∑ j= 1 X 2 ij . 18.15 Thus, the Frobenius norm of X measures the discrepancy between C k and C; our goal is to find a matrix C k that minimizes this discrepancy, while con- straining C k to have rank at most k. If r is the rank of C, clearly C r = C and the Frobenius norm of the discrepancy is zero in this case. When k is far smaller than r, we refer to C k as a low-rank approximation. LOW - RANK APPROXIMATION The singular value decomposition can be used to solve the low-rank ma- trix approximation problem. We then derive from it an application to ap- proximating term-document matrices. We invoke the following three-step procedure to this end: Preliminary draft c 2008 Cambridge UP

18.3 Low-rank approximations

411 C k = U Σ k V T r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r ◮ Figure 18.2 Illustration of low rank approximation using the singular-value de- composition. The dashed boxes indicate the matrix entries affected by “zeroing out” the smallest singular values. 1. Given C, construct its SVD in the form shown in 18.9 ; thus, C = UΣV T . 2. Derive from Σ the matrix Σ k formed by replacing by zeros the r − k small- est singular values on the diagonal of Σ. 3. Compute and output C k = UΣ k V T as the rank-k approximation to C. The rank of C k is at most k: this follows from the fact that Σ k has at most k non-zero values. Next, we recall the intuition of Example 18.1 : the effect of small eigenvalues on matrix products is small. Thus, it seems plausible that replacing these small eigenvalues by zero will not substantially alter the product, leaving it “close” to C. The following theorem due to Eckart and Young tells us that, in fact, this procedure yields the matrix of rank k with the lowest possible Frobenius error. Theorem 18.4. min Z | rank Z=k k C − Z k F = k C − C k k F = σ k+ 1 . 18.16 Recalling that the singular values are in decreasing order σ 1 ≥ σ 2 ≥ · · · , we learn from Theorem 18.4 that C k is the best rank-k approximation to C, incurring an error measured by the Frobenius norm of C − C k equal to σ k+ 1 . Thus the larger k is, the smaller this error and in particular, for k = r , the error is zero since Σ r = Σ ; provided r M , N, then σ r+ 1 = 0 and thus C r = C . To derive further insight into why the process of truncating the smallest r − k singular values in Σ helps generate a rank-k approximation of low error, we examine the form of C k : C k = UΣ k V T 18.17 Preliminary draft c 2008 Cambridge UP 412 18 Matrix decompositions and latent semantic indexing = U       σ 1 · · · σ k · · ·       V T 18.18 = k ∑ i= 1 σ i ~ u i ~ v T i , 18.19 where ~ u i and ~ v i are the ith columns of U and V, respectively. Thus, ~ u i ~ v T i is a rank-1 matrix, so that we have just expressed C k as the sum of k rank-1 matrices each weighted by a singular value. As i increases, the contribution of the rank-1 matrix ~ u i ~ v T i is weighted by a sequence of shrinking singular values σ i . ? Exercise 18.8 Compute a rank 1 approximation C 1 to the matrix C in Example 18.12 , using the SVD as in Exercise 18.13 . What is the Frobenius norm of the error of this approximation? Exercise 18.9 Consider now the computation in Exercise 18.8 . Following the schematic in Fig- ure 18.2 , notice that for a rank 1 approximation we have σ 1 being a scalar. Denote by U 1 the first column of U and by V 1 the first column of V. Show that the rank-1 approximation to C can then be written as U 1 σ 1 V T 1 = σ 1 U 1 V T 1 . Exercise 18.10 Exercise 18.9 can be generalized to rank k approximations: we let U ′ k and V ′ k denote the “reduced” matrices formed by retaining only the first k columns of U and V, respectively. Thus U ′ k is an M × k matrix while V ′ T k is a k × N matrix. Then, we have C k = U ′ k Σ ′ k V ′ T k , 18.20 where Σ ′ k is the square k × k submatrix of Σ k with the singular values σ 1 , . . . , σ k on the diagonal. The primary advantage of using 18.20 is to eliminate a lot of redun- dant columns of zeros in U and V, thereby explicitly eliminating multiplication by columns that do not affect the low-rank approximation; this version of the SVD is sometimes known as the reduced SVD or truncated SVD and is a computationally simpler representation from which to compute the low rank approximation. For the matrix C in Example 18.3 , write down both Σ 2 and Σ ′ 2 .

18.4 Latent semantic indexing