next up previous
Next: Predictions Based on a Up: Integrating Semantic Similarity with Previous: Integrating Semantic Similarity with

Using Latent Semantic Analysis on Semantic Attributes.

Latent Semantic Indexing (LSI) [4] is a dimensionality reduction technique which is widely used in information retrieval (IR). Many IR applications have shown that performing latent semantic analysis, including in document indexing, can improve the accuracy of information retrieval. Given a term-document frequency matrix, LSI is used to decompose it into two matrices of reduced dimensions and a diagonal matrix of singular values. Each dimension in the reduced space is a latent variable (or factor) representing groups of highly correlated index terms. Reducing the dimensionality of the original matrix reduces the amount of noise in the data as well as its sparsity, thereby, improving retrieval based on the computation of similarities between the indexed documents and user queries. Here we apply this idea to create a reduced dimension space for the semantic attributes associated with items.

Singular Value Decomposition (SVD) is a well known technique used in LSI to perform matrix decomposition. In our case, we perform SVD on the semantic attribute matrix $S_{n \times d}$ by decomposing it into three matrices:

\begin{displaymath}
S_{n \times d} = U_{n \times r} \bullet \Sigma _{r \times r} \bullet V_{r \times d}
\end{displaymath}

where $U$ and $V$ are two orthogonal matrices; $r$ is the rank of matrix $S$, and $\Sigma$ is a diagonal matrix of size $r \times r$, where its diagonal entries contain all singular values of matrix $S$ and are stored in decreasing order. One advantage of SVD is that it provides the best lower rank approximation of the original matrix $S$ [4]. We can reduce the diagonal matrix $\Sigma$ into a lower-rank diagonal matrix $\Sigma_{k \times k}$ by only keeping $k$ ($k < r$) largest values. Accordingly, we reduce $U$ to $U'$ and $V$ to $V'$. Then the matrix $S' = U' \bullet \Sigma' \bullet V'$ is the rank-$k$ approximation of the original matrix $S$. In the above process, $U'$ consists of the first $k$ columns of the matrix $U$ corresponding to the $k$ highest order singular values. In the resulting semantic attribute matrix, $S'$, each item is, thus, represented by a set of $k$ latent variables, instead of the original $d$ attributes. This results in a much less sparse matrix, improving the results of similarity computations, as well as the computational cost associated with the process. Furthermore, the generated latent variables represent groups of highly correlated attributes in the original data, thus potentially reducing the amount of noise associated with the semantic information. As we will illustrate in the next section, performing latent semantic analysis on the semantic space, generally leads to substantial gains in prediction accuracy based on the semantic attributes.


next up previous
Next: Predictions Based on a Up: Integrating Semantic Similarity with Previous: Integrating Semantic Similarity with
Bamshad Mobasher 2004-03-09