Latent Semantic Indexing (LSI) [4] is a dimensionality reduction technique which is widely used in information retrieval (IR). Many IR applications have shown that performing latent semantic analysis, including in document indexing, can improve the accuracy of information retrieval. Given a term-document frequency matrix, LSI is used to decompose it into two matrices of reduced dimensions and a diagonal matrix of singular values. Each dimension in the reduced space is a latent variable (or factor) representing groups of highly correlated index terms. Reducing the dimensionality of the original matrix reduces the amount of noise in the data as well as its sparsity, thereby, improving retrieval based on the computation of similarities between the indexed documents and user queries. Here we apply this idea to create a reduced dimension space for the semantic attributes associated with items.
Singular Value Decomposition (SVD) is a well known technique used in LSI to perform matrix decomposition. In our case, we perform SVD on the semantic attribute matrix
by decomposing it into three matrices:
where
and
are two orthogonal matrices;
is the rank of matrix
, and
is a diagonal matrix of size
, where its diagonal entries contain all singular values of matrix
and are stored in decreasing order. One advantage of SVD is that it provides the best lower rank approximation of the original matrix
[4]. We can reduce the diagonal matrix
into a lower-rank diagonal matrix
by only keeping
(
) largest values. Accordingly, we reduce
to
and
to
. Then the matrix
is the rank-
approximation of the original matrix
.
In the above process,
consists of the first
columns of the matrix
corresponding to the
highest order singular values. In the resulting semantic attribute matrix,
, each item is, thus, represented by a set of
latent variables, instead of the original
attributes. This results in a much less sparse matrix, improving the results of similarity computations, as well as the computational cost associated with the process. Furthermore, the generated latent variables represent groups of highly correlated attributes in the original data, thus potentially reducing the amount of noise associated with the semantic information. As we will illustrate in the next section, performing latent semantic analysis on the semantic space, generally leads to substantial gains in prediction accuracy based on the semantic attributes.