|
About
NeuroCOLT
Papers
Archive
Books
info@neurocolt.org
|
NeuroCOLT
Technical Report NC-TR-02-119
2002-119
Finding Language-Independent Semantic Representation of Text using Kernel
Canonical Correlation Analysis
Alexei Vinokourov
John Shawe-Taylor
Nello Cristianini
ABSTRACT
The problem of learning a semantic representation of a text document
from data is addressed, in the situation
where a corpus of unlabeled paired documents is available, each pair being
formed by a short English document
and its French translation. This representation can be used either for
cross-linguistic retrieval, or, more
generally, as a part of a mono-linguistic categorisation or clustering
system.
By using kernel functions, in this case simple bag-of-words inner
products,
each part of the corpus is mapped to a high-dimensional space.
The correlations between the two spaces are then learnt by using kernel
Canonical Correlation Analysis.
A set of directions is found in the first and in the second space that
are maximally correlated hence forming a semantic representation of the
data.
Download
Postscript
|