NeuroCOLT

Neural Networks and Computational Learning Theory

 

About NeuroCOLT

Papers Archive

1994 1995
1996 1997
1998 1999
2000 2001
2002

Books

info@neurocolt.org

NeuroCOLT Technical Report NC-TR-02-119


2002-119
Finding Language-Independent Semantic Representation of Text using Kernel Canonical Correlation Analysis

Alexei Vinokourov
John Shawe-Taylor
Nello Cristianini

ABSTRACT
The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can be used either for cross-linguistic retrieval, or, more generally, as a part of a mono-linguistic categorisation or clustering system. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated hence forming a semantic representation of the data.

Download Postscript