At our last meeting, Behrang suggested we might want to switch to more classic IR vectors. The idea was that such vectors could be generated via a function, bypassing the need for storing a large database.
The idea of generating vectors on-the-fly is of course terribly attractive, but (at this point in time) this can only be done in a way where the vector, in the end, does not encapsulate the meaning of the word anymore. I said at the meeting I would write something 'in defence' of meaning-carrying distributional vectors, so here we go I think this is also a good exercise in summarising what we are trying to do now, and what we might want to do in the future.
To me, the representations we use for PeARS should have at least the following features:
carry meaning in a way that similarity can be calculated, both at the global level of the network and for each node separately. This is essential to get a system which a) will return pages about kittens when the query was 'cat' (i.e. another word); b) can account for meaning differences between speakers (a cook and a fisherman will associate different things with 'fish', and the system should know this in order to return the best possible results).
carry meaning in a way that is not modality-dependent. If we want to claim that our representations really carry meaning (and therefore will be suitable for a variety of tasks, including more AI-oriented problems), the vector for 'cat' should not only encapsulate its linguistic properties but also visual, sound, smell aspects, you name it. There is a vibrant strand of research on producing 'multimodal' vectors which can simulate people's brain activity when they hear a particular word. Beside the sheer coolness factor associated with this it would also allow us to search for images, sounds, etc with one single vector representation.
be ready for 'global sharing'. Behrang has suggested in the past the idea of 'open vectors', i.e. vectors built on a standard which would allow interoperability of representations. This would mean that my 'cat' vector and your 'cat' vector don't have to be the same to be recognised as referring to the same thing. This is important from the point of view of letting people play with representations, and create their own for particular purposes. We can take this idea further and say that we should provide a way to create a distributed, open-access repository of vectors built under the same standard, which would contain meanings for 'everything': meanings of words, documents, images, music, sound bites, things (as in, Internet of Things), etc. There could be a standard, for instance, where any Web page contains an HTML tag with the vector for that page. (Even better, make the vector the URI of the page.)
In my mind, the above implies that we have rich, meaning-carrying representations, as opposed to randomly-generated ones which only uniquely identify a word. The price to pay is that these vectors must be stored somewhere but then, it's one of the points of being distributed that each user does not need the entire repository. This is after all what happens in real life: I have a set of concepts stored 'in my head' and I encounter others every day, by reading, hearing, seeing, thinking new things. Some of those will become permanent ('that amazing food I had at the Swedish-Ethiopian restaurant in Uppsala'), others will be forgotten (by me) and picked up by others.
To the above, more technical points, I would add one more political argument. There is at the moment a lot of fuss made around meaning representations in AI, in particular in the context of neural networks. Big companies are pouring billions into this and it is not surprising: processing meaning is one of the most fundamental human abilities -- one without which we couldn't talk about the world, without which we couldn't build distributed search engines or ask for the salt at the dinner table. As we move towards a world where more and more things are automated and decided by virtual agents etc, it seems to me that 'owning meaning' ('quality meaning', i.e. rich, accurate, flexible vectors) will be one core aspect of controlling the net, and more broadly speaking, our digital environments. Meaning should be in the public domain, and we can help towards to this goal by giving PeARS powerful, shared semantic representations.