This kind of representation is famously done by word2vec, also from Google, and ...

danieldk · on May 14, 2016

No, this is not the type of representation that is done by word2vec. These older bilexical preference models work as follows:

- Take a huge corpus.

- Parse the corpus using your parser.

- Extract head-dependent strengths according to some metric (e.g. pointwise mutual information) from the machine annotated data. These association strengths are used as an auxiliary distribution. The machine annotated data may contain erroneous analyses, but they are typically outnumbered by correct analyses[1].

- In parse disambiguation, the head-dependent strength is added as another feature (the strength is the feature value).

This idea precedes word2vec for quite some time and is also different, both in the training procedure (word2vec uses raw unannotated data, bilexical preference models use machine-annotation data) as well as the result of training (distributed word representations vs. head-dependent association strengths).

Although it's a different method, they can capture the same thing as word embeddings plus a non-linear classifier: which heads typically take which dependents. However, in contrast to word embeddings, they are effective in a linear classifier because the auxiliary distribution itself consists of combinatory features (head-dependents).

[1] E.g. consider freer word order languages. In Dutch, SVO is the preferred word order, but OVS is also permitted. A parser without such an auxiliary distribution might analyse the object as a subject in an OVS sentence. However, since SVO is much more frequent than OVS, it will learn the correct associations on a large machine-annotated corpus.