A word-embedding model typically learns vector representations from a large, general-purpose corpus like Google News. But to make the resulting vectors useful in a specialized domain, such as veterinary medicine, they must be fine-tuned on a smaller, domain-specific dataset. Researchers from Facebook AI offer a more accurate method.

What’s new: Rather than fine-tuning, Piotr Bojanowski and colleagues developed a model that aligns word vectors learned from general and specialized corpora.

Key insight: The authors drew inspiration from the way multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.

How it works: To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is {human,cat} and the other is {cat,dog}, the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.

  • A word-embedding model learns independent word vectors from both corpora.
  • For words that appear in both corpora, the alignment model learns a linear mapping from general-purpose vectors to domain-specific vectors. The mapping solves a linear equation that minimizes the distance between the general-purpose vectors and the domain-specific vectors.
  • The authors use a loss function called RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that are far apart remain far apart.
  • Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.

Results: The authors tested this approach to learning word vectors on tasks that include predicting analogies and text classification in a dataset where the test set has a slightly different word usage than the training set. Models that use word vectors learned via alignment outperformed those that use word vectors fine-tuned in the usual way. The new method’s advantage was more pronounced when the domain-specific dataset was relatively small.

Why it matters: Machine learning engineers need tools that enable existing word representations to capture specialized knowledge. The alignment technique could be a boon in any situation where general-purpose word vectors don’t capture the meanings at play.

We’re thinking: Open-source, pretrained word embeddings have been a boon to NLP systems. It would be great to have freely available word embeddings that captured knowledge from diverse fields like biology, law, and architecture.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox