View on GitHub

nlp-recipes

Natural Language Processing Best Practices & Examples

Word Embedding

This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
There are three typical ways for training word embedding: Word2Vec, GloVe, and fastText. All of the three methods provide pretrained models (pretrained model with Word2Vec, pretrained model with Glove, pretrained model with fastText).
These pretrained models are trained with general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations where you have a domain-specific language learning problem or there is no pretrained model for the language you need to work with. In this folder, we provide examples of how to apply each of the three methods to train your own word embeddings.

What is Word Embedding?

Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers. The learned vector representations of words capture syntactic and semantic word relationships and therefore can be very useful for tasks like sentence similary, text classifcation, etc.

Summary

Notebook Environment Description Dataset Language
Developing Word Embeddings Local A notebook shows how to learn word representation with Word2Vec, fastText and Glove STS Benchmark dataset en