{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Developing Word Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than use pre-trained embeddings (as we did in the sentence similarity baseline_deep_dive [notebook](../sentence_similarity/baseline_deep_dive.ipynb)), we can train word embeddings using our own dataset. In this notebook, we demonstrate the training process for producing word embeddings using the word2vec, GloVe, and fastText models. We'll utilize the STS Benchmark dataset for this task. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Data Loading and Preprocessing](#Load-and-Preprocess-Data)\n", "* [Word2Vec](#Word2Vec)\n", "* [fastText](#fastText)\n", "* [GloVe](#GloVe)\n", "* [Concluding Remarks](#Concluding-Remarks)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "import sys\n", "import os\n", "\n", "# Set the environment path\n", "sys.path.append(\"../..\")\n", "\n", "import numpy as np\n", "from utils_nlp.dataset.preprocess import (\n", " to_lowercase,\n", " to_spacy_tokens,\n", " rm_spacy_stopwords,\n", ")\n", "from utils_nlp.dataset import stsbenchmark\n", "from utils_nlp.common.timer import Timer\n", "from gensim.models import Word2Vec\n", "from gensim.models.fasttext import FastText" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Set the path for where your repo is located\n", "NLP_REPO_PATH = os.path.join('..','..')\n", "\n", "# Set the path for where your datasets are located\n", "BASE_DATA_PATH = os.path.join(NLP_REPO_PATH, \"data\")\n", "\n", "# Set the path for location to save embeddings\n", "SAVE_FILES_PATH = os.path.join(BASE_DATA_PATH, \"trained_word_embeddings\")\n", "if not os.path.exists(SAVE_FILES_PATH):\n", " os.makedirs(SAVE_FILES_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and Preprocess Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 401/401 [00:02<00:00, 182KB/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "Data downloaded to ../../data/raw/stsbenchmark\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Produce a pandas dataframe for the training set\n", "train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")\n", "\n", "# Clean the sts dataset\n", "sts_train = stsbenchmark.clean_sts(train_raw)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoresentence1sentence2
05.00A plane is taking off.An air plane is taking off.
13.80A man is playing a large flute.A man is playing a flute.
23.80A man is spreading shreded cheese on a pizza.A man is spreading shredded cheese on an uncoo...
32.60Three men are playing chess.Two men are playing chess.
44.25A man is playing the cello.A man seated is playing the cello.
\n", "
" ], "text/plain": [ " score sentence1 \\\n", "0 5.00 A plane is taking off. \n", "1 3.80 A man is playing a large flute. \n", "2 3.80 A man is spreading shreded cheese on a pizza. \n", "3 2.60 Three men are playing chess. \n", "4 4.25 A man is playing the cello. \n", "\n", " sentence2 \n", "0 An air plane is taking off. \n", "1 A man is playing a flute. \n", "2 A man is spreading shredded cheese on an uncoo... \n", "3 Two men are playing chess. \n", "4 A man seated is playing the cello. " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sts_train.head(5)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5749, 3)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the size of our dataframe\n", "sts_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training set preprocessing" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Convert all text to lowercase\n", "df_low = to_lowercase(sts_train) \n", "# Tokenize text\n", "sts_tokenize = to_spacy_tokens(df_low) \n", "# Tokenize with removal of stopwords\n", "sts_train_stop = rm_spacy_stopwords(sts_tokenize) " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Append together the two sentence columns to get a list of all tokenized sentences.\n", "all_sentences = sts_train_stop[[\"sentence1_tokens_rm_stopwords\", \"sentence2_tokens_rm_stopwords\"]]\n", "# Flatten two columns into one list and remove all sentences that are size 0 after tokenization and stop word removal.\n", "sentences = [i for i in all_sentences.values.flatten().tolist() if len(i) > 0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11498" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(sentences)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum sentence length is 1 tokens\n", "Maximum sentence length is 43 tokens\n", "Median sentence length is 6.0 tokens\n" ] } ], "source": [ "sentence_lengths = [len(i) for i in sentences]\n", "print(\"Minimum sentence length is {} tokens\".format(min(sentence_lengths)))\n", "print(\"Maximum sentence length is {} tokens\".format(max(sentence_lengths)))\n", "print(\"Median sentence length is {} tokens\".format(np.median(sentence_lengths)))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['plane', 'taking', '.'],\n", " ['air', 'plane', 'taking', '.'],\n", " ['man', 'playing', 'large', 'flute', '.'],\n", " ['man', 'playing', 'flute', '.'],\n", " ['man', 'spreading', 'shreded', 'cheese', 'pizza', '.'],\n", " ['man', 'spreading', 'shredded', 'cheese', 'uncooked', 'pizza', '.'],\n", " ['men', 'playing', 'chess', '.'],\n", " ['men', 'playing', 'chess', '.'],\n", " ['man', 'playing', 'cello', '.'],\n", " ['man', 'seated', 'playing', 'cello', '.']]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word2Vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word2vec is a predictive model for learning word embeddings from text (see [original research paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)). Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the \"context\") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed background on the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gensim Word2Vec model has many different parameters (see [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) but the ones that are useful to know about are: \n", "- size: length of the word embedding/vector (defaults to 100)\n", "- window: maximum distance between the word being predicted and the current word (defaults to 5)\n", "- min_count: ignores all words that have a frequency lower than this value (defaults to 5)\n", "- workers: number of worker threads used to train the model (defaults to 3)\n", "- sg: training algorithm; 1 for skip-gram and 0 for CBOW (defaults to 0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Set up a Timer to see how long the model takes to train\n", "t = Timer()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "t.start()\n", "\n", "# Train the Word2vec model\n", "word2vec_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=3, sg=0)\n", "\n", "t.stop()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 0.3194\n" ] } ], "source": [ "print(\"Time elapsed: {}\".format(t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the model is trained we can:\n", "\n", "1. Query for the word embeddings of a given word. \n", "2. Inspect the model vocabulary\n", "3. Save the word embeddings" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embedding for apple: [-0.13886626 -0.04330257 0.12527628 0.08564945 0.02040523 -0.10037457\n", " -0.1182736 0.05916803 -0.09810918 0.11094606 -0.00045659 -0.07130833\n", " -0.07526248 0.01439941 -0.01924936 -0.04267681 0.05364342 0.01334886\n", " 0.09927388 0.04298429 0.07616432 -0.09218667 0.13563654 0.13954957\n", " 0.17032589 0.13070972 0.04971378 0.05326121 0.1633883 0.0867981\n", " 0.01025774 0.19571003 -0.11564688 0.00285543 -0.02306972 -0.07086422\n", " -0.03311775 0.16642122 0.10450041 0.11148815 -0.11674852 -0.10021858\n", " -0.00149789 -0.10769422 0.1467818 -0.00330875 0.09308671 -0.12129212\n", " 0.07261119 0.07583102 0.00192156 0.23766024 -0.0063716 -0.10565527\n", " -0.06545153 0.04053855 0.24339062 0.15191206 -0.04718588 -0.05213067\n", " 0.00187512 -0.08648538 -0.05337012 0.15507293 -0.09485061 0.03063929\n", " 0.00369516 -0.20911641 0.09312427 0.03583751 0.07270095 0.18968543\n", " 0.08637197 -0.03679648 0.12222783 -0.11879333 -0.1462169 0.02210324\n", " 0.18023533 0.03193852 -0.02540419 0.01615141 0.12228711 -0.03577682\n", " 0.05543301 0.15039788 -0.01812798 0.10888109 -0.08378831 -0.10893872\n", " 0.04931932 0.03412211 0.05080304 -0.16159546 0.02976557 0.08955383\n", " -0.02231676 0.06976417 0.2003142 0.04647517]\n", "\n", "First 30 vocabulary words: ['plane', 'taking', '.', 'air', 'man', 'playing', 'large', 'flute', 'spreading', 'cheese', 'pizza', 'men', 'seated', 'fighting', 'smoking', 'piano', 'guitar', 'singing', 'woman', 'person']\n" ] } ], "source": [ "# 1. Let's see the word embedding for \"apple\" by accessing the \"wv\" attribute and passing in \"apple\" as the key.\n", "print(\"Embedding for apple:\", word2vec_model.wv[\"apple\"])\n", "\n", "# 2. Inspect the model vocabulary by accessing keys of the \"wv.vocab\" attribute. We'll print the first 20 words.\n", "print(\"\\nFirst 30 vocabulary words:\", list(word2vec_model.wv.vocab)[:20])\n", "\n", "# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.\n", "word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"word2vec_model\", binary=True) # binary format\n", "word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"word2vec_model\", binary=False) # ASCII format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## fastText" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fastText is an unsupervised algorithm created by Facebook Research for efficiently learning word embeddings (see [original research paper](https://arxiv.org/pdf/1607.04606.pdf)). fastText is significantly different than word2vec or GloVe in that these two algorithms treat each word as the smallest possible unit to find an embedding for. Conversely, fastText assumes that words are formed by an n-gram of characters (i.e. 2-grams of the word \"language\" would be {la, an, ng, gu, ua, ag, ge}). The embedding for a word is then composed of the sum of these character n-grams. This has advantages when finding word embeddings for rare words and words not present in the dictionary, as these words can still be broken down into character n-grams. Typically, for smaller datasets, fastText performs better than word2vec or GloVe. See this [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html) on fastText for more detail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gensim fastText model has many different parameters (see [here](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText)) but the ones that are useful to know about are: \n", "- size: length of the word embedding/vector (defaults to 100)\n", "- window: maximum distance between the word being predicted and the current word (defaults to 5)\n", "- min_count: ignores all words that have a frequency lower than this value (defaults to 5)\n", "- workers: number of worker threads used to train the model (defaults to 3)\n", "- sg: training algorithm- 1 for skip-gram and 0 for CBOW (defaults to 0)\n", "- iter: number of epochs (defaults to 5)\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Set up a Timer to see how long the model takes to train\n", "t = Timer()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "t.start()\n", "\n", "# Train the FastText model\n", "fastText_model = FastText(size=100, window=5, min_count=5, sentences=sentences, iter=5)\n", "\n", "t.stop()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 9.3665\n" ] } ], "source": [ "print(\"Time elapsed: {}\".format(t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can utilize the same attributes as we saw above for word2vec due to them both originating from the gensim package" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embedding for apple: [-0.2255679 -0.15831569 0.03804937 0.47731966 0.47977886 -0.27653983\n", " -0.27343377 -0.4507852 -0.05649747 0.01470412 0.27904618 -0.02155268\n", " -0.02492249 -0.07855172 0.18532543 0.25709668 0.05939932 0.10333744\n", " -0.09892524 -0.61932683 -0.15273307 -0.02246136 -0.06295346 -0.5022594\n", " -0.13407618 -0.10411069 0.13370538 0.11902415 -0.44436237 0.27073038\n", " 0.06540621 -0.02650584 -0.0179158 0.08797703 0.18899101 0.12898529\n", " 0.05865225 -0.18658654 -0.40497953 -0.23991017 0.30457255 0.39893195\n", " 0.2913193 -0.18734889 0.10662938 -0.1165131 -0.42884877 0.31400812\n", " 0.04840293 0.10146416 -0.10285722 -0.21854313 -0.69022155 -0.48051542\n", " -0.17416449 0.12879132 0.12302257 -0.32911557 -0.48828328 0.22531843\n", " -0.35535514 -0.34300882 0.07264371 0.262703 -0.10182904 0.03486007\n", " -0.09019874 0.12621203 0.35632437 -0.10350075 0.3397234 -0.04080832\n", " -0.17116521 -0.20685913 0.18177888 0.19674565 0.00776504 -0.22853185\n", " 0.01387324 -0.33452377 0.1017314 -0.06989139 0.15893722 0.02910445\n", " -0.18428223 0.30011976 -0.05394572 -0.18550391 0.09144824 0.2203982\n", " 0.3605487 -0.0106479 0.729859 0.516405 -0.44636923 -0.4128766\n", " -0.523939 -0.20086594 -0.38725898 0.0440867 ]\n", "\n", "First 30 vocabulary words: ['plane', 'taking', '.', 'air', 'man', 'playing', 'large', 'flute', 'spreading', 'cheese', 'pizza', 'men', 'seated', 'fighting', 'smoking', 'piano', 'guitar', 'singing', 'woman', 'person']\n" ] } ], "source": [ "# 1. Let's see the word embedding for \"apple\" by accessing the \"wv\" attribute and passing in \"apple\" as the key.\n", "print(\"Embedding for apple:\", fastText_model.wv[\"apple\"])\n", "\n", "# 2. Inspect the model vocabulary by accessing keys of the \"wv.vocab\" attribute. We'll print the first 20 words.\n", "print(\"\\nFirst 30 vocabulary words:\", list(fastText_model.wv.vocab)[:20])\n", "\n", "# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format.\n", "fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"fastText_model\", binary=True) # binary format\n", "fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"fastText_model\", binary=False) # ASCII format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GloVe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GloVe is an unsupervised algorithm for obtaining word embeddings created by the Stanford NLP group (see [original research paper](https://nlp.stanford.edu/pubs/glove.pdf)). Training occurs on word-word co-occurrence statistics with the objective of learning word embeddings such that the dot product of two words' embeddings is equal to the words' probability of co-occurrence. See this [tutorial](https://nlp.stanford.edu/projects/glove/) on GloVe for more detailed background on the model. \n", "\n", "Gensim doesn't have an implementation of the GloVe model and the other python packages that implement GloVe are unstable, so we leveraged the code directly from the Stanford NLP [repo](https://github.com/stanfordnlp/GloVe). " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir -p build\n", "gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic\n", "\u001b[01m\u001b[Ksrc/glove.c:\u001b[m\u001b[K In function ‘\u001b[01m\u001b[Kglove_thread\u001b[m\u001b[K’:\n", "\u001b[01m\u001b[Ksrc/glove.c:117:9:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&cr, sizeof(CREC), 1, fin);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic\n", "\u001b[01m\u001b[Ksrc/shuffle.c:\u001b[m\u001b[K In function ‘\u001b[01m\u001b[Kshuffle_merge\u001b[m\u001b[K’:\n", "\u001b[01m\u001b[Ksrc/shuffle.c:106:17:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&array[i], sizeof(CREC), 1, fid[j]);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "\u001b[01m\u001b[Ksrc/shuffle.c:\u001b[m\u001b[K In function ‘\u001b[01m\u001b[Kshuffle_by_chunks\u001b[m\u001b[K’:\n", "\u001b[01m\u001b[Ksrc/shuffle.c:163:9:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&array[i], sizeof(CREC), 1, fin);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic\n", "\u001b[01m\u001b[Ksrc/cooccur.c:\u001b[m\u001b[K In function ‘\u001b[01m\u001b[Kmerge_files\u001b[m\u001b[K’:\n", "\u001b[01m\u001b[Ksrc/cooccur.c:267:9:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&new, sizeof(CREC), 1, fid[i]);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "\u001b[01m\u001b[Ksrc/cooccur.c:277:5:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&new, sizeof(CREC), 1, fid[i]);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "\u001b[01m\u001b[Ksrc/cooccur.c:290:9:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kignoring return value of ‘\u001b[01m\u001b[Kfread\u001b[m\u001b[K’, declared with attribute warn_unused_result [-Wunused-result]\n", " fread(&new, sizeof(CREC), 1, fid[i]);\n", "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", "gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic\n" ] } ], "source": [ "# Define path\n", "glove_model_path = os.path.join(NLP_REPO_PATH, \"utils_nlp\", \"models\", \"glove\")\n", "# Execute shell commands\n", "!cd $glove_model_path && make" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train GloVe vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training GloVe embeddings requires some data prep and then 4 steps (also documented in the original Stanford NLP repo [here](https://github.com/stanfordnlp/GloVe/tree/master/src))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 0: Prepare Data**\n", " \n", "In order to train our GloVe vectors, we first need to save our corpus as a text file with all words separated by 1+ spaces or tabs. Each document/sentence is separated by a new line character." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Save our corpus as tokens delimited by spaces with new line characters in between sentences.\n", "training_corpus_file_path = os.path.join(SAVE_FILES_PATH, \"training-corpus-cleaned.txt\")\n", "with open(training_corpus_file_path, 'w', encoding='utf8') as file:\n", " for sent in sentences:\n", " file.write(\" \".join(sent) + \"\\n\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Set up a Timer to see how long the model takes to train\n", "t = Timer()\n", "t.start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 1: Build Vocabulary**\n", "\n", "Run the vocab_count executable. There are 3 optional parameters:\n", "1. min-count: lower limit on how many times a word must appear in dataset. Otherwise the word is discarded from our vocabulary.\n", "2. max-vocab: upper bound on the number of vocabulary words to keep\n", "3. verbose: 0, 1, or 2 (default)\n", "\n", "Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the vocabulary to " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BUILDING VOCABULARY\r\n", "Processed 0 tokens.\u001b[0GProcessed 85334 tokens.\r\n", "Counted 11716 unique words.\r\n", "Truncating vocabulary at min count 5.\r\n", "Using vocabulary of size 2943.\r\n", "\r\n" ] } ], "source": [ "# Define path\n", "vocab_count_exe_path = os.path.join(glove_model_path, \"build\", \"vocab_count\")\n", "vocab_file_path = os.path.join(SAVE_FILES_PATH, \"vocab.txt\")\n", "# Execute shell commands\n", "!$vocab_count_exe_path -min-count 5 -verbose 2 <$training_corpus_file_path> $vocab_file_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 2: Construct Word Co-occurrence Statistics**\n", "\n", "Run the cooccur executable. There are many optional parameters, but we list the top ones here:\n", "1. symmetric: 0 for only looking at left context, 1 (default) for looking at both left and right context\n", "2. window-size: number of context words to use (default 15)\n", "3. verbose: 0, 1, or 2 (default)\n", "4. vocab-file: path/name of the vocabulary file created in Step 1\n", "5. memory: soft limit for memory consumption, default 4\n", "6. max-product: limit the size of dense co-occurrence array by specifying the max product (integer) of the frequency counts of the two co-occurring words\n", "\n", "Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the co-occurrences to" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "COUNTING COOCCURRENCES\n", "window size: 15\n", "context: symmetric\n", "max product: 13752509\n", "overflow length: 38028356\n", "Reading vocab from file \"../../data/trained_word_embeddings/vocab.txt\"...loaded 2943 words.\n", "Building lookup table...table contains 8661250 elements.\n", "Processing token: 0\u001b[0GProcessed 85334 tokens.\n", "Writing cooccurrences to disk......2 files in total.\n", "Merging cooccurrence files: processed 0 lines.\u001b[39G0 lines.\u001b[39G100000 lines.\u001b[0GMerging cooccurrence files: processed 188154 lines.\n", "\n" ] } ], "source": [ "# Define path\n", "cooccur_exe_path = os.path.join(glove_model_path, \"build\", \"cooccur\")\n", "cooccurrence_file_path = os.path.join(SAVE_FILES_PATH, \"cooccurrence.bin\")\n", "# Execute shell commands\n", "!$cooccur_exe_path -memory 4 -vocab-file $vocab_file_path -verbose 2 -window-size 15 <$training_corpus_file_path> $cooccurrence_file_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 3: Shuffle the Co-occurrences**\n", "\n", "Run the shuffle executable. The parameters are as follows:\n", "1. verbose: 0, 1, or 2 (default)\n", "2. memory: soft limit for memory consumption, default 4\n", "3. array-size: limit to the length of the buffer which stores chunks of data to shuffle before writing to disk\n", "\n", "Then provide the path to the co-occurrence file we created in Step 2 followed by a file path that we'll save the shuffled co-occurrences to" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SHUFFLING COOCCURRENCES\r\n", "array size: 255013683\r\n", "Shuffling by chunks: processed 0 lines.\u001b[22Gprocessed 188154 lines.\r\n", "Wrote 1 temporary file(s).\r\n", "Merging temp files: processed 0 lines.\u001b[31G188154 lines.\u001b[0GMerging temp files: processed 188154 lines.\r\n", "\r\n" ] } ], "source": [ "# Define path\n", "shuffle_exe_path = os.path.join(glove_model_path, \"build\", \"shuffle\")\n", "cooccurrence_shuf_file_path = os.path.join(SAVE_FILES_PATH, \"cooccurrence.shuf.bin\")\n", "# Execute shell commands\n", "!$shuffle_exe_path -memory 4 -verbose 2 <$cooccurrence_file_path> $cooccurrence_shuf_file_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 4: Train GloVe model**\n", "\n", "Run the glove executable. There are many parameter options, but the top ones are listed below:\n", "1. verbose: 0, 1, or 2 (default)\n", "2. vector-size: dimension of word embeddings (50 is default)\n", "3. threads: number threads, default 8\n", "4. iter: number of iterations, default 25\n", "5. eta: learning rate, default 0.05\n", "6. binary: whether to save binary format (0: text = default, 1: binary, 2: both)\n", "7. x-max: cutoff for weighting function, default is 100\n", "8. vocab-file: file containing vocabulary as produced in Step 1\n", "9. save-file: filename to save vectors to \n", "10. input-file: filename with co-occurrences as returned from Step 3" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TRAINING MODEL\n", "Read 188154 lines.\n", "Initializing parameters...done.\n", "vector size: 50\n", "vocab size: 2943\n", "x_max: 10.000000\n", "alpha: 0.750000\n", "08/13/19 - 05:39.53PM, iter: 001, cost: 0.078545\n", "08/13/19 - 05:39.53PM, iter: 002, cost: 0.072337\n", "08/13/19 - 05:39.53PM, iter: 003, cost: 0.070195\n", "08/13/19 - 05:39.53PM, iter: 004, cost: 0.066766\n", "08/13/19 - 05:39.53PM, iter: 005, cost: 0.063480\n", "08/13/19 - 05:39.53PM, iter: 006, cost: 0.060623\n", "08/13/19 - 05:39.53PM, iter: 007, cost: 0.058089\n", "08/13/19 - 05:39.53PM, iter: 008, cost: 0.056030\n", "08/13/19 - 05:39.53PM, iter: 009, cost: 0.053907\n", "08/13/19 - 05:39.53PM, iter: 010, cost: 0.051774\n", "08/13/19 - 05:39.53PM, iter: 011, cost: 0.049576\n", "08/13/19 - 05:39.53PM, iter: 012, cost: 0.047385\n", "08/13/19 - 05:39.53PM, iter: 013, cost: 0.045207\n", "08/13/19 - 05:39.53PM, iter: 014, cost: 0.043098\n", "08/13/19 - 05:39.53PM, iter: 015, cost: 0.041065\n" ] } ], "source": [ "# Define path\n", "glove_exe_path = os.path.join(glove_model_path, \"build\", \"glove\")\n", "glove_vector_file_path = os.path.join(SAVE_FILES_PATH, \"GloVe_vectors\")\n", "# Execute shell commands\n", "!$glove_exe_path -save-file $glove_vector_file_path -threads 8 -input-file \\\n", "$cooccurrence_shuf_file_path -x-max 10 -iter 15 -vector-size 50 -binary 2 \\\n", "-vocab-file $vocab_file_path -verbose 2" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "t.stop()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 3.4293\n" ] } ], "source": [ "print(\"Time elapsed: {}\".format(t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect Word Vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like we did above for the word2vec and fastText models, let's now inspect our word embeddings" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#load in the saved word vectors.\n", "glove_wv = {}\n", "glove_vector_txt_file_path = os.path.join(SAVE_FILES_PATH, \"GloVe_vectors.txt\")\n", "with open(glove_vector_txt_file_path, encoding='utf-8') as f:\n", " for line in f:\n", " split_line = line.split(\" \")\n", " glove_wv[split_line[0]] = [float(i) for i in split_line[1:]]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embedding for apple: [-0.037004, -0.000665, -0.028638, 0.025758, -0.050187, 0.038694, 0.016966, -0.042032, -0.033963, 0.143667, -0.068749, -0.005046, 0.180022, 0.088593, -0.04615, -0.013351, 0.064172, 0.051637, -0.000885, 0.009899, -0.092548, -0.026595, 0.036515, -0.09158, -0.027992, 0.016924, -0.024003, -0.029879, 0.252747, 0.093754, -0.034897, 0.079439, -0.073516, -0.110923, 0.095652, 0.072123, -0.047069, -0.17929, -0.068377, -0.224694, -0.016158, 0.236704, 0.010695, -0.133073, 0.084929, 0.102969, 0.040056, -0.009444, -0.051333, 0.130339]\n", "\n", "First 30 vocabulary words: ['.', ',', 'man', '-', '\"', 'woman', \"'\", 'said', 'dog', 'playing', ':', 'white', 'black', '$', 'killed', 'percent', 'new', 'syria', 'people', 'china']\n" ] } ], "source": [ "# 1. Let's see the word embedding for \"apple\" by passing in \"apple\" as the key.\n", "print(\"Embedding for apple:\", glove_wv[\"apple\"])\n", "\n", "# 2. Inspect the model vocabulary by accessing keys of the \"wv.vocab\" attribute. We'll print the first 20 words.\n", "print(\"\\nFirst 30 vocabulary words:\", list(glove_wv.keys())[:20])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Concluding Remarks\n", "\n", "In this notebook we have shown how to train word2vec, GloVe, and fastText word embeddings on the STS Benchmark dataset. We also inspected how long each model took to train on our dataset: word2vec took 0.39 seconds, GloVe took 8.16 seconds, and fastText took 10.41 seconds.\n", "\n", "FastText is typically regarded as the best baseline for word embeddings (see [blog](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)) and is a good place to start when generating word embeddings. Now that we generated word embeddings on our dataset, we could also repeat the baseline_deep_dive notebook using these embeddings (versus the pre-trained ones from the internet). " ] } ], "metadata": { "kernelspec": { "display_name": "Python (nlp_gpu)", "language": "python", "name": "nlp_gpu" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }