{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Developing Word Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than use pre-trained embeddings (as we did in the sentence similarity baseline_deep_dive [notebook](../sentence_similarity/baseline_deep_dive.ipynb)), we can train word embeddings using our own dataset. In this notebook, we demonstrate the training process for producing word embeddings using the word2vec, GloVe, and fastText models. We'll utilize the STS Benchmark dataset for this task. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Data Loading and Preprocessing](#Load-and-Preprocess-Data)\n", "* [Word2Vec](#Word2Vec)\n", "* [fastText](#fastText)\n", "* [GloVe](#GloVe)\n", "* [Concluding Remarks](#Concluding-Remarks)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "import sys\n", "import os\n", "\n", "# Set the environment path\n", "sys.path.append(\"../..\")\n", "\n", "import numpy as np\n", "from utils_nlp.dataset.preprocess import (\n", " to_lowercase,\n", " to_spacy_tokens,\n", " rm_spacy_stopwords,\n", ")\n", "from utils_nlp.dataset import stsbenchmark\n", "from utils_nlp.common.timer import Timer\n", "from gensim.models import Word2Vec\n", "from gensim.models.fasttext import FastText" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Set the path for where your repo is located\n", "NLP_REPO_PATH = os.path.join('..','..')\n", "\n", "# Set the path for where your datasets are located\n", "BASE_DATA_PATH = os.path.join(NLP_REPO_PATH, \"data\")\n", "\n", "# Set the path for location to save embeddings\n", "SAVE_FILES_PATH = os.path.join(BASE_DATA_PATH, \"trained_word_embeddings\")\n", "if not os.path.exists(SAVE_FILES_PATH):\n", " os.makedirs(SAVE_FILES_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and Preprocess Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 401/401 [00:02<00:00, 182KB/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "Data downloaded to ../../data/raw/stsbenchmark\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Produce a pandas dataframe for the training set\n", "train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")\n", "\n", "# Clean the sts dataset\n", "sts_train = stsbenchmark.clean_sts(train_raw)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | score | \n", "sentence1 | \n", "sentence2 | \n", "
---|---|---|---|
0 | \n", "5.00 | \n", "A plane is taking off. | \n", "An air plane is taking off. | \n", "
1 | \n", "3.80 | \n", "A man is playing a large flute. | \n", "A man is playing a flute. | \n", "
2 | \n", "3.80 | \n", "A man is spreading shreded cheese on a pizza. | \n", "A man is spreading shredded cheese on an uncoo... | \n", "
3 | \n", "2.60 | \n", "Three men are playing chess. | \n", "Two men are playing chess. | \n", "
4 | \n", "4.25 | \n", "A man is playing the cello. | \n", "A man seated is playing the cello. | \n", "