Tutorial - Create an ANN Index¶
Once you have your data in the supported format, and a spaCy model with vectors you can use the spacy_ann
CLI to compute the nearest neighbors index for your Aliases and tran an Encoder for disambiguating entity spans to their canonical Id
Run the create_index
help command to understand the required arguments.
$ spacy_ann create_index --help spacy_ann create_index --help Usage: spacy_ann create_index [OPTIONS] MODEL KB_DIR OUTPUT_DIR Create an AnnLinker based on the Character N-Gram TF- IDF vectors for aliases in a KnowledgeBase model (str): spaCy language model directory or name to load kb_dir (Path): path to the directory with kb entities.jsonl and aliases.jsonl files output_dir (Path): path to output_dir for spaCy model with ann_linker pipe kb File Formats e.g. entities.jsonl {"id": "a1", "description": "Machine learning (ML) is the scientific study of algorithms and statistical models..."} {"id": "a2", "description": "ML ("Meta Language") is a general-purpose functional programming language. It has roots in Lisp, and has been characterized as "Lisp with types"."} e.g. aliases.jsonl {"alias": "ML", "entities": ["a1", "a2"], "probabilities": [0.5, 0.5]} Options: --new-model-name TEXT --cg-threshold FLOAT --n-iter INTEGER --verbose / --no-verbose --install-completion Install completion for the current shell. --show-completion Show completion for the current shell, to copy it or customize the installation. --help Show this message and exit.
Now provide the required arguments. I'm using the example data but at this step use your own.
the create_index
command will run a few steps and you should see an output like the one below.
spacy_ann create_index en_core_web_md examples/tutorial/data examples/tutorial/models // The create_index command runs a few steps // Load the model passed as the first positional argument (en_core_web_md) ===================== Load Model ====================== ⠹ Loading model en_core_web_md✔ Done. ℹ 0 entities without a description // Train an EntityEncoder on the descriptions of each Entity ================= Train EntityEncoder ================= ⠸ Starting training EntityEncoder✔ Done Training // Apply the EntityEncoder to get the final vectors for each entity ================= Apply EntityEncoder ================= ⠙ Applying EntityEncoder to descriptions✔ Finished, embeddings created ✔ Done adding entities and aliases to kb // Create Nearest Neighbors index from the Aliases in kb_dir/aliases.jsonl ================== Create ANN Index =================== Fitting tfidf vectorizer on 6 aliases Fitting and saving vectorizer took 0.012949 seconds Finding empty (all zeros) tfidf vectors Deleting 2/6 aliases because their tfidf is empty Fitting ann index on 4 aliases 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Fitting ann index took 0.030826 seconds