Tutorial - Local Entity Linking

In the previous step, you ran the spacy_ann create_index CLI command. The output of this command is a loadable spaCy model with an ann_linker capable of Entity Linking against your KnowledgeBase data. You can load the saved model from output_dir in the previous step just like you would any normal spaCy model.

Load ann_linker model

First load the model created by spacy_ann create_index

import spacy
from spacy.tokens import Span

if __name__ == "__main__":

    # Load the spaCy model from the output_dir you used
    # from the create_index command
    model_dir = "examples/tutorial/models/ann_linker"
    nlp = spacy.load(model_dir)

    # The NER component of the en_core_web_md model doesn't actually
    # recognize the aliases as entities so we'll add a 
    # spaCy EntityRuler component for now to extract them.
    ruler = nlp.create_pipe('entity_ruler')
    patterns = [
        {"label": "SKILL", "pattern": alias}
        for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings() + ['machine learn']
    ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler, before="ann_linker")

    doc = nlp("NLP is a subset of machine learn.")

    print([(e.text, e.label_, e.kb_id_, e._.alias_candidates) for e in doc.ents])

    # Outputs:
    # [('NLP', 'SKILL', 'a3'), ('machine learn', 'SKILL', 'a1')]
    #
    # In our entities.jsonl file
    # a3 => Natural Language Processing
    # a1 => Machine learning

Load Extraction Model

This is a bit of misnomar for the provided example code. You likely want a trained NER model but the purpose of this example we'll just arbitrarily extract entities using the spaCy EntityRuler component by just add a few terms to it that are close to those in our KnowledgeBase.

import spacy
from spacy.tokens import Span

if __name__ == "__main__":

    # Load the spaCy model from the output_dir you used
    # from the create_index command
    model_dir = "examples/tutorial/models/ann_linker"
    nlp = spacy.load(model_dir)

    # The NER component of the en_core_web_md model doesn't actually
    # recognize the aliases as entities so we'll add a 
    # spaCy EntityRuler component for now to extract them.
    ruler = nlp.create_pipe('entity_ruler')
    patterns = [
        {"label": "SKILL", "pattern": alias}
        for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings() + ['machine learn']
    ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler, before="ann_linker")

    doc = nlp("NLP is a subset of machine learn.")

    print([(e.text, e.label_, e.kb_id_, e._.alias_candidates) for e in doc.ents])

    # Outputs:
    # [('NLP', 'SKILL', 'a3'), ('machine learn', 'SKILL', 'a1')]
    #
    # In our entities.jsonl file
    # a3 => Natural Language Processing
    # a1 => Machine learning

Test the trained ann_linker component

Run the pipeline on some sample text and ensure that you have e.kb_id_ set properly for each entity. You should get id a3 for "NLP" and id a1 for "machine learn

import spacy
from spacy.tokens import Span

if __name__ == "__main__":

    # Load the spaCy model from the output_dir you used
    # from the create_index command
    model_dir = "examples/tutorial/models/ann_linker"
    nlp = spacy.load(model_dir)

    # The NER component of the en_core_web_md model doesn't actually
    # recognize the aliases as entities so we'll add a 
    # spaCy EntityRuler component for now to extract them.
    ruler = nlp.create_pipe('entity_ruler')
    patterns = [
        {"label": "SKILL", "pattern": alias}
        for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings() + ['machine learn']
    ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler, before="ann_linker")

    doc = nlp("NLP is a subset of machine learn.")

    print([(e.text, e.label_, e.kb_id_, e._.alias_candidates) for e in doc.ents])

    # Outputs:
    # [('NLP', 'SKILL', 'a3'), ('machine learn', 'SKILL', 'a1')]
    #
    # In our entities.jsonl file
    # a3 => Natural Language Processing
    # a1 => Machine learning

Next Steps

This works great when you can afford to fit your KnowledgeBase in memory and have full access to your KnowledgeBase. In the next step of this tutorial, we'll talk about hosting the KnowledgeBase and ANN Index remotely and making batch calls to the endpoint so you can keep the KnowledgeBase and model code separate.