LexiconFeaturizer¶

In PyIS, an operator is either a stateless global function, or a member function of a class that holds external states(data members). In either case, an operator is thread-safe and reentrant.

In most cases, we prefer a class member function. The WordDict is such a sample class. It demostrates how an operator is implemented in the form of a class, and its behavior.

APIs¶

class pyis.python.ops.NGramFeaturizer(self: ops.NGramFeaturizer, n: int, boundaries: bool) → None¶

NGramFeaturizer extracts ngram features given a string token list.

Create a NGramFeaturizer object.

Parameters

n (int) – The n gram token length, valid numbers are [1, 8].
boundaries (bool) – Capture query boundaries or not. If True, the ngrams at start or end of the sentence are treated as additional features.

dump_ngram(self: ops.NGramFeaturizer, arg0: str) → None¶

Save ngram words to file.

Parameters: ngram_file (str) – The target ngram file.

fit(self: ops.NGramFeaturizer, tokens: List[str]) → None¶

Build ngrams from the token list and add them to ngrams list if not already seen.

Parameters: tokens (List[str]) – The token list for collecting new ngrams.

load_ngram(self: ops.NGramFeaturizer, ngram_file: str) → None¶

Load ngram list from file.

The ngram file should contains two columns, separated by WHITESPACE or TABULAR characters. The first and second columns are ngram words and assigned ids correspondingly. For example,

the answer 0
answer is 1

Ensure the file is encoded in utf-8.

Parameters: ngram_file (str) – The source ngram file.

transform(self: ops.NGramFeaturizer, tokens: List[str]) → List[ops.TextFeature]¶

Extract ngrams given the token list based on known ngrams.

Parameters: tokens (List[str]) – The token list.

Example¶

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

import os
from pyis.python import ops
from typing import List

class Model:
    def __init__(self, dict_data_file: str):
        super().__init__()
        self.dictionary = ops.WordDict(dict_data_file)

    def run(self, tokens: List[str]) -> List[str]:
        res = self.dictionary.translate(tokens)
        return res

dict_data_file = os.path.join(os.path.dirname(__file__), 'word_dict.data.txt')

m = Model(dict_data_file)
res = m.run(["life", "in", "suzhou"])
print(res)