NGramFeaturizer

APIs

class pyis.python.ops.NGramFeaturizer(self: ops.NGramFeaturizer, n: int, boundaries: bool) None

NGramFeaturizer extracts ngram features given a string token list.

Create a NGramFeaturizer object.

Parameters
  • n (int) – The n gram token length, valid numbers are [1, 8].

  • boundaries (bool) – Capture query boundaries or not. If True, the ngrams at start or end of the sentence are treated as additional features.

dump_ngram(self: ops.NGramFeaturizer, arg0: str) None

Save ngram words to file.

Parameters

ngram_file (str) – The target ngram file.

fit(self: ops.NGramFeaturizer, tokens: List[str]) None

Build ngrams from the token list and add them to ngrams list if not already seen.

Parameters

tokens (List[str]) – The token list for collecting new ngrams.

load_ngram(self: ops.NGramFeaturizer, ngram_file: str) None

Load ngram list from file.

The ngram file should contains two columns, separated by WHITESPACE or TABULAR characters. The first and second columns are ngram words and assigned ids correspondingly. For example,

the answer 0
answer is 1

Ensure the file is encoded in utf-8.

Parameters

ngram_file (str) – The source ngram file.

transform(self: ops.NGramFeaturizer, tokens: List[str]) List[ops.TextFeature]

Extract ngrams given the token list based on known ngrams.

Parameters

tokens (List[str]) – The token list.

Example

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

from pyis.python import ops

featurizer = ops.NGramFeaturizer(2, True)
featurizer.fit(['the', 'answer', 'is', '42'])
features = featurizer.transform(['the', 'answer', 'is', '42'])

for f in features:
    print(f.to_tuple())

'''Output:     
# (0, 1.0, 0, 1)
# (1, 1.0, 0, 1)
# (2, 1.0, 1, 2)
# (3, 1.0, 2, 3)
# (4, 1.0, 2, 3)
'''