NGramFeaturizer¶
APIs¶
- class pyis.python.ops.NGramFeaturizer(self: ops.NGramFeaturizer, n: int, boundaries: bool) None ¶
NGramFeaturizer extracts ngram features given a string token list.
Create a NGramFeaturizer object.
- Parameters
n (int) – The n gram token length, valid numbers are [1, 8].
boundaries (bool) – Capture query boundaries or not. If True, the ngrams at start or end of the sentence are treated as additional features.
- dump_ngram(self: ops.NGramFeaturizer, arg0: str) None ¶
Save ngram words to file.
- Parameters
ngram_file (str) – The target ngram file.
- fit(self: ops.NGramFeaturizer, tokens: List[str]) None ¶
Build ngrams from the token list and add them to ngrams list if not already seen.
- Parameters
tokens (List[str]) – The token list for collecting new ngrams.
- load_ngram(self: ops.NGramFeaturizer, ngram_file: str) None ¶
Load ngram list from file.
The ngram file should contains two columns, separated by WHITESPACE or TABULAR characters. The first and second columns are ngram words and assigned ids correspondingly. For example,
the answer 0 answer is 1
Ensure the file is encoded in utf-8.
- Parameters
ngram_file (str) – The source ngram file.
- transform(self: ops.NGramFeaturizer, tokens: List[str]) List[ops.TextFeature] ¶
Extract ngrams given the token list based on known ngrams.
- Parameters
tokens (List[str]) – The token list.
Example¶
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.
from pyis.python import ops
featurizer = ops.NGramFeaturizer(2, True)
featurizer.fit(['the', 'answer', 'is', '42'])
features = featurizer.transform(['the', 'answer', 'is', '42'])
for f in features:
print(f.to_tuple())
'''Output:
# (0, 1.0, 0, 1)
# (1, 1.0, 0, 1)
# (2, 1.0, 1, 2)
# (3, 1.0, 2, 3)
# (4, 1.0, 2, 3)
'''