A Tokenizer using a reduced set of tokens from a base GPT2Tokenizer. Typical use case is for narrow vocab problems with only 1000 tokens out of a vocab of 50000. To use these class, you only have to implement the method ‘get_set_tokens’.

add_special_tokens(special_tokens)[source]¶

Method to add some special tokens to the tokenizer.

For more details about extending a tiktoken.Encoding https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken

Return type:: None
Parameters:: special_tokens (list[str])

decode(tokens, *args, **kwargs)[source]¶

Return type:

str

Parameters:

tokens (list)
args (Any)
kwargs (Any)

encode(text, *args, **kwargs)[source]¶

Return type:

List[int]

Parameters:

text (str)
args (Any)
kwargs (Any)

property eot_token: int¶

name()[source]¶

Return type:: str

abstract tokens()[source]¶

Method that return a set of tokenized words.

Return type:: set

Example

def tokens(self) -> set:

unique_tokens = set() texts: list[str] = … # Load all texts you want to encode for text in texts:

tokens = self.gpt2_tokenizer.encode(text) unique_tokens.update(tokens)

return unique_tokens

property vocab_size: int¶

class mfai.tokenizers.Tokenizer[source]¶

Bases: ABC

abstract decode(tokens, *args, **kwargs)[source]¶

Return type:

str

Parameters:

tokens (list)
args (Any)
kwargs (Any)

abstract encode(text, *args, **kwargs)[source]¶

Return type:

List[int]

Parameters:

text (str)
args (Any)
kwargs (Any)

abstract property eot_token: int¶

abstract name()[source]¶

Return type:: str

abstract property vocab_size: int¶