tokenizers

Module with various LLM tokenizers wrapped in a common interface.

class mfai.tokenizers.GPT2Tokenizer[source]

Bases: Tokenizer

add_special_tokens(new_special_tokens)[source]

Method to add some special tokens to the tokenizer.

For more details about extending a tiktoken.Encoding https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken

Return type:

None

Parameters:

new_special_tokens (list[str])

decode(tokens, *args, **kwargs)[source]
Return type:

str

Parameters:
encode(text, *args, **kwargs)[source]
Return type:

List[int]

Parameters:
property eot_token: int
name()[source]
Return type:

str

property vocab_size: int
class mfai.tokenizers.LlamaTokenizer[source]

Bases: Tokenizer

decode(tokens, *args, **kwargs)[source]
Return type:

str

Parameters:
encode(text, *args, **kwargs)[source]
Return type:

List[int]

Parameters:
property eot_token: int
name()[source]
Return type:

str

property vocab_size: int
class mfai.tokenizers.MiniGPT2Tokenizer[source]

Bases: Tokenizer, ABC

A Tokenizer using a reduced set of tokens from a base GPT2Tokenizer. Typical use case is for narrow vocab problems with only 1000 tokens out of a vocab of 50000. To use these class, you only have to implement the method ‘get_set_tokens’.

add_special_tokens(special_tokens)[source]

Method to add some special tokens to the tokenizer.

For more details about extending a tiktoken.Encoding https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken

Return type:

None

Parameters:

special_tokens (list[str])

decode(tokens, *args, **kwargs)[source]
Return type:

str

Parameters:
encode(text, *args, **kwargs)[source]
Return type:

List[int]

Parameters:
property eot_token: int
name()[source]
Return type:

str

abstract tokens()[source]

Method that return a set of tokenized words.

Return type:

set

Example

def tokens(self) -> set:

unique_tokens = set() texts: list[str] = … # Load all texts you want to encode for text in texts:

tokens = self.gpt2_tokenizer.encode(text) unique_tokens.update(tokens)

return unique_tokens

property vocab_size: int
class mfai.tokenizers.Tokenizer[source]

Bases: ABC

abstract decode(tokens, *args, **kwargs)[source]
Return type:

str

Parameters:
abstract encode(text, *args, **kwargs)[source]
Return type:

List[int]

Parameters:
abstract property eot_token: int
abstract name()[source]
Return type:

str

abstract property vocab_size: int