tokenizers¶
Module with various LLM tokenizers wrapped in a common interface.
- class mfai.tokenizers.GPT2Tokenizer[source]¶
Bases:
Tokenizer- add_special_tokens(new_special_tokens)[source]¶
Method to add some special tokens to the tokenizer.
For more details about extending a tiktoken.Encoding https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken
- class mfai.tokenizers.LlamaTokenizer[source]¶
Bases:
Tokenizer
- class mfai.tokenizers.MiniGPT2Tokenizer[source]¶
-
A Tokenizer using a reduced set of tokens from a base GPT2Tokenizer. Typical use case is for narrow vocab problems with only 1000 tokens out of a vocab of 50000. To use these class, you only have to implement the method ‘get_set_tokens’.
- add_special_tokens(special_tokens)[source]¶
Method to add some special tokens to the tokenizer.
For more details about extending a tiktoken.Encoding https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken
- abstract tokens()[source]¶
Method that return a set of tokenized words.
- Return type:
Example
- def tokens(self) -> set:
unique_tokens = set() texts: list[str] = … # Load all texts you want to encode for text in texts:
tokens = self.gpt2_tokenizer.encode(text) unique_tokens.update(tokens)
return unique_tokens