gpt2

Pytorch implementation of GPT-2. It is widely inspired by Sebastian Raschka’s book and work https://github.com/rasbt/LLMs-from-scratch/.

class mfai.pytorch.models.llms.gpt2.CrossAttentionGPT2(settings, vocab_size=50257)[source]

Bases: Module

A GPT2 with cross attention to allow vision/weather data injection as key/values into some of the transformer block. Freely inspired by Llama3.2 as described here : https://magazine.sebastianraschka.com/i/151078631/the-llama-herd-of-models.

Parameters:
embed_tokens(tok_ids)[source]

Embeds and pos encodes tokens.

Return type:

Tensor

Parameters:

tok_ids (Tensor)

forward(token_ids, vision_inputs)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:
Return type:

Tensor

model_type = 4
settings_kls

alias of CrossAttentionGPT2Settings

class mfai.pytorch.models.llms.gpt2.CrossAttentionGPT2Settings(emb_dim=768, context_length=1024, n_heads=12, n_layers=12, drop_rate=0.1, qkv_bias=False, model_size='124M', attn_tf_compat=False, x_att_ratio=4)[source]

Bases: GPT2Settings

Parameters:
  • emb_dim (int)

  • context_length (int)

  • n_heads (int)

  • n_layers (int)

  • drop_rate (float)

  • qkv_bias (bool)

  • model_size (Literal['124M', '355M', '774M', '1558M'])

  • attn_tf_compat (bool)

  • x_att_ratio (int)

attn_tf_compat: bool
context_length: int
drop_rate: float
emb_dim: int
classmethod from_dict(kvs, *, infer_missing=False)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

kvs (dict | list | str | int | float | bool | None)

classmethod from_json(s, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

s (str | bytes | bytearray)

model_size: Literal['124M', '355M', '774M', '1558M']
n_heads: int
n_layers: int
qkv_bias: bool
classmethod schema(*, infer_missing=False, only=None, exclude=(), many=False, context=None, load_only=(), dump_only=(), partial=False, unknown=None)
Return type:

SchemaF[TypeVar(A, bound= DataClassJsonMixin)]

Parameters:
to_dict(encode_json=False)
Return type:

Dict[str, Union[dict, list, str, int, float, bool, None]]

to_json(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, indent=None, separators=None, default=None, sort_keys=False, **kw)
Return type:

str

Parameters:
x_att_ratio: int
class mfai.pytorch.models.llms.gpt2.CrossAttentionTransformerBlock(settings)[source]

Bases: Module

A cross attention transformer block.

Parameters:

settings (CrossAttentionGPT2Settings)

forward(x_q, x_kv)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:
Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.FeedForward(emb_dim)[source]

Bases: Module

Parameters:

emb_dim (int)

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.GELU[source]

Bases: Module

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.GPT2(settings, vocab_size=50257)[source]

Bases: Module

GPT implementation - Based on Sebastian Raschka’s book and github repo :

Parameters:
dowload_weights_from_tf_ckpt(model_dir)[source]

Downloads a tensorflow checkpoint into model_dir and sets the weights of self.

Return type:

None

Parameters:

model_dir (str | Path)

embed_tokens(tok_ids)[source]

Embeds and pos encodes tokens.

Return type:

Tensor

Parameters:

tok_ids (Tensor)

forward(tok_ids)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

tok_ids (Tensor)

Return type:

Tensor

forward_vectors(embeddings, first_embedding=None)[source]

Process a batch of embeddings through the model. If first_embedding is supplied the first tokens of each blocks are replaced by the corresponding embeddings. Useful for multimodal models with injection of vision data at each stage.

Return type:

Tensor

Parameters:
load_weights_from_dict(params)[source]

Loads weights into self using a dict likely coming from a tensorflow or other framework training. Use this to finetune from the official weights.

Parameters:

params (dict)

model_type = 4
settings_kls

alias of GPT2Settings

class mfai.pytorch.models.llms.gpt2.GPT2Settings(emb_dim=768, context_length=1024, n_heads=12, n_layers=12, drop_rate=0.1, qkv_bias=False, model_size='124M', attn_tf_compat=False)[source]

Bases: object

default settings correspond to a GPT2 small ‘124M’.

Parameters:
  • emb_dim (int)

  • context_length (int)

  • n_heads (int)

  • n_layers (int)

  • drop_rate (float)

  • qkv_bias (bool)

  • model_size (Literal['124M', '355M', '774M', '1558M'])

  • attn_tf_compat (bool)

attn_tf_compat: bool
context_length: int
drop_rate: float
emb_dim: int
classmethod from_dict(kvs, *, infer_missing=False)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

kvs (dict | list | str | int | float | bool | None)

classmethod from_json(s, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

s (str | bytes | bytearray)

model_size: Literal['124M', '355M', '774M', '1558M']
n_heads: int
n_layers: int
qkv_bias: bool
classmethod schema(*, infer_missing=False, only=None, exclude=(), many=False, context=None, load_only=(), dump_only=(), partial=False, unknown=None)
Return type:

SchemaF[TypeVar(A, bound= DataClassJsonMixin)]

Parameters:
to_dict(encode_json=False)
Return type:

Dict[str, Union[dict, list, str, int, float, bool, None]]

to_json(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, indent=None, separators=None, default=None, sort_keys=False, **kw)
Return type:

str

Parameters:
class mfai.pytorch.models.llms.gpt2.LayerNorm(emb_dim)[source]

Bases: Module

Parameters:

emb_dim (int)

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.MultiHeadAttention(d_in, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False)[source]

Bases: Module

MultiHead Attention compatible with tensorflow original implementation and weigths.

Parameters:
forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.MultiHeadAttentionPySDPA(d_in, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False)[source]

Bases: Module

Mutli Head Attention using Pytorch’s scaled_dot_product_attention.

Parameters:
forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.MultiHeadCrossAttentionPySDPA(d_in_q, d_in_kv, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False)[source]

Bases: Module

Mutli Head Cross Attention using Pytorch’s scaled_dot_product_attention The query and key/values are from different sources.

Parameters:
forward(x_q, x_kv)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:
Return type:

Tensor

class mfai.pytorch.models.llms.gpt2.TransformerBlock(settings)[source]

Bases: Module

A transformer block - Based on Sebastian Raschka’s book and github repo : https://github.com/rasbt/LLMs-from-scratch/.

  • Attention used is based on pytorch’s scaled_dot_product_attention

( Most efficient MultiHeadAttention module accodring S.Raschka’s benchmark https://github.com/rasbt/LLMs-from-scratch/tree/main/ch03/02_bonus_efficient-multihead-attention ).

Parameters:

settings (GPT2Settings)

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor