fuyu

class mfai.pytorch.models.llms.fuyu.Fuyu(settings=FuyuSettings(backend='gpt2', emb_dim=768, context_length=1024, n_heads=12, n_layers=12, drop_rate=0.1, qkv_bias=False, hidden_dim=768, num_kv_groups=2, rope_base=500000.0, model_size='124M', attn_tf_compat=False, vision_input_shape=(3, 256, 256), inject_vision_each_stage=False, vision_encoder='linear', resnet_num_tokens=32, resnet_pos_embedding=False, resnet_mlp_output=False, layer_norm_viz_txt=False, layer_norm_viz=True, patch_size=None), vocab_size=50257)[source]

Bases: FreezeMLMMixin, Module

A multimodal LLM : vision/weather and txt inspired by Fuyu. Can use GPT2 or Llama2 as its LLM backend.

Parameters:
property context_length: int
forward(txt_token_ids, vision_inputs)[source]

Forward function of the Fuyu Multimodal language model.

Parameters:
  • txt_token_ids (Tensor) – tensor of shape (B, n_tok).

  • vision_inputs (Tensor | list[Tensor]) – tensor or list of tensor of shape (B, channels, lat, lon).

Returns:

tensor of shape (B, n_tok, vocab_size)

Return type:

Tensor

model_type = 5
settings_kls

alias of FuyuSettings

class mfai.pytorch.models.llms.fuyu.FuyuSettings(backend='gpt2', emb_dim=768, context_length=1024, n_heads=12, n_layers=12, drop_rate=0.1, qkv_bias=False, hidden_dim=768, num_kv_groups=2, rope_base=500000.0, model_size='124M', attn_tf_compat=False, vision_input_shape=(3, 256, 256), inject_vision_each_stage=False, vision_encoder='linear', resnet_num_tokens=32, resnet_pos_embedding=False, resnet_mlp_output=False, layer_norm_viz_txt=False, layer_norm_viz=True, patch_size=None)[source]

Bases: object

Settings for a multimodal language model.

Parameters:
  • backend (Literal['gpt2', 'llama2', 'llama3'])

  • emb_dim (int)

  • context_length (int)

  • n_heads (int)

  • n_layers (int)

  • drop_rate (float)

  • qkv_bias (bool)

  • hidden_dim (int)

  • num_kv_groups (int)

  • rope_base (float)

  • model_size (Literal['124M', '355M', '774M', '1558M'])

  • attn_tf_compat (bool)

  • vision_input_shape (tuple[int, int, int])

  • inject_vision_each_stage (bool)

  • vision_encoder (Literal['resnet50', 'linear', 'vit'])

  • resnet_num_tokens (int)

  • resnet_pos_embedding (bool)

  • resnet_mlp_output (bool)

  • layer_norm_viz_txt (bool)

  • layer_norm_viz (bool)

  • patch_size (None | int | tuple[int, int])

attn_tf_compat: bool = False
backend: Literal['gpt2', 'llama2', 'llama3'] = 'gpt2'
context_length: int = 1024
drop_rate: float = 0.1
emb_dim: int = 768
classmethod from_dict(kvs, *, infer_missing=False)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

kvs (dict | list | str | int | float | bool | None)

classmethod from_json(s, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw)
Return type:

TypeVar(A, bound= DataClassJsonMixin)

Parameters:

s (str | bytes | bytearray)

hidden_dim: int = 768
inject_vision_each_stage: bool = False
layer_norm_viz: bool = True
layer_norm_viz_txt: bool = False
model_size: Literal['124M', '355M', '774M', '1558M'] = '124M'
n_heads: int = 12
n_layers: int = 12
num_kv_groups: int = 2
patch_size: None | int | tuple[int, int] = None
qkv_bias: bool = False
resnet_mlp_output: bool = False
resnet_num_tokens: int = 32
resnet_pos_embedding: bool = False
rope_base: float = 500000.0
classmethod schema(*, infer_missing=False, only=None, exclude=(), many=False, context=None, load_only=(), dump_only=(), partial=False, unknown=None)
Return type:

SchemaF[TypeVar(A, bound= DataClassJsonMixin)]

Parameters:
to_dict(encode_json=False)
Return type:

Dict[str, Union[dict, list, str, int, float, bool, None]]

to_json(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, indent=None, separators=None, default=None, sort_keys=False, **kw)
Return type:

str

Parameters:
vision_encoder: Literal['resnet50', 'linear', 'vit'] = 'linear'
vision_input_shape: tuple[int, int, int] = (3, 256, 256)