Transformer++

Sections

Transformer++#

Backbones#

CodeGen#

class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.block.Mlp(in_features, hidden_features=None, out_features=None, activation=<built-in function gelu>, return_residual=False, device=None, dtype=None)[source]#

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.block.CodeGenBlock(arch_config: ArchConfig, hf_config: PretrainedConfig, hidden_size: int, layer_idx: int | None = None)[source]#

forward(hidden_states: Tensor, mixer_subset=None, mixer_kwargs=None, **kwargs)[source]#

Pass the input through the encoder layer.

Parameters:

hidden_states – the sequence to the encoder layer (required).
mixer_subset – for cross-attention only. If not None, will take a subset of x before applying the query projection. Useful for e.g., ViT where we only care about the CLS token in the last layer.

training: bool#

PyTorch CodeGen model.

class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.model.CodeGenModel(arch_config: ArchConfig, hf_config)[source]#

get_input_embeddings()[source]#

Returns the model’s input embeddings.

Returns:: A torch module mapping vocabulary to hidden states.
Return type:: nn.Module

set_input_embeddings(new_embeddings)[source]#

Set model’s input embeddings.

Parameters:: value (nn.Module) – A module mapping vocabulary to hidden states.

Returns:

A [transformers.modeling_outputs.BaseModelOutputWithPast] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([CodeGenConfig]) and inputs.

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) – Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type:

[transformers.modeling_outputs.BaseModelOutputWithPast] or tuple(torch.FloatTensor)

Example:

```python >>> from transformers import AutoTokenizer, CodeGenModel >>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenModel.from_pretrained("Salesforce/codegen-2B-mono")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
```

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.model.CodeGenForCausalLM(arch_config: ArchConfig, hf_config)[source]#

get_output_embeddings()[source]#

Returns the model’s output embeddings.

Returns:: A torch module mapping hidden states to vocabulary.
Return type:: nn.Module

set_output_embeddings(new_embeddings)[source]#

prepare_inputs_for_generation(input_ids, past=None, **kwargs)[source]#

labels (torch.LongTensor of shape (batch_size, sequence_length), optional):: Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, …, config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, …, config.vocab_size]

training: bool#

Operators#

Causal Self-Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.causal_self_attn.CausalSelfAttention(arch_config: ArchConfig, hf_config: CodeGenConfig, hidden_size: int, total_heads: int, op_heads: int, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Fast Fourier Transform Convolution#

Local Attention#

Adapted from lucidrains/local-attention.

class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.SinusoidalEmbeddings(dim)[source]#

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.rotate_half(x)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.apply_rotary_pos_emb(q, k, freqs)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.max_neg_value(tensor)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.pad_to_multiple(tensor, multiple, dim=-1, value=0)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.look_around(x, backward=1, forward=0, pad_value=-1, dim=2)[source]#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.LocalAttention(window_size, causal=False, look_backward=1, look_forward=None, dropout=0.0, autopad=False, exact_windowsize=False, pad_value: int = -1, rel_pos_emb_dim: int | None = None, **kwargs)[source]#

forward(q, k, v, bin_attention_mask: FloatTensor | None = None)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.LocalMHA(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, att_dropout=0.0, prenorm=False, use_rotary: bool = True, **kwargs)[source]#

forward(hidden_states, bin_attention_mask: LongTensor | None = None, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Locality Sensitive Hashing Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.lsh_attn.LSHAttention(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, auto_pick_num_buckets: bool = True, autopad: bool = True, **kwargs)[source]#

forward(hidden_states, bin_attention_mask: FloatTensor | None = None, past_buckets_states: Tensor | None = None, use_cache: bool = False, *args, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Multi-Head Attention#

Modified from HazyResearch/flash-attention

class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.BaseRotaryEmbedding(dim: int, base=10000, scale_base=0, device=None)[source]#

apply_rotary_emb_qkv(qkv: FloatTensor, sin: FloatTensor, cos: FloatTensor, sin_k: FloatTensor | None = None, cos_k: FloatTensor | None = None) → FloatTensor[source]#

forward(qkv: Tensor, seqlen_offset: int = 0) → Tuple[Tensor, Tensor][source]#: seqlen_offset: can be used in generation where the qkv being passed in is only the last token in the batch.

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.SelfAttention(causal=False, softmax_scale=None, attention_dropout=0.0)[source]#

Implement the scaled dot product attention with softmax. :param softmax_scale: (default: 1/sqrt(d_keys) where d_keys is computed at

runtime)

Parameters:: attention_dropout (The dropout rate to apply to the attention) – (default: 0.0)

forward(qkv, causal=None, key_padding_mask=None)[source]#: Implements the multihead softmax attention. :param qkv: :type qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) :param causal: :type causal: if passed, will override self.causal :param key_padding_mask: False means to mask out. (B, S) :type key_padding_mask: boolean mask to apply to the attention weights. True means to keep,

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.MHA(hf_config: PretrainedConfig, hidden_size: int, total_heads: int, op_heads: int, bias=True, dropout=0.0, softmax_scale=None, causal=True, layer_idx=None, rotary_emb_scale_base=0, return_residual=False, checkpointing=False, device=None, dtype=None, **kwargs)[source]#

training: bool#

forward(x, x_kv=None, key_padding_mask=None, cu_seqlens=None, max_seqlen=None, mixer_subset=None, inference_params=None, **kwargs)[source]#

Parameters:

x – (batch, seqlen, hidden_dim) (where hidden_dim = num heads * head dim) if cu_seqlens is None and max_seqlen is None, else (total, hidden_dim) where total is the is the sum of the sequence lengths in the batch.
x_kv – (batch, seqlen, hidden_dim), only applicable for cross-attention. If None, use x.
cu_seqlens – (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into x. Only applicable when using FlashAttention.
max_seqlen – int. Maximum sequence length in the batch.
key_padding_mask – boolean mask, True means to keep, False means to mask out. (batch, seqlen). Only applicable when not using FlashAttention.
mixer_subset – for cross-attention only. If not None, will take a subset of x before applying the query projection. Useful for e.g., ViT where we only care about the CLS token in the last layer.
inference_params – for generation. Adapted from Megatron-LM (and Apex)
https – //github.com/NVIDIA/apex/blob/3ff1a10f72ec07067c4e44759442329804ac5162/apex/transformer/testing/standalone_transformer_lm.py#L470

Separable 1D-Convolution#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sep_conv1d.SeparableConv1d(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, **kwargs)[source]#

forward(hidden_states, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Structured Global Convolution#

Adapted from ctlllll/SGConv

archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.get_initializer(name, activation=None)[source]#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.modrelu(features)[source]#

reset_parameters()[source]#

forward(inputs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Modrelu(features)[source]#

reset_parameters()[source]#

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.TransposedLinear(d_input, d_output, bias=True)[source]#

Linear module on the second-to-last dimension Assumes shape (B, D, L), where L can be 1 or more axis

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.TransposedLN(d, scalar=True)[source]#

LayerNorm module over second dimension Assumes shape (B, D, L), where L can be 1 or more axis

This is slow and a dedicated CUDA/Triton implementation shuld provide substantial end-to-end speedup

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Activation(activation=None, size=None, dim=-1)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.LinearActivation(d_input, d_output, bias=True, zero_bias_init=False, transposed=False, initializer=None, activation=None, activate=False, weight_norm=False, **kwargs)[source]#: Returns a linear nn.Module with control over axes order, initialization, and activation

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Normalization(d, transposed=False, _name_='layer', **kwargs)[source]#

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

step(x, **kwargs)[source]#

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.GConv(d_model, d_state=64, l_max=1, channels=1, bidirectional=False, activation='gelu', ln=False, postact=None, initializer=None, weight_norm=False, hyper_act=None, use_fast_fftconv=False, dropout=0.0, transposed=True, verbose=False, shift=False, linear=False, mode='cat_randn', **kernel_args)[source]#

requires_length = True#

fft_conv(u, k, L)[source]#

forward(u, return_kernel=False)[source]#

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

property d_state#

property d_output#

property state_to_tensor#

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.SGConv(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, hf_config: PretrainedConfig, **kwargs)[source]#

training: bool#

forward(x: Tensor, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Structured Global Convolution 3#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv3.GConv3(d_model, d_state=64, l_max=1, head_dim=1, channels=1, bidirectional=False, activation='gelu', ln=False, postact=None, initializer=None, weight_norm=False, hyper_act=None, use_fast_fftconv=False, dropout=0.0, transposed=True, verbose=False, shift=False, linear=False, mode='cat_randn', **kernel_args)[source]#

requires_length = True#

init_kernels(h, **kernel_args)[source]#

get_kernels_forward(multiplier, kernel_list_init)[source]#

forward(u, return_kernel=False)[source]#

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

property d_state#

property d_output#

property state_to_tensor#

training: bool#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv3.SGConv3(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, hf_config: PretrainedConfig, **kwargs)[source]#

forward(x: Tensor, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Mixed Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.mixed_attention.MixedAttentionBlock(arch_config: ArchConfig, hf_config: PretrainedConfig, hidden_size: int, layer_idx: int | None = None)[source]#

forward(hidden_states: Tensor, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Mixed Operators#

class archai.discrete_search.search_spaces.nlp.tfpp.mixed_op.MixedAttentionBlock(arch_config: ArchConfig, hf_config: GPT2Config, hidden_size: int, layer_idx: int | None = None)[source]#

forward(hidden_states, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

Model#

class archai.discrete_search.search_spaces.nlp.tfpp.model.LanguageModel(arch_config: ArchConfig, **hf_config_kwargs)[source]#

forward(*args, **kwargs) → Any[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

static get_hf_config_cls(arch_config: ArchConfig) → PretrainedConfig[source]#

training: bool#

Search Space#

archai.discrete_search.search_spaces.nlp.tfpp.search_space.to_tuple(x: Tuple[int] | int) → Tuple[int][source]#

class archai.discrete_search.search_spaces.nlp.tfpp.search_space.TfppSearchSpace(backbone: str = 'codegen', embed_dims: Tuple[int] | int = (768,), inner_dims: Tuple[int] | int = (3072,), total_heads: Tuple[int] | int = (12,), total_layers: Tuple[int] | int = (8, 10, 12, 16, 18), local_attn_window_sizes: Tuple[int] | int = (256,), sgconv_kernel_sizes: Tuple[int] | int = (256,), sconv1d_kernel_sizes: Tuple[int] | int = (256,), lsh_attn_num_hashes: Tuple[int] | int = (4, 8), lsh_attn_bucket_size: Tuple[int] | int = (64,), op_subset: Tuple[str] | None = None, mixed_ops: bool = True, homogeneous: bool = False, seed: int | None = None, disable_cache: bool = True, **hf_config_kwargs)[source]#

Utilities#

archai.discrete_search.search_spaces.nlp.tfpp.utils.get_optim_flag(config: PretrainedConfig, flag_name: str)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.from_json_file(json_file: str | PathLike) → Dict[str, Any][source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.from_yaml_file(yaml_file: str | PathLike) → Dict[str, Any][source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.group_texts(examples, tokenizer, **kwargs)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.split_heads(tensor, num_heads, attn_head_size)[source]#: Splits hidden_size dim into attn_head_size and num_heads

archai.discrete_search.search_spaces.nlp.tfpp.utils.merge_heads(tensor, num_heads, attn_head_size)[source]#: Merges attn_head_size dim and num_attn_heads dim into hidden_size

archai.discrete_search.search_spaces.nlp.tfpp.utils.make_asso_map(input_ids, mask)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.make_broadcast_map(input_ids, mask, eos_id=103)[source]#

archai.discrete_search.search_spaces.nlp.tfpp.utils.get_attn_head_simplex(total_attn_heads: int | List[int], ops_list: List[str], grid_scale: int = 3) → List[Tuple][source]#