


class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.block.Mlp(in_features, hidden_features=None, out_features=None, activation=<built-in function gelu>, return_residual=False, device=None, dtype=None)[source]#

training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.block.CodeGenBlock(arch_config: ArchConfig, hf_config: PretrainedConfig, hidden_size: int, layer_idx: int | None = None)[source]#
forward(hidden_states: Tensor, mixer_subset=None, mixer_kwargs=None, **kwargs)[source]#

Pass the input through the encoder layer.

  • hidden_states – the sequence to the encoder layer (required).

  • mixer_subset – for cross-attention only. If not None, will take a subset of x before applying the query projection. Useful for e.g., ViT where we only care about the CLS token in the last layer.

training: bool#

PyTorch CodeGen model.

class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.model.CodeGenModel(arch_config: ArchConfig, hf_config)[source]#

Returns the model’s input embeddings.


A torch module mapping vocabulary to hidden states.

Return type:



Set model’s input embeddings.


value (nn.Module) – A module mapping vocabulary to hidden states.

forward(input_ids: LongTensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, attention_mask: FloatTensor | None = None, token_type_ids: LongTensor | None = None, position_ids: LongTensor | None = None, head_mask: FloatTensor | None = None, inputs_embeds: FloatTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) Tuple | BaseModelOutputWithPast[source]#

A [transformers.modeling_outputs.BaseModelOutputWithPast] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([CodeGenConfig]) and inputs.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) – Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type:

[transformers.modeling_outputs.BaseModelOutputWithPast] or tuple(torch.FloatTensor)


```python >>> from transformers import AutoTokenizer, CodeGenModel >>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenModel.from_pretrained("Salesforce/codegen-2B-mono")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.backbones.codegen.model.CodeGenForCausalLM(arch_config: ArchConfig, hf_config)[source]#

Returns the model’s output embeddings.


A torch module mapping hidden states to vocabulary.

Return type:


prepare_inputs_for_generation(input_ids, past=None, **kwargs)[source]#
forward(input_ids: LongTensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, attention_mask: FloatTensor | None = None, token_type_ids: LongTensor | None = None, position_ids: LongTensor | None = None, head_mask: FloatTensor | None = None, inputs_embeds: FloatTensor | None = None, labels: LongTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) Tuple | CausalLMOutputWithPast[source]#
labels (torch.LongTensor of shape (batch_size, sequence_length), optional):

Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, …, config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, …, config.vocab_size]

training: bool#


Causal Self-Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.causal_self_attn.CausalSelfAttention(arch_config: ArchConfig, hf_config: CodeGenConfig, hidden_size: int, total_heads: int, op_heads: int, **kwargs)[source]#
forward(hidden_states: FloatTensor | None, attention_mask: FloatTensor | None = None, layer_past: Tuple[Tensor] | None = None, head_mask: FloatTensor | None = None, use_cache: bool | None = False, output_attentions: bool | None = False, **kwargs) Tuple[Tensor, Tuple[Tensor]] | Tuple[Tensor, Tuple[Tensor], Tuple[Tensor, ...]] | None[source]#

training: bool#

Fast Fourier Transform Convolution#

Local Attention#

Adapted from lucidrains/local-attention.

class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.SinusoidalEmbeddings(dim)[source]#

training: bool#
archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.apply_rotary_pos_emb(q, k, freqs)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.pad_to_multiple(tensor, multiple, dim=-1, value=0)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.look_around(x, backward=1, forward=0, pad_value=-1, dim=2)[source]#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.LocalAttention(window_size, causal=False, look_backward=1, look_forward=None, dropout=0.0, autopad=False, exact_windowsize=False, pad_value: int = -1, rel_pos_emb_dim: int | None = None, **kwargs)[source]#
forward(q, k, v, bin_attention_mask: FloatTensor | None = None)[source]#

training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.local_attention.LocalMHA(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, att_dropout=0.0, prenorm=False, use_rotary: bool = True, **kwargs)[source]#
forward(hidden_states, bin_attention_mask: LongTensor | None = None, **kwargs)[source]#

training: bool#

Locality Sensitive Hashing Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.lsh_attn.LSHAttention(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, auto_pick_num_buckets: bool = True, autopad: bool = True, **kwargs)[source]#
forward(hidden_states, bin_attention_mask: FloatTensor | None = None, past_buckets_states: Tensor | None = None, use_cache: bool = False, *args, **kwargs)[source]#

training: bool#

Multi-Head Attention#

Modified from HazyResearch/flash-attention

class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.BaseRotaryEmbedding(dim: int, base=10000, scale_base=0, device=None)[source]#
apply_rotary_emb_qkv(qkv: FloatTensor, sin: FloatTensor, cos: FloatTensor, sin_k: FloatTensor | None = None, cos_k: FloatTensor | None = None) FloatTensor[source]#
forward(qkv: Tensor, seqlen_offset: int = 0) Tuple[Tensor, Tensor][source]#

seqlen_offset: can be used in generation where the qkv being passed in is only the last token in the batch.

training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.SelfAttention(causal=False, softmax_scale=None, attention_dropout=0.0)[source]#

Implement the scaled dot product attention with softmax. :param softmax_scale: (default: 1/sqrt(d_keys) where d_keys is computed at



attention_dropout (The dropout rate to apply to the attention) – (default: 0.0)

forward(qkv, causal=None, key_padding_mask=None)[source]#

Implements the multihead softmax attention. :param qkv: :type qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) :param causal: :type causal: if passed, will override self.causal :param key_padding_mask: False means to mask out. (B, S) :type key_padding_mask: boolean mask to apply to the attention weights. True means to keep,

training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.mha.MHA(hf_config: PretrainedConfig, hidden_size: int, total_heads: int, op_heads: int, bias=True, dropout=0.0, softmax_scale=None, causal=True, layer_idx=None, rotary_emb_scale_base=0, return_residual=False, checkpointing=False, device=None, dtype=None, **kwargs)[source]#
training: bool#
forward(x, x_kv=None, key_padding_mask=None, cu_seqlens=None, max_seqlen=None, mixer_subset=None, inference_params=None, **kwargs)[source]#
  • x – (batch, seqlen, hidden_dim) (where hidden_dim = num heads * head dim) if cu_seqlens is None and max_seqlen is None, else (total, hidden_dim) where total is the is the sum of the sequence lengths in the batch.

  • x_kv – (batch, seqlen, hidden_dim), only applicable for cross-attention. If None, use x.

  • cu_seqlens – (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into x. Only applicable when using FlashAttention.

  • max_seqlen – int. Maximum sequence length in the batch.

  • key_padding_mask – boolean mask, True means to keep, False means to mask out. (batch, seqlen). Only applicable when not using FlashAttention.

  • mixer_subset – for cross-attention only. If not None, will take a subset of x before applying the query projection. Useful for e.g., ViT where we only care about the CLS token in the last layer.

  • inference_params – for generation. Adapted from Megatron-LM (and Apex)

  • https – //

Separable 1D-Convolution#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sep_conv1d.SeparableConv1d(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, **kwargs)[source]#
forward(hidden_states, **kwargs)[source]#

training: bool#

Structured Global Convolution#

Adapted from ctlllll/SGConv

archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.get_initializer(name, activation=None)[source]#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.modrelu(features)[source]#

training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Modrelu(features)[source]#
training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.TransposedLinear(d_input, d_output, bias=True)[source]#

Linear module on the second-to-last dimension Assumes shape (B, D, L), where L can be 1 or more axis


training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.TransposedLN(d, scalar=True)[source]#

LayerNorm module over second dimension Assumes shape (B, D, L), where L can be 1 or more axis

This is slow and a dedicated CUDA/Triton implementation shuld provide substantial end-to-end speedup


training: bool#
archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Activation(activation=None, size=None, dim=-1)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.LinearActivation(d_input, d_output, bias=True, zero_bias_init=False, transposed=False, initializer=None, activation=None, activate=False, weight_norm=False, **kwargs)[source]#

Returns a linear nn.Module with control over axes order, initialization, and activation

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.Normalization(d, transposed=False, _name_='layer', **kwargs)[source]#

step(x, **kwargs)[source]#
training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.GConv(d_model, d_state=64, l_max=1, channels=1, bidirectional=False, activation='gelu', ln=False, postact=None, initializer=None, weight_norm=False, hyper_act=None, use_fast_fftconv=False, dropout=0.0, transposed=True, verbose=False, shift=False, linear=False, mode='cat_randn', **kernel_args)[source]#
requires_length = True#
fft_conv(u, k, L)[source]#
forward(u, return_kernel=False)[source]#

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

property d_state#
property d_output#
property state_to_tensor#
training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv.SGConv(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, hf_config: PretrainedConfig, **kwargs)[source]#
training: bool#
forward(x: Tensor, **kwargs)[source]#

Structured Global Convolution 3#

class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv3.GConv3(d_model, d_state=64, l_max=1, head_dim=1, channels=1, bidirectional=False, activation='gelu', ln=False, postact=None, initializer=None, weight_norm=False, hyper_act=None, use_fast_fftconv=False, dropout=0.0, transposed=True, verbose=False, shift=False, linear=False, mode='cat_randn', **kernel_args)[source]#
requires_length = True#
init_kernels(h, **kernel_args)[source]#
get_kernels_forward(multiplier, kernel_list_init)[source]#
forward(u, return_kernel=False)[source]#

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

property d_state#
property d_output#
property state_to_tensor#
training: bool#
class archai.discrete_search.search_spaces.nlp.tfpp.ops.sgconv3.SGConv3(arch_config: ArchConfig, hidden_size: int, total_heads: int, op_heads: int, hf_config: PretrainedConfig, **kwargs)[source]#
forward(x: Tensor, **kwargs)[source]#

training: bool#

Mixed Attention#

class archai.discrete_search.search_spaces.nlp.tfpp.mixed_attention.MixedAttentionBlock(arch_config: ArchConfig, hf_config: PretrainedConfig, hidden_size: int, layer_idx: int | None = None)[source]#
forward(hidden_states: Tensor, **kwargs)[source]#

training: bool#

Mixed Operators#

class archai.discrete_search.search_spaces.nlp.tfpp.mixed_op.MixedAttentionBlock(arch_config: ArchConfig, hf_config: GPT2Config, hidden_size: int, layer_idx: int | None = None)[source]#
forward(hidden_states, **kwargs)[source]#

training: bool#


class archai.discrete_search.search_spaces.nlp.tfpp.model.LanguageModel(arch_config: ArchConfig, **hf_config_kwargs)[source]#
forward(*args, **kwargs) Any[source]#

static get_hf_config_cls(arch_config: ArchConfig) PretrainedConfig[source]#
training: bool#

Search Space#

archai.discrete_search.search_spaces.nlp.tfpp.search_space.to_tuple(x: Tuple[int] | int) Tuple[int][source]#
class archai.discrete_search.search_spaces.nlp.tfpp.search_space.TfppSearchSpace(backbone: str = 'codegen', embed_dims: Tuple[int] | int = (768,), inner_dims: Tuple[int] | int = (3072,), total_heads: Tuple[int] | int = (12,), total_layers: Tuple[int] | int = (8, 10, 12, 16, 18), local_attn_window_sizes: Tuple[int] | int = (256,), sgconv_kernel_sizes: Tuple[int] | int = (256,), sconv1d_kernel_sizes: Tuple[int] | int = (256,), lsh_attn_num_hashes: Tuple[int] | int = (4, 8), lsh_attn_bucket_size: Tuple[int] | int = (64,), op_subset: Tuple[str] | None = None, mixed_ops: bool = True, homogeneous: bool = False, seed: int | None = None, disable_cache: bool = True, **hf_config_kwargs)[source]#


archai.discrete_search.search_spaces.nlp.tfpp.utils.get_optim_flag(config: PretrainedConfig, flag_name: str)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.from_json_file(json_file: str | PathLike) Dict[str, Any][source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.from_yaml_file(yaml_file: str | PathLike) Dict[str, Any][source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.group_texts(examples, tokenizer, **kwargs)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.split_heads(tensor, num_heads, attn_head_size)[source]#

Splits hidden_size dim into attn_head_size and num_heads

archai.discrete_search.search_spaces.nlp.tfpp.utils.merge_heads(tensor, num_heads, attn_head_size)[source]#

Merges attn_head_size dim and num_attn_heads dim into hidden_size

archai.discrete_search.search_spaces.nlp.tfpp.utils.make_asso_map(input_ids, mask)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.make_broadcast_map(input_ids, mask, eos_id=103)[source]#
archai.discrete_search.search_spaces.nlp.tfpp.utils.get_attn_head_simplex(total_attn_heads: int | List[int], ops_list: List[str], grid_scale: int = 3) List[Tuple][source]#