Transformer-Flex

Sections

Transformer-Flex#

Search Space#

class archai.discrete_search.search_spaces.nlp.transformer_flex.search_space.TransformerFlexSearchSpace(arch_type: str, min_layers: int | None = 1, max_layers: int | None = 10, d_inner_options: List[int] | None = None, d_model_options: List[int] | None = None, n_head_options: List[int] | None = None, share_d_inner: bool | None = True, mutation_prob: float | None = 0.3, vocab_size: int | None = 10000, max_sequence_length: int | None = 1024, att_dropout_rate: float | None = 0.0, disable_weights_init: bool | None = False, random_seed: int | None = 1)[source]#

Search space for Transformer models with flexible architecture.

This class allows defining a search space for Transformer models with flexible architectures, using evolutionary or Bayesian optimization algorithms.

The search space can be customized to include different values for hyperparameters, such as the number of layers, embedding dimensions, and number of attention heads.

It also supports different Transformer variants, such as CodeGen, GPT-2, and Transformer-XL.

get_archid(config: Dict[str, Any]) → str[source]#

Returns a unique identifier for a given configuration.

Parameters:: config – Configuration dictionary.
Returns:: A unique identifier for the configuration.

random_sample() → ArchaiModel[source]#

Randomly sample an architecture from the search spaces.

Returns:: Sampled architecture.

save_arch(model: ArchaiModel, path: str) → None[source]#

Save an architecture to a file without saving the weights.

Parameters:

model – Model’s architecture to save.
file_path – File path to save the architecture.

load_arch(path: str) → ArchaiModel[source]#

Load from a file an architecture that was saved using SearchSpace.save_arch().

Parameters:: file_path – File path to load the architecture.
Returns:: Loaded model.

save_model_weights(model: ArchaiModel, path: str) → None[source]#

Save the weights of a model.

Parameters:

model – Model to save the weights.
file_path – File path to save the weights.

load_model_weights(model: ArchaiModel, path: str) → None[source]#

Load the weights (created with SearchSpace.save_model_weights()) into a model of the same architecture.

Parameters:

model – Model to load the weights.
file_path – File path to load the weights.

mutate(arch: ArchaiModel) → ArchaiModel[source]#

Mutate an architecture from the search space.

This method should not alter the base model architecture directly, only generate a new one.

Parameters:: arch – Base model.
Returns:: Mutated model.

crossover(arch_list: List[ArchaiModel]) → ArchaiModel[source]#

Combine a list of architectures into a new one.

Parameters:: arch_list – List of architectures.
Returns:: Resulting model.

encode(model: ArchaiModel) → List[float][source]#

Encode an architecture into a fixed-length vector representation.

Parameters:: arch – Model from the search space.
Returns:: Fixed-length vector representation of arch.

GPT-2 Flex#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.configuration_gpt2_flex.GPT2FlexConfig(*args, primer_square: bool | None = False, **kwargs)[source]#

model_type: str = 'gpt2-flex'#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_gpt2_flex.GPT2FlexAttention(config: GPT2FlexConfig, is_cross_attention: bool | None = False, layer_idx: int | None = None)[source]#

training: bool#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_gpt2_flex.GPT2FlexMLP(intermediate_size: int, config: GPT2FlexConfig)[source]#

forward(hidden_states: FloatTensor) → FloatTensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_gpt2_flex.GPT2FlexBlock(config: GPT2FlexConfig, layer_idx: int | None = None)[source]#

training: bool#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_gpt2_flex.GPT2FlexModel(config: GPT2FlexConfig)[source]#

config_class#: alias of GPT2FlexConfig

training: bool#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_gpt2_flex.GPT2FlexLMHeadModel(config: GPT2FlexConfig)[source]#

config_class#: alias of GPT2FlexConfig

training: bool#

NVIDIA Memory Transformer#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.configuration_mem_transformer.MemTransformerConfig(*args, **kwargs)[source]#

model_type: str = 'mem-transformer'#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_mem_transformer.MemTransformerBaseOutput[source]#

last_hidden_state: FloatTensor#

past_key_values: Tuple[FloatTensor] | None = None#

mems: List[FloatTensor] | None = None#

hidden_states: Tuple[FloatTensor] | None = None#

attentions: Tuple[FloatTensor] | None = None#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_mem_transformer.MemTransformerOutput[source]#

loss: FloatTensor | None = None#

prediction_scores: FloatTensor | None = None#

past_key_values: Tuple[FloatTensor] | None = None#

mems: List[FloatTensor] | None = None#

hidden_states: Tuple[FloatTensor] | None = None#

attentions: Tuple[FloatTensor] | None = None#

property logits: FloatTensor#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_mem_transformer.MemTransformerModel(config: MemTransformerConfig)[source]#

config_class#: alias of MemTransformerConfig

The [TransfoXLModel] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Parameters:

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) –
Indices of input sequence tokens in the vocabulary.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)
mems (List[torch.FloatTensor] of length config.n_layers) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see mems output below). Can be used to speed up sequential decoding. The token ids which have their mems given to this model should not be passed as input_ids as they have already been computed.
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) –
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) – Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) – Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

Returns:

A [transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([TransfoXLConfig]) and inputs.

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
mems (List[torch.FloatTensor] of length config.n_layers) – Contains pre-computed hidden-states (key and values in the attention blocks). Can be used (see mems input) to speed up sequential decoding. The token ids which have their past given to this model should not be passed as input ids as they have already been computed.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type:

[transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput] or tuple(torch.FloatTensor)

Example:

```python >>> from transformers import AutoTokenizer, TransfoXLModel >>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("transfo-xl-wt103")
>>> model = TransfoXLModel.from_pretrained("transfo-xl-wt103")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
```

training: bool#

class archai.discrete_search.search_spaces.nlp.transformer_flex.models.modeling_mem_transformer.MemTransformerLMHeadModel(config: MemTransformerConfig)[source]#

config_class#: alias of MemTransformerConfig

tie_weights() → None[source]#

Tie the weights between the input embeddings and the output embeddings.

If the torchscript flag is set in the configuration, can’t handle parameter sharing so we are cloning the weights instead.

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#