Multilayer Offload (MLO)¶

Multilayer Offload (MLO) is a powerful feature recently added to FINN that enables the implementation of much larger neural networks by implementing a repeating slice of the model (such as a single transformer encoder layer) in hardware and cycling model weights through external memory (DRAM/HBM). This technique allows models that would otherwise be too large to be mapped to the FPGA.

MLO is currently an experimental feature is not yet available on the main branch.

Overview¶

In many cases large Deep Learning models such as transformers and SLMs (and LLMs for that matter) have millions or billions of parameters processed over several identical repeating layers. One solution would be to map these layers to multiple FPGAs but the sheer quantity of layers (e.g. 32 layers in the PHI-4 Mini) makes it impractical to spread the design across so many devices. MLO overcomes this limitation by:

Implementing a single repeating layer (e.g., one transformer encoder) in hardware
Storing weights off-chip in high-bandwidth memory (HBM/DRAM)
Streaming weights into the accelerator as needed for each layer
Reusing the same hardware to process multiple layers sequentially

This approach trades some throughput for the ability to handle much larger models, making it ideal for larger transformer models such as SLMs, vision transformers, and other deep architectures.

How It Works¶

Loop Body Hierarchy¶

MLO works by identifying a repeating structure in the neural network and implementing only that structure in hardware. Currently, loop body discovery is not automated - users must manually identify one iteration of the repeating pattern and specify it using the loop_body_hierarchy parameter:

finn_config:
  loop_body_hierarchy: [['encoder', 'encoder.layer.0']]

Manual Loop Body Identification: The loop_body_hierarchy configuration must match the hierarchical naming structure in your ONNX model, which corresponds to the pkg.torch.onnx.name_scopes field used during model export. The loop rolling transformation uses these name scopes to determine which levels of hierarchy to include in the loop body.

⚠️ Important: You must use dynamo=True when exporting your PyTorch model to ONNX. Exporting with dynamo=True generates the metadata (name scopes) that MLO requires to identify repeating structures. Without this flag, the ONNX model will lack the hierarchical metadata needed for loop body discovery, and the MLO transformation will fail to locate the repeating patterns.

Technical Implementation: The node extraction mechanism is implemented in FINN's loop rolling transformations:

Step Location: deps/finn/src/finn/builder/build_dataflow_steps.py
Extraction Process: deps/finn/src/finn/transformation/fpgadataflow/loop_rolling.py (LoopExtraction class)
Hierarchy Matching: deps/finn/src/finn/util/onnxscript_helpers.py (PytorchHierarchyNode class)

The extraction works by:

Creating a hierarchy parser from PyTorch metadata (pkg.torch.onnx.name_scopes)
Adding each ONNX node to the parser based on its hierarchy path
Using prefix matching to find all nodes under the specified hierarchy paths
Extracting matching nodes to create loop templates and removing originals from the main graph

This process requires the PyTorch exporter metadata generated by dynamo=True, which contains the module instance hierarchies that map ONNX nodes back to their originating PyTorch modules.

This configuration tells Brainsmith:

Look for a repeating pattern called 'encoder' (top-level hierarchy)
The repeating unit is 'encoder.layer.0' (one complete encoder layer)
All encoder layers (layer.0, layer.1, layer.2, etc.) will be processed using the same hardware
The name scopes must exactly match the ONNX node names for proper identification

Multiple Hierarchy Groups¶

For models with multiple independent repeating structures, you can specify multiple hierarchy groups in the loop_body_hierarchy configuration:

finn_config:
  loop_body_hierarchy: [
    ['encoder', 'encoder.layer.0'],
    ['encoder', 'encoder.layer.1']
  ]

This advanced configuration enables the following:

Multiple Loop Iterations in a Single Body - Include nodes from consecutive layers (e.g., layer.0 and layer.1) to unroll multiple iterations into the hardware implementation
Fine-tuning Node Selection - Adjust which nodes are included in the loop body when metadata is lost or inexact during ONNX export

Multiple Group Behavior:

The loop body will include all of the nodes belonging to each hierarchy region within the loop body.

Hierarchy Level Specification¶

The loop_body_hierarchy can specify multiple levels of hierarchy to precisely control what gets included in the loop body:

Two-level hierarchy (simple case):

loop_body_hierarchy: [['encoder', 'encoder.layer.0']]

- Includes all nodes under encoder.layer.0.* - Good for simple transformer architectures

Three-level hierarchy (precise control):

loop_body_hierarchy: [
  ['bert', 'bert.encoder', 'bert.encoder.layer.0']
]

- Specifies the full path: model → encoder stack → specific layer - Provides more precise control over node selection - Useful for complex models with nested structures

The FINN loop rolling step will find all ONNX nodes whose names start with the final hierarchy level (e.g., bert.encoder.layer.0) and extract them as the loop body.

Loop Rolling Process¶

The loop rolling transformation (step_loop_rolling in FINN) performs these key operations:

Parses the loop_body_hierarchy to identify which nodes belong to the repeating structure
Extracts nodes by name scope matching - finds all ONNX nodes whose names match the specified hierarchy pattern (e.g., nodes starting with 'bert.encoder.layer.0')
Generates loop iteration logic - creates control structures to iterate through all layers using the same hardware
Sets up weight streaming infrastructure - configures memory interfaces to stream different weights for each iteration
Updates folding configuration - modifies parallelization parameters to account for the loop structure

Loop Body Extraction Details¶

The specific extraction logic is implemented in the FINN library (finn.builder.build_dataflow_steps.step_loop_rolling). While the exact source code lines are not visible in this repository, the process performs these operations based on observable behavior:

Node Selection Process:

# Conceptual extraction logic (actual implementation in FINN)
def extract_loop_body_nodes(model, loop_body_hierarchy):
    """Extract nodes matching the loop body hierarchy pattern."""
    extracted_nodes = []

    # Get the target pattern from hierarchy (e.g., 'bert.encoder.layer.0')
    target_pattern = loop_body_hierarchy[0][-1]  # Final level

    # Find all nodes whose names start with the target pattern
    for node in model.graph.node:
        if node.name.startswith(target_pattern):
            extracted_nodes.append(node)

    return extracted_nodes

The metadata fields exported by PyTorch Dynamo are not always reliable and in some cases can be removed by optimization passes. When encountered, these issues are reported to the onnxscript team and are often resolved. However, we have tried to make the Loop Body Extraction process as robust as possible in the presence of missing metadata.

In some cases, the Loop Body Extraction process can identify nodes with missing metadata fields. For example, if a node is missing its metadata field, Loop Extract attempts to infer the missing information for that node by checking the metadata of its input and output nodes.

Configuration¶

Basic MLO Setup¶

To enable MLO in your blueprint, add the loop_body_hierarchy configuration:

name: "BERT with MLO"
description: "BERT model with Multilayer Offload"

finn_config:
  loop_body_hierarchy: [['encoder', 'encoder.layer.0']]
  split_large_fifos: true
  fifosim_n_inferences: 2  # Speed up FIFO simulation

design_space:
  steps:
    - "qonnx_to_finn"
    - "bert_streamlining"
    - "infer_kernels"
    - "create_dataflow_partition"
    - "specialize_layers"
    - "loop_rolling"        # This step implements MLO
    - "target_fps_parallelization"
    - "apply_folding_config"
    # ... rest of pipeline

The easiest way to identify the proper loop body hierarchy is to open the model in Netron and check the values of the node metadata that you'd like to include in the loop body.

BERT MLO Example¶

For BERT models, a typical MLO configuration looks like:

# bert_mlo_demo.yaml
name: "BERT Demo"
description: "Hugging face BERT model with MLO"

extends: "../../brainsmith/blueprints/bert.yaml"

finn_config:
  loop_body_hierarchy: [['encoder', 'encoder.layer.0']]
  split_large_fifos: true
  fifosim_n_inferences: 2
  verify_steps: ['folded_hls_cppsim', 'stitched_ip_rtlsim']

design_space:
  steps:
    - at_start:
        insert:
          - "bert_cleanup"
          - "remove_head"
          - "remove_tail"
          - "generate_reference_io"
    - at_end:
        insert: "shell_metadata_handover"

Example: BERT MLO Demo¶

The examples/bert/bert_mlo_demo.sh demonstrates a complete MLO workflow:

#!/bin/bash
# BERT MLO Demo

# Generate folding configuration
python gen_folding_config.py \
    --simd 4 \
    --pe 4 \
    --num_layers 2 \
    -t 1 \
    -o ./configs/bert_mlo_demo.json

# Run BERT demo with MLO
python bert_demo.py \
    -o bert_mlo_demo \
    -n 4 \                    # 4 attention heads
    -l 2 \                    # 2 layers total
    -z 64 \                   # Hidden size 64
    -i 256 \                  # Intermediate size 256
    -b 8 \                    # 8-bit quantization
    -q 32 \                   # Sequence length 32
    --blueprint ./bert_mlo_demo.yaml

This creates a BERT model with 2 encoder layers where only the first layer is implemented in hardware, and the second layer reuses the same hardware with different weights.

CRITICAL: ONNX Export Requirements

# When exporting your model to ONNX, you MUST use dynamo=True
# This generates the metadata (name scopes) that MLO requires for loop body discovery
import brevitas.onnx as bo

bo.export_qonnx(
    model,
    inputs,
    output_path,
    dynamo=True,              # Generates name scope metadata for MLO
    input_names=['input_ids'],
    opset_version=18,
    do_constant_folding=True
)

Alternative: Custom Loop Rolling for Non-Dynamo Export

If you cannot use dynamo=True (due to compatibility issues, model complexity, or other constraints), you can either add the metadata manually or you can implement a custom loop rolling step.

Adding Metadata Manually

If your ONNX model was exported without dynamo=True or the metadata was lost during optimization, you can manually add the required pkg.torch.onnx.name_scopes metadata to enable MLO. This approach requires modifying the ONNX model's metadata properties directly.

Step 1: Understanding the Metadata Structure

The pkg.torch.onnx.name_scopes metadata field contains hierarchical naming information that maps each ONNX node back to its originating PyTorch module. The metadata is stored as a list of strings representing the hierarchy path from the root module to the specific operation.

For example, in a BERT model:

# Layer 0 attention query node
['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.query']

# Layer 0 attention key node
['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.key']

# Layer 1 attention query node
['bert', 'bert.encoder', 'bert.encoder.layer.1', 'bert.encoder.layer.1.attention.self.query']

Step 2: Identify Your Model's Hierarchy

First, determine the hierarchical structure of your model:

import torch

# Example: Print your PyTorch model structure
model = YourModel()
for name, module in model.named_modules():
    print(name)

# Output might look like:
# encoder
# encoder.layer.0
# encoder.layer.0.attention
# encoder.layer.0.attention.self
# encoder.layer.1.attention
# encoder.layer.1.attention.self

Step 3: Add Metadata to ONNX Nodes

Use the following script to add metadata to your ONNX model:

import onnx
from onnx import helper

def add_name_scope_metadata(model_path, output_path, node_hierarchy_map):
    """
    Add pkg.torch.onnx.name_scopes metadata to ONNX nodes.

    Args:
        model_path: Path to input ONNX model
        output_path: Path to save modified ONNX model
        node_hierarchy_map: Dict mapping node names to hierarchy paths (as list of strings)
                           e.g., {'MatMul_0': ['encoder', 'encoder.layer.0', 'encoder.layer.0.attention']}
    """
    model = onnx.load(model_path)

    for node in model.graph.node:
        if node.name in node_hierarchy_map:
            hierarchy_list = node_hierarchy_map[node.name]
            # Convert list to the string format expected by ONNX metadata
            # Format: serialized list of strings
            hierarchy_str = str(hierarchy_list)

            # Add or update the metadata attribute
            metadata_found = False
            for attr in node.attribute:
                if attr.name == "pkg.torch.onnx.name_scopes":
                    attr.s = hierarchy_str.encode('utf-8')
                    metadata_found = True
                    break

            if not metadata_found:
                # Create new metadata attribute
                metadata_attr = helper.make_attribute(
                    "pkg.torch.onnx.name_scopes",
                    hierarchy_str
                )
                node.attribute.append(metadata_attr)

    onnx.save(model, output_path)
    print(f"Model with metadata saved to {output_path}")

# Example usage for a BERT model
node_hierarchy_map = {
    # Attention layer nodes
    'MatMul_0': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.query'],
    'MatMul_1': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.key'],
    'MatMul_2': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.value'],
    'MatMul_3': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.output.dense'],

    # Intermediate layer nodes
    'MatMul_4': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.intermediate.dense'],
    'MatMul_5': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.output.dense'],

    # LayerNorm nodes
    'LayerNormalization_0': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.output.LayerNorm'],
    'LayerNormalization_1': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.output.LayerNorm'],

    # You only need to add metadata for the nodes used in the loop body template
}

add_name_scope_metadata(
    'model_without_metadata.onnx',
    'model_with_metadata.onnx',
    node_hierarchy_map
)

Step 4: Verify Metadata with Netron

After adding metadata, open the modified model in Netron and inspect node properties to verify the pkg.torch.onnx.name_scopes field appears correctly.

Step 5: Use in MLO Configuration

Once metadata is added, configure your blueprint with the appropriate loop_body_hierarchy:

finn_config:
  loop_body_hierarchy: [['encoder', 'encoder.layer.0']]  # Must match your hierarchy paths

Important Notes: - Metadata must accurately reflect the repeating structure of your model - All nodes within a layer should have consistent hierarchy prefixes - Test with a small model (2-3 layers) before applying to larger models - Incorrect metadata will cause loop body extraction to fail or extract wrong nodes

Custom Loop Rolling Step

If you cannot export via PyTorch Dynamo, you can write your own Loop Extraction transform and then leverage the existing Loop Rolling transform to create the FINNLoop ONNX node. At present, you'll need to copy the Loop Rolling step in FINN and replace the Loop Extraction functionality. In the future, we plan to update the Loop Rolling step to accept a custom Loop Extraction function.

The standard Loop Rolling build step consists of two transformations: Loop Body Extraction and Loop Rolling. Loop Body Extraction returns a LoopBodyTemplate object which is used by the LoopRolling transformation to as a pattern to identify individual instances of each loop body. The LoopBody template object is created using an ONNX file that contains one copy of the LoopBody you'd like to create.

If you have a graph of the loop body or can easily create one, then you can simply create a custom Loop Rolling step in BrainSmith that creates the LoopBodyTemplate object from the ONNX file and passes it to the LoopRolling transformation as shown in the example code below.

Example: Custom Loop Rolling Step with Pre-built Loop Body Template

from brainsmith.core.plugins import step
from finn.transformation.fpgadataflow.loop_rolling import LoopBodyTemplate, LoopRolling

@step(name="custom_loop_rolling_with_template")
def custom_loop_rolling_with_template(model, cfg):
    """
    Custom loop rolling step that uses a pre-created loop body ONNX file.

    Use this approach when you have manually created or extracted the loop body
    graph and saved it to an ONNX file.
    """
    # Load the loop body template from a pre-created ONNX file
    # This file should contain one complete iteration of your loop body
    loop_body_template_path = "path/to/your/loop_body_template.onnx"
    loop_body_template = LoopBodyTemplate(loop_body_template_path)

    # Apply the loop rolling transformation using your custom template
    model = model.transform(LoopRolling(loop_body_template))

    return model

In this approach, you need to manually create loop_body_template.onnx containing one instance of your repeating layer structure. You can create this file by: 1. Extracting a subgraph from your full model using ONNX tools 2. Building it programmatically using ONNX IR or onnxscript 3. Exporting a single layer model from PyTorch

Otherwise, you can create a custom LoopBodyExtraction transform. One approach to creating this transform is to create a python list of ONNX nodes within the model that fully comprise an iteration of the LoopBody. Then you can use that list to create a SubGraphView object which can in turn be saved to an ONNX file and then used to create the LoopBodyTemplate as shown in the example code below.

Example: Custom Loop Extraction and Rolling

from brainsmith.core.plugins import step
from finn.transformation.fpgadataflow.loop_rolling import LoopBodyTemplate, LoopRolling
from finn.util import onnxscript_helpers as osh
import onnxscript
from onnxscript import ir
import onnx

class CustomLoopExtraction:
    """
    Custom loop body extraction that identifies loop body nodes
    without relying on PyTorch metadata.
    """

    def __init__(self, loop_body_hierarchy):
        self.loop_body_hierarchy = loop_body_hierarchy
        self.loop_body_template = None

    def extract_loop_body_nodes(self, graph, target_pattern):
        """
        Identify nodes that belong to the loop body.

        This is where you implement your custom logic to find the nodes.
        You can use pattern matching, graph analysis, or any other method.
        """
        extracted_nodes = []

        # Strategy 1: Simple name prefix matching
        for node in graph._nodes:
            if node.name.startswith(target_pattern):
                extracted_nodes.append(node)

        # Strategy 2: If prefix matching fails, try pattern in node name
        if not extracted_nodes:
            layer_id = target_pattern.split('.')[-1]
            for node in graph._nodes:
                if f".{layer_id}." in node.name or f"_{layer_id}_" in node.name:
                    extracted_nodes.append(node)

        return extracted_nodes

    def apply(self, model):
        """Extract loop body and create template file."""
        # Deserialize the model to ONNX IR
        model_ir = onnxscript.ir.serde.deserialize_model(model.model)
        graph = model_ir.graph

        # Get the target pattern from hierarchy
        target_pattern = self.loop_body_hierarchy[0][-1]

        # Extract nodes belonging to the loop body
        nodes = self.extract_loop_body_nodes(graph, target_pattern)

        if not nodes:
            raise ValueError(f"No nodes found matching pattern: {target_pattern}")

        print(f"Extracted {len(nodes)} nodes for loop body")

        # Create a SubGraphView containing only the loop body nodes
        loop_body_graph_view = osh.SubGraphView(graph, "loop-body", nodes)

        # Create an ONNX model from the subgraph
        loop_body_model = onnxscript.ir.Model(
            loop_body_graph_view,
            ir_version=model.model.ir_version
        )

        # Serialize and save the loop body template
        proto = onnxscript.ir.serde.serialize_model(loop_body_model)
        template_path = "loop-body-template.onnx"
        onnx.save(proto, template_path)

        print(f"Loop body template saved to: {template_path}")

        # Create the LoopBodyTemplate object
        self.loop_body_template = LoopBodyTemplate(template_path)

        return model

@step(name="custom_loop_rolling_full")
def custom_loop_rolling_full(model, cfg):
    """
    Complete custom loop rolling step with custom extraction.

    This approach:
    1. Uses custom logic to identify loop body nodes
    2. Creates a loop body template from those nodes
    3. Applies FINN's LoopRolling transformation
    """
    # Get loop body hierarchy from config
    hierarchy = cfg.loop_body_hierarchy if hasattr(cfg, 'loop_body_hierarchy') \
                else [['encoder', 'encoder.layer.0']]

    # Step 1: Custom extraction to create loop body template
    extractor = CustomLoopExtraction(hierarchy)
    model = extractor.apply(model)

    # Step 2: Apply FINN's loop rolling with the custom template
    if extractor.loop_body_template is None:
        raise ValueError("Loop body extraction failed - no template created")

    model = model.transform(LoopRolling(extractor.loop_body_template))

    print("Custom loop rolling completed successfully")

    return model

Key Points:

CustomLoopExtraction.extract_loop_body_nodes(): This is where you implement your custom logic to identify which nodes belong to the loop body. The example shows simple name matching, but you can implement more sophisticated graph analysis.
SubGraphView: This FINN utility class creates a view of a subgraph given a list of nodes. It automatically handles:
Finding all necessary inputs/outputs
Maintaining graph connectivity
Preserving node attributes and metadata
LoopBodyTemplate: This class (from FINN) wraps the loop body ONNX file and provides the pattern matching infrastructure that LoopRolling needs.
LoopRolling transformation: This is FINN's standard transformation that:
Finds all instances of the loop body pattern in your model
Replaces them with a single FINNLoop node
Sets up weight streaming infrastructure
Handles I/O normalization and type checking

Usage in Blueprint:

design_space:
  steps:
    - "qonnx_to_finn"
    - "bert_streamlining"
    - "infer_kernels"
    - "create_dataflow_partition"
    - "specialize_layers"
    - "custom_loop_rolling_full"  # Your custom step
    - "target_fps_parallelization"
    - "apply_folding_config"

Debugging MLO Issues¶

Common Problems¶

Missing or incorrect metadata (most common): - Ensure ONNX export used dynamo=True to generate name scope metadata - Verify the ONNX model contains proper hierarchical node names - If unable to use dynamo export, implement custom loop rolling step (see Loop Body Identification section)

Missing Loop Body Nodes

If a node that should be in the loop body is not included during Loop Extraction, this can appear in loopbody_template.onnx as unexpected inputs and outputs to the loop body graph. Further, this can result in loop rolling failure or errors in subsequent build steps like step_create_dataflow_partition.

Sometimes a node in the middle of the loop body will be excluded from the loop body. This can result in a self-referencing loop error in step_create_dataflow_partition, where the partitioning process detects invalid circular dependencies.

Debugging Steps: 1. Open loopbody_template.onnx in your build directory using Netron 2. Check for unexpected graph inputs/outputs that should be internal to the loop body 3. Identify which nodes are missing by comparing against your expected layer structure 4. Adjust the loop_body_hierarchy configuration to include missing nodes: - Try adding an additional hierarchy group for the missing node's namespace - Use a broader hierarchy prefix to capture more nodes - If using custom loop extraction, verify your node matching patterns 5. Verify metadata on the missing nodes (check pkg.torch.onnx.name_scopes field in Netron) 6. Rebuild and verify the loopbody_template.onnx contains all expected nodes

Incorrect loop body identification: - Check loop_body_hierarchy matches your model structure - Verify layer naming conventions in ONNX graph

Debug Tools¶

Save intermediate models - Use save_intermediate_models: true
Enable verification - Use RTL simulation to check correctness
Memory tracing - Monitor weight loading patterns
Performance counters - Track cycles, bandwidth utilization