Multilayer Offload (MLO)¶
Multilayer Offload (MLO) is a powerful feature recently added to FINN that enables the implementation of much larger neural networks by implementing a repeating slice of the model (such as a single transformer encoder layer) in hardware and cycling model weights through external memory (DRAM/HBM). This technique allows models that would otherwise be too large to be mapped to the FPGA.
MLO is currently an experimental feature is not yet available on the main branch.
Overview¶
In many cases large Deep Learning models such as transformers and SLMs (and LLMs for that matter) have millions or billions of parameters processed over several identical repeating layers. One solution would be to map these layers to multiple FPGAs but the sheer quantity of layers (e.g. 32 layers in the PHI-4 Mini) makes it impractical to spread the design across so many devices. MLO overcomes this limitation by:
- Implementing a single repeating layer (e.g., one transformer encoder) in hardware
- Storing weights off-chip in high-bandwidth memory (HBM/DRAM)
- Streaming weights into the accelerator as needed for each layer
- Reusing the same hardware to process multiple layers sequentially
This approach trades some throughput for the ability to handle much larger models, making it ideal for larger transformer models such as SLMs, vision transformers, and other deep architectures.
How It Works¶
Loop Body Hierarchy¶
MLO works by identifying a repeating structure in the neural network and implementing only that structure in hardware. Currently, loop body discovery is not automated - users must manually identify one iteration of the repeating pattern and specify it using the loop_body_hierarchy parameter:
Manual Loop Body Identification:
The loop_body_hierarchy configuration must match the hierarchical naming structure in your ONNX model, which corresponds to the pkg.torch.onnx.name_scopes field used during model export. The loop rolling transformation uses these name scopes to determine which levels of hierarchy to include in the loop body.
⚠️ Important: You must use
dynamo=Truewhen exporting your PyTorch model to ONNX. Exporting withdynamo=Truegenerates the metadata (name scopes) that MLO requires to identify repeating structures. Without this flag, the ONNX model will lack the hierarchical metadata needed for loop body discovery, and the MLO transformation will fail to locate the repeating patterns.
Technical Implementation: The node extraction mechanism is implemented in FINN's loop rolling transformations:
- Step Location:
deps/finn/src/finn/builder/build_dataflow_steps.py - Extraction Process:
deps/finn/src/finn/transformation/fpgadataflow/loop_rolling.py(LoopExtraction class) - Hierarchy Matching:
deps/finn/src/finn/util/onnxscript_helpers.py(PytorchHierarchyNode class)
The extraction works by:
- Creating a hierarchy parser from PyTorch metadata (
pkg.torch.onnx.name_scopes) - Adding each ONNX node to the parser based on its hierarchy path
- Using prefix matching to find all nodes under the specified hierarchy paths
- Extracting matching nodes to create loop templates and removing originals from the main graph
This process requires the PyTorch exporter metadata generated by dynamo=True, which contains the module instance hierarchies that map ONNX nodes back to their originating PyTorch modules.
This configuration tells Brainsmith:
- Look for a repeating pattern called 'encoder' (top-level hierarchy)
- The repeating unit is 'encoder.layer.0' (one complete encoder layer)
- All encoder layers (layer.0, layer.1, layer.2, etc.) will be processed using the same hardware
- The name scopes must exactly match the ONNX node names for proper identification
Multiple Hierarchy Groups¶
For models with multiple independent repeating structures, you can specify multiple hierarchy groups in the loop_body_hierarchy configuration:
finn_config:
loop_body_hierarchy: [
['encoder', 'encoder.layer.0'],
['encoder', 'encoder.layer.1']
]
This advanced configuration enables the following:
- Multiple Loop Iterations in a Single Body - Include nodes from consecutive layers (e.g., layer.0 and layer.1) to unroll multiple iterations into the hardware implementation
- Fine-tuning Node Selection - Adjust which nodes are included in the loop body when metadata is lost or inexact during ONNX export
Multiple Group Behavior:
- The loop body will include all of the nodes belonging to each hierarchy region within the loop body.
Hierarchy Level Specification¶
The loop_body_hierarchy can specify multiple levels of hierarchy to precisely control what gets included in the loop body:
Two-level hierarchy (simple case):
- Includes all nodes underencoder.layer.0.*
- Good for simple transformer architectures
Three-level hierarchy (precise control):
- Specifies the full path: model → encoder stack → specific layer - Provides more precise control over node selection - Useful for complex models with nested structuresThe FINN loop rolling step will find all ONNX nodes whose names start with the final hierarchy level (e.g., bert.encoder.layer.0) and extract them as the loop body.
Loop Rolling Process¶
The loop rolling transformation (step_loop_rolling in FINN) performs these key operations:
- Parses the
loop_body_hierarchyto identify which nodes belong to the repeating structure - Extracts nodes by name scope matching - finds all ONNX nodes whose names match the specified hierarchy pattern (e.g., nodes starting with 'bert.encoder.layer.0')
- Generates loop iteration logic - creates control structures to iterate through all layers using the same hardware
- Sets up weight streaming infrastructure - configures memory interfaces to stream different weights for each iteration
- Updates folding configuration - modifies parallelization parameters to account for the loop structure
Loop Body Extraction Details¶
The specific extraction logic is implemented in the FINN library (finn.builder.build_dataflow_steps.step_loop_rolling). While the exact source code lines are not visible in this repository, the process performs these operations based on observable behavior:
Node Selection Process:
# Conceptual extraction logic (actual implementation in FINN)
def extract_loop_body_nodes(model, loop_body_hierarchy):
"""Extract nodes matching the loop body hierarchy pattern."""
extracted_nodes = []
# Get the target pattern from hierarchy (e.g., 'bert.encoder.layer.0')
target_pattern = loop_body_hierarchy[0][-1] # Final level
# Find all nodes whose names start with the target pattern
for node in model.graph.node:
if node.name.startswith(target_pattern):
extracted_nodes.append(node)
return extracted_nodes
The metadata fields exported by PyTorch Dynamo are not always reliable and in some cases can be removed by optimization passes. When encountered, these issues are reported to the onnxscript team and are often resolved. However, we have tried to make the Loop Body Extraction process as robust as possible in the presence of missing metadata.
In some cases, the Loop Body Extraction process can identify nodes with missing metadata fields. For example, if a node is missing its metadata field, Loop Extract attempts to infer the missing information for that node by checking the metadata of its input and output nodes.
Configuration¶
Basic MLO Setup¶
To enable MLO in your blueprint, add the loop_body_hierarchy configuration:
name: "BERT with MLO"
description: "BERT model with Multilayer Offload"
finn_config:
loop_body_hierarchy: [['encoder', 'encoder.layer.0']]
split_large_fifos: true
fifosim_n_inferences: 2 # Speed up FIFO simulation
design_space:
steps:
- "qonnx_to_finn"
- "bert_streamlining"
- "infer_kernels"
- "create_dataflow_partition"
- "specialize_layers"
- "loop_rolling" # This step implements MLO
- "target_fps_parallelization"
- "apply_folding_config"
# ... rest of pipeline
The easiest way to identify the proper loop body hierarchy is to open the model in Netron and check the values of the node metadata that you'd like to include in the loop body.
BERT MLO Example¶
For BERT models, a typical MLO configuration looks like:
# bert_mlo_demo.yaml
name: "BERT Demo"
description: "Hugging face BERT model with MLO"
extends: "../../brainsmith/blueprints/bert.yaml"
finn_config:
loop_body_hierarchy: [['encoder', 'encoder.layer.0']]
split_large_fifos: true
fifosim_n_inferences: 2
verify_steps: ['folded_hls_cppsim', 'stitched_ip_rtlsim']
design_space:
steps:
- at_start:
insert:
- "bert_cleanup"
- "remove_head"
- "remove_tail"
- "generate_reference_io"
- at_end:
insert: "shell_metadata_handover"
Example: BERT MLO Demo¶
The examples/bert/bert_mlo_demo.sh demonstrates a complete MLO workflow:
#!/bin/bash
# BERT MLO Demo
# Generate folding configuration
python gen_folding_config.py \
--simd 4 \
--pe 4 \
--num_layers 2 \
-t 1 \
-o ./configs/bert_mlo_demo.json
# Run BERT demo with MLO
python bert_demo.py \
-o bert_mlo_demo \
-n 4 \ # 4 attention heads
-l 2 \ # 2 layers total
-z 64 \ # Hidden size 64
-i 256 \ # Intermediate size 256
-b 8 \ # 8-bit quantization
-q 32 \ # Sequence length 32
--blueprint ./bert_mlo_demo.yaml
This creates a BERT model with 2 encoder layers where only the first layer is implemented in hardware, and the second layer reuses the same hardware with different weights.
CRITICAL: ONNX Export Requirements
# When exporting your model to ONNX, you MUST use dynamo=True
# This generates the metadata (name scopes) that MLO requires for loop body discovery
import brevitas.onnx as bo
bo.export_qonnx(
model,
inputs,
output_path,
dynamo=True, # Generates name scope metadata for MLO
input_names=['input_ids'],
opset_version=18,
do_constant_folding=True
)
Alternative: Custom Loop Rolling for Non-Dynamo Export
If you cannot use dynamo=True (due to compatibility issues, model complexity, or other constraints), you can either add the metadata manually or you can implement a custom loop rolling step.
Adding Metadata Manually
If your ONNX model was exported without dynamo=True or the metadata was lost during optimization, you can manually add the required pkg.torch.onnx.name_scopes metadata to enable MLO. This approach requires modifying the ONNX model's metadata properties directly.
Step 1: Understanding the Metadata Structure
The pkg.torch.onnx.name_scopes metadata field contains hierarchical naming information that maps each ONNX node back to its originating PyTorch module. The metadata is stored as a list of strings representing the hierarchy path from the root module to the specific operation.
For example, in a BERT model:
# Layer 0 attention query node
['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.query']
# Layer 0 attention key node
['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.key']
# Layer 1 attention query node
['bert', 'bert.encoder', 'bert.encoder.layer.1', 'bert.encoder.layer.1.attention.self.query']
Step 2: Identify Your Model's Hierarchy
First, determine the hierarchical structure of your model:
import torch
# Example: Print your PyTorch model structure
model = YourModel()
for name, module in model.named_modules():
print(name)
# Output might look like:
# encoder
# encoder.layer.0
# encoder.layer.0.attention
# encoder.layer.0.attention.self
# encoder.layer.1.attention
# encoder.layer.1.attention.self
Step 3: Add Metadata to ONNX Nodes
Use the following script to add metadata to your ONNX model:
import onnx
from onnx import helper
def add_name_scope_metadata(model_path, output_path, node_hierarchy_map):
"""
Add pkg.torch.onnx.name_scopes metadata to ONNX nodes.
Args:
model_path: Path to input ONNX model
output_path: Path to save modified ONNX model
node_hierarchy_map: Dict mapping node names to hierarchy paths (as list of strings)
e.g., {'MatMul_0': ['encoder', 'encoder.layer.0', 'encoder.layer.0.attention']}
"""
model = onnx.load(model_path)
for node in model.graph.node:
if node.name in node_hierarchy_map:
hierarchy_list = node_hierarchy_map[node.name]
# Convert list to the string format expected by ONNX metadata
# Format: serialized list of strings
hierarchy_str = str(hierarchy_list)
# Add or update the metadata attribute
metadata_found = False
for attr in node.attribute:
if attr.name == "pkg.torch.onnx.name_scopes":
attr.s = hierarchy_str.encode('utf-8')
metadata_found = True
break
if not metadata_found:
# Create new metadata attribute
metadata_attr = helper.make_attribute(
"pkg.torch.onnx.name_scopes",
hierarchy_str
)
node.attribute.append(metadata_attr)
onnx.save(model, output_path)
print(f"Model with metadata saved to {output_path}")
# Example usage for a BERT model
node_hierarchy_map = {
# Attention layer nodes
'MatMul_0': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.query'],
'MatMul_1': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.key'],
'MatMul_2': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.self.value'],
'MatMul_3': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.output.dense'],
# Intermediate layer nodes
'MatMul_4': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.intermediate.dense'],
'MatMul_5': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.output.dense'],
# LayerNorm nodes
'LayerNormalization_0': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.attention.output.LayerNorm'],
'LayerNormalization_1': ['bert', 'bert.encoder', 'bert.encoder.layer.0', 'bert.encoder.layer.0.output.LayerNorm'],
# You only need to add metadata for the nodes used in the loop body template
}
add_name_scope_metadata(
'model_without_metadata.onnx',
'model_with_metadata.onnx',
node_hierarchy_map
)
Step 4: Verify Metadata with Netron
After adding metadata, open the modified model in Netron and inspect node properties to verify the pkg.torch.onnx.name_scopes field appears correctly.
Step 5: Use in MLO Configuration
Once metadata is added, configure your blueprint with the appropriate loop_body_hierarchy:
finn_config:
loop_body_hierarchy: [['encoder', 'encoder.layer.0']] # Must match your hierarchy paths
Important Notes: - Metadata must accurately reflect the repeating structure of your model - All nodes within a layer should have consistent hierarchy prefixes - Test with a small model (2-3 layers) before applying to larger models - Incorrect metadata will cause loop body extraction to fail or extract wrong nodes
Custom Loop Rolling Step
If you cannot export via PyTorch Dynamo, you can write your own Loop Extraction transform and then leverage the existing Loop Rolling transform to create the FINNLoop ONNX node. At present, you'll need to copy the Loop Rolling step in FINN and replace the Loop Extraction functionality. In the future, we plan to update the Loop Rolling step to accept a custom Loop Extraction function.
The standard Loop Rolling build step consists of two transformations: Loop Body Extraction and Loop Rolling. Loop Body Extraction returns a LoopBodyTemplate object which is used by the LoopRolling transformation to as a pattern to identify individual instances of each loop body. The LoopBody template object is created using an ONNX file that contains one copy of the LoopBody you'd like to create.
If you have a graph of the loop body or can easily create one, then you can simply create a custom Loop Rolling step in BrainSmith that creates the LoopBodyTemplate object from the ONNX file and passes it to the LoopRolling transformation as shown in the example code below.
Example: Custom Loop Rolling Step with Pre-built Loop Body Template
from brainsmith.core.plugins import step
from finn.transformation.fpgadataflow.loop_rolling import LoopBodyTemplate, LoopRolling
@step(name="custom_loop_rolling_with_template")
def custom_loop_rolling_with_template(model, cfg):
"""
Custom loop rolling step that uses a pre-created loop body ONNX file.
Use this approach when you have manually created or extracted the loop body
graph and saved it to an ONNX file.
"""
# Load the loop body template from a pre-created ONNX file
# This file should contain one complete iteration of your loop body
loop_body_template_path = "path/to/your/loop_body_template.onnx"
loop_body_template = LoopBodyTemplate(loop_body_template_path)
# Apply the loop rolling transformation using your custom template
model = model.transform(LoopRolling(loop_body_template))
return model
In this approach, you need to manually create loop_body_template.onnx containing one instance of your repeating layer structure. You can create this file by:
1. Extracting a subgraph from your full model using ONNX tools
2. Building it programmatically using ONNX IR or onnxscript
3. Exporting a single layer model from PyTorch
Otherwise, you can create a custom LoopBodyExtraction transform. One approach to creating this transform is to create a python list of ONNX nodes within the model that fully comprise an iteration of the LoopBody. Then you can use that list to create a SubGraphView object which can in turn be saved to an ONNX file and then used to create the LoopBodyTemplate as shown in the example code below.
Example: Custom Loop Extraction and Rolling
from brainsmith.core.plugins import step
from finn.transformation.fpgadataflow.loop_rolling import LoopBodyTemplate, LoopRolling
from finn.util import onnxscript_helpers as osh
import onnxscript
from onnxscript import ir
import onnx
class CustomLoopExtraction:
"""
Custom loop body extraction that identifies loop body nodes
without relying on PyTorch metadata.
"""
def __init__(self, loop_body_hierarchy):
self.loop_body_hierarchy = loop_body_hierarchy
self.loop_body_template = None
def extract_loop_body_nodes(self, graph, target_pattern):
"""
Identify nodes that belong to the loop body.
This is where you implement your custom logic to find the nodes.
You can use pattern matching, graph analysis, or any other method.
"""
extracted_nodes = []
# Strategy 1: Simple name prefix matching
for node in graph._nodes:
if node.name.startswith(target_pattern):
extracted_nodes.append(node)
# Strategy 2: If prefix matching fails, try pattern in node name
if not extracted_nodes:
layer_id = target_pattern.split('.')[-1]
for node in graph._nodes:
if f".{layer_id}." in node.name or f"_{layer_id}_" in node.name:
extracted_nodes.append(node)
return extracted_nodes
def apply(self, model):
"""Extract loop body and create template file."""
# Deserialize the model to ONNX IR
model_ir = onnxscript.ir.serde.deserialize_model(model.model)
graph = model_ir.graph
# Get the target pattern from hierarchy
target_pattern = self.loop_body_hierarchy[0][-1]
# Extract nodes belonging to the loop body
nodes = self.extract_loop_body_nodes(graph, target_pattern)
if not nodes:
raise ValueError(f"No nodes found matching pattern: {target_pattern}")
print(f"Extracted {len(nodes)} nodes for loop body")
# Create a SubGraphView containing only the loop body nodes
loop_body_graph_view = osh.SubGraphView(graph, "loop-body", nodes)
# Create an ONNX model from the subgraph
loop_body_model = onnxscript.ir.Model(
loop_body_graph_view,
ir_version=model.model.ir_version
)
# Serialize and save the loop body template
proto = onnxscript.ir.serde.serialize_model(loop_body_model)
template_path = "loop-body-template.onnx"
onnx.save(proto, template_path)
print(f"Loop body template saved to: {template_path}")
# Create the LoopBodyTemplate object
self.loop_body_template = LoopBodyTemplate(template_path)
return model
@step(name="custom_loop_rolling_full")
def custom_loop_rolling_full(model, cfg):
"""
Complete custom loop rolling step with custom extraction.
This approach:
1. Uses custom logic to identify loop body nodes
2. Creates a loop body template from those nodes
3. Applies FINN's LoopRolling transformation
"""
# Get loop body hierarchy from config
hierarchy = cfg.loop_body_hierarchy if hasattr(cfg, 'loop_body_hierarchy') \
else [['encoder', 'encoder.layer.0']]
# Step 1: Custom extraction to create loop body template
extractor = CustomLoopExtraction(hierarchy)
model = extractor.apply(model)
# Step 2: Apply FINN's loop rolling with the custom template
if extractor.loop_body_template is None:
raise ValueError("Loop body extraction failed - no template created")
model = model.transform(LoopRolling(extractor.loop_body_template))
print("Custom loop rolling completed successfully")
return model
Key Points:
-
CustomLoopExtraction.extract_loop_body_nodes(): This is where you implement your custom logic to identify which nodes belong to the loop body. The example shows simple name matching, but you can implement more sophisticated graph analysis.
-
SubGraphView: This FINN utility class creates a view of a subgraph given a list of nodes. It automatically handles:
- Finding all necessary inputs/outputs
- Maintaining graph connectivity
-
Preserving node attributes and metadata
-
LoopBodyTemplate: This class (from FINN) wraps the loop body ONNX file and provides the pattern matching infrastructure that LoopRolling needs.
-
LoopRolling transformation: This is FINN's standard transformation that:
- Finds all instances of the loop body pattern in your model
- Replaces them with a single FINNLoop node
- Sets up weight streaming infrastructure
- Handles I/O normalization and type checking
Usage in Blueprint:
design_space:
steps:
- "qonnx_to_finn"
- "bert_streamlining"
- "infer_kernels"
- "create_dataflow_partition"
- "specialize_layers"
- "custom_loop_rolling_full" # Your custom step
- "target_fps_parallelization"
- "apply_folding_config"
Debugging MLO Issues¶
Common Problems¶
Missing or incorrect metadata (most common):
- Ensure ONNX export used dynamo=True to generate name scope metadata
- Verify the ONNX model contains proper hierarchical node names
- If unable to use dynamo export, implement custom loop rolling step (see Loop Body Identification section)
Missing Loop Body Nodes
If a node that should be in the loop body is not included during Loop Extraction, this can appear in loopbody_template.onnx as unexpected inputs and outputs to the loop body graph. Further, this can result in loop rolling failure or errors in subsequent build steps like step_create_dataflow_partition.
Sometimes a node in the middle of the loop body will be excluded from the loop body. This can result in a self-referencing loop error in step_create_dataflow_partition, where the partitioning process detects invalid circular dependencies.
Debugging Steps:
1. Open loopbody_template.onnx in your build directory using Netron
2. Check for unexpected graph inputs/outputs that should be internal to the loop body
3. Identify which nodes are missing by comparing against your expected layer structure
4. Adjust the loop_body_hierarchy configuration to include missing nodes:
- Try adding an additional hierarchy group for the missing node's namespace
- Use a broader hierarchy prefix to capture more nodes
- If using custom loop extraction, verify your node matching patterns
5. Verify metadata on the missing nodes (check pkg.torch.onnx.name_scopes field in Netron)
6. Rebuild and verify the loopbody_template.onnx contains all expected nodes
Incorrect loop body identification:
- Check loop_body_hierarchy matches your model structure
- Verify layer naming conventions in ONNX graph
Debug Tools¶
- Save intermediate models - Use
save_intermediate_models: true - Enable verification - Use RTL simulation to check correctness
- Memory tracing - Monitor weight loading patterns
- Performance counters - Track cycles, bandwidth utilization