1 Introduction[Intro]

The hlsl is the GPU programming language provided in conjunction with the dx runtime. Over many years its use has expanded to cover every major rendering API across all major development platforms. Despite its popularity and long history hlsl has never had a formal language specification. This document seeks to change that.

hlsl draws heavy inspiration originally from isoC and later from isoCPP with additions specific to graphics and parallel computation programming. The language is also influenced to a lesser degree by other popular graphics and parallel programming languages.

hlsl has two reference implementations which this specification draws heavily from. The original reference implementation fxc has been in use since dx 9. The more recent reference implementation dxc has been the primary shader compiler since dx 12.

In writing this specification bias is leaned toward the language behavior of dxc rather than the behavior of fxc, although that can vary by context.

In very rare instances this spec will be aspirational, and may diverge from both reference implementation behaviors. This will only be done in instances where there is an intent to alter implementation behavior in the future. Since this document and the implementations are living sources, one or the other may be ahead in different regards at any point in time.

1.1 Scope[Intro.Scope]

This document specifies the requirements for implementations of hlsl. The hlsl specification is based on and highly influenced by the specifications for the c and the cpp.

This document covers both describing the language grammar and semantics for hlsl, and (in later sections) the standard library of data types used in shader programming.

1.2 Normative References[Intro.Refs]

The following referenced documents provide significant influence on this document and should be used in conjunction with interpreting this standard.

isoC, Programming languages - C
isoCPP, Programming languages - C++
dx Specifications, https://microsoft.github.io/DirectX-Specs/

1.3 Terms and definitions[Intro.Terms]

This document aims to use terms consistent with their definitions in isoC and isoCPP. In cases where the definitions are unclear, or where this document diverges from isoC and isoCPP, the definitions in this section, the remaining sections in this chapter, and the attached glossary ([main]) supersede other sources.

1.4 Common Definitions[Intro.Defs]

The following definitions are consistent between hlsl and the isoC and isoCPP specifications, however they are included here for reader convenience.

1.4.1 Correct Data[Intro.Defs.CorrectData]

Data is correct if it represents values that have specified or unspecified but not undefined behavior for all the operations in which it is used. Data that is the result of undefined behavior is not correct, and may be treated as undefined.

1.4.2 Diagnostic Message[Intro.Defs.Diags]

An implementation defined message belonging to a subset of the implementation’s output messages which communicates diagnostic information to the user.

1.4.3 Ill-formed Program[Intro.Defs.IllFormed]

A program that is not well-formed, for which the implementation is expected to return unsuccessfully and produce one or more diagnostic messages.

1.4.4 Implementation-defined Behavior[Intro.Defs.ImpDef]

Behavior of a well-formed program and correct data which may vary by the implementation, and the implementation is expected to document the behavior.

1.4.5 Implementation Limits[Intro.Defs.ImpLimits]

Restrictions imposed upon programs by the implementation of either the compiler or runtime environment. The compiler may seek to surface runtime-imposed limits to the user for improved user experience.

1.4.6 Undefined Behavior[Intro.Defs.Undefined]

Behavior of invalid program constructs or incorrect data for which this standard imposes no requirements, or does not sufficiently detail.

1.4.7 Unspecified Behavior[Intro.Defs.Unspecified]

Behavior of a well-formed program and correct data which may vary by the implementation, and the implementation is not expected to document the behavior.

1.4.8 Well-formed Program[Intro.Defs.WellFormed]

An hlsl program constructed according to the syntax rules, diagnosable semantic rules, and the One Definition Rule.

1.4.9 Runtime Implementation[Intro.Defs.Runtime]

A runtime implementation refers to a full-stack implementation of a software runtime that can facilitate the execution of hlsl programs. This broad definition includes libraries and device driver implementations. The hlsl specification does not distinguish between the user-facing programming interfaces and the vendor-specific backing implementation.

1.5 Runtime Targeting[Intro.Runtime]

hlsl emerged from the evolution of dx to grant greater control over GPU geometry and color processing. It gained popularity because it targeted a common hardware description which all conforming drivers were required to support. This common hardware description, called a sm, is an integral part of the description for hlsl . Some hlsl features require specific sm features, and are only supported by compilers when targeting those sm versions or later.

1.6 spmd Programming Model[Intro.Model]

hlsl uses a spmd programming model where a program describes operations on a single element of data, but when the program executes it executes across more than one element at a time. This programming model is useful due to GPUs largely being simd hardware architectures where each instruction natively executes across multiple data elements at the same time.

There are many different terms of art for describing the elements of a GPU architecture and the way they relate to the spmd program model. In this document we will use the terms as defined in the following subsections.

1.6.1 spmd Terminology[Intro.Model.Terms]

1.6.1.1 Host and Device[Intro.Model.Terms.HostDevice]

hlsl is a data-parallel programming language designed for programming auxiliary processors in a larger system. In this context the host refers to the primary processing unit that runs the application which in turn uses a runtime to execute hlsl programs on a supported device. There is no strict requirement that the host and device be different physical hardware, although they commonly are. The separation of host and device in this specification is useful for defining the execution and memory model as well as specific semantics of language constructs.

1.6.1.2 lane[Intro.Model.Terms.Lane]

A lane represents a single computed element in an spmd program. In a traditional programming model it would be analogous to a thread of execution, however it differs in one key way. In multi-threaded programming threads advance independent of each other. In spmd programs, a group of lanes may execute instructions in lockstep because each instruction may be a simd instruction computing the results for multiple lanes simultaneously, or synchronizing execution across multiple lanes or waves. A lane has an associated lane state which denotes the execution status of the lane (1.6.1.7).

1.6.1.3 wave[Intro.Model.Terms.Wave]

A grouping of lanes for execution is called a wave. The size of a wave is defined as the maximum number of active lanes the wave supports. wave sizes vary by hardware architecture, and are required to be powers of two. The number of active lanes in a wave can be any value between one and the wave size.

Some hardware implementations support multiple wave sizes. There is no overall minimum wave size requirement, although some language features do have minimum lane size requirements.

hlsl is explicitly designed to run on hardware with arbitrary wave sizes. Hardware architectures may implement waves as simt where each thread executes instructions in lockstep. This is not a requirement of the model. Some constructs in hlsl require synchronized execution. Such constructs will explicitly specify that requirement.

1.6.1.4 quad[Intro.Model.Terms.Quad]

A quad is a subdivision of four lanes in a wave which are computing adjacent values. In pixel shaders a quad may represent four adjacent pixels and quad operations allow passing data between adjacent lanes. In compute shaders quads may be one or two dimensional depending on the workload dimensionality. Quad operations require four active lanes.

1.6.1.5 threadgroup[Intro.Model.Terms.Group]

A grouping of lanes executing the same shader to produce a combined result is called a threadgroup. threadgroups are independent of simd hardware specifications. The dimensions of a threadgroup are defined in three dimensions. The maximum extent along each dimension of a threadgroup, and the total size of a threadgroup are implementation limits defined by the runtime and enforced by the compiler. If a threadgroup’s size is not a whole multiple of the hardware wave size, the unused hardware lanes are implicitly inactive.

If a threadgroup size is smaller than the wave size , or if the threadgroup size is not an even multiple of the wave size, the remaining lane are inactive lanes.

1.6.1.6 dispatch[Intro.Model.Terms.Dispatch]

A grouping of threadgroups which represents the full execution of a hlsl program and results in a completed result for all input data elements.

1.6.1.7 lane States[Intro.Model.Terms.LaneState]

lanes may be in four primary states: active, helper, inactive, and predicated off.

An active lane is enabled to perform computations and produce output results based on the initial launch conditions and program control flow.

A helper lane is a lane which would not be executed by the initial launch conditions except that its computations are required for adjacent pixel operations in pixel fragment shaders. A helper lane will execute all computations but will not perform writes to buffers, and any outputs it produces are discarded. Helper lanes may be required for lane-cooperative operations to execute correctly.

A inactive lane is a lane that is not executed by the initial launch conditions. This can occur if there are insufficient inputs to fill all lanes in the wave, or to reduce per-thread memory requirements or register pressure.

A predicated off lane is a lane that is not being executed due to program control flow. A lane may be predicated off when control flow for the lanes in a wave diverge and one or more lanes are temporarily not executing.

The diagram blow illustrates the state transitions between lane states:

1.6.2 spmd Execution Model[Intro.Model.Exec]

A runtime implementation shall provide an implementation-defined mechanism for defining a dispatch. A runtime shall manage hardware resources and schedule execution to conform to the behaviors defined in this specification in an implementation-defined way. A runtime implementation may sort the threadgroups of a dispatch into waves in an implementation-defined way. During execution no guarantees are made that all lanes in a wave are actively executing.

wave, quad, and threadgroup operations require execution synchronization of applicable active and helper lanes as defined by the individual operation.

1.6.3 Optimization Restrictions[Intro.Model.Restrictions]

An optimizing compiler may not optimize code generation such that it changes the behavior of a well-formed program except in the presence of implementation-defined or unspecified behavior.

The presence of wave, quad, or threadgroup operations may further limit the valid transformations of a program. Specifically, control flow operations which result in changing which lanes, quads, or waves are actively executing are illegal in the presence of cooperative operations if the optimization alters the behavior of the program.

1.7 hlsl Memory Models[Intro.Memory]

Memory accesses for sm 5.0 and earlier operate on 128-bit slots aligned on 128-bit boundaries. This optimized for the common case in early shaders where data being processed on the GPU was usually 4-element vectors of 32-bit data types.

On modern hardware memory access restrictions are loosened, and reads of 32-bit multiples are supported starting with sm 5.1 and reads of 16-bit multiples are supported with sm 6.0. sm features are fully documented in the dx Specifications, and this document will not attempt to elaborate further.

1.7.1 Memory Spaces[Intro.Memory.Spaces]

hlsl programs manipulate data stored in four distinct memory spaces: thread, threadgroup, device and constant.

1.7.1.1 Thread Memory[Intro.Memory.Spaces.Thread]

Thread memory is local to the lane. It is the default memory space used to store local variables. Thread memory cannot be directly read from other threads without the use of intrinsics to synchronize execution and memory.

1.7.1.2 threadgroup Memory[Intro.Memory.Spaces.Group]

threadgroup memory is denoted in hlsl with the groupshared keyword. The underlying memory for any declaration annotated with groupshared is shared across an entire threadgroup. Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.

1.7.1.3 Device Memory[Intro.Memory.Spaces.Device]

Device memory is memory available to all lanes executing on the device. This memory may be read or written to by multiple threadgroups that are executing concurrently. Reads and writes to device memory may occur in any order except as restricted by synchronization intrinsics or other memory annotations. Some device memory may be visible to the host. Device memory that is visible to the host may have additional synchronization concerns for host visibility.

1.7.1.4 Constant Memory[Intro.Memory.Spaces.Constant]

Constant memory is similar to device memory in that it is available to all lanes executing on the device. Constant memory is read-only, and an implementation can assume that constant memory is immutable and cannot change during execution.

1.7.2 Memory Spaces[Intro.Memory.Alignment]

TODO

The alignment requirements of an offset into device memory space is the size in bytes of the largest scalar type contained in the given aggregate type.