NVIDIA DLSS 4 | LI, Jiacheng (李家丞)

I. Introduction

The foundational principle of DLSS is to decouple the internal rendering resolution from the final display resolution. It allows the graphics pipeline to render the primary 3D scene at a lower internal resolution, thereby significantly reducing the computational workload on the GPU. Subsequently, sophisticated AI algorithms, executed on specialized hardware units known as Tensor Cores within NVIDIA’s RTX series of GPUs, are employed to intelligently reconstruct a high-quality image at the target display resolution. Over its iterations, DLSS has expanded its capabilities significantly beyond mere super-resolution upscaling.

The advent and evolution of DLSS signify a paradigm shift in how rendering efficiency is approached. Rather than solely relying on brute-force computation for every pixel at the native display resolution, AI facilitates an intelligent reconstruction from a reduced workload. This strategic reallocation of rendering budget frees up GPU resources, which can then be utilized to achieve higher frame rates, enable more demanding graphical settings (such as higher levels of ray tracing), or drive higher display resolutions, all within the same performance envelope.

Furthermore, the trajectory of DLSS highlights a critical co-evolutionary relationship between AI algorithms, GPU architectures, and game engine integration. The capabilities and performance of each DLSS iteration are not merely a product of more sophisticated AI models but are intrinsically linked to advancements in NVIDIA’s GPU hardware—particularly the evolution of Tensor Core capabilities and the introduction of specialized hardware units like the Optical Flow Accelerator in the Ada Lovelace architecture for DLSS 3.0 Frame Generation, and new features in the Blackwell architecture for DLSS 4.

II. Evolution of NVIDIA DLSS

NVIDIA’s Deep Learning Super Sampling has undergone a significant evolutionary journey since its inception, with each major iteration addressing limitations of its predecessor and introducing new AI-driven capabilities. This progression reflects a continuous refinement of AI models, a deeper integration with the graphics pipeline, and an increasing synergy with NVIDIA’s GPU hardware advancements.

DLSS 1.x (2019)¹: The Spatial Pioneer DLSS 1.0 was primarily a spatial image upscaler. It employed convolutional auto-encoder neural networks to achieve its upscaling. A defining characteristic of this initial version was its per-game training requirement. NVIDIA had to train a unique AI model for each supported game. This training process involved feeding the neural network vast numbers of aliased low-resolution frames from the specific game, alongside corresponding “ground truth” high-resolution reference images. These reference images were typically generated using computationally intensive methods like extreme super-sampling (e.g., 64 samples per pixel) on NVIDIA’s supercomputer infrastructure.11 Some analyses suggest DLSS 1.0 operated via a two-stage process: an initial image enhancement network that utilized the current frame and motion vectors for tasks like edge enhancement and rudimentary spatial anti-aliasing, followed by a separate upscaling network that operated primarily on the single raw low-resolution frame to produce the final output resolution.

DLSS 2.x (2020)²: The Temporal Revolution and Generalization DLSS 2.0 represented a major architectural overhaul and a paradigm shift from its predecessor. It introduced a generalized AI model and transitioned to being a temporal anti-aliasing upsampling (TAAU) technique. This meant it extensively utilized data from previously rendered frames, in addition to the current frame, to inform the reconstruction process. The AI model was a single, generalized convolutional autoencoder network. Crucially, this network was trained on a diverse dataset of non-game-specific content (e.g., thousands of high-resolution images rendered offline with high sample counts), allowing it to work effectively across a wide variety of games and visual styles without the need for per-game retraining. DLSS 2.0 employed temporal feedback, where the AI network takes the low-resolution current frame and the high-resolution output from the previous frame (reprojected using motion vectors) to determine, on a pixel-by-pixel basis, how to generate a higher quality current frame.

DLSS 3.0 (2022)³: Optical Multi Frame Generation (AI Frame Generation) DLSS 3.0 builds upon the Super Resolution capabilities of DLSS 2.x and introduces a novel AI-powered Frame Generation technique. This technology synthesizes entirely new frames, inserting them between traditionally rendered (and super-resolved) frames, rather than just upscaling pixels within existing frames.12 This approach can lead to dramatic increases in displayed FPS, particularly beneficial in CPU-bound scenarios where traditional rendering is limited by the game engine’s processing speed. The frame generation component of DLSS 3.0 employs a convolutional autoencoder. This neural network takes current and prior game frames, an optical flow field, and game engine data (such as motion vectors and depth) as inputs to predict and generate the intermediate frame. Optical Flow Accelerator (OFA) is another crucial hardware innovation introduced with the NVIDIA Ada Lovelace architecture (found in GeForce RTX 40 Series GPUs). The OFA is a dedicated unit designed to analyze two sequential in-game frames and calculate a dense optical flow field. This field captures pixel-level motion for elements that traditional game engine motion vectors might not accurately model, such as particles, reflections, shadows, and lighting effects.12 The OFA was a hardware-exclusive feature making DLSS 3.0 Frame Generation initially available only on RTX 40 Series GPUs.

DLSS 3.5 (2023)⁴: Ray Reconstruction (RR) DLSS 3.5 introduced Ray Reconstruction, an AI-powered neural rendering technique specifically designed to improve the image quality of ray-traced effects. It achieves this by replacing the multiple, often hand-tuned, denoisers traditionally used in ray tracing pipelines with a single, more advanced and unified AI network. The Ray Reconstruction AI model was trained on significantly more data—reportedly 5x more than the models used in DLSS 3.0. This extensive training enables the AI to recognize various ray-traced effects (such as global illumination, ambient occlusion, reflections, and shadows) with greater accuracy. It makes more intelligent decisions about how to utilize temporal and spatial data from the noisy ray-traced input, and is better at retaining high-frequency information, which is crucial for subsequent upscaling stages.

The evolution from DLSS 1.0 to 3.5 showcases a clear pattern of iterative problem-solving and increasing specialization. Each version targeted specific limitations of its predecessor—DLSS 1.0’s spatial-only upscaling and per-game training were addressed by DLSS 2.0’s temporal accumulation and generalized model. DLSS 2.0’s residual temporal artifacts and the desire for higher raw FPS paved the way for DLSS 3.0’s Frame Generation, which, in turn, introduced new considerations around latency and UI handling. DLSS 3.5 then focused on a very specific and challenging aspect of modern graphics: the quality of ray tracing denoising. This progression has resulted in a suite of specialized AI models (Super Resolution, Frame Generation, Ray Reconstruction) rather than a single, monolithic AI attempting to solve all rendering challenges. This modularity appears to be a key strategy in tackling the immense complexity of real-time graphics.

Concurrently, each major DLSS iteration has become more deeply intertwined with specific NVIDIA GPU hardware capabilities, moving beyond general compute. While Tensor Cores have been foundational to all DLSS versions, DLSS 3.0’s Frame Generation was made practical and performant on the RTX 40 Series due to the dedicated Optical Flow Accelerator. This hardware unit provided crucial optical flow data at high speed, a task that might have been too slow or of lower quality if attempted purely in software on previous-generation hardware. This trend indicates that NVIDIA is not just developing AI algorithms in isolation but is co-designing them with hardware accelerators to achieve performance targets that would otherwise be unattainable. This creates a powerful hardware-software ecosystem but also tends to tie the most advanced features to the latest hardware generations.

As DLSS has evolved to boost GPU rendering throughput (via Super Resolution) and apparent frame rates (via Frame Generation), other system bottlenecks, such as CPU performance, memory bandwidth, and input latency, have become more prominent. NVIDIA’s introduction of Reflex as an integral part of DLSS 3.0, and its continued emphasis in the DLSS 4 framework, demonstrates an awareness that true perceived performance is a holistic characteristic of the entire system. Super Resolution eases the load on the GPU’s shading units. Frame Generation can even provide benefits in CPU-bound scenarios by synthesizing frames largely independently of the main game simulation loop.14 However, merely increasing FPS numbers is insufficient if system latency increases to a detrimental degree or if critical elements like the UI become visually compromised. The mandatory inclusion of Reflex with DLSS 3.0 Frame Generation was a direct acknowledgment that the latency introduced by frame interpolation required active mitigation. This holistic view, considering the entire end-to-end pipeline from player input to final display, is crucial for delivering a genuinely enhanced user experience.

III. Deep Dive: NVIDIA DLSS 4

NVIDIA DLSS 4⁵ represents the latest iteration in the company’s suite of AI-driven rendering technologies, introducing significant architectural advancements and new features aimed at further enhancing image quality, frame rates, and responsiveness in real-time graphics applications. This version builds upon the foundations laid by its predecessors, particularly DLSS 3.x, while incorporating novel AI models and deeper hardware integration with the NVIDIA Blackwell GPU architecture.

A hallmark of DLSS 4 is the strategic shift in its AI model architecture for key components and the introduction of more sophisticated frame generation capabilities.

Transformer-based AI Models for Super Resolution (SR) and Ray Reconstruction (RR): DLSS 4 marks a pivotal transition from predominantly Convolutional Neural Network (CNN) based architectures, used in previous DLSS versions for Super Resolution and Ray Reconstruction, to transformer-based models. Transformers, which have demonstrated remarkable success in fields like natural language processing and offline image generation, employ attention mechanisms. These mechanisms allow the model to dynamically weigh the importance of different parts of the input data, enabling them to capture global dependencies and long-range relationships within an image or across frames more effectively than traditional CNNs, which are inherently more focused on local features. This architectural change is anticipated to lead to improved image stability, more effective reduction of ghosting artifacts, better preservation of detail in motion, and smoother edges. The new transformer model for SR is reported to involve four times the number of compute operations compared to its predecessor but has been co-designed with the enhanced Tensor Cores of the Blackwell architecture to maintain high efficiency. The adoption of transformers also suggests a path towards models that can scale more effectively with larger and more diverse training datasets, potentially leading to continuous improvements in generalization and fidelity.

Multi Frame Generation (MFG): DLSS 4 evolves the Frame Generation capabilities of DLSS 3.0 with a new technique termed Multi Frame Generation (MFG). While DLSS 3.0 generated one additional AI frame for every rendered frame, DLSS 4’s MFG aims to generate up to three additional frames for every one traditionally rendered frame. When combined with DLSS Super Resolution (which itself might be reconstructing a significant portion of the rendered frame’s pixels from a lower internal resolution), MFG has the potential to achieve up to an 8x increase in rendering efficiency compared to brute-force native resolution rendering. The neural architecture for MFG has been redesigned for efficiency and improved quality. It splits the neural component of DLSS 3.0’s Frame Generation into two parts: a larger, more computationally intensive part that runs once per pair of input (rendered) frames, with its output being reusable for the generation of multiple intermediate frames; and a much smaller, lighter part that runs once for every generated output frame. This split architecture is a key optimization that allows for reduced latency and improved efficiency, making the generation of multiple frames feasible within tight real-time budgets.
AI-based Optical Flow: A significant architectural change in DLSS 4’s Frame Generation pipeline is the replacement of the hardware-based Optical Flow Accelerator (OFA)—a feature of the Ada Lovelace architecture used by DLSS 3.0—with a highly efficient AI model dedicated to calculating optical flow. This new AI-driven approach to optical flow is reported to be 40% faster and use 30% less VRAM for the frame generation model compared to the previous OFA-dependent model. Furthermore, it is designed to offer improved image quality, particularly in challenging scenarios like the rendering of particle effects, due to more refined flow estimation capabilities. This AI optical flow model only needs to execute once per rendered frame to support the generation of multiple intermediate frames. This shift from a fixed-function hardware unit to a continuously improvable AI model for optical flow represents a notable trend in AI graphics.

These architectural innovations—the adoption of transformers, the introduction of multi-stage frame generation, and the use of AI for optical flow—collectively aim to deliver higher frame rates and superior image quality by leveraging more sophisticated AI paradigms and co-designing these algorithms with new GPU hardware capabilities.

Reflex Integration and Frame Warp Minimizing latency remains a critical concern, especially when employing frame generation techniques. NVIDIA Reflex, which optimizes the rendering pipeline to reduce system latency, continues to be an integral part of the DLSS framework. DLSS 4 introduces a new enhancement to this system: Reflex Frame Warp.

Reflex Frame Warp is described as a late-stage reprojection technique. Its core function is to update the most recently rendered frame based on the very latest player input (e.g., mouse movement, camera adjustments) immediately before that frame is sent to the display. The process involves the CPU calculating the new camera position based on the latest input, and then Frame Warp samples this new position and “warps” the frame just rendered by the GPU to align with this updated perspective. This ensures that the image viewed by the player reflects their most recent actions as closely as possible, aiming to counteract the inherent latency that can be introduced by multi-stage rendering and frame generation pipelines.

A key challenge with such warping techniques is handling disocclusions—areas of the scene that become newly visible due to the shift in camera perspective and for which no pixel data exists in the original rendered frame. Reflex Frame Warp addresses this through a combination of strategies: minimizing their occurrence by rendering a guard band around the screen border and using layered rendering, and employing predictive rendering. In predictive rendering, camera movement is extrapolated from user input, and the frame is initially rendered at this predicted future position. This predicted frame is then warped to the true, most current viewpoint before display, correcting any deviations and significantly reducing the average size of disocclusions with minimal performance impact.3 For any remaining holes created by the reprojection, Frame Warp utilizes a latency-optimized AI inpainting approach. This inpainting algorithm incorporates historical frame data, G-buffer information from the predictive rendering pass, and information about the upcoming camera position to reconstruct the missing areas, striving for visual consistency while dynamically adjusting algorithm fidelity to maximize latency savings.

Reflex Frame Warp is thus positioned as a critical technology for maintaining a responsive gaming experience, especially when used in conjunction with Multi Frame Generation. It aims to make the significantly increased frame rates delivered by MFG feel immediate and connected to player input.

Pipeline Interface: Inputs, Processing, and Outputs The effective operation of DLSS 4 relies on a sophisticated interplay of data inputs from the game engine and graphics pipeline, specific processing injection points, and carefully constructed outputs.

Inputs:
- Super Resolution (Transformer-based): Likely requires similar inputs to previous DLSS SR versions: low-resolution current frames, motion vectors, depth information, exposure data, and a history buffer containing previous high-resolution outputs for temporal feedback. The transformer model is noted to be better at handling scenarios like animated textures, where it can intelligently decide to ignore motion vectors if they are deemed unreliable for that content.
- Ray Reconstruction (Transformer-based): Takes the noisy, un-denoised output from ray tracing passes as a primary input. It also likely utilizes other G-buffer data (normals, roughness, etc.) and temporal information (motion vectors, history frames) to provide context for the AI to accurately denoise and reconstruct ray-traced effects like global illumination, reflections, and shadows.
- Multi Frame Generation (AI Optical Flow): Requires two sequential rendered (and typically super-resolved) frames as a basis. Instead of OFA output, it uses an internally generated optical flow field from its AI optical flow model. It also utilizes game engine motion vectors and depth data, similar to DLSS 3.0 Frame Generation. The split architecture allows for the output of the first, larger network part (run once per input pair) to be reused for generating multiple intermediate frames.
- Reflex Frame Warp: Operates on the final rendered frame from the GPU. It also takes the latest mouse/controller input from the system, the new camera position calculated by the CPU based on this input, and leverages historical frame data and G-buffers from its predictive rendering stage for inpainting.
Processing Injection Points:
- Super Resolution and Ray Reconstruction: These processes likely occur after the main geometry processing and initial lighting/shading passes (which produce the low-resolution frame and noisy ray-traced data) but before final screen-space post-processing effects (like tone mapping or film grain) and UI rendering. The DLSS 4 research mentions that ray-traced shading (reflections, GI, shadows) is processed at a reduced resolution and then passed through a super resolution model.
- Multi Frame Generation: This is fundamentally a post-processing step. It takes fully rendered (and super-resolved) frames as input and generates additional frames that are then inserted into the display sequence.
- Reflex Frame Warp: This is designed to be a very late-stage process, occurring “just before the rendered frame is sent to the display”. This timing is critical for it to incorporate the absolute latest user input.
Outputs:
- The Super Resolution and Ray Reconstruction components output high-resolution, high-quality, and temporally stable frames with enhanced ray-traced effects.
- The Multi Frame Generation component outputs multiple AI-synthesized intermediate frames.
- The final frames presented to the display are those that have potentially been modified by Reflex Frame Warp to reflect the latest player input.

Understanding this data flow and the specific points at which DLSS 4 integrates its various processing stages is crucial for game developers seeking to implement the technology effectively and for engineers analyzing its performance characteristics and potential sources of visual artifacts. The shift to an AI-based optical flow model for MFG simplifies one aspect of the input pipeline by removing the dependency on a specific hardware OFA signal, potentially offering more flexibility.

The replacement of a specialized hardware unit (the Optical Flow Accelerator in DLSS 3.0) with an AI-based model for optical flow calculation in DLSS 4’s Multi Frame Generation is another noteworthy trend. This indicates that AI models, when executed on sufficiently powerful and generalized AI hardware like modern Tensor Cores, are becoming capable and efficient enough to supplant fixed-function or more specialized hardware blocks for certain complex tasks. While the OFA in the Ada Lovelace architecture was a dedicated silicon block optimized for its specific purpose, its functionality was inherently fixed by its hardware design. In contrast, an AI model for optical flow, as described for DLSS 4, offers the potential for continuous improvement through new training data, architectural refinements, and algorithmic advancements, without necessitating new hardware revisions for that specific function. This approach offers greater flexibility and adaptability. It also potentially frees up die space or power that would have been consumed by a dedicated OFA, which could then be allocated to other GPU resources. This is a strong indicator of AI’s expanding capability to take over tasks that were traditionally the domain of hardware acceleration or complex, hand-tuned algorithms.

References

I. Introduction

II. Evolution of NVIDIA DLSS

III. Deep Dive: NVIDIA DLSS 4

Enjoy Reading This Article?