Optimizing PSO Hitches in UE5: FileCache, PreCache, and Async PSO

This article discusses several approaches to optimize PSO hitches in UE5, briefly introduces their implementation mechanisms, pros and cons, and compares their performance.

1. Background

1.1. What is PSO?

PSO (Pipeline State Object) is a combination of various rendering states and is a core component of draw calls. It simplifies rendering state modifications and allows drivers to perform state-based optimizations.

Currently, PSO is required on all rendering platforms. The PSO content varies slightly between platforms. For example, OpenGL's PSO only considers shaders.
Before all draw call submissions, the corresponding PSO needs to be compiled based on shaders and rendering states. This step is very time-consuming and can reach 100ms levels on Android devices.
PSO is hardware-dependent and must be compiled on the target device; it cannot be fully prepared offline. Metal, Vulkan, and DX12 can all be pre-compiled into some type of intermediate code, but still require final compilation on the target hardware.
PSO can be cached in memory or disk, so the same PSO only needs to be compiled once. Loading PSO from cache is very fast.

PS: D3D11's PSO cannot be explicitly compiled but is automatically compiled by the driver when DrawCall is first submitted. Therefore, we generally consider D3D11's PSO uncacheable. While it can be warmed up through some hacks, this would be difficult to implement in UE. Thus, the discussion in this article does not apply to D3D11.

1.2. Why Optimize PSO Hitches?

For a well-designed level, various streaming and asynchronous loading can eliminate most loading hitches. However, even if the loading process can be completely asynchronous or done in advance, PSO compilation is still triggered synchronously when objects are finally submitted for rendering. This means that in many mature projects, PSO hitches are likely to become the main source of stuttering.

The core contradiction here is that the engine doesn't know what PSO will be used until the object is rendered, and DrawCall submission must synchronously wait for the PSO.

1.3. Solutions?

Experienced client programmers might realize these optimization points:

Since PSO can be cached, can we prepare it in advance?
Is it really impossible to determine what PSO to use until rendering?
Do we have to synchronously wait for PSO? Can we render after the PSO is ready?

Based on these observations, we can propose the following solutions:

Pre-recording: Collect all potentially used PSOs in advance, compile them on the target machine, then load the map. Before UE5.5, this was the default solution, referred to as FileCache below.
Streaming: Instead of collecting PSOs in advance, predict potentially needed PSOs when the mesh loads. This essentially treats PSO compilation as part of the streaming loading process. After UE5.5, this became the default solution, called PreCache.
On-demand: Solve the problem at the PSO compilation site by making the PSO compilation process asynchronous, then skip DrawCalls where PSO isn't compiled yet. UE hasn't officially implemented this solution, and we'll discuss why below. This solution will be referred to as Async PSO.

2. FileCache

2.1. Introduction

The basic logic of this solution is to collect all encountered PSOs during the game testing phase and save them as a list. After installation on users' phones, this list is compiled locally. As long as the testing process covers all player workflows, the vast majority of hitches can be eliminated.

For details on FileCache usage, implementation mechanisms, and performance optimization, see my other article: Unreal Engine PSO Cache Mechanism, Usage and Optimization.

2.2. Main Drawbacks

As an essentially offline solution, FileCache's biggest problem is the need to intervene in the development process. For example:

Testing workflows need to be properly designed to ensure higher PSO coverage. All quality settings, mesh, and material combinations need to be covered as much as possible.
Once assets change, tests need to be rerun. If the project has modified the engine, it might even require complete retesting.
If planning a separate "PSO collection" testing workflow, when should this workflow be done, and how many resources will it consume?
If PSO collection is automatically done in all daily tests, how do we aggregate PSO lists from different devices and exclude outdated PSOs?

If your project can handle the above issues, using this method is actually quite reliable. Otherwise, see the new solution in UE5.5 below.

3. PreCache

3.1. How to Enable

Core switch: r.PSOPrecaching. Enabled by default in UE5.5.

3.2. Mechanism

During the mesh loading phase, based on the current VertexFactory types, quality switches, mesh passes used, shading models, etc., it selects some shader variants from materials, adds other rendering parameters, and combines a series of PSOs that might be used, placing them in advance for compilation in asynchronous threads.
There are two choices for asynchronous threads:
1. Engine thread pool: Supports all platforms
2. Standalone compilation service: Only supports Android
Before all PSOs for the mesh are compiled, there are several options:
1. Skip mesh rendering. To players, it looks like asynchronously loading resources
2. Use a fallback material for rendering. To players, it looks like LOD streaming

3.3. Usage Strategy

In practice, designing a universal fallback material is not simple. Therefore, it's generally recommended to use the "don't render until PSO is compiled" approach. Based on this, the recommended usage strategy is:

When entering the map, use a loading screen to wait for all first-batch visible meshes to complete compilation
After entering the map, for all newly loaded meshes, asynchronously wait for their PSO compilation before allowing rendering
For sub-levels loaded via level streaming, PSO compilation time needs to be considered in addition to the original LevelStreaming strategy
Load all meshes through level streaming as much as possible, rather than dynamically loading in-game, to reduce the possibility of delayed loading

3.4. Advantages and Disadvantages

PreCache's biggest advantage is that it requires no preprocessing and is plug-and-play.

Conversely, since it cannot accurately predict actually used PSOs, redundant compilation is needed. This brings several issues:

Even if the shader to be used is already compiled, it still waits for other shaders of this material to compile before rendering
For meshes loaded when entering the map, the wait time will be relatively long
For meshes dynamically loaded after entering the map, the mesh display delay will be significant
When a large number of meshes are waiting to load in a short time, it will occupy additional memory and CPU
Not all meshes support precache. Currently, official support focuses on static mesh and skeletal mesh
If the team has modified the rendering pipeline, it may cause PSO prediction to fail, resulting in both "redundant PSO compilation" + "actual PSO compilation" overhead, with both delays and hitches

4. Async PSO

4.1. Mechanism and Advantages

After all the above discussion, we haven't discussed the most intuitive solution: when submitting a draw call, if the PSO used by this draw call isn't compiled, put it into an asynchronous compilation thread and skip the current draw call. Its advantages are:

No preprocessing required
PSO compilation is completely on-demand, no prediction or redundancy
No additional memory and CPU overhead

From an infrastructure perspective, this solution is completely implementable with minimal workload. Why hasn't it been officially adopted? This brings us to async PSO's biggest flaw: flickering.

4.2. Drawbacks

For players, occasional delayed appearance of objects is usually not a big problem. Players have expectations for delayed loading. However, the disappearance of already rendered objects severely affects the experience. Simply using async PSO will cause object flickering in some situations, such as:

Different LODs of the same mesh use different materials, causing brief mesh disappearance when switching LODs
Objects originally drawn separately suddenly get batched, requiring another set of PSOs, causing the original mesh to briefly disappear
Lazily updated drawing, after the first draw is skipped, subsequent attempts won't automatically retry (the engine doesn't know the draw call was skipped), such as: shadow cache
Quality setting changes, all full-screen PSOs become invalid, such as: enabling/disabling point lights

4.3. Solutions

So, can these problems be solved? In fact, based on our project's implementation, 1-3 can be solved at the engine level, and 4 can be solved at the gameplay level. Specifically:

To avoid flickering at the engine level, the mesh draw pipeline needs to be modified to allow the render layer to access PSO state and affect LOD, batching, shadow mechanisms accordingly. This process will also partially implement PreCache functionality
To avoid flickering at the gameplay level, you need to understand which quality switches will invalidate PSOs, and use UI or game logic to mask delays or hitches before these switches change

Since it can be solved, why hasn't UE officially adopted it? I speculate several reasons:

This fix is intrusive, requiring modification of various mesh rendering logic, increasing engineering maintenance difficulty, and cannot exist as a basically independent module like PreCache
Gameplay layer cooperation is necessary. As a general-purpose engine, UE pursues out-of-the-box usability. For teams without graphics/engine programmers, this problem is difficult to handle
Many of UE's official technical considerations are based on future hardware. Considering hardware advancement, the disadvantages brought by PreCache's redundant compilation will diminish

5. Testing and Recommendations

Based on UE5.5, I tested the performance and time consumption of the three solutions.

5.1. Test Environment

Scene: 512 simple StaticMeshes, each using different simple materials
Hardware: Snapdragon 855 (mid-to-low-end device)
Platform: Android + Vulkan
Configuration: LRU disabled, EnablePSOFileCacheWhenPrecachingActive enabled, UseChunkedPSOCache enabled

Configuration 1: FileCache

r.PSOPrecaching=0
r.ShaderPipelineCache.Enabled=1
r.AsyncPSO=0

Configuration 2: PreCache

r.PSOPrecaching=1
r.ShaderPipelineCache.Enabled=0
r.AsyncPSO=0

Configuration 3: Async PSO

r.PSOPrecaching=0
r.ShaderPipelineCache.Enabled=0
r.AsyncPSO=1

5.2. Test Results

Time (s)	FileCache	PreCache	Async PSO
Compilation Time	4.8	0	0
Display Time	0.3	22.3	4.7

Compilation Time: Specifically refers to the time taken to compile all PSOs in FileCache. In actual projects, a loading screen is needed during this time.
Display Time: Time from scene loading start to submitting DrawCalls for all 512 PSOs. During this time, the scene can render and interact normally, with unprepared meshes not rendering.

5.3. Recommendations

For small projects, small teams, or projects insensitive to delays, PreCache is recommended
If you have a testing team of a certain scale or can implement an automated testing/automated PSO collection system, FileCache is recommended
If the team has strong engine expertise and pursues ultimate performance, consider implementing async PSO while using basic FileCache to cover static scenes
Establish a hitch monitoring system to continuously track how many PSOs are synchronously compiled and how long the hitches are

PS: Since UE5.5, official maintenance of FileCache has weakened, and project teams may need to maintain it themselves in the future