Optimising our shadows in Unity

Background

Shadows Still

We have a projected shadows system that we use in a few of our games. Very much like a shadow map, it involves rendering objects from the perspective of a light and then projecting the shadows from that light onto the scene.

On some of our games, the fully fledged Unity shadow mapping solution is overkill – we don’t want to render dynamic shadows for everything, only smaller objects in the scene. We also want more control over how we filter our shadows – how we blur them to make them softer.

During a recent profiling session on one of our games, we noticed that generating one of these shadow maps was taking up approximately 12% of the total frame time. So I went about investigating it and looking into what we could do to reduce this cost and at the same time reduce the amount of memory the system was consuming.

Optimisations

My first step was to fire up my preferred profiling tools for both Android (RenderDoc) and iOS (XCode). RenderDoc is a free to use profiler and debugger that can connect to a host Android device and capture frame traces.

RenderDoc

RenderDoc

XCode is to the go-to development app on MacOS, you can capture a GPU frame at any time by selecting the option from the debug menu.

XCode.png

XCode GPU Frame Debugger

Making the most of the space we have

Using the render target viewer on both platforms I spotted that the contents of the shadow map render target was only occupying a small section of the entire texture. I would estimate that over 50% of the pixels in the render target were unoccupied – what a waste of space!

We use our projected shadows with directional lights, orthographic projection modes tend to be easier to control and tweak. You will lose any perspective but for us, this isn’t an issue. Swapping the projection mode over to orthographic as well as better positioning of the light source allowed us to make better use of the available render target space.

In the end, we were able to reduce the resolution of our shadow map texture from 128×128 to 64×64 – that’s 1/4 of the original size. One of the biggest bottlenecks in mobile devices is bandwidth – mobile devices have small busses. Moving 75% less data down the bus is a big saving. This along with shading 75% fewer fragments is a huge win (ignore the colour for the minute – I changed the way we store the shadow map in the render texture).

Shadows-OldProjectionShadows-NewProjection

 

MSAA

As we are using such a small render target when objects start moving within the render target you will notice a lot of aliasing. Due to the way in which mobile GPUs work, MSAA is very cheap. Mobile GPU’s use a tile-based architecture. All of the anti-aliasing work is done on chip and all additional memory is on tile memory. Enabling 4x MSAA with a smaller render texture gave us much better results with only a tiny increase in processing cost.

Render Target Formats

I spotted that our shadow map render texture was using an R8G8B8A8 format. Only two of the channels were being used. The first (R) was being used to store the shadow itself, and the second channel (G) was being used to store a linear fall off. Our artists requested that the intensity of our shadows fall off with distance.

Looking further into it, we didn’t actually need to store both pieces of information here. We only needed the shadow value or the falloff value depending on what was enabled for this shadow projector. I changed the render target format to a single 8-bit channel format (R8). This cut our texture size down by another 1/4. Again this reduces our bandwidth massively,

Blur Method

After we populate our shadow map render texture we blur it. This reduces any artefacts we see from using smaller textures as well as giving the impression of soft shadows. We were using a 3×3 box blur – that’s 9 texture samples per pixel. What’s more we weren’t taking advantage of bilinear filtering with a half pixel offset. I quickly added the option to only sample the surrounding corner pixels along with a half pixel offset, this reduced our sample count from 9 to 5 (we still sample the centre pixel).

You sample a texel from a texture using a texture coordinate. With point filtering enabled, sampling between two texels will result in only one of the texels being sampled. With bilinear filtering enabled the GPU will actually linearly blend between the two texels and return you the average of the two texels. So if we add an additional half pixel offset, we can essentially sample two pixels for the price of one.

Reducing ALU Instructions

Unity doesn’t support the common border texture wrapping mode. Therefore we had to add a bit of logic to our blur shader that checks to see if the current texel is a border texel, and if so keep it clear. This prevents shadows from smearing across the receiving surface. The shader code was using the step intrinsic to calculate if the current texel was a border texel. The step intrinsic is kind of like an if statement, I managed to rework this bit of code to use floor instead, this alone reduced the ALU count from 13-9. This doesn’t sound like much, but when you are doing this for each pixel in a render texture, it all adds up. This wasn’t the only saving we made here, but its an example of what we did.

When writing your shader, head to the inspector in Unity. From here, with a shader selected, you can select the “Compile and show code” button. This will compile your shader code and open it in a text editor. At the top of you compiled shader code you can see how many ALU and Tex instructions your shader uses.

// Stats: 5 math, 5 textures
Shader Disassembly:
#version 100

#ifdef VERTEX

For even more information you can download and use the Mali Offline Shader Compiler. Simply copy the compiled shader code – the bit in between #if VERTEX or #if FRAGMENT – and save it to a .vert or .frag file. From here you can run it through the compiler and it will show you the shader statistics

malisc --vertex myshader.vert malisc --fragment myshader.frag
Mali Shader Compiler Output

Above you can see that the 5 tap blur shader uses

  • 2 ALU (arithmetic logic unit) instructions
  • 3 Load/Store instructions
  • 5 Texture instructions

OnRenderImage

After the end of the blur pass, I noticed that there was an additional Blit – copying the blurred texture into another render target! I started digging into this and noticed that, even though we specified that our blurred render texture is of R8 format, it was R8G8B8A8! It turns out that this is a bug with Unity. OnRenderImage is passed a 32-bit destination texture, then the value of this is copied out to the final target format. This wasn’t acceptable so I changed our pipeline. We now allocate our render textures manually and perform the blur in OnPostRender.

private void OnPostRender()
{
    if (shadowQuality != ShadowQuality.Hard)
    {
        SetupBlurMaterial();
        blurTex.DiscardContents();
        Graphics.Blit(shadowTex, blurTex, blurMat, 0);
    }
}

Depth Buffer

This one is a bit weird – I’m not sure if I like it or not, I’m pretty sure I don’t. If you’re desperate to save memory you can disable the depth buffer. But this means that you are going to get a tonne more overdraw. However if you know what’s going into the shadow map render target and you know there isn’t a lot of overdraw, then this might be an option for you. The only way you’re going to tell is if you profile it, and before you even do this, make sure you’re really desperate for that additional few kilobytes.

Performance Metrics

Here we can see the cost of rendering a single shadow map (the example above). These readings were taken using XCode GPU Frame Debugger on an iPhone 6s. As you can see, the cost of rendering this shadow map is less than 50% of the original cost.

Cost

Thanks to reducing the size of our render targets, using a smaller texture format, eliminating the unnecessary Blit and (optionally) not using a depth buffer our memory consumption went from 320kb down to 8kb! Using a 16-bit depth buffer doubles that to 16kb. Either way, that’s A LOT of saved bandwidth.

Conclusion

Shadows Gif

In the best case scenario, we were able to reduce our memory consumption (and bandwidth usage) by over 40x. We were also able to reduce the overall cost of our shadow system by just over 50%! To any of the art team that may be reading this – it doesn’t mean we can have twice as many shadows 😀 All in all, I spent about 2-3 days profiling, optimizing and changing things up and it was definitely worth it.

Leave a comment