'Why is a single depth buffer sufficient for this vulkan swapchain render loop?

I was following the vulkan tutorial at https://vulkan-tutorial.com/ and at the depth buffering chapter, the author Alexander Overvoorde mentions that "We only need a single depth image, because only one draw operation is running at once." This is where my issue comes in.

I've read many SO questions and articles/blog posts on Vulkan synchronization in the past days, but I can't seem to reach a conclusion. The information that I've gathered so far is the following:

Draw calls in the same subpass execute on the gpu as if they were in order, but only if they draw to the framebuffer (I can't recall exectly where I read this, it might have been a tech talk on youtube, so I am not 100% sure about this). As far as I understood, this is more GPU hardware behavior than it is Vulkan behaviour, so this would essentially mean that the above is true in general (including across subpasses and even render passes) - which would answer my question, but I can't find any clear information on this.

The closest I've gotten to getting my question answered is this reddit comment that the OP seemed to accept, but the justification is based on 2 things:

  • "there is a queue flush at the high level that ensures previously submitted render passes are finished"

  • "the render passes themselves describe what attachments they read from and write to as external dependencies"

I see neither any high level queue flush (unless there is some sort of explicit one that I cannot find for the life of me in the specification), nor where the render pass describes dependencies on its attachments - it describes the attachments, but not the dependencies (at least not explicitly). I have read the relevant chapters of the specification multiple times, but I feel like the language is not clear enough for a beginner to fully grasp.

I would also really appreciate Vulkan specification quotes where possible.

Edit: to clarify, the final question is: What synchronization mechanism guarantees that the draw call in the next command buffer is not submitted until the current draw call is finished?



Solution 1:[1]

I'm afraid, I have to say that the Vulkan Tutorial is wrong. In its current state, it can not be guaranteed that there are no memory hazards when using only one single depth buffer. However, it would require only a very small change so that only one depth buffer would be sufficient.


Let's analyze the relevant steps of the code that are performed within drawFrame.

We have two different queues: presentQueue and graphicsQueue, and MAX_FRAMES_IN_FLIGHT concurrent frames. I refer to the "in flight index" with cf (which stands for currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT). I am using sem1 and sem2 to represent the different arrays of semaphores and fence for the array of fences.

The relevant steps in pseudocode are the following:

vkWaitForFences(..., fence[cf], ...);
vkAcquireNextImageKHR(..., /* signal when done: */ sem1[cf], ...);
vkResetFences(..., fence[cf]);
vkQueueSubmit(graphicsQueue, ...
    /* wait for: */ sem1[cf], /* wait stage: *, COLOR_ATTACHMENT_OUTPUT ...
    vkCmdBeginRenderPass(cb[cf], ...);
      Subpass Dependency between EXTERNAL -> 0:
          srcStages = COLOR_ATTACHMENT_OUTPUT,
          srcAccess = 0, 
          dstStages = COLOR_ATTACHMENT_OUTPUT,
          dstAccess = COLOR_ATTACHMENT_WRITE
      ...
      vkCmdDrawIndexed(cb[cf], ...);
      (Implicit!) Subpass Dependency between 0 -> EXTERNAL:
          srcStages = ALL_COMMANDS,
          srcAccess = COLOR_ATTACHMENT_WRITE|DEPTH_STENCIL_WRITE, 
          dstStages = BOTTOM_OF_PIPE,
          dstAccess = 0
    vkCmdEndRenderPass(cb[cf]);
    /* signal when done: */ sem2[cf], ...
    /* signal when done: */ fence[cf]
);
vkQueuePresent(presentQueue, ... /* wait for: */ sem2[cf], ...);

The draw calls are performed on one single queue: the graphicsQueue. We must check if commands on that graphicsQueue could theoretically overlap.

Let us consider the events that are happening on the graphicsQueue in chronological order for the first two frames:

img[0] -> sem1[0] signal -> t|...|ef|fs|lf|co|b -> sem2[0] signal, fence[0] signal
img[1] -> sem1[1] signal -> t|...|ef|fs|lf|co|b -> sem2[1] signal, fence[1] signal

where t|...|ef|fs|lf|co|b stands for the different pipeline stages, a draw call passes through:

  • t ... TOP_OF_PIPE
  • ef ... EARLY_FRAGMENT_TESTS
  • fs ... FRAGMENT_SHADER
  • lf ... LATE_FRAGMENT_TESTS
  • co ... COLOR_ATTACHMENT_OUTPUT
  • b ... BOTTOM_OF_PIPE

While there might be an implicit dependency between sem2[i] signal -> present and sem1[i+1], this only applies when the swap chain provides only one image (or if it would always provide the same image). In the general case, this can not be assumed. That means, there is nothing which would delay the immediate progression of the subsequent frame after the first frame is handed over to present. The fences also do not help because after fence[i] signal, the code waits on fence[i+1], i.e. that also does not prevent progression of subsequent frames in the general case.

What I mean by all of that: The second frame starts rendering concurrently to the first frame and there is nothing that prevents it from accessing the depth buffer concurrently as far as I can tell.


The Fix:

If we wanted to use only a single depth buffer, though, we can fix the tutorial's code: What we want to achieve is that the ef and lf stages wait for the previous draw call to complete before resuming. I.e. we want to create the following scenario:

img[0] -> sem1[0] signal -> t|...|ef|fs|lf|co|b -> sem2[0] signal, fence[0] signal
img[1] -> sem1[1] signal -> t|...|________|ef|fs|lf|co|b -> sem2[1] signal, fence[1] signal

where _ indicates a wait operation.

In order to achieve this, we would have to add a barrier that prevents subsequent frames performing the EARLY_FRAGMENT_TEST and LATE_FRAGMENT_TEST stages at the same time. There is only one queue where the draw calls are performed, so only the commands in the graphicsQueue require a barrier. The "barrier" can be established by using the subpass dependencies:

vkWaitForFences(..., fence[cf], ...);
vkAcquireNextImageKHR(..., /* signal when done: */ sem1[cf], ...);
vkResetFences(..., fence[cf]);
vkQueueSubmit(graphicsQueue, ...
    /* wait for: */ sem1[cf], /* wait stage: *, EARLY_FRAGMENT_TEST...
    vkCmdBeginRenderPass(cb[cf], ...);
      Subpass Dependency between EXTERNAL -> 0:
          srcStages = EARLY_FRAGMENT_TEST|LATE_FRAGMENT_TEST,
          srcAccess = DEPTH_STENCIL_ATTACHMENT_WRITE, 
          dstStages = EARLY_FRAGMENT_TEST|LATE_FRAGMENT_TEST,
          dstAccess = DEPTH_STENCIL_ATTACHMENT_WRITE|DEPTH_STENCIL_ATTACHMENT_READ
      ...
      vkCmdDrawIndexed(cb[cf], ...);
      (Implicit!) Subpass Dependency between 0 -> EXTERNAL:
          srcStages = ALL_COMMANDS,
          srcAccess = COLOR_ATTACHMENT_WRITE|DEPTH_STENCIL_WRITE, 
          dstStages = BOTTOM_OF_PIPE,
          dstAccess = 0
    vkCmdEndRenderPass(cb[cf]);
    /* signal when done: */ sem2[cf], ...
    /* signal when done: */ fence[cf]
);
vkQueuePresent(presentQueue, ... /* wait for: */ sem2[cf], ...);

This should establish a proper barrier on the graphicsQueue between the draw calls of the different frames. Because it is an EXTERNAL -> 0-type subpass dependency, we can be sure that renderpass-external commands are synchronized (i.e. sync with the previous frame).

Update: Also the wait stage for sem1[cf] has to be changed from COLOR_ATTACHMENT_OUTPUT to EARLY_FRAGMENT_TEST. This is because layout transitions happen at vkCmdBeginRenderPass time: after the first synchronization scope (srcStages and srcAccess) and before the second synchronization scope (dstStages and dstAccess). Therefore, the swapchain image must be available there already so that the layout transition happens at the right point in time.

Solution 2:[2]

No, rasterization order does not (per specification) extend outside a single subpass. If multiple subpasses write to the same depth buffer, then there should be a VkSubpassDependency between them. If something outside a render pass writes to the depth buffer, then there should also be explicit synchronization (via barriers, semaphores, or fences).

FWIW I think the vulkan-tutorial sample is non-conformant. At least I do not see anything that would prevent a memory hazard on the depth buffer. It seems that the depth buffer should be duplicated to MAX_FRAMES_IN_FLIGHT, or explicitly synchronized.

The sneaky part about undefined behavior is that wrong code often works correctly. Unfortunately making sync proofs in the validation layers is little bit tricky, so for now only thing that remains is to simply be careful.

Futureproofing the answer:
What I do see is conventional WSI semaphore chain (used with vkAnquireNextImageKHR and vkQueuePresentKHR) with imageAvailable and renderFinished semaphores. There is only one subpass dependency with VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, that is chained to the imageAvailable semaphore. Then there are fences with MAX_FRAMES_IN_FLIGHT == 2, and fences guarding the individual swapchain images. Meaning two subsequent frames should run unimpeded wrt each other (except in the rare case they acquire the same swapchain image). So, the depth buffer seems to be unprotected between two frames.

Solution 3:[3]

Yes, I also spent some time trying to figure out what was meant by the statement "We only need a single depth image, because only one draw operation is running at once."

That didn't make sense to me for a triple buffered rendering setup where work is submitted to the queues until MAX_FRAMES_IN_FLIGHT is reached - there's no guarantee that all three aren't running at once!

Whilst the single depth image worked OK, triplicating everything so each frame uses a fully independent set of resources (blocks and all) would seem to be the safest design and yielded identical performance under test.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 MarvinWS