Skip to content

Conversation

@Przemog1
Copy link
Contributor

No description provided.

};

template<class RandGen, class RayGen, class Intersector, class MaterialSystem, /* class PathGuider, */ class NextEventEstimator>
// TODO: maybe implement a concept to ensure that OutputTypeVec is a vector?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please concept needed!

Comment on lines -4 to +15
struct SPushConstants
#ifndef __HLSL_VERSION
#include "matrix4SIMD.h"
#endif

struct RenderPushConstants
{
#ifdef __HLSL_VERSION
float32_t4x4 invMVP;
#else
nbl::core::matrix4SIMD invMVP;
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we use hlsl::float32_t4x4 in C++ and float32_t4x4 in HLSL only!

Comment on lines +60 to +63
#ifdef RWMC_ENABLED
[[vk::image_format("rgba16f")]] [[vk::binding(0, 1)]] RWTexture2DArray<float32_t4> cascade;
#endif
[[vk::image_format("rgba16f")]] [[vk::binding(0, 0)]] RWTexture2D<float32_t4> outImage;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use an array view (without RWMC you can just make the array 1 layer), and use the same binding

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're only writing the image (not loading) you can skip declaring the format, because we use the Unformatted Storage extension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after taking another look at the code, i figured it's best to keep cascade and outImage in two separate descriptor sets. this way descriptor set 0 (the one with outImage only) can be reused across different shaders.

using pathtracer_type = ext::PathTracer::Unidirectional<randgen_type, raygen_type, intersector_type, material_system_type, nee_type>;

#ifdef RWMC_ENABLED
using accumulator_type = rwmc::CascadeAccumulator<float32_t3, CascadeSize>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nitpick, CascadeCount

Comment on lines +9 to +17
struct RenderRWMCPushConstants
{
#ifdef __HLSL_VERSION
float32_t4x4 invMVP;
#else
nbl::core::matrix4SIMD invMVP;
#endif
int sampleCount;
int depth;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make one push cosntant struct in terms of the other, preferably through composition, so

struct RenderRWMCPushConstants
{
   RenderPushConstants base;
   rwmc::SplattingParameters rwmc;
};

Comment on lines +18 to +20
float start;
float base;
float kappa;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use the same push constants for resolve and rendering & splatting

I really want you to pack up start and base into its own rwmc::SplattingParameters struct

Comment on lines +41 to +44
#ifdef RWMC_ENABLED
#include <nbl/builtin/hlsl/rwmc/CascadeAccumulator.hlsl>
#include "render_rwmc_common.hlsl"
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to always include stuff like this without a penalty

Comment on lines +29 to +47
#ifdef PERSISTENT_WORKGROUPS
uint32_t virtualThreadIndex;
[loop]
for (uint32_t virtualThreadBase = glsl::gl_WorkGroupID().x * WorkgroupSize; virtualThreadBase < 1920 * 1080; virtualThreadBase += glsl::gl_NumWorkGroups().x * WorkgroupSize)
{
virtualThreadIndex = virtualThreadBase + glsl::gl_LocalInvocationIndex().x;
const int32_t2 coords = (int32_t2)math::Morton<uint32_t>::decode2d(virtualThreadIndex);
#else
const int32_t2 coords = getCoordinates();
#endif

rwmc::ReweightingParameters reweightingParameters = rwmc::computeReweightingParameters(pc.base, pc.sampleCount, pc.minReliableLuma, pc.kappa, CascadeSize);
float32_t3 color = rwmc::reweight(reweightingParameters, cascade, coords);

outImage[coords] = float32_t4(color, 1.0f);

#ifdef PERSISTENT_WORKGROUPS
}
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve doesn't need to use persistent WG, you can use a Regular 2D dispatch

#ifdef PERSISTENT_WORKGROUPS
uint32_t virtualThreadIndex;
[loop]
for (uint32_t virtualThreadBase = glsl::gl_WorkGroupID().x * WorkgroupSize; virtualThreadBase < 1920 * 1080; virtualThreadBase += glsl::gl_NumWorkGroups().x * WorkgroupSize)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't hardcode rendertarget resolution

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can query it from the cascade image HLSL has intrinsics for that, you can add SPIR-V instrinscis for GLSL functions textureSize and imageSize

const int32_t2 coords = getCoordinates();
#endif

rwmc::ReweightingParameters reweightingParameters = rwmc::computeReweightingParameters(pc.base, pc.sampleCount, pc.minReliableLuma, pc.kappa, CascadeSize);
Copy link
Member

@devshgraphicsprogramming devshgraphicsprogramming Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done on CPU and the reweightingParamaters should be in your ResolvePushConstants push constants here.

Comment on lines +16 to +17
NBL_CONSTEXPR uint32_t MAX_DEPTH_LOG2 = 4;
NBL_CONSTEXPR uint32_t MAX_SAMPLES_LOG2 = 10;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont need those here I think

Comment on lines +19 to +24
int32_t2 getCoordinates()
{
uint32_t width, height;
outImage.GetDimensions(width, height);
return int32_t2(glsl::gl_GlobalInvocationID().x % width, glsl::gl_GlobalInvocationID().x / width);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a 2D dispatch

Comment on lines +296 to +298
Accumulator accumulator;
accumulator.initialize(accumulatorInitData);
//scalar_type meanLumaSq = 0.0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take accumulator byt deferend from the outside, otherwise static polymorphism gets harder with stateful accumulators

Consider this scenario, I already have an accumulator, and I just want to add a fwe samples


// Li
measure_type getMeasure(uint32_t numSamples, uint32_t depth, NBL_CONST_REF_ARG(scene_type) scene)
output_storage_type getMeasure(uint32_t numSamples, uint32_t depth, NBL_CONST_REF_ARG(scene_type) scene, NBL_REF_ARG(typename Accumulator::initialization_data) accumulatorInitData)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one suggestion to refactor, take sampleIndex (the i in the loop) and do the loop outside the path tracer (initialize the accumulator outside as well)

Also I know its not your code, but better rename depth to maxDepth

Comment on lines -302 to +326
return Li;
return accumulator.accumulation;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take accumulator by reference and make the function void

};

std::array<ICPUDescriptorSetLayout::SBinding, 1> descriptorSet0Bindings = {};
std::array<ICPUDescriptorSetLayout::SBinding, 1> descriptorSet1Bindings = {};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add a new set, reuse the binding for the output image for the cascades as I wrote about the layers == CascadeCount in other comments

params.shader.entryPoint = "main";
params.shader.entries = nullptr;
params.shader.requireFullSubgroups = true;
params.shader.requiredSubgroupSize = static_cast<IGPUShader::SSpecInfo::SUBGROUP_SIZE>(5);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bad and breaks on other GPUs than NV and RDNA, I know its @keptsecret's "fault" but lets fix, make the subgroup size the log2 of the minimum the device reports (for best divergence handling)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice code deduplication btw :P

Comment on lines +799 to +805
imgViewInfo.viewType = IGPUImageView::ET_2D;
}
else
{
imgViewInfo.subresourceRange.layerCount = CascadeSize;
imgViewInfo.viewType = IGPUImageView::ET_2D_ARRAY;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify code, always make it a 2D Array (you can just have 1 layer, no problem)

Comment on lines +1409 to +1657
cmdbuf->reset(IGPUCommandBuffer::RESET_FLAGS::NONE);
// disregard surface/swapchain transformation for now
const auto viewProjectionMatrix = m_camera.getConcatenatedMatrix();
viewProjectionMatrix.getInverseTransform(pc.invMVP);
pc.sampleCount = spp;
pc.depth = depth;

// safe to proceed
// upload buffer data
cmdbuf->beginDebugMarker("ComputeShaderPathtracer IMGUI Frame");
cmdbuf->begin(IGPUCommandBuffer::USAGE::ONE_TIME_SUBMIT_BIT);

// TRANSITION m_outImgView to GENERAL (because of descriptorSets0 -> ComputeShader Writes into the image)
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> imgBarriers[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::ALL_TRANSFER_BITS,
.srcAccessMask = ACCESS_FLAGS::TRANSFER_WRITE_BIT,
.dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS
}
},
.image = m_outImgView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
},
.oldLayout = IImage::LAYOUT::UNDEFINED,
.newLayout = IImage::LAYOUT::GENERAL
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imgBarriers });
}

// cube envmap handle
{
IGPUComputePipeline* pipeline;
if (usePersistentWorkGroups)
pipeline = renderMode == E_RENDER_MODE::ERM_HLSL ? m_PTHLSLPersistentWGPipelines[PTPipeline].get() : m_PTGLSLPersistentWGPipelines[PTPipeline].get();
else
pipeline = renderMode == E_RENDER_MODE::ERM_HLSL ? m_PTHLSLPipelines[PTPipeline].get() : m_PTGLSLPipelines[PTPipeline].get();
cmdbuf->bindComputePipeline(pipeline);
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 0u, 1u, &m_descriptorSet0.get());
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 2u, 1u, &m_descriptorSet2.get());
cmdbuf->pushConstants(pipeline->getLayout(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0, sizeof(RenderPushConstants), &pc);

// TODO: shouldn't it be computed only at initialization stage and on window resize?
const uint32_t dispatchSize = usePersistentWorkGroups ?
m_physicalDevice->getLimits().computeOptimalPersistentWorkgroupDispatchSize(WindowDimensions.x * WindowDimensions.y, DefaultWorkGroupSize) :
1 + (WindowDimensions.x * WindowDimensions.y - 1) / DefaultWorkGroupSize;

cmdbuf->dispatch(dispatchSize, 1u, 1u);
}

// TRANSITION m_outImgView to READ (because of descriptorSets0 -> ComputeShader Writes into the image)
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> imgBarriers[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS,
.dstStageMask = PIPELINE_STAGE_FLAGS::FRAGMENT_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_READ_BITS
}
},
.image = m_outImgView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
},
.oldLayout = IImage::LAYOUT::GENERAL,
.newLayout = IImage::LAYOUT::READ_ONLY_OPTIMAL
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imgBarriers });
}


}

void beginCommandBufferAndDispatchPathracerPipelineUseRWMC(IGPUCommandBuffer* cmdbuf)
{
if (renderMode != E_RENDER_MODE::ERM_HLSL)
{
m_logger->log("Only HLSL render mode is supported.", ILogger::ELL_ERROR);
std::exit(-1);
}

cmdbuf->reset(IGPUCommandBuffer::RESET_FLAGS::NONE);
// disregard surface/swapchain transformation for now
const auto viewProjectionMatrix = m_camera.getConcatenatedMatrix();
viewProjectionMatrix.getInverseTransform(rwmcPushConstants.invMVP);

rwmcPushConstants.start = rwmcCascadeStart;
rwmcPushConstants.depth = depth;
rwmcPushConstants.sampleCount = resolvePushConstants.sampleCount = spp;
rwmcPushConstants.base = resolvePushConstants.base = rwmcCascadeBase;
resolvePushConstants.minReliableLuma = rwmcMinReliableLuma;
rwmcPushConstants.kappa = resolvePushConstants.kappa = rwmcKappa;

// safe to proceed
// upload buffer data
cmdbuf->beginDebugMarker("ComputeShaderPathtracer IMGUI Frame");
cmdbuf->begin(IGPUCommandBuffer::USAGE::ONE_TIME_SUBMIT_BIT);

// TRANSITION m_outImgView to GENERAL (because of descriptorSets0 -> ComputeShader Writes into the image)
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> imgBarriers[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::ALL_TRANSFER_BITS,
.srcAccessMask = ACCESS_FLAGS::TRANSFER_WRITE_BIT,
.dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS
}
},
.image = m_outImgView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
},
.oldLayout = IImage::LAYOUT::UNDEFINED,
.newLayout = IImage::LAYOUT::GENERAL
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imgBarriers });
}

// transit m_cascadeView layout to GENERAL, block until previous shader is done with reading from cascade
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> cascadeBarrier[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::NONE,
.dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::NONE
}
},
.image = m_cascadeView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = CascadeSize
},
.oldLayout = IImage::LAYOUT::UNDEFINED,
.newLayout = IImage::LAYOUT::GENERAL
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = cascadeBarrier });
}

// TODO: shouldn't it be computed only at initialization stage and on window resize?
const uint32_t dispatchSize = usePersistentWorkGroups ?
m_physicalDevice->getLimits().computeOptimalPersistentWorkgroupDispatchSize(WindowDimensions.x * WindowDimensions.y, DefaultWorkGroupSize) :
1 + (WindowDimensions.x * WindowDimensions.y - 1) / DefaultWorkGroupSize;

{
IGPUComputePipeline* pipeline = usePersistentWorkGroups ? m_PTHLSLPersistentWGPipelinesRWMC[PTPipeline].get() : m_PTHLSLPipelinesRWMC[PTPipeline].get();

cmdbuf->bindComputePipeline(pipeline);
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 0u, 1u, &m_descriptorSet0.get());
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 1u, 1u, &m_descriptorSet1.get());
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 2u, 1u, &m_descriptorSet2.get());
cmdbuf->pushConstants(pipeline->getLayout(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0, sizeof(RenderRWMCPushConstants), &rwmcPushConstants);

cmdbuf->dispatch(dispatchSize, 1u, 1u);
}

// m_cascadeView synchronization - wait for previous compute shader to write into the cascade
// TODO: create this and every other barrier once outside of the loop?
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> cascadeBarrier[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS,
.dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_READ_BITS
}
},
.image = m_cascadeView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = CascadeSize
}
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = cascadeBarrier });
}

// reweighting
{
IGPUComputePipeline* pipeline = usePersistentWorkGroups ? m_resolvePersistentWGPipeline.get() : m_resolvePipeline.get();

cmdbuf->bindComputePipeline(pipeline);
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 0u, 1u, &m_descriptorSet0.get());
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, pipeline->getLayout(), 1u, 1u, &m_descriptorSet1.get());
cmdbuf->pushConstants(pipeline->getLayout(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0, sizeof(ResolvePushConstants), &resolvePushConstants);

cmdbuf->dispatch(dispatchSize, 1u, 1u);
}

// TRANSITION m_outImgView to READ (because of descriptorSets0 -> ComputeShader Writes into the image)
{
const IGPUCommandBuffer::SImageMemoryBarrier<IGPUCommandBuffer::SOwnershipTransferBarrier> imgBarriers[] = {
{
.barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS,
.dstStageMask = PIPELINE_STAGE_FLAGS::FRAGMENT_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_READ_BITS
}
},
.image = m_outImgView->getCreationParameters().image.get(),
.subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
},
.oldLayout = IImage::LAYOUT::GENERAL,
.newLayout = IImage::LAYOUT::READ_ONLY_OPTIMAL
}
};
cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imgBarriers });
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduplicate this code a bit, I think everything up to the resolve shader can stay as is, and the only diff between RWMC and no RWMC should be the push constants you set on the main PT shader

Comment on lines +1726 to +1729
float rwmcCascadeStart;
float rwmcCascadeBase;
float rwmcMinReliableLuma;
float rwmcKappa;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the reason you're not using shared HLSL/C++ RWMC structs is because the ImGUI parameters are different to what we actually carry around ?

Still ImGUI can take poitners to struct members, so if you can wrap it up in some HLSL struct which then gets translated intot eh actual optimized one with log2 members, that would be better

Comment on lines +1732 to +1734
RenderRWMCPushConstants rwmcPushConstants;
RenderPushConstants pc;
ResolvePushConstants resolvePushConstants;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you rebuild them every frame from imgui stuff + camera etc?

Maybe lets not store them around as global state and have them transient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants