We have a conversion from LDG on the compute shader to a special constant buffer binding that's used to exceed hardware limits on compute, but it was only running if the byte offset could be identified. The fallback that checks all of the bindings at runtime only checks the storage buffers.
This PR adds checking ube ranges to the LoadGlobal fallback. This extends the changes in #4011 to only check ube entries which are accessed by the shader.
Fixes particles affected by the wind in The Legend of Zelda: Breath of the Wild. May fix other weird issues with compute shaders in some games.
Try a bunch of games and drivers to make sure they don't blow up loading constants willynilly from searchable buffers.
* Make all structs readonly when applicable. It should reduce amount of needless defensive copies
* Make structs with trivial boilerplate equality code record structs
* Remove unnecessary readonly modifiers from TextureCreateInfo
* Make BitMap structs readonly too
* GPU: Use lazy checks for specialization state
This PR adds a new class, the SpecializationStateUpdater, that allows elements of specialization state to be updated individually, and signal the state is checked when it changes between draws, instead of building and checking it on every draw. This also avoids building spec state when
Most state updates have been moved behind the shader state update, so that their specialization state updates make it in before shaders are fetched.
Downside: Fields in GpuChannelGraphicsState are no longer readonly. To counteract copies that might be caused this I pass it as `ref` when possible, though maybe `in` would be better? Not really sure about the quirks of `in` and the difference probably won't show on a benchmark.
The result is around 2 extra FPS on SMO in the usual spot. Not much right now, but it will remove costs when we're doing more expensive specialization checks, such as fragment output type specialization for macos. It may also help more on other games with more draws.
* Address Feedback
* Oops
* GPU: Swap bindings array instead of copying
Reduces work on UpdateShaderState. Now the cost is a few reference moves for arrays, rather than copying data.
Downside: bindings arrays are no longer readonly.
* Micro optimisation
* Add missing docs
* Address Feedback
* Track buffer migrations and flush source on incomplete copy
Makes sure that the modified range list is always from the latest iteration of the buffer, and flushes earlier iterations of a buffer if the data has not been migrated yet.
* Cleanup 1
* Reduce cost for redundant signal checks on Vulkan
* Only inherit the range list if there are pending ranges.
* Fix OpenGL
* Address Feedback
* Whoops
* Ensure that vertex attribute buffer index is valid on GPU
* Remove vertex buffer validation code from OpenGL
* Remove some fields that are no longer necessary
This replacement is meant to be done with the original identified byteOffset, not the one assigned later on by the below conditionals (that already has the constant offset added, for instance).
This fixes videos being pixelated in Xenoblade 3, and other regressions that might have happened since #3847.
* GAL: Send all buffer assignments at once rather than individually
The `(int first, BufferRange[] ranges)` method call has very significant performance implications when the bindings are spread out, which they generally always are in Vulkan. This change makes it so that these methods are only called a maximum of one time per draw.
Significantly improves GPU thread performance in Pokemon Scarlet/Violet.
* Address Feedback
Removed SetUniformBuffers(int first, ReadOnlySpan<BufferRange> buffers)
* GPU: Access non-prefetch command buffers directly
Saves allocating new arrays for them constantly - they can be quite small so it can be very wasteful. About 0.4% of GPU thread in SMO, but was a bit higher in S/V when I checked.
Assumes that non-prefetch command buffers won't be randomly clobbered before they finish executing, though that's probably a safe bet.
* Small change while I'm here
* Address feedback
I did this on ncbuffer2 when we were using it for LDN 3, but I noticed that it can apply to the current buffer manager too, and it's an easy performance win.
The only buffer access that can come from another thread is the overlap search for buffers that have been unmapped. Everything else, including modifications, come from the main GPU thread. That means we only need to lock the range list when it's being modified, as that's the only time where we'll cause a race with the unmapped handler.
This has a significant performance improvements in situations where FIFO is high, like the other two PRs. Joined together they give a nice boost (73.6 master -> 79 -> 83 fps in SMO).
A quick fix to prevent reading the wrong value of Count when reregistering ranges for a new target buffer. Buffer flushes from another thread can modify the range list when the lock isn't active, which can change the count.
This prevents some crashes in Pokemon Scarlet/Violet. It's probably likely that buffer migration during flush is causing some other issues in this game, but this at least prevents the crashing.
* Prune ForceDirty and CheckModified caches on unmap
Since we're now using this for modified checks on the HLE indirect draw method, I'm worried that leaving these to forever gather cache entries isn't the best idea for performance in the long term, and it could keep old buffer objects alive for longer than they should be.
This PR adds the ability to prune invalid entries before checking these caches, and queues it whenever gpu memory is unmapped. It also aligns modified checks to the page size, as I figured it would be possible for a huge number of overlapping over a game's runtime.
This prevents Super Mario Odyssey from having 10s of thousands of entries in the modified cache in Metro Kingdom, and them duplicating when entering and leaving a building (should be cleared, as they were unmapped).
* Address Feedback
The type in the `texOp` in the textureSize instruction doesn't have the exact type on SPIR-V (for example, it is missing the Array flag). This PR gives it the proper type before giving it to the unscaling helper.
This fixes the ground textures being broken on Pokemon Scarlet/Violet when scaling. It wasn't finding the texture, so the descriptor index it provided was -1...
* Eliminate CB0 accesses
Still some work to do, decouple from hle?
* Forgot the important part somehow
* Fix and improve alignment test
* Address Feedback
* Remove some complexity when checking storage buffer alignment
* Update Ryujinx.Graphics.Shader/Translation/Optimizations/GlobalToStorage.cs
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Implement HLE macro for DrawElementsIndirect
* Shader cache version bump
* Use GL_ARB_shader_draw_parameters extension on OpenGL
* Fix DrawIndexedIndirectCount on Vulkan when extension is not supported
* Implement DrawIndex
* Alignment
* Fix some validation errors
* Rename BaseIds to DrawParameters
* Fix incorrect index buffer and vertex buffer size in some cases
* Add HLE macros for DrawArraysInstanced and DrawElementsInstanced
* Perform a regular draw when indirect data is not modified
* Use non-indirect draw methods if indirect buffer was not GPU modified
* Only check if draw parameters match if the shader actually uses them
* Expose Macro HLE setting on GUI
* Reset FirstVertex and FirstInstance after draw
* Update shader cache version again since some people already tested this
* PR feedback
Co-authored-by: riperiperi <rhy3756547@hotmail.com>
* Update readme to mention .NET 7
* infra: Migrate to .NET 7
.NET 7 is still in preview but this prepare for the release coming up
next month.
* Use Random.Shared in CreateRandom
* Move UInt128Utils.cs to Ryujinx.Common project
* Fix inverted parameters in System.UInt128 constructor
* Fix Visual Studio complains on Ryujinx.Graphics.Vic
* time: Fix missing alignment enforcement in SystemClockContext
Fixes at least Smash
* time: Fix missing alignment enforcement in SteadyClockContext
Fix games (like recent version of Smash) using time shared memory
* Switch to .NET 7.0.100 release
* Enable Tiered PGO
* Ensure CreateId validity requirements are meet when doing random generation
Also enforce correct packing layout for other Mii structures.
This fix a Mario Kart 8 crashes related to the default Miis.
* Vulkan: Implement multisample <-> non-multisample copies and depth-stencil resolve
* FramebufferParams is no longer required there
* Implement Specialization Constants and merge CopyMS Shaders (#15)
* Vulkan: Initial Specialization Constants
* Replace with specialized helper shader
* Reimplement everything
Fix nonexistant interaction with Ryu pipeline caching
Decouple specialization info from data and relocate them
Generalize mapping and add type enum to better match spv types
Use local fixed scopes instead of global unmanaged allocs
* Fix misses in initial implementation
Use correct info variable in Create2DLayerView
Add ShaderStorageImageMultisample to required feature set
* Use texture for source image
* No point in using ReadOnlyMemory
* Apply formatting feedback
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Apply formatting suggestions on shader source
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Support conversion with samples count that does not match the requested count, other minor changes
Co-authored-by: mageven <62494521+mageven@users.noreply.github.com>
* GPU: Pass SpanOrArray for Texture SetData to avoid copy
Texture data is often converted before upload, meaning that an array was allocated to perform the conversion into. However, the backend SetData methods were being passed a Span of that data, and the Multithreaded layer does `ToArray()` on it so that it can be stored for later! This method can't extract the original array, so it creates a copy.
This PR changes the type passed for textures to a new ref struct called SpanOrArray, which is backed by either a ReadOnlySpan or an array. The benefit here is that we can have a ToArray method that doesn't copy if it is originally backed by an array.
This will also avoid a copy when running the ASTC decoder.
On NieR this was taking 38% of texture upload time, which it does a _lot_ of when you move between areas, so there should be a 1.6x performance boost when strictly uploading textures. No doubt this will also improve texture streaming performance in UE4 games, and maybe a small reduction with video playback.
From the numbers, it's probably possible to improve the upload rate by a further 1.6x by performing layout conversion on GPU. I'm not sure if we could improve it further than that - multithreading conversion on CPU would probably result in memory bottleneck.
This doesn't extend to buffers, since we don't convert their data on the GPU emulator side.
* Remove implicit cast to array.
* Scale SamplesPassed counter by RT scale on report
Adds a scale factor for samples passed counter report based on the render target scale at the time. This ensures that when a game reads this counter, it appears similar to the result at 1x.
This doesn't cover cases where the the render target scale changes during the queried draws, though that might be better to handle along with other scope related issues in a future rework of counters. Games generally don't count for occlusion queries over render target changes anyways.
Fixes an issue in the Splatoon games where the special charge would scale too quickly at high res, points at the end of the game would be broken (but still provide a correct winner), and playing at a low res would make it impossible to swim in ink.
May also affect LOD scaling in The Witcher 3.
* Update Ryujinx.Graphics.Gpu/Engine/Threed/SemaphoreUpdater.cs
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: gdkchan <gab.dark.100@gmail.com>