Making the stream buffer resident increases GPU usage significantly on
some games. This seems to be addressed invalidating the stream buffer
with InvalidateBufferData instead of using a Unmap + Map (with
invalidation flags).
Switch games are allowed to bind less data than what they use in a
vertex buffer, the expected behavior here is that these values are read
as zero. At the moment of writing this only D3D12, OpenGL and NVN through
NV_vertex_buffer_unified_memory support vertex buffer with a size limit.
In theory this could be emulated on Vulkan creating a new VkBuffer for
each (handle, offset, length) tuple and binding the expected data to it.
This is likely going to be slow and memory expensive when used on the
vertex buffer and we have to do it on all draws because we can't know
without analyzing indices when a game is going to read vertex data out
of bounds.
This is not a problem on OpenGL's BufferAddressRangeNV because it takes
a length parameter, unlike Vulkan's CmdBindVertexBuffers that only takes
buffers and offsets (the length is implicit in VkBuffer). It isn't a
problem on D3D12 either, because D3D12_VERTEX_BUFFER_VIEW on
IASetVertexBuffers takes SizeInBytes as a parameter (although I am not
familiar with robustness on D3D12).
Currently this only implements buffer ranges for vertex buffers,
although indices can also be affected. A KHR_robustness profile is not
created, but Nvidia's driver reads out of bound vertex data as zero
anyway, this might have to be changed in the future.
- Fixes SMO random triangles when capturing an enemy, getting hit, or
looking at the environment on certain maps.
Make stream buffer and cached buffers as resident and query their
address. This allows us to use GPU addresses for several proprietary
Nvidia extensions.
Expose NV_vertex_buffer_unified_memory when the driver supports it.
This commit adds a function the determine if a GL_RENDERER is a Turing
GPU. This is required because on Turing GPUs Nvidia's driver crashes
when the buffer is marked as resident or on DeleteBuffers. Without a
synchronous debug output (single threaded driver), it's likely that
the driver will crash in the first blocking call.
Add HSET2_IMM. Due to the complexity of the encoding avoid using
BitField unions and read the relevant bits from the code itself.
This is less error prone.
Update validation layer string to VK_LAYER_KHRONOS_validation.
While we are at it, properly check for available validation layers
before enabling them.
Enable GL_EXT_texture_shadow_lod if available. If this extension is not available, such as on Intel/AMD proprietary drivers, use textureGrad as a workaround.
Variables that are marked as const cannot have the move constructor
invoked when returning from a function (the move constructor requires a
non-const variable so it can "steal" the resources from it.
Check() can throw an exception if the Vulkan result isn't successful.
We remove the check so that std::terminate isn't outright called and
allows for better debugging (should it ever actually fail).
Renames some variables to prevent ones in inner scopes from shadowing
outer-scoped variables.
The Copy* functions have no shadowing, but we rename them anyways to
remain consistent with the other functions.
We can reduce the capture scope so that it's not possible for both "reg"
variables to clash with one another.
While we're at it, we can prevent unnecessary copies while we're at it.
There's no need to load contents from the CPU when a clear resets all
the contents of the underlying memory. This is already implemented on
OpenGL and the texture cache.
maxwell_to_vk: Reorder filtering modes to start with None, then Nearest, then Linear.
maxwell_to_vk: Logs filter modes under UNREACHABLE_MSG instead of UNIMPLEMENTED_MSG, since any unknown filter modes are invalid and not unimplemented.
maxwell_to_vk: Return VK_SAMPLER_MIPMAP_MODE_NEAREST instead of VK_SAMPLER_MIPMAP_MODE_LINEAR when mipmap_filter is None with the description from the VkSamplerCreateInfo(3) man page.
maxwell_to_gl: Log unimplemented features under UNIMPLEMENTED_MSG instead of LOG_ERROR to bring into parity with maxwell_to_vk
maxwell_to_gl: Deduplicate logging in VertexType(), merging them into one.
maxwell_to_gl: Return GL_NEAREST instead of GL_LINEAR if an unknown texture filter mode is encountered.
maxwell_to_gl: Log the mipmap filter mode if an unknown value is passed in.
maxwell_to_gl: Reorder filtering modes to start with None, then Nearest, then Linear.
Due to the limitation of GL_MAX_IMAGE_UNITS being low (8) on Intel's and Nvidia's proprietary drivers, we have to reserve an appropriate amount of image bindings for each of the stages. So far games have been observed to use 4 image bindings on the fragment stage (Kirby Star Allies) and 1 on the vertex stage (TWD series).
No games thus far in my limited testing used more than 4 images concurrently and across all currently active programs.
This fixes shader compilation errors on Kirby Star Allies on OpenGL (GLSL/GLASM)
All registers are now callee-save registers.
RBX and RBP selected for STATE and RESULT because these are most commonly accessed; this is to avoid the REX prefix.
RBP not used for STATE because there are some SIB restrictions, RBX emits smaller code.
Emit code compatible with NV_gpu_program5.
This should emit code compatible with Fermi, but it wasn't tested on
that architecture. Pascal has some issues not present on Turing GPUs.
Instead of using as template argument a shared pointer, use the
underlying type and manage shared pointers explicitly. This can make
removing shared pointers from the cache more easy.
While we are at it, make some misc style changes and general
improvements (like insert_or_assign instead of operator[] + operator=).
Vertex buffers bindings become invalid after the stream buffer is
invalidated. We were originally doing this, but it got lost at some
point.
- Fixes Animal Crossing: New Horizons, but it affects everything.
This allows rendering to 3D textures with more than one slice.
Applications are allowed to render to more than one slice of a texture
using gl_Layer from a VTG shader.
This also requires reworking how 3D texture collisions are handled, for
now, this commit allows rendering to slices but not to miplevels. When a
render target attempts to write to a mipmap, we fallback to the previous
implementation (copying or flushing as needed).
- Fixes color correction 3D textures on UE4 games (rainbow effects).
- Allows Xenoblade games to render to 3D textures directly.
Implement a generic shader cache for fast lookups and invalidations.
Invalidations are cheap but expensive when a shader is invalidated.
Use two mutexes instead of one to avoid locking invalidations for
lookups and vice versa. When a shader has to be removed, lookups are
locked as expected.
Skip fast buffer uploads on Nvidia 443.24 Vulkan beta driver on OpenGL.
This driver throws the following error when calling BufferSubData or
BufferData on buffers that are candidates for fast constant buffer
uploads. This is the equivalens to push constants on Vulkan, except that
they can access the full buffer. The error:
Unknown internal debug message. The NVIDIA OpenGL driver has encountered
an out of memory error. This application might
behave inconsistently and fail.
If this error persists on future drivers, we might have to look deeper
into this issue. For now, we can black list it and log it as a temporary
solution.
Games using D3D idioms can join images and samplers when a shader
executes, instead of baking them into a combined sampler image. This is
also possible on Vulkan.
One approach to this solution would be to use separate samplers on
Vulkan and leave this unimplemented on OpenGL, but we can't do this
because there's no consistent way of determining which constant buffer
holds a sampler and which one an image. We could in theory find the
first bit and if it's in the TIC area, it's an image; but this falls
apart when an image or sampler handle use an index of zero.
The used approach is to track for a LOP.OR operation (this is done at an
IR level, not at an ISA level), track again the constant buffers used as
source and store this pair. Then, outside of shader execution, join
the sample and image pair with a bitwise or operation.
This approach won't work on games that truly use separate samplers in a
meaningful way. For example, pooling textures in a 2D array and
determining at runtime what sampler to use.
This invalidates OpenGL's disk shader cache :)
- Used mostly by D3D ports to Switch
NV_transform_feedback, NV_transform_feedback2 and
ARB_transform_feedback3 with NV_transform_feedback interactions allows
implementing transform feedbacks as dynamic state.
Maxwell implements transform feedbacks as dynamic state, so using these
extensions with TransformFeedbackStreamAttribsNV allows us to properly
emulate transform feedbacks without having to recompile shaders when the
state changes.
On Intel's proprietary drivers, gl_Layer and gl_ViewportIndex are not allowed members of gl_PerVertex block, causing the shader to fail to compile. Fix this by declaring these variables outside of gl_PerVertex.
This avoids using Nvidia's ASTC decoder on OpenGL.
The last time it was profiled, it was slower than yuzu's decoder.
While we are at it, fix a bug in the texture cache when native ASTC is
not supported.
Previously we were disabling compute shaders on Intel's proprietary driver due to broken compute. This has been fixed in the latest Intel drivers. Re-enable compute for Intel proprietary drivers and remove the check for broken compute.
Geometry shaders built from Nvidia's compiler check for bits[16:23] to
be less than or equal to 0 with VSETP to default to a "safe" value of
0x8000'0000 (safe from hardware's perspective). To avoid hitting this
path in the shader, return 0x00ff'0000 from S2R INVOCATION_INFO.
This seems to be the maximum number of vertices a geometry shader can
emit in a primitive.
Implement more surface reconstruct cases. Allow overlaps with more than
one layer and mipmap and copies all of them to the new texture.
- Fixes textures moving around objects on Xenoblade games
Avoid copying to a staging buffer on non-granular memory addresses.
Add a callable argument to StreamBufferUpload to be able to copy to the
staging buffer directly from ReadBlockUnsafe.
Stop ignoring image swizzles on depth and stencil images.
This doesn't fix a known issue on Xenoblade Chronicles 2 where an OpenGL
texture changes swizzles twice before being used. A proper fix would be
having a small texture view cache for this like we do on Vulkan.
While Vulkan was assuming we had no negative viewports, OpenGL code
was assuming we had them. Port the old code from Vulkan to OpenGL,
checking if the first viewport is negative before flipping faces.
This is not a complete implementation since we only check for the first
viewport to be negative. That said, unless a game is using Vulkan,
OpenGL and NVN games should be fine here, and we can always compare with
our Vulkan backend to see if there's a difference.
The check to flip faces when viewports are negative were a left over
from the old OpenGL code. This is not required on Vulkan where we have
negative viewports.
Hardware S2R special registers match gl_Thread*MaskNV. We can trivially
implement these using Nvidia's extension on OpenGL or naively stubbing
them with the ARB instructions to match. This might cause issues if the
host device warp size doesn't match Nvidia's. That said, this is
unlikely on proper shaders.
Refer to the attached url for more documentation about these flags.
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_shader_thread_group.txt
Some operations like atomicMin were ignored because they returned were
being stored to RZ. This operations have a side effect and it was being
ignored.
Drop the std::list hack to allocate memory indefinitely.
Instead use a custom allocator that keeps references valid until
destruction. This allocates fixed chunks of memory and puts pointers in
a free list. When an allocation is no longer used put it back to the
free list, this doesn't heap allocate because std::vector doesn't change
the capacity. If the free list is empty, allocate a new chunk.
Most overlaps in the buffer cache only contain one mapped address.
We can avoid close to all heap allocations once the buffer cache is
warmed up by using a small_vector with a stack size of one.
Instead of using boost::icl::interval_map for caching, use
boost::intrusive::set. interval_map is intended as a container where the
keys can overlap with one another; we don't need this for caching
buffers and a std::set-like data structure that allows us to search with
lower_bound is enough.
Add code required to use OpenGL assembly programs based on
NV_gpu_program5. Decompilation for ARB programs is intended to be added
in a follow up commit. This does **not** include ARB decompilation and
it's not in an usable state.
The intention behind assembly programs is to reduce shader stutter
significantly on drivers supporting NV_gpu_program5 (and other required
extensions). Currently only Nvidia's proprietary driver supports these
extensions.
Add a UI option hidden for now to avoid people enabling this option
accidentally.
This code path has some limitations that OpenGL compatibility doesn't
have:
- NV_shader_storage_buffer_object is limited to 16 entries for a single
OpenGL context state (I don't know if this is an intended limitation, an
specification issue or I am missing something). Currently causes issues
on The Legend of Zelda: Link's Awakening.
- NV_parameter_buffer_object can't bind buffers using an offset
different to zero. The used workaround is to copy to a temporary buffer
(this doesn't happen often so it's not an issue).
On the other hand, it has the following advantages:
- Shaders build a lot faster.
- We have control over how floating point rounding is done over
individual instructions (SPIR-V on Vulkan can't do this).
- Operations on shared memory can be unsigned and signed.
- Transform feedbacks are dynamic state (not yet implemented).
- Parameter buffers (uniform buffers) are per stage, matching NVN and
hardware's behavior.
- The API to bind and create assembly programs makes sense, unlike
ARB_separate_shader_objects.
Constant attributes (in OpenGL known disabled attributes) are not
supported on Vulkan, even with extensions. To emulate this behavior we
return zero on reads from disabled vertex attributes in shader code.
This has no caching cost because attribute formats are not dynamic state
on Vulkan and we have to store it in the pipeline cache anyway.
- Fixes Animal Crossing: New Horizons terrain borders
This was a left over from OpenGL when disabled buffers where not properly
emulated. We no longer have to assert this as it is checked in vertex
buffer initialization.
"Not equal" operators on GLSL seem to behave as unordered when we expect
an ordered comparison.
Manually emulate this checking for LGE values (numbers, not-NaNs).
This should fix grass interactions on Breath of the Wild on Vulkan.
It is currently untested against validation layers.
Nvidia's Windows 443.09 beta driver or Linux 440.66.12 is required for
now.
In file included from src/video_core/renderer_opengl/renderer_opengl.cpp:25:
In file included from src/./video_core/renderer_opengl/gl_rasterizer.h:26:
In file included from src/./video_core/renderer_opengl/gl_fence_manager.h:11:
src/./video_core/fence_manager.h:91:32: error: use 'template' keyword
to treat 'Write' as a dependent template name
memory_manager.Write<u32>(current_fence->GetAddress(), current_fence->GetPayload());
^
template
src/./video_core/fence_manager.h:137:32: error: use 'template'
keyword to treat 'Write' as a dependent template name
memory_manager.Write<u32>(current_fence->GetAddress(), current_fence->GetPayload());
^
template
Reduces some header churn and reduces rebuilds when some header
internals change.
While we're at it we can also resolve a missing include in buffer_cache.
Xenoblade 2 invokes a draw call with zero vertices.
This is likely due to indirect drawing (glDrawArraysIndirect).
This causes a crash in the staging buffer pool when trying to create a
buffer with a size of zero. To workaround this, skip index buffer setup
entirely when the number of indices is zero.
Drop MemoryBarrier from the buffer cache and use Maxwell3D's register
WaitForIdle.
To implement this on OpenGL we just call glMemoryBarrier with the
necessary bits.
Vulkan lacks this synchronization primitive, so we set an event and
immediately wait for it. This is not a pretty solution, but it's what
Vulkan can do without submitting the current command buffer to the queue
(which ends up being more expensive on the CPU).
Using deko3d as reference:
4e47ba0013/source/maxwell/gpu_3d_state.cpp (L42)
We were using bits 3 and 4 to determine depth clamping, but these are
the same both enabled and disabled:
state->depthClampEnable ? 0x101A : 0x181D
The same happens on Nvidia's OpenGL driver, where they do something like
this (default capabilities, GL 4.5 compatibility):
(state & DEPTH_CLAMP) != 0 ? 0x201a : 0x281c
There's always a difference between the first bits in this register, but
bit 11 is consistently disabled on both deko3d/NVN and OpenGL. This
commit changes yuzu's behaviour to use bit 11 to determine depth
clamping.
- Fixes depth issues on Super Mario Odyssey's intro.
This reverts commit 94b0e2e5da.
preserve_contents proved to be a meaningful optimization. This commit
reintroduces it but properly implemented on OpenGL.
We have to make sure the clear removes all the previous contents of the
image.
It's not currently implemented on Vulkan because we can do smart things
there that's preferred to be introduced in a separate commit.
Deduplicate code shared between vk_pipeline_cache and gl_shader_cache as
well as shader decoder code.
While we are at it, fix a bug in gl_shader_cache where compute shaders
had an start offset of a stage shader.
Signed integer addition overflow might be undefined behavior. It's free
to change operations to UAdd and use unsigned integers to avoid
potential bugs.
P2R CC takes the state of condition codes and puts them into a register.
We already have this implemented for PR (predicates). This commit
implements CC over that.
Sometimes for unknown reasons NVN games can bind a render target format
of 0. This may be a yuzu bug.
With the commits before this the formats were specified without being
"packed", assuming all formats and texceptions will be written like in
the color_attachments vector.
To address this issue, iterate all render targets and pack them as they
are valid. This way they will match color_attachments.
- Fixes validation errors and graphical issues on Breath of the Wild.
The encoding for negation and absolute value was wrong.
Extracting is now done manually. Similar instructions having different
encodings is the rule, not the exception. To keep sanity and readability
I preferred to extract the desired bit manually.
This is implemented against nxas:
8dbc389957/table.h (L68)
That is itself tested against nvdisasm (Nvidia's official disassembler).
This allows deducing some properties from the texture instruction before
asking the runtime. By doing this we can handle type mismatches in some
instructions from the renderer instead of the shader decoder.
Fixes texelFetch issues with games using 2D texture instructions on a 1D
sampler.
The intention behind this was to assign a float to from an uint32_t, but
it was unintentionally being copied directly into the std::optional.
Copy to a temporary and assign that temporary to std::optional. This can
be replaced with std::bit_cast<float> once we are in C++20.
All drivers (even Intel) seem to have a device local memory type that is
not host visible. Remove this flag so all devices follow the same path.
This fixes a crash when trying to map to host device local memory on
integrated devices.
Introduce a default buffer getter that lazily constructs an empty
buffer. This is intended to match OpenGL's buffer 0.
Use this for disabled vertex and uniform buffers.
While we are at it, include vertex buffer usages for staging buffers to
silence validation errors.
On NVN buffers can be enabled but have no size. According to deko3d and
the behavior we see in Animal Crossing: New Horizons these buffers get
the special address of 0x1000 and limit themselves to 0xfff.
Implement buffers without a size by binding a null buffer to OpenGL
without a side.
1d1930beea/source/maxwell/gpu_3d_vbo.cpp (L62-L63)
Render.Vulkan <Error> video_core/renderer_vulkan/renderer_vulkan.cpp:CreateInstance:131: Presentation not supported on this platform
Render.Vulkan <Error> video_core/renderer_vulkan/renderer_vulkan.cpp:CreateSurface:378: Presentation not supported on this platform
Core <Critical> core/core.cpp:Load:199: Failed to initialize system (Error 5)!
Sort discrete GPUs over the rest, Nvidia over AMD, AMD over Intel, Intel
over the rest. This gives us a somewhat consistent order when Optimus
is removed (renderdoc does this when it's attached).
This can break the configuration of users with an Intel GPU that
manually remove Optimus on yuzu. That said, it's a very unlikely to
happen.
Pad FixedPipelineState's size to 384 bytes to be a multiple of 16.
Compare the whole struct with std::memcmp and hash with CityHash. Using
CityHash instead of a naive hash should reduce the number of collisions.
Improve used type traits to ensure this operation is safe.
With these changes the improvements to the hashable pipeline state are:
Optimized structure
Hash: 89 ns
Comparison: 103 ns
Construction*: 164 ns
Struct size: 384 bytes
Original structure
Hash: 148 ns
Equal: 174 ns
Construction*: 281 ns
Size: 1384 bytes
* Attribute state initialization is not measured
These measures are averages taken with std::chrono::high_accuracy_clock
on MSVC shipped on Visual Studio 16.6.0 Preview 2.1.
Avoids unnecessary reference count increments where applicable and also
avoids reallocating a vector.
Unlikely to make a huge difference, but given how trivial of an
amendment it is, why not?
Nvidia recently introduced a new memory type for data streaming
(awesome!), but yuzu was assuming that all heaps had enough memory
for the assumed stream buffer size (256 MiB).
This worked fine on AMD but Nvidia's new memory heap was smaller than
256 MiB. This commit changes this assumption and allocates a bit less
than the size of the preferred heap, with a maximum of 256 MiB (to avoid
allocating all system memory on integrated devices).
- Fixes a crash on NVIDIA 450.82.0.0
Some variables aren't used, so we can remove these.
Unfortunately, diagnostics are still reported on structured bindings
even when annotated with [[maybe_unused]], so we need to unpack the
elements that we want to use manually.
Implement indexed quads (GL_QUADS used with glDrawElements*) with a
compute pass conversion.
The compute shader converts from uint8/uint16/uint32 indices to uint32.
The format is passed through push constants to avoid having different
variants of the same shader.
- Used by Fast RMX
- Used by Xenoblade Chronicles 2 (it still has graphical due to
synchronization issues on Vulkan)
The original idea of returning pointers is that handles can be moved.
The problem is that the implementation didn't take that in mind and made
everything harder to work with. This commit drops pointer to handles and
returns the handles themselves. While it is still true that handles can
be invalidated, this way we get an old handle instead of a dangling
pointer.
This problem can be solved in the future with sparse buffers.
Allows reporting more cases where logic errors may exist, such as
implicit fallthrough cases, etc.
We currently ignore unused parameters, since we currently have many
cases where this is intentional (virtual interfaces).
While we're at it, we can also tidy up any existing code that causes
warnings. This also uncovered a few bugs as well.
This can result in silent logic bugs within code, and given the amount
of times these kind of warnings are caused, they should be flagged at
compile-time so no new code is submitted with them.
When the dynamic state is specified, pViewports and pScissors are
ignored, quoting the specification:
pViewports is a pointer to an array of VkViewport structures, defining
the viewport transforms. If the viewport state is dynamic, this member
is ignored.
That said, AMD's proprietary driver itself seem to read it regardless of
what the specification says.
This is a simple optimization as Buffer Copies are mostly used for texture recycling. They are, however, useful when games abuse undefined behavior but most 3D APIs forbid it.
This reverts commit 05cf270836.
Apparently the first approach using floats instead of bitfieldInert
worked better for Fire Emblem: Three Houses. Reverting to get that
behavior back.
From my testing on a Splatoon 2 shader that takes 3800ms on average to
compile changing to FullDecompile reduces it to 900ms on average.
The shader decoder will automatically fallback to a more naive method if
it can't use full decompile.
Adds optional support for Nsight Aftermath. It is enabled through
ENABLE_NSIGHT_AFTERMATH in cmake. A path to the SDK has to be provided
by the environment variable NSIGHT_AFTERMATH_SDK.
Nsight Aftermath allows an application to generate "minidumps" of the
GPU state when a device loss happens. By analysing these on Nsight we
can know what a game was doing and why it triggered a device loss.
The dump is generated inside %APPDATA%\yuzu\log\gpucrash and this
directory is deleted every time a new instance is initialized with
Nsight enabled.
To enable it on yuzu there has a to be a driver and device capable of
running Nsight Aftermath on Vulkan. That means only Turing based GPUs
on the latest stable driver, beta drivers won't work for now.
It is manually enabled in Configuration>Debug>Enable Graphics Debugging
because when using all debugging capabilities there is a runtime cost.
The base level is already included in the texture view. If we specify
the base level in the texture again, this will end up in the incorrect
level and potentially out of bounds.
This also fixes Turing issues but it avoids doing more bitcasts. This
should improve the generated code while also avoiding more points where
compilers can flush floats.
Implements the common usages for VMNMX. Inputs with a different size
than 32 bits are not supported and sign mismatches aren't supported
either.
VMNMX works as follows:
It grabs Ra and Rb and applies a maximum/minimum on them (this is
defined by .MX), having in mind the input sign. This result can then be
saturated. After the intermediate result is calculated, it applies
another operation on it using Rc. These operations are merges,
accumulations or another min/max pass.
This instruction allows to implement with a more flexible approach GCN's
min3 and max3 instructions (for instance).
preserve_contents was always true. We can't assume we don't have to
preserve clears because scissored and color masked clears exist.
This removes preserve_contents and assumes it as true at all times.
Since commit e22816a5bb we handle type mismatches from the CPU.
We don't need to hack our shader decoder due to game bugs anymore.
Removed in this commit.
This is a reversed look up table extracted from
https://gist.github.com/rygorous/2203834#file-gistfile1-cpp-L41-L62
that is used in
04d4e9e587/source/maxwell/tsc_generate.cpp (L38)
Games usually bind 0xFD expecting a float texture border of 1.0f.
The conversion previous to this commit was multiplying the uint8 sRGB
texture border color by 255. This is close to 1.0f but when that
difference matters, some graphical glitches appear.
This look up table is manually changed in the edges, clamping towards
0.0f and 1.0f.
While we are at it, move this logic to its own translation unit.