* Flush in the middle of long command buffers.
* Vulkan: add situational "Fast Flush" mode
The AutoFlushCounter class was added to periodically flush Vulkan command buffers throughout a frame, which reduces latency to the GPU as commands are submitted and processed much sooner. This was done by allowing command buffers to flush when framebuffer attachments changed.
However, some games have incredibly long render passes with a large number of draws, and really aggressive data access that forces GPU sync.
The Vulkan backend could potentially end up building a single command buffer for 4-5ms if a pass has enough draws, such as in BOTW. In the scenario where sync is waited on immediately after submission, this would have to wait for the completion of a much longer command buffer than usual.
The solution is to force command buffer submission periodically in a "fast flush" mode. This will end up splitting render passes, but it will only enable if sync is aggressive enough.
This should improve performance in GPU limited scenarios, or in games that aggressively wait on synchronization. In some games, it may only kick in when res scaling. It won't trigger in games like SMO where sync is not an issue.
Improves performance in Pokemon Scarlet/Violet (res scaled) and BOTW (in general).
* Add conversions in milliseconds next to flush timers.
* ARMeilleure: Move TPIDR_EL0 and TPIDRRO_EL0 to NativeContext
Some games access these system registers several tens of thousands of times in a second from many different threads. While this isn't really crippling, it is a lot of wasted time spent in a reverse pinvoke transition.
Example games are Pokemon Scarlet/Violet and BOTW. These games have a lot of different potential bottlenecks so it's unlikely you will see a consistent improvement, but it definitely disappears from the cpu profile.
* Remove unreachable code.
* Add ulong conversion for offsets
* Nit
This seems to have been removed by the Post-Processing PR, but it is required for the display in OBS to be the right way up and properly scaled.
I've tested this with AA and FSR on MK8D and it seems to behave properly. Testing is welcome.
* ARMeilleure: Respect Fz flag for all floating point operations.
This is a change in strategy for emulating the Fz FPCR flag. Before, it was set before instructions that "needed it" and reset after. However, this missed a few hot instructions like the multiplication instruction, and the entirety of A32.
The new strategy is to set the Fz flag only in the following circumstances:
- Set to match FPCR before translated functions/loop are executed.
- Reset when calling SoftFloat methods, set when returning.
- Reset when exiting execution.
This allows us to remove the code around the existing Fz aware instructions, and get the accuracy benefits on all floating point instructions executed while in translated code.
Single step executions now need to be called with a context wrapper - right now it just contains the Fz flag initialization, and won't actually do anything on ARM.
This fixes a bug in Breath of the Wild where some physics interactions could randomly crash the game due to subnormal values not flushing to zero.
This is draft right now because I need to answer the questions:
- Does dotnet avoid changing the value of Mxcsr?
- Is it a good idea to assume that? Or should the flag set/restore be done on every managed method call, not just softfloat?
- If we assume that, do we want a unit test to verify the behaviour?
I recommend testing a bunch of games, especially games affected when this was originally added, such as #1611.
* Remove unused method
* Use FMA for Fmadd, Fmsub, Fnmadd, Fnmsub, Fmla, Fmls
...when available.
Similar implementation to A32
* Use FMA for Frecps, Frsqrts
* Don't set DAZ.
* Add round mode to ARM FP mode
* Fix mistakes
* Add test for FP state when calling managed methods
* Add explanatory comment to test.
* Cleanup
* Add A64 FPCR flags
* Vrintx_S A32 fast path on A64 backend
* Address feedback 1, re-enable DAZ
* Fix FMA instructions By Elem
* Address feedback
* Redesign use of ISampledData for accessing the SamplingNumber value on input data structs.
* Always read SamplingNumber as little-endian
* Restored field order for SixAxisSensorState. Rework to allow possibility of non-zero offsets for the SamplingNumber field. Set StructLayout Pack=8 - the KeyboardState struct is 4 bytes shorter with any other value.
* fix spelling
Co-authored-by: riperiperi <rhy3756547@hotmail.com>
* set Pack = 1 for ISampledDataStruct types, added Unknown field to KeyboardState
* extend size of KeyboardModifier
---------
Co-authored-by: riperiperi <rhy3756547@hotmail.com>
* vulkan: Move most of the properties enumeration to VulkanPhysicalDevice
That clean up a bit of duplicate logic.
Also move to use an hashset for device extensions.
* vulkan: Move instance querying to VulkanInstance
Also cleanup code to use span when possible instead of unsafe pointers.
* Address gdkchan's comments
* Use index fragment shader output when dual source blend is enabled
* Shader cache version bump
* Actually set DualSourceBlendEnabled to true
* Fix XML doc
---------
Co-authored-by: Ac_K <Acoustik666@gmail.com>
* Fix missing string enum converters for the config
* Revert changing KeyboardHotkeys to struct
This needs to be done because
Avalonia's TwoWay Binding breaks otherwise.
* Use source generated json serializers in order to improve code trimming
* Use strongly typed github releases model to fetch updates instead of raw Newtonsoft.Json parsing
* Use separate model for LogEventArgs serialization
* Make dynamic object formatter static. Fix string builder pooling.
* Do not inherit json version of LogEventArgs from EventArgs
* Fix extra space in object formatting
* Write log json directly to stream instead of using buffer writer
* Rebase fixes
* Rebase fixes
* Rebase fixes
* Enforce block-scoped namespaces in the solution. Convert style for existing code
* Apply suggestions from code review
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* Rebase indent fix
* Fix indent
* Delete unnecessary json properties
* Rebase fix
* Remove overridden json property names as they are handled in the options
* Apply suggestions from code review
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* Use default json options in github api calls
* Indentation and spacing fixes
* Fix json serialization
* Fix missing JsonConverter for config enums
* Add double \n\n after the whole string, not inside join
---------
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* vulkan: Separate debug utils logic from VulkanInitialization
Also checks for VK_EXT_debug_utils existence instead of force enabling it and allow possible error during messenger init
* Address gdkchan's comment
* Use CreateDebugUtilsMessenger Span variant
* HLE: Refactoring of ApplicationLoader
* Fix SDL2 Headless
* Addresses gdkchan feedback
* Fixes LoadUnpackedNca RomFS loading
* remove useless casting
* Cleanup and fixe empty application name
* Remove ProcessInfo
* Fixes typo
* ActiveProcess to ActiveApplication
* Update check
* Clean using.
* Use the correct filepath when loading Homebrew.npdm
* Fix NRE in ProcessResult if MetaLoader is null
* Add more checks for valid processId & return success
* Add missing logging statement for npdm error
* Return result for LoadKip()
* Move error logging out of PFS load extension method
This avoids logging "Could not find Main NCA"
followed by "Loading main..." when trying to start hbl.
* Fix GUIs not checking load results
* Fix style and formatting issues
* Fix formatting and wording
* gtk: Refactor LoadApplication()
---------
Co-authored-by: TSR Berry <20988865+TSRBerry@users.noreply.github.com>
* Rework StdErr-to-log redirection to use built-in FileStream, and do reads asynchronously to avoid hanging the process shutdown.
* set _disposable to false ASAP
* Simplify return statements by using ternary expressions
* Remove a redundant type conversion
* Reduce nesting by inverting "if" statements
* Try to improve code readability by using LINQ and inverting "if" statements
* Try to improve code readability by using LINQ, using ternary expressions, and inverting "if" statements
* Add line breaks to long LINQ
* Add line breaks to long LINQ
* Vulkan: Insert barriers before clears
Newer NVIDIA GPUs seem to be able to start clearing render targets before the last rasterization task is completed, which can cause it to clear a texture while it is being sampled.
This change adds a barrier from read to write when doing a clear, assuming it has been sampled in the past. It could be possible for this to be needed for sample into draw by some GPU, but it's not right now afaik.
This should fix visual artifacts on newer NVIDIA GPUs and driver combos. Contrary to popular belief, Tetris® Effect: Connected is not affected. Testing welcome, hopefully should fix most cases of this and not cost too much performance.
* Visual Studio Moment
* Address feedback
* Address Feedback 2
Protection for the `xgetbv` instruction for systems that do not support
`xcr0` such as nehalem processors.
The `XSAVE` cpuid indicates support for `XSAVE`, `XRESTOR`, `XSETBV`,
`XGETBV` while `OSXSAVE` indicates if the operating system itself has
`XSAVE` turned on. Both must be checked at the same time.
* Use source generated json serializers in order to improve code trimming
* Use strongly typed github releases model to fetch updates instead of raw Newtonsoft.Json parsing
* Use separate model for LogEventArgs serialization
* Make dynamic object formatter static. Fix string builder pooling.
* Do not inherit json version of LogEventArgs from EventArgs
* Fix extra space in object formatting
* Write log json directly to stream instead of using buffer writer
* Rebase fixes
* Rebase fixes
* Rebase fixes
* Enforce block-scoped namespaces in the solution. Convert style for existing code
* Apply suggestions from code review
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* Rebase indent fix
* Fix indent
* Delete unnecessary json properties
* Rebase fix
* Remove overridden json property names as they are handled in the options
* Apply suggestions from code review
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* Use default json options in github api calls
* Indentation and spacing fixes
---------
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* ARMeilleure: Add AVX512{F,VL,DQ,BW} detection
Add `UseAvx512Ortho` and `UseAvx512OrthoFloat` optimization flags as
short-hands for `F+VL` and `F+VL+DQ`.
* ARMeilleure: Add initial support for EVEX instruction encoding
Does not implement rounding, or exception controls.
* ARMeilleure: Add `X86Vpternlogd`
Accelerates the vector-`Not` instruction.
* ARMeilleure: Add check for `OSXSAVE` for AVX{2,512}
* ARMeilleure: Add check for `XCR0` flags
Add XCR0 register checks for AVX and AVX512F, following the guidelines
from section 14.3 and 15.2 from the Intel Architecture Software
Developer's Manual.
* ARMeilleure: Remove redundant `ReProtect` and `Dispose`, formatting
* ARMeilleure: Move XCR0 procedure to GetXcr0Eax
* ARMeilleure: Add `XCR0` to `FeatureInfo` structure
* ARMeilleure: Utilize `ReadOnlySpan` for Xcr0 assembly
Avoids an additional allocation
* ARMeilleure: Formatting fixes
* ARMeilleure: Fix EVEX encoding src2 register index
> Just like in VEX prefix, vvvv is provided in inverted form.
* ARMeilleure: Add `X86Vpternlogd` acceleration to `Vmvn_I`
Passes unit tests, verified instruction utilization
* ARMeilleure: Fix EVEX register operand designations
Operand 2 was being sourced improperly.
EVEX encoded instructions source their operands like so:
Operand 1: ModRM:reg
Operand 2: EVEX.vvvvv
Operand 3: ModRM:r/m
Operand 4: Imm
This fixes the improper register designations when emitting vpternlog.
Now "dest", "src1", "src2" arguments emit in the proper order in EVEX instructions.
* ARMeilleure: Add `X86Vpternlogd` acceleration to `Orn_V`
* ARMeilleure: PTC version bump
* ARMeilleure: Update EVEX encoding Debug.Assert to Debug.Fail
* ARMeilleure: Update EVEX encoding comment capitalization
* Initial implementation of migration between memory heaps
- Missing OOM handling
- Missing `_map` data safety when remapping
- Copy may not have completed yet (needs some kind of fence)
- Map may be unmapped before it is done being used. (needs scoped access)
- SSBO accesses are all "writes" - maybe pass info in another way.
- Missing keeping map type when resizing buffers (should this be done?)
* Ensure migrated data is in place before flushing.
* Fix issue where old waitable would be signalled.
- There is a real issue where existing Auto<> references need to be replaced.
* Swap bound Auto<> instances when swapping buffer backing
* Fix conversion buffers
* Don't try move buffers if the host has shared memory.
* Make GPU methods return PinnedSpan with scope
* Storage Hint
* Fix stupidity
* Fix rebase
* Tweak rules
Attempt to sidestep BOTW slowdown
* Remove line
* Migrate only when command buffers flush
* Change backing swap log to debug
* Address some feedback
* Disallow backing swap when the flush lock is held by the current thread
* Make PinnedSpan from ReadOnlySpan explicitly unsafe
* Fix some small issues
- Index buffer swap fixed
- Allocate DeviceLocal buffers using a separate block list to images.
* Remove alternative flags
* Address feedback
* Avoid copying more handles than we have space for
* Use locks instead
* Reduce nesting by combining the lock statements
* Add locks for other uses of _sessionHandles and _portHandles
* Use one object to lock instead of locking twice
* Release the lock as soon as possible
* add RecyclableMemoryStream dependency and MemoryStreamManager
* organize BinaryReader/BinaryWriter extensions
* add StreamExtensions to reduce need for BinaryWriter
* simple replacments of MemoryStream with RecyclableMemoryStream
* add write ReadOnlySequence<byte> support to IVirtualMemoryManager
* avoid 0-length array creation
* rework IpcMessage and related types to greatly reduce memory allocation by using RecylableMemoryStream, keeping streams around longer, avoiding their creation when possible, and avoiding creation of BinaryReader and BinaryWriter when possible
* reduce LINQ-induced memory allocations with custom methods to query KPriorityQueue
* use RecyclableMemoryStream in StreamUtils, and use StreamUtils in EmbeddedResources
* add constants for nanosecond/millisecond conversions
* code formatting
* XML doc adjustments
* fix: StreamExtension.WriteByte not writing non-zero values for lengths <= 16
* XML Doc improvements. Implement StreamExtensions.WriteByte() block writes for large-enough count values.
* add copyless path for StreamExtension.Write(ReadOnlySpan<int>)
* add default implementation of IVirtualMemoryManager.Write(ulong, ReadOnlySequence<byte>); remove previous explicit implementations
* code style fixes
* remove LINQ completely from KScheduler/KPriorityQueue by implementing a custom struct-based enumerator
* GPU: Fast path for adding one texture view to a group
Texture group handles must store a list of their overlapping views, so they can be properly notified when a write is detected, and a few other things relating to texture readback. This is generally created when the group is established, with each handle looping over all views to find its overlaps. This whole process was also done when only a single view was added (and no handles were changed), however...
Sonic Frontiers had a huge cubemap array with 7350 faces (175 cubemaps * 6 faces * 7 levels), so iterating over both handles and existing views added up very fast. Since we are only adding a single view, we only need to _add_ that view to the existing overlaps, rather than recalculate them all.
This greatly improves performance during loading screens and a few seconds into gameplay on the "open zone" sections of Sonic Frontiers. May improve loading times or stutters on some other games.
Note that the current texture cache rules will cause these views to fall out of the cache, as there are more than the hard cap, so the cost will be repaid when reloading the open zone.
I also added some code to properly remove overlaps when texture views are removed, since it seems that was missing.
This can be improved further by only iterating handles that overlap the view (filter by range), but so can a few places in TextureGroup, so better to do all at once. The full generation of overlaps could probably be improved in a similar way.
I recommend testing a few games to make sure nothing breaks.
* Address feedback
* Update sparsely mapped texture ranges without recreating
Important TODO in TexturePool. Smaller TODO: should I look into making textures with views also do this? It needs to be able to detect if the views can be instantly deleted without issue if they're now remapped.
* Actually do partial updates
* Signal group dirty after mappings changed
* Fix various issues (should work now)
* Further optimisation
Should load a lot less data (16x) when partial updating 3d textures.
* Improve stability
* Allow granular uploads on large textures, improve rules
* Actually avoid updating slices that aren't modified.
* Address some feedback, minor optimisation
* Small tweak
* Refactor DereferenceRequest
More specific initialization methods.
* Improve code for resetting handles
* Explain data loading a bit more
* Add some safety for setting null from different threads.
All texture sets come from the one thread, but null sets can come from multiple. Only decrement ref count if we succeeded the null set first.
* Address feedback 1
* Make a bit safer
* GPU: Scale counter results before addition
Counter results were being scaled on ReportCounter, which meant that the _total_ value of the counter was being scaled. Not only could this result in very large numbers and weird overflows if the game doesn't clear the counter, but it also caused the result to change drastically.
This PR changes scaling to be done when the value is added to the counter on the backend. This should evaluate the scale at the same time as before, on report counter, but avoiding the issue with scaling the total.
Fixes scaling in Warioware, at least in the demo, where it seems to compare old/new counters and broke down when scaling was enabled.
* Fix issues when result is partially uploaded.
Drivers tend to write the low half first, then the high half. Retry if the high half is FFFFFFFF.
* .
* Apply suggestions from code review
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* wildcard(*) needs to be outside of quotes(") for cp to work
---------
Co-authored-by: TSRBerry <20988865+TSRBerry@users.noreply.github.com>
* use Array.Empty() where instead of allocating new zero-length arrays
* structure for loops in a way that the JIT will elide array/Span bounds checking
* avoiding function calls in for loop condition tests
* avoid LINQ in a hot path
* conform with code style
* fix mistake in GetNextWaitingObject()
* fix GetNextWaitingObject() possibility of returning null if all list items have TimePoint == long.MaxValue
* make GetNextWaitingObject() behave FIFO behavior for multiple items with the same TimePoint