* Add AddressTable<T>
* Use AddressTable<T> for dispatch
* Remove JumpTable & co.
* Add fallback for out of range addresses
* Add PPTC support
* Add documentation to `AddressTable<T>`
* Make AddressTable<T> configurable
* Fix table walk
* Fix IsMapped check
* Remove CountTableCapacity
* Add PPTC support for fast path
* Rename IsMapped to IsValid
* Remove stale comment
* Change format of address in exception message
* Add TranslatorStubs
* Split DispatchStub
Avoids recompilation of stubs during tests.
* Add hint for 64bit or 32bit
* Add documentation to `Symbol`
* Add documentation to `TranslatorStubs`
Make `TranslatorStubs` disposable as well.
* Add documentation to `SymbolType`
* Add `AddressTableEventSource` to monitor function table size
Add an EventSource which measures the amount of unmanaged bytes
allocated by AddressTable<T> instances.
dotnet-counters monitor -n Ryujinx --counters ARMeilleure
* Add `AllowLcqInFunctionTable` optimization toggle
This is to reduce the impact this change has on the test duration.
Before everytime a test was ran, the FunctionTable would be initialized
and populated so that the newly compiled test would get registered to
it.
* Implement unmanaged dispatcher
Uses the DispatchStub to dispatch into the next translation, which
allows execution to stay in unmanaged for longer and skips a
ConcurrentDictionary look up when the target translation has been
registered to the FunctionTable.
* Remove redundant null check
* Tune levels of FunctionTable
Uses 5 levels instead of 4 and change unit of AddressTableEventSource
from KB to MB.
* Use 64-bit function table
Improves codegen for direct branches:
mov qword [rax+0x408],0x10603560
- mov rcx,sub_10603560_OFFSET
- mov ecx,[rcx]
- mov ecx,ecx
- mov rdx,JIT_CACHE_BASE
- add rdx,rcx
+ mov rcx,sub_10603560
+ mov rdx,[rcx]
mov rcx,rax
Improves codegen for dispatch stub:
and rax,byte +0x1f
- mov eax,[rcx+rax*4]
- mov eax,eax
- mov rcx,JIT_CACHE_BASE
- lea rax,[rcx+rax]
+ mov rax,[rcx+rax*8]
mov rcx,rbx
* Remove `JitCacheSymbol` & `JitCache.Offset`
* Turn `Translator.Translate` into an instance method
We do not have to add more parameter to this method and related ones as
new structures are added & needed for translation.
* Add symbol only when PTC is enabled
Address LDj3SNuD's feedback
* Change `NativeContext.Running` to a 32-bit integer
* Fix PageTable symbol for host mapped
* Refactoring of KMemoryManager class
* Replace some trivial uses of DRAM address with VA
* Get rid of GetDramAddressFromVa
* Abstracting more operations on derived page table class
* Run auto-format on KPageTableBase
* Managed to make TryConvertVaToPa private, few uses remains now
* Implement guest physical pages ref counting, remove manual freeing
* Make DoMmuOperation private and call new abstract methods only from the base class
* Pass pages count rather than size on Map/UnmapMemory
* Change memory managers to take host pointers
* Fix a guest memory leak and simplify KPageTable
* Expose new methods for host range query and mapping
* Some refactoring of MapPagesFromClientProcess to allow proper page ref counting and mapping without KPageLists
* Remove more uses of AddVaRangeToPageList, now only one remains (shared memory page checking)
* Add a SharedMemoryStorage class, will be useful for host mapping
* Sayonara AddVaRangeToPageList, you served us well
* Start to implement host memory mapping (WIP)
* Support memory tracking through host exception handling
* Fix some access violations from HLE service guest memory access and CPU
* Fix memory tracking
* Fix mapping list bugs, including a race and a error adding mapping ranges
* Simple page table for memory tracking
* Simple "volatile" region handle mode
* Update UBOs directly (experimental, rough)
* Fix the overlap check
* Only set non-modified buffers as volatile
* Fix some memory tracking issues
* Fix possible race in MapBufferFromClientProcess (block list updates were not locked)
* Write uniform update to memory immediately, only defer the buffer set.
* Fix some memory tracking issues
* Pass correct pages count on shared memory unmap
* Armeilleure Signal Handler v1 + Unix changes
Unix currently behaves like windows, rather than remapping physical
* Actually check if the host platform is unix
* Fix decommit on linux.
* Implement windows 10 placeholder shared memory, fix a buffer issue.
* Make PTC version something that will never match with master
* Remove testing variable for block count
* Add reference count for memory manager, fix dispose
Can still deadlock with OpenAL
* Add address validation, use page table for mapped check, add docs
Might clean up the page table traversing routines.
* Implement batched mapping/tracking.
* Move documentation, fix tests.
* Cleanup uniform buffer update stuff.
* Remove unnecessary assignment.
* Add unsafe host mapped memory switch
On by default. Would be good to turn this off for untrusted code (homebrew, exefs mods) and give the user the option to turn it on manually, though that requires some UI work.
* Remove C# exception handlers
They have issues due to current .NET limitations, so the meilleure one fully replaces them for now.
* Fix MapPhysicalMemory on the software MemoryManager.
* Null check for GetHostAddress, docs
* Add configuration for setting memory manager mode (not in UI yet)
* Add config to UI
* Fix type mismatch on Unix signal handler code emit
* Fix 6GB DRAM mode.
The size can be greater than `uint.MaxValue` when the DRAM is >4GB.
* Address some feedback.
* More detailed error if backing memory cannot be mapped.
* SetLastError on all OS functions for consistency
* Force pages dirty with UBO update instead of setting them directly.
Seems to be much faster across a few games. Need retesting.
* Rebase, configuration rework, fix mem tracking regression
* Fix race in FreePages
* Set memory managers null after decrementing ref count
* Remove readonly keyword, as this is now modified.
* Use a local variable for the signal handler rather than a register.
* Fix bug with buffer resize, and index/uniform buffer binding.
Should fix flickering in games.
* Add InvalidAccessHandler to MemoryTracking
Doesn't do anything yet
* Call invalid access handler on unmapped read/write.
Same rules as the regular memory manager.
* Make unsafe mapped memory its own MemoryManagerType
* Move FlushUboDirty into UpdateState.
* Buffer dirty cache, rather than ubo cache
Much cleaner, may be reusable for Inline2Memory updates.
* This doesn't return anything anymore.
* Add sigaction remove methods, correct a few function signatures.
* Return empty list of physical regions for size 0.
* Also on AddressSpaceManager
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Use branch instead of tailcall for recursive calls
Use a branch instead of doing a tailcall for recursive calls. This
avoids having to store the dispatch address, setting up the epilogue and
keeps guest registers in host registers for longer.
The rejit check is moved down into the entry block so that the rejit
behaviour remains the same as before.
* Set PTC version
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Add EntryTable<TEntry>
* Add on translation call counting
* Add Counter
* Add PPTC support
* Make Counter a generic & use a 32-bit counter instead
* Return false on overflow
* Set PPTC version
* Print more information about the rejit queue
* Make Counter<T> disposable
* Remove Block.TailCall since it is not used anymore
* Apply suggestions from code review
Address gdkchan's feedback
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Fix more stale docs
* Remove rejit requests queue logging
* Make Counter<T> finalizable
Most certainly quite an odd use case.
* Make EntryTable<T>.TryAllocate set entry to default
* Re-trigger CI
* Dispose Counters before they hit the finalizer queue
* Re-trigger CI
Just for good measure...
* Make EntryTable<T> expandable
* EntryTable is now expandable instead of being a fixed slab.
* Remove EntryTable<T>.TryAllocate
* Remove Counter<T>.TryCreate
Address LDj3SNuD's feedback
* Apply suggestions from code review
Address LDj3SNuD's feedback
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Remove useless return
* POH approach, but the sequel
* Revert "POH approach, but the sequel"
This reverts commit 5f5abaa247.
The sequel got shelved
* Add extra documentation
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Improve StoreToContext emission
Hoist StoreToContext in dynamic branch fast & slow paths out into
their predecessor.
Reduces register pressure, code size and compile time because we're
throwing less stuff down the pipeline.
* Set PTC internal version
* Turn EmitDynamicTableCall private
* Re-trigger CI
* Implement VCNT based on AArch64 CNT
Add tests
* Update PTC version
* Address LDj's comments
* Explicit size in encoding
* Tighter tests
* Replace SoftFallback with IR helper
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Reduce one BitwiseAnd from IR fallback
Based on popcount64b from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
* Rename parameter and add assert
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Enable PTE null checks again
* Do address validation on EmitPtPointerLoad, and make it branchless
* PTC version increment
* Mask of pointer tag for exclusive access
* Move mask to the correct place
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Add VCLZ fast path
* Add VCLZ.8B/16B SSSE3 fast path
* Add VCLZ.4H/8H SSSE3 fast path
* Add VCLZ.2S/4S SSE2 fast path
* Improve CLZ.4H/8H fast path
* Improve CLZ.2S/4S fast path
* Set PPTC version
* Add Pmull_V Sse fast path only, both "8/16B -> 8H" and "1/2D -> 1Q" variants; with Test.
* Add Clmul fast path for the 128 bits variant.
* Small optimisation (save 60 instructions) for the Sse fast path about the 128 bits variant.
* Add slow path, both variants. Fix V128 Shl/Shr when shift = 0.
* A32: Add Vmull_I P64 variant (slow path); not tested.
* A32: Add Vmull_I_P8_P64 Test and fix P64 variant.
* Added support for offline invalidation, via PPTC, of low cq translations replaced by high cq translations; both on a single run and between runs.
Added invalidation of .cache files in the event of reuse on a different user operating system.
Added .info and .cache files invalidation in case of a failed stream decompression.
Nits.
* InternalVersion = 1712;
* Nits.
* Address comment.
* Get rid of BinaryFormatter.
Nits.
* Move Ptc.LoadTranslations().
Nits.
* Nits.
* Fixed corner cases (in case backup copies have to be used). Added save logs.
* Not core fixes.
* Complement to the previous commit. Added load logs. Removed BinaryFormatter leftovers.
* Add LoadTranslations log.
* Nits.
* Removed the search and management of LowCq overlapping functions.
* Final increment of .info and .cache flags.
* Nit.
* GetIndirectFunctionAddress(): Validate that writing actually takes place in dynamic table memory range (and not elsewhere).
* Fix Ptc.UpdateInfo() due to rebase.
* Nit for retrigger Checks.
* Nit for retrigger Checks.
* Initial cache memory allocator implementation
* Get rid of CallFlag
* Perform cache cleanup on exit
* Basic cache invalidation
* Thats not how conditionals works in C# it seems
* Set PTC version to PR number
* Address PR feedback
* Update InstEmitFlowHelper.cs
* Flag clear on address is no longer needed
* Do not include exit block in function size calculation
* Dispose jump table
* For future use
* InternalVersion = 1519 (force retest).
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Implement VFNMA.F<32/64>
* Update PTC Version
* Update Implementation & Renames & Correct Order
* Fix alignment
* Update implementation to not trigger assert
* Actually use the intrinsic that makes sense :)
* WIP Range Tracking
- Texture invalidation seems to have large problems
- Buffer/Pool invalidation may have problems
- Mirror memory tracking puts an additional `add` in compiled code, we likely just want to make HLE access slower if this is the final solution.
- Native project is in the messiest possible location.
- [HACK] JIT memory access always uses native "fast" path
- [HACK] Trying some things with texture invalidation and views.
It works :)
Still a few hacks, messy things, slow things
More work in progress stuff (also move to memory project)
Quite a bit faster now.
- Unmapping GPU VA and CPU VA will now correctly update write tracking regions, and invalidate textures for the former.
- The Virtual range list is now non-overlapping like the physical one.
- Fixed some bugs where regions could leak.
- Introduced a weird bug that I still need to track down (consistent invalid buffer in MK8 ribbon road)
Move some stuff.
I think we'll eventually just put the dll and so for this in a nuget package.
Fix rebase.
[WIP] MultiRegionHandle variable size ranges
- Avoid reprotecting regions that change often (needs some tweaking)
- There's still a bug in buffers, somehow.
- Might want different api for minimum granularity
Fix rebase issue
Commit everything needed for software only tracking.
Remove native components.
Remove more native stuff.
Cleanup
Use a separate window for the background context, update opentk. (fixes linux)
Some experimental changes
Should get things working up to scratch - still need to try some things with flush/modification and res scale.
Include address with the region action.
Initial work to make range tracking work
Still a ton of bugs
Fix some issues with the new stuff.
* Fix texture flush instability
There's still some weird behaviour, but it's much improved without this. (textures with cpu modified data were flushing over it)
* Find the destination texture for Buffer->Texture full copy
Greatly improves performance for nvdec videos (with range tracking)
* Further improve texture tracking
* Disable Memory Tracking for view parents
This is a temporary approach to better match behaviour on master (where invalidations would be soaked up by views, rather than trigger twice)
The assumption is that when views are created to a texture, they will cover all of its data anyways. Of course, this can easily be improved in future.
* Introduce some tracking tests.
WIP
* Complete base tests.
* Add more tests for multiregion, fix existing test.
* Cleanup Part 1
* Remove unnecessary code from memory tracking
* Fix some inconsistencies with 3D texture rule.
* Add dispose tests.
* Use a background thread for the background context.
Rather than setting and unsetting a context as current, doing the work on a dedicated thread with signals seems to be a bit faster.
Also nerf the multithreading test a bit.
* Copy to texture with matching alignment
This extends the copy to work for some videos with unusual size, such as tutorial videos in SMO. It will only occur if the destination texture already exists at XCount size.
* Track reads for buffer copies. Synchronize new buffers before copying overlaps.
* Remove old texture flushing mechanisms.
Range tracking all the way, baby.
* Wake the background thread when disposing.
Avoids a deadlock when games are closed.
* Address Feedback 1
* Separate TextureCopy instance for background thread
Also `BackgroundContextWorker.InBackground` for a more sensible idenfifier for if we're in a background thread.
* Add missing XML docs.
* Address Feedback
* Maybe I should start drinking coffee.
* Some more feedback.
* Remove flush warning, Refocus window after making background context
* Changes to allow explicit management of service threads
* Remove now unused code
* Remove ThreadCounter, its no longer needed
* Allow and use separate server per service, also fix exit issues
* New policy change: PTC version now uses PR number
* Implement block placement
Implement a simple pass which re-orders cold blocks at the end of the
list of blocks in the CFG.
* Set PPTC version
* Use Array.Resize
Address gdkchan's feedback
* Add Compare instruction
* Add BranchIf instruction
* Use test when BranchIf & Compare against 0
* Propagate Compare into BranchIfTrue/False use
- Propagate Compare operations into their BranchIfTrue/False use and
turn these into a BranchIf.
- Clean up Comparison enum.
* Replace BranchIfTrue/False with BranchIf
* Use BranchIf in EmitPtPointerLoad
- Using BranchIf early instead of BranchIfTrue/False improves LCQ and
reduces the amount of work needed by the Optimizer.
EmitPtPointerLoader was a/the big producer of BranchIfTrue/False.
- Fix asserts firing when assembling BitwiseAnd because of type
mismatch in EmitStoreExclusive. This is harmless and should not
cause any diffs.
* Increment PPTC interval version
* Improve IRDumper for BranchIf & Compare
* Use BranchIf in EmitNativeCall
* Clean up
* Do not emit test when immediately preceded by and
* Add Fmax/minv_V & S/Ushl_S Inst.s with Tests. Fix Maxps/d & Minps/d double zero sign handling. Allows better handling of NaNs.
* Optimized EmitSse2VectorIsNaNOpF() for multiple uses per opF.
* Add CRC32 A32 instructions.
* Fix CRC32 instructions.
* Add CRC intrinsic and fast path.
Loop is currently unrolled, will look into adding temp vars after tests are added.
* Begin work on Crc tests
* Fix SSE4.2 path for CRC32C, finialize tests.
* Remove unused IR path.
* Fix spacing between prefix checks.
* This should be Src.
* PTC Version
* OpCodeTable Order
* Integer check improvement. Value and Crc can be either 32 or 64 size.
* This wasn't necessary...
* If size is 3, value type must be I64.
* Fix same src+dest handling for non crc intrinsics.
* Pre-fix (ha) issue with vex encodings
* Add Vmvn (register), tests for both Vmvn variants.
* Add Vpmin, Vpmax, improve Non-FastFp accuracy for Vpadd
* Rebase on top of PTC.
* Add Nopcode
* Increment PTC version.
* Fix nits.
* Generalize tail continues
* Fix DecodeBasicBlock
`Next` and `Branch` would be null, which is not the state expected by
the branch instructions. They end up branching or falling into a block
which is never populated by the `Translator`. This causes an assert to
be fired when building the CFG.
* Clean up Decode overloads
* Do not synchronize when branching into exit block
If we're branching into an exit block, that exit block will tail
continue into another translation which already has a synchronization.
* Remove A32 predicate tail continue
If `block` is not an exit block then the `block.Next` must exist (as
per the last instruction of `block`).
* Throw if decoded 0 blocks
Address gdkchan's feedback
* Rebuild block list instead of setting to null
Address gdkchan's feedback
* Delete DelegateTypes.cs
* Delete DelegateCache.cs
* Add files via upload
* Update Horizon.cs
* Update Program.cs
* Update MainWindow.cs
* Update Aot.cs
* Update RelocEntry.cs
* Update Translator.cs
* Update MemoryManager.cs
* Update InstEmitMemoryHelper.cs
* Update Delegates.cs
* Nit.
* Nit.
* Nit.
* 10 fewer MSIL bytes for us
* Add comment. Nits.
* Update Translator.cs
* Update Aot.cs
* Nits.
* Opt..
* Opt..
* Opt..
* Opt..
* Allow to change compression level.
* Update MemoryManager.cs
* Update Translator.cs
* Manage corner cases during the save phase. Nits.
* Update Aot.cs
* Translator response tweak for Aot disabled. Nit.
* Nit.
* Nits.
* Create DelegateHelpers.cs
* Update Delegates.cs
* Nit.
* Nit.
* Nits.
* Fix due to #784.
* Fixes due to #757 & #841.
* Fix due to #846.
* Fix due to #847.
* Use MethodInfo for managed method calls.
Use IR methods instead of managed methods about Max/Min (S/U).
Follow-ups & Nits.
* Add missing exception messages.
Reintroduce slow path for Fmov_Vi.
Implement slow path for Fmov_Si.
* Switch to the new folder structure.
Nits.
* Impl. index-based relocation information. Impl. cache file version field.
* Nit.
* Address gdkchan comments.
Mainly:
- fixed cache file corruption issue on exit; - exposed a way to disable AOT on the GUI.
* Address AcK77 comment.
* Address Thealexbarney, jduncanator & emmauss comments.
Header magic, CpuId (FI) & Aot -> Ptc.
* Adaptation to the new application reloading system.
Improvements to the call system of managed methods.
Follow-ups.
Nits.
* Get the same boot times as on master when PTC is disabled.
* Profiled Aot.
* A32 support (#897).
* #975 support (1 of 2).
* #975 support (2 of 2).
* Rebase fix & nits.
* Some fixes and nits (still one bug left).
* One fix & nits.
* Tests fix (by gdk) & nits.
* Support translations not only in high quality and rejit.
Nits.
* Added possibility to skip translations and continue execution, using `ESC` key.
* Update SettingsWindow.cs
* Update GLRenderer.cs
* Update Ptc.cs
* Disabled Profiled PTC by default as requested in the past by gdk.
* Fix rejit bug. Increased number of parallel translations. Add stack unwinding stuffs support (1 of 2).
Nits.
* Add stack unwinding stuffs support (2 of 2). Tuned number of parallel translations.
* Restored the ability to assemble jumps with 8-bit offset when Profiled PTC is disabled or during profiling.
Modifications due to rebase.
Nits.
* Limited profiling of the functions to be translated to the addresses belonging to the range of static objects only.
* Nits.
* Nits.
* Update Delegates.cs
* Nit.
* Update InstEmitSimdArithmetic.cs
* Address riperiperi comments.
* Fixed the issue of unjustifiably longer boot times at the second boot than at the first boot, measured at the same time or reference point and with the same number of translated functions.
* Implemented a simple redundant load/save mechanism.
Halved the value of Decoder.MaxInstsPerFunction more appropriate for the current performance of the Translator.
Replaced by Logger.PrintError to Logger.PrintDebug in TexturePool.cs about the supposed invalid texture format to avoid the spawn of the log.
Nits.
* Nit.
Improved Logger.PrintError in TexturePool.cs to avoid log spawn.
Added missing code for FZ handling (in output) for fp max/min instructions (slow paths).
* Add configuration migration for PTC
Co-authored-by: Thog <me@thog.eu>