* Implement JIT Arm64 backend
* PPTC version bump
* Address some feedback from Arm64 JIT PR
* Address even more PR feedback
* Remove unused IsPageAligned function
* Sync Qc flag before calls
* Fix comment and remove unused enum
* Address riperiperi PR feedback
* Delete Breakpoint IR instruction that was only implemented for Arm64
* ARMeilleure: Add AVX512{F,VL,DQ,BW} detection
Add `UseAvx512Ortho` and `UseAvx512OrthoFloat` optimization flags as
short-hands for `F+VL` and `F+VL+DQ`.
* ARMeilleure: Add initial support for EVEX instruction encoding
Does not implement rounding, or exception controls.
* ARMeilleure: Add `X86Vpternlogd`
Accelerates the vector-`Not` instruction.
* ARMeilleure: Add check for `OSXSAVE` for AVX{2,512}
* ARMeilleure: Add check for `XCR0` flags
Add XCR0 register checks for AVX and AVX512F, following the guidelines
from section 14.3 and 15.2 from the Intel Architecture Software
Developer's Manual.
* ARMeilleure: Increment InternalVersion
* ARMeilleure: Remove redundant `ReProtect` and `Dispose`, formatting
* ARMeilleure: Move XCR0 procedure to GetXcr0Eax
* ARMeilleure: Add `XCR0` to `FeatureInfo` structure
* ARMeilleure: Utilize `ReadOnlySpan` for Xcr0 assembly
Avoids an additional allocation
* ARMeilleure: Formatting fixes
* Make all structs readonly when applicable. It should reduce amount of needless defensive copies
* Make structs with trivial boilerplate equality code record structs
* Remove unnecessary readonly modifiers from TextureCreateInfo
* Make BitMap structs readonly too
* ARMeilleure: Add `GFNI` detection
This is intended for utilizing the `gf2p8affineqb` instruction
* ARMeilleure: Add `gf2p8affineqb`
Not using the VEX or EVEX-form of this instruction is intentional. There
are `GFNI`-chips that do not support AVX(so no VEX encoding) such as
Tremont(Lakefield) chips as well as Jasper Lake.
13df339fe7/GenuineIntel/GenuineIntel00806A1_Lakefield_LC_InstLatX64.txt (L1297-L1299)13df339fe7/GenuineIntel/GenuineIntel00906C0_JasperLake_InstLatX64.txt (L1252-L1254)
* ARMeilleure: Add `gfni` acceleration of `Rbit_V`
Passes all `Rbit_V*` unit tests on my `i9-11900k`
* ARMeilleure: Add `gfni` acceleration of `S{l,r}i_V`
Also added a fast-path for when the shift amount is greater than the
size of the element.
* ARMeilleure: Add `gfni` acceleration of `Shl_V` and `Sshr_V`
* ARMeilleure: Increment InternalVersion
* ARMeilleure: Fix Intrinsic and Assembler Table alignment
`gf2p8affineqb` is the longest instruction name I know of. It shouldn't
get any wider than this.
* ARMeilleure: Remove SSE2+SHA requirement for GFNI
* ARMeilleure Add `X86GetGf2p8LogicalShiftLeft`
Used to generate GF(2^8) 8x8 bit-matrices for bit-shifting for the `gf2p8affineqb` instruction.
* ARMeilleure: Append `FeatureInfo7Ecx` to `FeatureInfo`
* Add an early `TailMerge` pass
Some translations can have a lot of guest calls and since for each guest
call there is a call guard which may return. This can produce a lot of
epilogue code for returns. This pass merges the epilogue into a single
block.
```
Using filter 'hcq'.
Using metric 'code size'.
Total diff: -1648111 (-7.19 %) (bytes):
Base: 22913847
Diff: 21265736
Improved: 4567, regressed: 14, unchanged: 144
```
* Set PTC version
* Address feedback
* Handle `void` returning functions
* Actually handle `void` returning functions
* Fix `RegisterToLocal` logging
* Optimize `TryAllocateRegWithtoutSpill` a bit
* Add a fast path for when all registers are live.
* Do not query `GetOverlapPosition` if the register is already in use
(i.e: free position is 0).
* Do not allocate child split list if not parent
* Turn `LiveRange` into a reference struct
`LiveRange` is now a reference wrapping struct like `Operand` and
`Operation`.
It has also been changed into a singly linked-list. In micro-benchmarks
traversing the linked-list was faster than binary search on `List<T>`.
Even for quite large input sizes (e.g: 1,000,000), surprisingly.
Could be because the code gen for traversing the linked-list is much
much cleaner and there is no virtual dispatch happening when checking if
intervals overlaps.
* Turn `LiveInterval` into an iterator
The LSRA allocates in forward order and never inspect previous
`LiveInterval` once they are expired. Something similar can be done for
the `LiveRange`s within the `LiveInterval`s themselves.
The `LiveInterval` is turned into a iterator which expires `LiveRange`
within it. The iterator is moved forward along with interval walking
code, i.e: AllocateInterval(context, interval, cIndex).
* Remove `LinearScanAllocator.Sources`
Local methods are less susceptible to do allocations than lambdas.
* Optimize `GetOverlapPosition(interval)` a bit
Time complexity should be in O(n+m) instead of O(nm) now.
* Optimize `NumberLocals` a bit
Use the same idea as in `HybridAllocator` to store the visited state
in the MSB of the Operand's value instead of using a `HashSet<T>`.
* Optimize `InsertSplitCopies` a bit
Avoid allocating a redundant `CopyResolver`.
* Optimize `InsertSplitCopiesAtEdges` a bit
Avoid redundant allocations of `CopyResolver`.
* Use stack allocation for `freePositions`
Avoid redundant computations.
* Add `UseList`
Replace `SortedIntegerList` with an even more specialized data
structure. It allocates memory on the arena allocators and does not
require copying use positions when splitting it.
* Turn `LiveInterval` into a reference struct
`LiveInterval` is now a reference wrapping struct like `Operand` and
`Operation`.
The rationale behind turning this in a reference wrapping struct is
because a `LiveInterval` is associated with each local variable, and
these intervals may themselves be split further. I've seen translations
having up to 8000 local variables.
To make the `LiveInterval` unmanaged, a new data structure called
`LiveIntervalList` was added to store child splits. This differs from
`SortedList<,>` because it can contain intervals with the same start
position.
Really wished we got some more of C++ template in C#. :^(
* Optimize `GetChildSplit` a bit
No need to inspect the remaining ranges if we've reached a range which
starts after position, since the split list is ordered.
* Optimize `CopyResolver` a bit
Lazily allocate the fill, spill and parallel copy structures since most
of the time only one of them is needed.
* Optimize `BitMap.Enumerator` a bit
Marking `MoveNext` as `AggressiveInlining` allows RyuJIT to promote the
`Enumerator` struct into registers completely, reducing load/store code
a lot since it does not have to store the struct on the stack for ABI
purposes.
* Use stack allocation for `use/blockedPositions`
* Optimize `AllocateWithSpill` a bit
* Address feedback
* Make `LiveInterval.AddRange(,)` more conservative
Produces no diff against master, but just for good measure.
* Add `Operand.Label` support to `Assembler`
This adds label support to `Assembler` and enables branch tightening
when compiling with relocatables. Jump management and patching has been
moved to the `Assembler`.
* Move instruction table to `Assembler.Table`
* Set PTC internal version
* Rename `Assembler.Table` to `AssemblerTable`
* Store constant `Operand`s in the `LocalInfo`
Since the spill slot and register assigned is fixed, we can just store
the `Operand` reference in the `LocalInfo` struct. This allows skipping
hitting the intern-table for a look up.
* Skip `Uses`/`Assignments` management
Since the `HybridAllocator` is the last pass and we do not care about
uses/assignments we can skip managing that when setting destinations or
sources.
* Make `GetLocalInfo` inlineable
Also fix a possible issue where with numbered locals. See or-assignment
operator in `SetVisited(local)` before patch.
* Do not run `BlockPlacement` in LCQ
With the host mapped memory manager, there is a lot less cold code to
split from hot code. So disabling this in LCQ gives some extra
throughput - where we need it.
* Address Mou-Ikkai's feedback
* Apply suggestions from code review
Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com>
* Move check to an assert
Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com>
* Refactor `PtcInfo`
This change reduces the coupling of `PtcInfo` by moving relocation
tracking to the backend. `RelocEntry`s remains as `RelocEntry`s through
out the pipeline until it actually needs to be written to the PTC
streams. Keeping this representation makes inspecting and manipulating
relocations after compilations less painful. This is something I needed
to do to patch relocations to 0 to diff dumps.
Contributes to #1125.
* Turn `Symbol` & `RelocInfo` into readonly structs
* Add documentation to `CompiledFunction`
* Remove `Compiler.Compile<T>`
Remove `Compiler.Compile<T>` and replace it by `Map<T>` of the
`CompiledFunction` returned.
* Fix type mismatch in `BitwiseAnd` simplification
`TryEliminateBitwiseAnd` would turn the `BitwiseAnd` operation into a
copy of the wrong type. E.g:
Before `Simplification`:
```llvm
i64 %0 = BitwiseAnd i64 0x0, %1
```
After `Simplication`:
```llvm
i64 %0 = Copy i32 0x0
```
Since the with the changes in #2515, we iterate in reverse order and
`Simplication`, `ConstantFolding` does not indicate if it modified
the CFG, the second pass to "retype" the copy into the proper
destination type does not happen.
This also blocked copy propagation since its destination type did not
match with its source type. But in the cases I've seen, the
`PreAllocator` would insert a copy for the propagated constant, which
results in no diffs.
Since the copy remained as is, asserts are fired when generating it.
* Set PPTC version
* Turn `MemoryOperand` into a struct
* Remove `IntrinsicOperation`
* Remove `PhiNode`
* Remove `Node`
* Turn `Operand` into a struct
* Turn `Operation` into a struct
* Clean up pool management methods
* Add `Arena` allocator
* Move `OperationHelper` to `Operation.Factory`
* Move `OperandHelper` to `Operand.Factory`
* Optimize `Operation` a bit
* Fix `Arena` initialization
* Rename `NativeList<T>` to `ArenaList<T>`
* Reduce `Operand` size from 88 to 56 bytes
* Reduce `Operation` size from 56 to 40 bytes
* Add optimistic interning of Register & Constant operands
* Optimize `RegisterUsage` pass a bit
* Optimize `RemoveUnusedNodes` pass a bit
Iterating in reverse-order allows killing dependency chains in a single
pass.
* Fix PPTC symbols
* Optimize `BasicBlock` a bit
Reduce allocations from `_successor` & `DominanceFrontiers`
* Fix `Operation` resize
* Make `Arena` expandable
Change the arena allocator to be expandable by allocating in pages, with
some of them being pooled. Currently 32 pages are pooled. An LRU removal
mechanism should probably be added to it.
Apparently MHR can allocate bitmaps large enough to exceed the 16MB
limit for the type.
* Move `Arena` & `ArenaList` to `Common`
* Remove `ThreadStaticPool` & co
* Add `PhiOperation`
* Reduce `Operand` size from 56 from 48 bytes
* Add linear-probing to `Operand` intern table
* Optimize `HybridAllocator` a bit
* Add `Allocators` class
* Tune `ArenaAllocator` sizes
* Add page removal mechanism to `ArenaAllocator`
Remove pages which have not been used for more than 5s after each reset.
I am on fence if this would be better using a Gen2 callback object like
the one in System.Buffers.ArrayPool<T>, to trim the pool. Because right
now if a large translation happens, the pages will be freed only after a
reset. This reset may not happen for a while because no new translation
is hit, but the arena base sizes are rather small.
* Fix `OOM` when allocating larger than page size in `ArenaAllocator`
Tweak resizing mechanism for Operand.Uses and Assignemnts.
* Optimize `Optimizer` a bit
* Optimize `Operand.Add<T>/Remove<T>` a bit
* Clean up `PreAllocator`
* Fix phi insertion order
Reduce codegen diffs.
* Fix code alignment
* Use new heuristics for degree of parallelism
* Suppress warnings
* Address gdkchan's feedback
Renamed `GetValue()` to `GetValueUnsafe()` to make it more clear that
`Operand.Value` should usually not be modified directly.
* Add fast path to `ArenaAllocator`
* Assembly for `ArenaAllocator.Allocate(ulong)`:
.L0:
mov rax, [rcx+0x18]
lea r8, [rax+rdx]
cmp r8, [rcx+0x10]
ja short .L2
.L1:
mov rdx, [rcx+8]
add rax, [rdx+8]
mov [rcx+0x18], r8
ret
.L2:
jmp ArenaAllocator.AllocateSlow(UInt64)
A few variable/field had to be changed to ulong so that RyuJIT avoids
emitting zero-extends.
* Implement a new heuristic to free pooled pages.
If an arena is used often, it is more likely that its pages will be
needed, so the pages are kept for longer (e.g: during PPTC rebuild or
burst sof compilations). If is not used often, then it is more likely
that its pages will not be needed (e.g: after PPTC rebuild or bursts
of compilations).
* Address riperiperi's feedback
* Use `EqualityComparer<T>` in `IntrusiveList<T>`
Avoids a potential GC hole in `Equals(T, T)`.
* Add AddressTable<T>
* Use AddressTable<T> for dispatch
* Remove JumpTable & co.
* Add fallback for out of range addresses
* Add PPTC support
* Add documentation to `AddressTable<T>`
* Make AddressTable<T> configurable
* Fix table walk
* Fix IsMapped check
* Remove CountTableCapacity
* Add PPTC support for fast path
* Rename IsMapped to IsValid
* Remove stale comment
* Change format of address in exception message
* Add TranslatorStubs
* Split DispatchStub
Avoids recompilation of stubs during tests.
* Add hint for 64bit or 32bit
* Add documentation to `Symbol`
* Add documentation to `TranslatorStubs`
Make `TranslatorStubs` disposable as well.
* Add documentation to `SymbolType`
* Add `AddressTableEventSource` to monitor function table size
Add an EventSource which measures the amount of unmanaged bytes
allocated by AddressTable<T> instances.
dotnet-counters monitor -n Ryujinx --counters ARMeilleure
* Add `AllowLcqInFunctionTable` optimization toggle
This is to reduce the impact this change has on the test duration.
Before everytime a test was ran, the FunctionTable would be initialized
and populated so that the newly compiled test would get registered to
it.
* Implement unmanaged dispatcher
Uses the DispatchStub to dispatch into the next translation, which
allows execution to stay in unmanaged for longer and skips a
ConcurrentDictionary look up when the target translation has been
registered to the FunctionTable.
* Remove redundant null check
* Tune levels of FunctionTable
Uses 5 levels instead of 4 and change unit of AddressTableEventSource
from KB to MB.
* Use 64-bit function table
Improves codegen for direct branches:
mov qword [rax+0x408],0x10603560
- mov rcx,sub_10603560_OFFSET
- mov ecx,[rcx]
- mov ecx,ecx
- mov rdx,JIT_CACHE_BASE
- add rdx,rcx
+ mov rcx,sub_10603560
+ mov rdx,[rcx]
mov rcx,rax
Improves codegen for dispatch stub:
and rax,byte +0x1f
- mov eax,[rcx+rax*4]
- mov eax,eax
- mov rcx,JIT_CACHE_BASE
- lea rax,[rcx+rax]
+ mov rax,[rcx+rax*8]
mov rcx,rbx
* Remove `JitCacheSymbol` & `JitCache.Offset`
* Turn `Translator.Translate` into an instance method
We do not have to add more parameter to this method and related ones as
new structures are added & needed for translation.
* Add symbol only when PTC is enabled
Address LDj3SNuD's feedback
* Change `NativeContext.Running` to a 32-bit integer
* Fix PageTable symbol for host mapped
* Allow `LocalVariable` to be assigned more than once
This allows us to write flow controls like loops and if-elses with
LocalVariables participating in phi nodes.
* Add `GetLocalNumber` to operand
* PPTC & Pool Enhancements.
* Avoid buffer allocations in CodeGenContext.GetCode(). Avoid stream allocations in PTC.PtcInfo.
Refactoring/nits.
* Use XXHash128, for Ptc.Load & Ptc.Save, x10 faster than Md5.
* Why not a nice Span.
* Added a simple PtcFormatter library for deserialization/serialization, which does not require reflection, in use at PtcJumpTable and PtcProfiler; improves maintainability and simplicity/readability of affected code.
* Nits.
* Revert #1987.
* Revert "Revert #1987."
This reverts commit 998be765cf.
* Optimization | Modify Add Instruction to use LEA instead.
Currently, the add instruction requires 4 registers to take place. By using LEA, we can effectively perform the same working using 3 registers, reducing memory spills and improving translation efficiency.
* Fix IsSameOperandDestSrc1 Check for Add
* Use LEA if Dest != SRC1
* Update IsSameOperandDestSrc1 to account for Cases where Dest and Src1 can be same for add
* Fix error in logic
* Typo
* Add paranthesis for clarity
* Compare registers as requested.
* Cleanup if statement, use same comparison method as generateCopy
* Make change as recommended by gdk
* Perform check only when Add calls are made
* use ensure sametype for lea, fix else
* Update comment
* Update version #
* Added support for offline invalidation, via PPTC, of low cq translations replaced by high cq translations; both on a single run and between runs.
Added invalidation of .cache files in the event of reuse on a different user operating system.
Added .info and .cache files invalidation in case of a failed stream decompression.
Nits.
* InternalVersion = 1712;
* Nits.
* Address comment.
* Get rid of BinaryFormatter.
Nits.
* Move Ptc.LoadTranslations().
Nits.
* Nits.
* Fixed corner cases (in case backup copies have to be used). Added save logs.
* Not core fixes.
* Complement to the previous commit. Added load logs. Removed BinaryFormatter leftovers.
* Add LoadTranslations log.
* Nits.
* Removed the search and management of LowCq overlapping functions.
* Final increment of .info and .cache flags.
* Nit.
* GetIndirectFunctionAddress(): Validate that writing actually takes place in dynamic table memory range (and not elsewhere).
* Fix Ptc.UpdateInfo() due to rebase.
* Nit for retrigger Checks.
* Nit for retrigger Checks.
* Implement VFNMA.F<32/64>
* Update PTC Version
* Update Implementation & Renames & Correct Order
* Fix alignment
* Update implementation to not trigger assert
* Actually use the intrinsic that makes sense :)
Before when splitting intervals, the end of the range would be included
in the split check, this can produce empty ranges in the child split.
This in turn can affect spilling decisions since the child split will
have a different start position and this empty range will get a register
and move to the active set for a brief moment.
For example:
A = [153, 172[; [1899, 1916[; [1991, 2010[; [2397, 2414[; ...
Split(A, 1916)
A0 = [153, 172[; [1899, 1916[
A1 = [1916, 1916[; [1991, 2010[; [2397, 2414[; ...
* Implement block placement
Implement a simple pass which re-orders cold blocks at the end of the
list of blocks in the CFG.
* Set PPTC version
* Use Array.Resize
Address gdkchan's feedback
* Relax block ordering constraints
Before `block.Next` had to follow `block.ListNext`, now it does not.
Instead `CodeGenerator` will now emit the necessary jump instructions
to ensure control flow.
This makes control flow and block order modifications easier. It also
eliminates some simple cases of redundant branches.
* Set PPTC version
* Add Compare instruction
* Add BranchIf instruction
* Use test when BranchIf & Compare against 0
* Propagate Compare into BranchIfTrue/False use
- Propagate Compare operations into their BranchIfTrue/False use and
turn these into a BranchIf.
- Clean up Comparison enum.
* Replace BranchIfTrue/False with BranchIf
* Use BranchIf in EmitPtPointerLoad
- Using BranchIf early instead of BranchIfTrue/False improves LCQ and
reduces the amount of work needed by the Optimizer.
EmitPtPointerLoader was a/the big producer of BranchIfTrue/False.
- Fix asserts firing when assembling BitwiseAnd because of type
mismatch in EmitStoreExclusive. This is harmless and should not
cause any diffs.
* Increment PPTC interval version
* Improve IRDumper for BranchIf & Compare
* Use BranchIf in EmitNativeCall
* Clean up
* Do not emit test when immediately preceded by and
* Use movd,movq for i32/64 VectorExtract %x, 0x0
* Increment PPTC interval version
* Use else-if instead
- Address gdkchan's feedback.
- Clean up Debug.Assert calls
* Inline `count` expression into Debug.Assert
Apparently the CoreCLR JIT will not eliminate this. :(
* Add CRC32 A32 instructions.
* Fix CRC32 instructions.
* Add CRC intrinsic and fast path.
Loop is currently unrolled, will look into adding temp vars after tests are added.
* Begin work on Crc tests
* Fix SSE4.2 path for CRC32C, finialize tests.
* Remove unused IR path.
* Fix spacing between prefix checks.
* This should be Src.
* PTC Version
* OpCodeTable Order
* Integer check improvement. Value and Crc can be either 32 or 64 size.
* This wasn't necessary...
* If size is 3, value type must be I64.
* Fix same src+dest handling for non crc intrinsics.
* Pre-fix (ha) issue with vex encodings