Beyond 39-bit instruction virtual address extension

Peter says:

I'd like to propose a spec change and don't know who to contact. My suggestion is that the instruction virtual address remain at 39-bits (or lower) while moving the data virtual address to 48-bits. These 2 spaces do not need to be the same size, and the instruction space will naturally be a very small subset. The reason we expand is to access more data, but the HW cost is primarily from the instruction virtual address. I don't believe there are any applications that require nearly this much instruction space, so it's possible compilers already abide by this restriction. However we would need to formalize it to take advantage in HW.

I've participated in many feasibilities to expand the virtual address through the years, and the costs (frequency, area, and power) are prohibitive and get worse with each process. The main reason it is so expensive is that the virtual address is used within the core to track each instruction, so it exists in almost every functional block. We try to implement address compression where possible, but it is still perhaps the costliest group of signals we have. This false dependency between instruction and data address space is the reason x86 processors have been stuck at 48 bits for more than a decade despite a strong demand for expansion from server customers.

This seems like the type of HW/SW collaboration that RISC-V was meant to address. Any suggestions how to proceed?

Discussion with Peter and lkcl

i believe that would have implications that only a 32/36/39 bit total application execution space could be fitted into the TLB at any one time, i.e. that if there were two applications approaching those limits, that the TLBs would need to be entirely swapped out to make room for one (and only one) of those insanely-large programs to execute at any one time.

Yes, one solution would be to restrict the instruction TLB to one (or a few) segments. Our interface to SW is on page misses and when reading from registers (e.g. indirect branches), so we can translate to the different address size at these points. It would be preferable if the corner cases were disallowed by SW.

ok so just to be clear:

  • application instruction space addressing is restricted to 32/36/39-bit (whatever)
  • virtual address space for applications is restricted to 48-bit (on rv64: rv128 has higher?)
  • TLBs for application instruction space can then be restricted to 32+N/36+N/39+N where 0 <= N <= a small number.
  • the smaller application space results in less virtual instruction address routing hardware (the primary goal)
  • an indirect branch, which will always be to an address within the 32/36/39-bit range, will result in a virtual TLB table miss
  • the miss will be in: -> the 32+N/36+N/39+N space that will be -> redirected to a virtual 48-bit address that will be -> redirected to real RAM through the TLB.

assuming i have that right, in this way:

  • you still have up to 48-bit actual virtual addressing (and potentially even higher, even on RV64)
  • but any one application is limited in instruction addressing range to 32/36/39-bit
  • BUT you CAN actually have multiple such applications running simultaneously (depending on whether N is greater than zero or not).

is that about right?

if so, what are the disadvantages? what is lost (vs what is gained)?


reply:

ok so just to be clear:

  • application instruction space addressing is restricted to 32/36/39-bit (whatever)

The address space of a process would ideally be restricted to a range such as this. If not, SW would preferably help with corner cases (e.g. instruction overlaps segment boundary).

  • virtual address space for applications is restricted to 48-bit (on rv64: rv128 has higher?)

Anything 64-bits or less would be fine (more of an ISA issue).

  • TLBs for application instruction space can then be restricted to 32+N/36+N/39+N where 0 <= N <= a small number.

Yes

  • the smaller application space results in less virtual instruction address routing hardware (the primary goal)

The primary goal is frequency, but routing in key areas is a major component of this (and is increasingly important on each new silicon process). Area and power are secondary goals.

  • an indirect branch, which will always be to an address within the 32/36/39-bit range, will result in a virtual TLB table miss

Indirect branches would ideally always map to the range, but HW would always check.

  • the miss will be in: -> the 32+N/36+N/39+N space that will be -> redirected to a virtual 48-bit address that will be -> redirected to real RAM through the TLB.

Actually a page walk through the page miss handler, but the concept is correct.

if so, what are the disadvantages? what is lost (vs what is gained)?

I think the disadvantages are mainly SW implementation costs. The advantages are frequency, power, and area. Also a mechanism for expanded addressability and security.

[hypothetically, the same scheme could equally be applied to 48-bit executables (so 32/36/39/48).)]

Jacob and Albert discussion

Albert Cahalan wrote:

The solution is likely to limit the addresses that can be living in the pipeline at any one moment. If that would be exceeded, you wait.

For example, split a 64-bit address into a 40-bit upper part and a 24-bit lower part. Assign 3-bit codes in place of the 40-bit portion, first-come-first-served. Track just 27 bits (3+24) through the processor. You can do a reference count on the 3-bit codes or just wait for the whole pipeline to clear and then recycle all of the 3-bit codes.

Adjust all those numbers as determined by benchmarking.

I must say, this bears a strong resemblance to the TLB. Maybe you could use a TLB entry index for the tracking.

I had thought of a similar solution.

The key is that the pipeline can only care about some subset of the virtual address space at any one time. All that is needed is some way to distinguish the instructions that are currently in the pipeline, rather than every instruction in the process, as virtual addresses do.

I suggest using cache or TLB coordinates as instruction tags. This would require that the L1 I-cache or ITLB "pin" each cacheline or slot that holds a currently-pending instruction until that instruction is retired. The L1 I-cache is probably an ideal reference, since the cache tag array has the current base virtual address for each cacheline and the rest of the pipeline would only need {cacheline number, offset} tuples. Evicting the cacheline containing the most-recently-fetched instruction would be insane in general, so this should have minimal impact on L1 I-cache management. If the virtual address of the instruction is needed for any reason, it can be read from the I-cache tag array.

This approach can be trivially extended to multi-ASID or even multi-VMID systems by simply adding VMID and ASID fields to the tag tuples.

The L1 I-cache provides an easy solution for assigning "short codes" to replace the upper portion of an instruction's virtual address. As an example, consider an 8KiB L1 I-cache with 128-byte cachelines. Such a cache has 64 cachelines (6 bits) and each cacheline has 64 or 32 possible instructions (depending on implementation of RVC or other odd-alignment ISA extensions). For an RVC-capable system (the worst case), each 128-byte cacheline has 64 possible instruction locations, for another 6 bits. So now the rest of the pipeline need only track 12-bit tags that reference the L1 I-cache. A similar approach could also use the ITLB, but the ITLB variant results in larger tags, due both to the need to track page offsets (11 bits) and the larger number of slots the ITLB is likely to have.

Conceivably, even the program counter could be internally implemented in this way.


Jacob replies

The idea is that the internal encoding for (example) sepc could be the cache coordinates, and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array. In other words, cache coordinates do not need to be resolved back to virtual addresses until software does something that requires the virtual address.

Branch target addresses get "interesting" since the implementation must either be able to carry a virtual address for a branch target into the pipeline (JALR needs the ability to transfer to a virtual address anyway) or prefetch all branch targets so the branch address can be written as a cache coordinate. An implementation could also simply have both "branch to VA" and "branch to CC" macro-ops and probe the cache when a branch is decoded: if the branch target is already in the cache, decode as "branch to CC", otherwise decode as "branch to VA". This requires tracking both forms of the program counter, however, and adds a performance-optimization rule: branch targets should be in the same or next cacheline when feasible. (I expect most implementations that implement I-cache prefetch at all to automatically prefetch the next cacheline of the instruction stream. That is very cheap to implement and the prefetch will hit whenever execution proceeds sequentially, which should be fairly common.)

Limiting which instructions can take traps helps with this model, and interrupts (which can otherwise introduce interrupt traps anywhere) would need to be handled by inserting a "take interrupt trap" macro-op into the decoded instruction stream.

Also, this approach can use coordinates into either the L1 I-cache or the ITLB. I have been describing the cache version because I find it more interesting and it can use smaller tags than the TLB version. You mention evaluating TLB pointers and finding them insufficient; do cache pointers reduce or solve those issues? What were the problems with using TLB coordinates instead of virtual addresses?

More directly addressing lkcl's question, I expect that use of cache coordinates to be completely transparent to software, requiring no change to the ISA spec. As a purely microarchitectural solution, it also meets Dr. Waterman's goal.

Microarchitecture design preference

andrew expressed a preference that the spec not require changes, instead that implementors design microarchitectures that solve the problem transparently.

so jacob (and peter, and albert, and others), what are your thoughts on whether these proposals would require a specification change. are they entirely transparent or are they guaranteed to have ramifications that propagate through the hardware and on to the toolchains and OSes, requiring formal platform-level specification and ratification?

I had hoped for software proposals, but these HW proposals would not require a specification change. I found that TLB ptrs didn't address our primary design issues (about 10 years ago), but it does simplify areas of the design. At least a partial TLB would be needed at other points in the pipeline when reading the VA from registers or checking branch addresses.

I still think the spec should recognize that the instruction space has very different requirements and costs.


" sepc could be the cache coordinates [set,way?], and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array" This makes no sense to me. First, reading the CSR move the CSR into a GPR, it doesn't look up anything in the cache.

In an implementation using cache coordinates for epc, reading epc does perform a cache tag lookup.

In case you instead meant that it is then used to index into the cache, then either: - Reading the CSR into a GPR resolves to a VA, or

This is correct.

[...] Neither of those explanations makes sense- could you explain better?

In this case, where sepc stores a (cache row, offset) tuple, reading sepc requires resolving that tuple into a virtual address, which is done by reading the high bits from the cache tag array and carrying over the offset within the cacheline. CPU-internal "magic cookie" cache coordinates are not software-visible. In this specific case, at entry to the trap handler, the relevant cacheline must be present -- it holds the most-recently executed instruction before the trap.

In general, the cacheline can be guaranteed to remain present using interlock logic that prevents its eviction unless no part of the processor is "looking at" it. Reference counting is a solved problem and should be sufficient for this. This gets a bit more complex with speculative execution and multiple privilege levels, although a cache-per-privilege-level model (needed to avoid side channels) also solves the problem of the cacheline being evicted -- the user cache is frozen while the supervisor runs and vice versa. I have an outline for a solution to this problem involving shadow cachelines (enabling speculative prefetch/eviction in a VIPT cache) and a "trace scoreboard" (multi-column reference counter array -- each column tracks references from pending execution traces: issuing an instruction increments a cell, retiring an instruction decrements a cell, dropping a speculative trace (resolving predicate as false) zeros an entire column, and a cacheline may be selected for eviction iff its entire row is zero).

CSR reads are allowed to have software-visible side effects in RISC-V, although none of the current standard CSRs have side-effects on read. Looking at it this way, resolving cache coordinates to a virtual address upon reading sepc is simply a side effect that is not visible to software.