6600-style Scoreboards

Images reproduced with kind permission from Mitch Alsup

Modifications needed to Computation Unit and Group Picker

The scoreboard uses two big NOR gates respectively to determine when there are no read/write hazards. These two NOR gates are permanently active (per Function Unit) even if the Function Unit is idle.

In the case of the Write path, these "permanently-on" signals are gated by a Write-Release-Request signal that would otherwise leave the Priority Picker permanently selecting one of the Function Units (the highest priority). However the same thing has to be done for the read path, as well.

Below are the modifications required to add a read-release path that will prevent a Function Unit from requesting a GoRead signal when it has no need to read registers. Note that once both the Busy and GoRead signals combined are dropped, the ReadRelease is dropped.

Note that this is a loop: GoRead (ANDed with Busy) goes through to the priority picker, which generates GoRead, so it is critical (in a modern design) to use a clock-sync'd latch in this path.

Source:

Multi-in cascading Priority Picker

Using the Group Picker as a fundamental unit, a cascading chain is created, with each output "masking" an output from being selected in all down-chain Pickers. Whilst the input is a single unary array of bits, the output is multiple unary arrays where only one bit in each is set.

This can be used for "port selection", for example when there are multiple Register File ports or multiple LOAD/STORE cache "ways", and there are many more devices seeking access to those "ports" than there are actual ports. (If the number of devices seeking access to ports were equal to the number of ports, each device could be allocated its own dedicated port).

Click on image to see full-sized version:

Links:

Modifications to Dependency Cell

Note: this version still requires CLK to operate on a HI-LO cycle. Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE sequence. For now however it is easier to stick with the original diagrams produced by Mitch Alsup.

The dependency cell is responsible for recording that a Function Unit requires the use of a dest or src register, which is given in UNARY. It is also responsible for "defending" that unary register bit for read and write hazards, and for also, on request (GoRead/GoWrite) generating a "Register File Select" signal.

The sequence of operations for determining hazards is as follows:

  • Issue goes HI when CLK is HI. If any of Dest / Oper1 / Oper2 are also HI, the relevant SRLatch will go HI to indicate that this Function Unit requires the use of this dest/src register
  • Bear in mind that this cell works in conjunction with the FU-FU cells
  • Issue is LOW when CLK is HI. This is where the "defending" comes into play. There will be another Function Unit somewhere that has had its Issue line raised. This cell needs to know if there is a conflict (Read Hazard or Write Hazard).
  • Therefore, this cell must, if either of the Oper1/Oper2 signals are HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI. This is the Read_Pending signal.
  • Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI, this cell must output a "Write after Read" (WaR) hazard if the (other) instruction has raised the unary Dest line.

The sequence for determining register select is as follows:

  • After the Issue+CLK-HI has resulted in the relevant (unary) latches for dest and src (unary) latches being set, at some point a GoRead (or GoWrite) signal needs to be asserted
  • The GoRead (or GoWrite) is asserted when CLK is LOW. The AND gate on Reset ensures that the SRLatch remains ENABLED.
  • This gives an opportunity for the Latch Q to be ANDed with the GoRead (or GoWrite), raising an indicator flag that the register is being "selected" by this Function Unit.
  • The "select" outputs from the entire column (all Function Units for this unary Register) are ORed together. Given that only one GoRead (or GoWrite) is guaranteed to be ASSERTed (because that is the Priority Picker's job), the ORing is acceptable.
  • Whilst the GoRead (or GoWrite) signal is still asserted HI, the CLK line goes LOW. With the Reset-AND-gate now being HI, this clears the latch. This is the desired outcome because in the previous cycle (which happened to be when CLK was LOW), the register file was read (or written)

The release of the latch happens to have a by-product of releasing the "reservation", such that future instructions, if they ever test for Read/Write hazards, will find that this Cell no longer responds: the hazard has already passed as this Cell already indicated that it was safe to read (or write) the register file, freeing up future instructions from hazards in the process.

Shadowing

Shadowing is important as it is the fundamental basis of:

  • Precise exceptions
  • Write-after-write hazard avoidance
  • Correct multi-issue instruction sequencing
  • Branch speculation

Modifications to the shadow circuit below allow the shadow flip-flops to be automatically reset after a Function Unit "dies". Without these modifications, the shadow unit may spuriously fire on subsequent re-use due to some of the latches being left in a previous state.

Note that only "success" will cause the latch to reset. Note also that the introduction of the NOT gate causes the latch to be more like a DFF (register).

LD/ST Computation Unit

The Load/Store Computation Unit is a little more complex, involving three functions: LOAD, STORE, and INT Addition. The SR Latches create a cyclic chain (just as with the ALU Computation Unit) however here there are three possible chains.

  • INT Addition mode will activate Issue, GoRead, GoWrite
  • LD Mode will activate Issue, GoRead, GoAddr then finally GoWrite
  • ST Mode will activate Issue, GoRead, GoAddr then GoStore.

These signals will be allowed to activate when the correct "Req" lines are active. Cyclically respecting these request-response signals results in the SR Latches never going into "unstable / unknown" states.

  • Issue will close the opcode latch and OPEN the operand latch AND trigger "Request-Read" (and set "Busy")
  • Go-Read will close the operand latch and OPEN the address latch AND trigger "Request Address".
  • Go-Address will close the address latch and OPEN the result latch AND trigger "Request Write"
  • Go-Write will close the result latch and OPEN the opcode latch, and reset BUSY back to OFF, ready for a new cycle.

Note: there is an error in the diagram, compared to the source code. It was necessary to capture src2 (op2) separate from src1 (op1), so that for the ST, op2 goes into the STORE as the data, not op1.

Source:

Memory-Memory Dependency Matrix

Due to the possibility of more than on LD/ST being in flight, it is necessary to determine which memory operations are conflicting, and to preserve a semblance of order. It turns out that as long as there is no possibility of overlaps (note this wording carefully), and that LOADs are done separately from STOREs, this is sufficient.

The first step then is to ensure that only a mutually-exclusive batch of LDs or STs (not both) is detected, with the order between such batches being preserved. This is what the memory-memory dependency matrix does.

"WAR" stands for "Write After Read" and is an SR Latch. "RAW" stands for "Read After Write" and likewise is an SR Latch. Any LD which comes in when a ST is pending will result in the relevant RAW SR Latch going active. Likewise, any ST which comes in when a LD is pending results in the relevant WAR SR Latch going active.

LDs can thus be prevented when it has any dependent RAW hazards active, and likewise STs can be prevented when any dependent WAR hazards are active. The matrix also ensures that ordering is preserved.

Note however that this is the equivalent of an ALU "FU-FU" Matrix. A separate Register-Mem Dependency Matrix is still needed in order to preserve the register read/write dependencies that occur between instructions, where the Mem-Mem Matrix simply protects against memory hazards.

Note also that it does not detect address clashes: that is the responsibility of the Address Match Matrix.

Source:

Address Match Matrix

This is an important adjunct to the Memory Dependency Matrices: it ensures that no LDs or STs overlap, because if they did it could result in memory corruption. Example: a 64-bit ST at address 0x0001 comes in at the same time as a 64-bit ST to address 0x0002: the second write will overwrite all writes to bytes in memory 0x0002 thru 0x0008 of the first write, and consequently the order of these two writes absolutely has to be preserved.

The suggestion from Mitch Alsup was to use a match system based on bits 4 thru 10/11 of the address. The idea being: we don't care if the matching is "too inclusive", i.e. we don't care if it includes addresses that don't actually overlap, because this just means "oh dear some LD/STs do not happen concurrently, they happen a few cycles later" (translation: Big Deal)

What we care about is if it were to miss some addresses that do actually overlap. Therefore it is perfectly acceptable to use only a few bits of the address. This is fortunate because the matching has to be done in a huge NxN Pascal's Triangle, and if we were to compare against the entirety of the address it would consume vast amounts of power and gates.

An enhancement of this idea is to turn the length of the operation (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the bottom 4 bits of the address to offset this mask and "line up" with the Memory byte read/write enable wires on the underlying Memory used in the L1 Cache.

Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary mask, can be "matched" using simple AND gate logic (instead of XOR for binary address matching), with the advantage that it is both trivial to use these masks as L1 Cache byte read/write enable lines, and furthermore it is straightforward to detect misaligned LD/STs crossing cache line boundaries.

Crossing over cache line boundaries is trivial in that the creation of the byte-map mask is permitted to be 24 bits in length (actually, only 23 needed). When the bottom 4 bits of the address are 0b1111 and the LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST) will be shifted up by 15 bits. This can then be chopped into two segments:

  • First segment is 0b1000 0000 0000 0000 and indicates that the first byte of the LD/ST is to go into byte 15 of the cache line
  • Second segment is 0b0111 1111 and indicates that bytes 2 through 8 of the LD/ST must go into bytes 0 thru 7 of the second cache line at an address offset by 16 bytes from the first.

Thus we have actually split the LD/ST operation into two. The AddrSplit class takes care of synchronising the two, by issuing two separate sets of LD/ST requests, waiting for both of them to complete (or indicate an error), and (in the case of a LD) merging the two.

The big advantage of this approach is that at no time does the L1 Cache need to know anything about the offsets from which the LD/ST came. All it needs to know is: which bytes to read/write into which positions in the cache line(s).

Source: