# MV.X and MV.swizzle

swizzle needs a MV (there are 2 of them: swizzle and swizzle2). see below for a potential way to use the funct7 to do a swizzle in rs2.

Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |

RV32-I-type | imm[11:0] | rs1[4:0] | funct3 | rd[4:0] | opcode | 0b11 | ||

RV32-I-type | fn4[3:0] | swizzle[7:0] | rs1[4:0] | 0b000 | rd[4:0] | OP-V | 0b11 | |

- funct3 = MV: 0b000 for FP, 0b001 for INT
- OP-V = 0b1010111
- fn4 = 4 bit function.
- fn4 = 0b0000 - MV-SWIZZLE
- fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
- fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)

swizzle (only active on SV or P48/P64 when SUBVL!=0):

7:6 | 5:4 | 3:2 | 1:0 |

w | z | y | x |

MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.

for i in range(VL): for j in range(SUBVL): regs[rd] = regs[rd+regs[rs+j]]

Normal mode will apply the element offsets incrementally:

for i in range(VL): for j in range(SUBVL): regs[rd] = regs[rd+regs[rs+k]] k++

Pseudocode for element width part of MV.X:

def mv_x(rd, rs1, funct4): elwidth = (funct4>>2) & 0x3 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el bytewidth = bitwidth / 8 # get bytes per el for i in range(VL): addr = (unsigned char *)®s[rs1] offset = addr + bytewidth # get offset within regfile as SRAM # TODO, actually, needs to respect rd and rs1 element width, # here, as well. this pseudocode just illustrates that the # MV.X operation contains a way to compact the indices into # less space. regs[rd] = (unsigned char*)(regs)[offset]

The idea here is to allow 8-bit indices to be stored inside XLEN-sized registers, such that rather than doing this:

ldimm x8, 1 ldimm x9, 3 ldimm x10, 2 ldimm x11, 0 {SVP.VL=4} MV.X x3, x8, elwidth=default

The alternative is this:

ldimm x8, 0x00020301 {SVP.VL=4} MV.X x3, x8, elwidth=8

Thus compacting four indices into the one register. x3 and x8's element
width are *independent* of the MV.X elwidth, thus allowing both source
and element element widths of the *elements* to be moved to be over-ridden,
whilst *at the same time* allowing the *indices* to be compacted, as well.

potential MV.X? register-version of MV-swizzle?

Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |

RV32-R-type | funct7 | rs2[4:0] | rs1[4:0] | funct3 | rd[4:0] | opcode | 0b11 | |

RV32-R-type | 0b0000000 | rs2[4:0] | rs1[4:0] | 0b001 | rd[4:0] | OP-V | 0b11 | |

- funct3 = MV.X
- OP-V = 0b1010111
- funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
- funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
- funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
- funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?

question: do we need a swizzle MV.X as well?

# MV.X with 3 operands

regs[rd] = regs[rs1 + regs[rs2]]

Similar to LD/ST with the same twin predication rules

# macro-op fusion

there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction. <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>

# VBLOCK context?

additional idea: a VBLOCK context that says that if a given register is used, it indicates that the register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.

# mm_shuffle_ps?

- __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
- _MM_SHUFFLE(hi3,hi2,lo1,lo0))
- Interleave inputs into low 2 floats and high 2 floats of output. Basically
- out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3];

For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.

# Transpose

assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using): using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:

input: | m00 m10 m20 m30 | | m01 m11 m21 m31 | | m02 m12 m22 m32 | | m03 m13 m23 m33 |

transpose 4 corner 2x2 matrices

intermediate: | m00 m01 m20 m21 | | m10 m11 m30 m31 | | m02 m03 m22 m23 | | m12 m13 m32 m33 |

finish transpose

output: | m00 m01 m02 m03 | | m10 m11 m12 m13 | | m20 m21 m22 m23 | | m30 m31 m32 m33 |

__m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); /* Assigning transposed values back into I[0-3] */ I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3);

# Transforms for DCT

# Table to evaluate

swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)

31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | |
---|---|---|---|---|---|---|

swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |

fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |

swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |

fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |

swizzlei | imm | rs1 | 001 | rd | ||

fswizzlei | rs1 | 010 | rd |

More:

swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:

int/fp | DESTSUBVL | 31 | 30:29 | 28:20 | 19:15 | 14:12 | 11:7 |
---|---|---|---|---|---|---|---|

int | 1 to 3 | 0 | DESTSUBVL | selector | rs | 000 | rd |

fp | 1 to 3 | 1 | DESTSUBVL | selector | rs | 000 | rd |

int | 4 | selector[11:0] | rs | 001 | rd | ||

fp | 4 | selector[11:0] | rs | 010 | rd |

the rest could be encoded as follows:

31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | |
---|---|---|---|---|---|---|

swizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 100 | rd |

swizzle | rs1 | DESTSUBVL | rs2 | rs1 | 100 | rd |

fswizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 101 | rd |

fswizzle | rs1 | DESTSUBVL | rs2 | rs1 | 101 | rd |

note how for [f]swizzle, rs3 == rs1

so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.

Alternative immed encoding

int/fp | 31:28 | 27:20 | 19:15 | 14:12 | 11:7 |
---|---|---|---|---|---|

int | DESTMASK | selector | rs | 000 | rd |

fp | DESTMASK | selector | rs | 001 | rd |

int | DESTMASK | constsel | rs | 010 | rd |

fp | DESTMASK | constsel | rs | 011 | rd |

Allows setting of arbitrary dest (xz, yw) without needing register-versions. Saves on instruction count. Needs 4 funct3 to express.

# Matrix 4x4 Vector mul

pfscale,3 F2, F1, F10 pfscaleadd,2 F2, F1, F11, F2 pfscaleadd,1 F2, F1, F12, F2 pfscaleadd,0 F2, F1, F13, F2

pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.

In effect what this is doing is:

fmul f2, f1.xxxx, f10 fmac f2, f1.yyyy, f11, f2 fmac f2, f1.zzzz, f12, f2 fmac f2, f1.wwww, f13, f2

Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.

# Pseudocode

Swizzle:

pub trait SwizzleConstants: Copy + 'static { const CONSTANTS: &'static [Self; 4]; } impl SwizzleConstants for u8 { const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F]; } impl SwizzleConstants for u16 { const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF]; } impl SwizzleConstants for f32 { const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5]; } // impl for other types too... pub fn swizzle<Elm, Selector>( rd: &mut [Elm], rs1: &[Elm], rs2: &[Selector], vl: usize, destsubvl: usize, srcsubvl: usize) where Elm: SwizzleConstants, // Selector is a copyable type that can be converted into u64 Selector: Copy + Into<u64>, { const FIELD_SIZE: usize = 3; const FIELD_MASK: u64 = 0b111; for vindex in 0..vl { let selector = rs2[vindex].into(); // selector's type is u64 if selector >> (FIELD_SIZE * destsubvl) != 0 { // handle illegal instruction trap } for i in 0..destsubvl { let mut sel_field = selector >> (FIELD_SIZE * i); sel_field &= FIELD_MASK; let src = if (sel_field & 0b100) == 0 { &rs1[(vindex * srcsubvl)..] } else { SwizzleConstants::CONSTANTS }; sel_field &= 0b11; if sel_field as usize >= srcsubvl { // handle illegal instruction trap } let value = src[sel_field as usize]; rd[vindex * destsubvl + i] = value; } } }

Swizzle2:

fn swizzle2<Elm, Selector>( rd: &mut [Elm], rs1: &[Elm], rs2: &[Selector], rs3: &[Elm], vl: usize, destsubvl: usize, srcsubvl: usize) where // Elm is a copyable type Elm: Copy, // Selector is a copyable type that can be converted into u64 Selector: Copy + Into<u64>, { const FIELD_SIZE: usize = 3; const FIELD_MASK: u64 = 0b111; for vindex in 0..vl { let selector = rs2[vindex].into(); // selector's type is u64 if selector >> (FIELD_SIZE * destsubvl) != 0 { // handle illegal instruction trap } for i in 0..destsubvl { let mut sel_field = selector >> (FIELD_SIZE * i); sel_field &= FIELD_MASK; let src = if (sel_field & 0b100) != 0 { rs1 } else { rs3 }; sel_field &= 0b11; if sel_field as usize >= srcsubvl { // handle illegal instruction trap } let value = src[vindex * srcsubvl + (sel_field as usize)]; rd[vindex * destsubvl + i] = value; } } }