Zftrans - transcendental operations

See:

Extension subsets:

  • Zftrans: standard transcendentals (best suited to 3D)
  • ZftransExt: extra functions (useful, not generally needed for 3D, can be synthesised using Ztrans)
  • Ztrigpi: trig. xxx-pi sinpi cospi tanpi
  • Ztrignpi: trig non-xxx-pi sin cos tan
  • Zarctrigpi: arc-trig. a-xxx-pi: atan2pi asinpi acospi
  • Zarctrignpi: arc-trig. non-a-xxx-pi: atan2, asin, acos
  • Zfhyp: hyperbolic/inverse-hyperbolic. sinh, cosh, tanh, asinh, acosh, atanh (can be synthesised - see below)
  • ZftransAdv: much more complex to implement in hardware
  • Zfrsqrt: Reciprocal square-root.

Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi, Zarctrignpi

TODO:

Proposed Opcodes vs Khronos OpenCL Opcodes

This list shows the (direct) equivalence between proposed opcodes and their Khronos OpenCL equivalents.

See https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

Special FP16 opcodes are not being proposed, except by indirect / inherent use of the "fmt" field that is already present in the RISC-V Specification.

"Native" opcodes are not being proposed: implementors will be expected to use the (equivalent) proposed opcode covering the same function.

"Fast" opcodes are not being proposed, because the Khronos Specification fast_length, fast_normalise and fast_distance OpenCL opcodes require vectors (or can be done as scalar operations using other RISC-V instructions).

The OpenCL FP32 opcodes are direct equivalents to the proposed opcodes. Deviation from conformance with the Khronos Specification - including the Khronos Specification accuracy requirements - is not an option.

Proposed opcode OpenCL FP32 OpenCL FP16 OpenCL native OpenCL fast
FSIN sin half_sin native_sin NONE
FCOS cos half_cos native_cos NONE
FTAN tan half_tan native_tan NONE
NONE (1) sincos NONE NONE NONE
FASIN asin NONE NONE NONE
FACOS acos NONE NONE NONE
FATAN atan NONE NONE NONE
FSINPI sinpi NONE NONE NONE
FCOSPI cospi NONE NONE NONE
FTANPI tanpi NONE NONE NONE
FASINPI asinpi NONE NONE NONE
FACOSPI acospi NONE NONE NONE
FATANPI atanpi NONE NONE NONE
FSINH sinh NONE NONE NONE
FCOSH cosh NONE NONE NONE
FTANH tanh NONE NONE NONE
FASINH asinh NONE NONE NONE
FACOSH acosh NONE NONE NONE
FATANH atanh NONE NONE NONE
FRSQRT rsqrt half_rsqrt native_rsqrt NONE
FCBRT cbrt NONE NONE NONE
FEXP2 exp2 half_exp2 native_exp2 NONE
FLOG2 log2 half_log2 native_log2 NONE
FEXPM1 expm1 NONE NONE NONE
FLOG1P log1p NONE NONE NONE
FEXP exp half_exp native_exp NONE
FLOG log half_log native_log NONE
FEXP10 exp10 half_exp10 native_exp10 NONE
FLOG10 log10 half_log10 native_log10 NONE
FATAN2 atan2 NONE NONE NONE
FATAN2PI atan2pi NONE NONE NONE
FPOW pow NONE NONE NONE
FROOT rootn NONE NONE NONE
FHYPOT hypot NONE NONE NONE
FRECIP NONE half_recip native_recip NONE

Note (1) FSINCOS is macro-op fused (see below).

List of 2-arg opcodes

opcode Description pseudo-code Extension
FATAN2 atan2 arc tangent rd = atan2(rs2, rs1) Zarctrignpi
FATAN2PI atan2 arc tangent / pi rd = atan2(rs2, rs1) / pi Zarctrigpi
FPOW x power of y rd = pow(rs1, rs2) ZftransAdv
FROOT x power 1/y rd = pow(rs1, 1/rs2) ZftransAdv
FHYPOT hypotenuse rd = sqrt(rs12 + rs22) Zftrans

List of 1-arg transcendental opcodes

opcode Description pseudo-code Extension
FRSQRT Reciprocal Square-root rd = sqrt(rs1) Zfrsqrt
FCBRT Cube Root rd = pow(rs1, 3) Zftrans
FRECIP Reciprocal rd = 1.0 / rs1 Zftrans
FEXP2 power-of-2 rd = pow(2, rs1) Zftrans
FLOG2 log2 rd = log2(rs1) Zftrans
FEXPM1 exponential minus 1 rd = pow(e, rs1) - 1.0 Zftrans
FLOG1P log plus 1 rd = log(e, 1 + rs1) Zftrans
FEXP exponential rd = pow(e, rs1) ZftransExt
FLOG natural log (base e) rd = log(e, rs1) ZftransExt
FEXP10 power-of-10 rd = pow(10, rs1) ZftransExt
FLOG10 log base 10 rd = log10(rs1) ZftransExt

List of 1-arg trigonometric opcodes

opcode Description pseudo-code Extension
FSIN sin (radians) rd = sin(rs1) Ztrignpi
FCOS cos (radians) rd = cos(rs1) Ztrignpi
FTAN tan (radians) rd = tan(rs1) Ztrignpi
FASIN arcsin (radians) rd = asin(rs1) Zarctrignpi
FACOS arccos (radians) rd = acos(rs1) Zarctrignpi
FATAN arctan (radians) rd = atan(rs1) Zarctrignpi
FSINPI sin times pi rd = sin(pi * rs1) Ztrigpi
FCOSPI cos times pi rd = cos(pi * rs1) Ztrigpi
FTANPI tan times pi rd = tan(pi * rs1) Ztrigpi
FASINPI arcsin / pi rd = asin(rs1) / pi Zarctrigpi
FACOSPI arccos / pi rd = acos(rs1) / pi Zarctrigpi
FATANPI arctan / pi rd = atan(rs1) / pi Zarctrigpi
FSINH hyperbolic sin (radians) rd = sinh(rs1) Zfhyp
FCOSH hyperbolic cos (radians) rd = cosh(rs1) Zfhyp
FTANH hyperbolic tan (radians) rd = tanh(rs1) Zfhyp
FASINH inverse hyperbolic sin rd = asinh(rs1) Zfhyp
FACOSH inverse hyperbolic cos rd = acosh(rs1) Zfhyp
FATANH inverse hyperbolic tan rd = atanh(rs1) Zfhyp

Synthesis, Pseudo-code ops and macro-ops

The pseudo-ops are best left up to the compiler rather than being actual pseudo-ops, by allocating one scalar FP register for use as a constant (loop invariant) set to "1.0" at the beginning of a function or other suitable code block.

  • FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
  • FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).

FATANPI example pseudo-code:

lui t0, 0x3F800 // upper bits of f32 1.0
fmv.x.s ft0, t0
fatan2pi.s rd, rs1, ft0

Hyperbolic function example (obviates need for Zfhyp except for high-performance or correctly-rounding):

ASINH( x ) = ln( x + SQRT(x**2+1))

Reciprocal

Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)

To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?

RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD". Research needed to ensure that implementors are not compromised by such a decision http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html

correctly-rounded LOG will return different results than LOGP1 and ADD. Likewise for EXP and EXPM1

ok, they stay in as real opcodes, then.

ATAN / ATAN2 commentary

Discussion starts here: http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html

from Mitch Alsup:

would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.

double ATAN2( double y, double x )
{   // IEEE 754-2008 quality ATAN2

    // deal with NANs
    if( ISNAN( x )             ) return x;
    if( ISNAN( y )             ) return y;

    // deal with infinities
    if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
    if( x == +∞                 ) return copysign(  0.0, y );
    if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
    if( x == -∞                 ) return copysign(    π, y );
    if(               |y|== +∞  ) return copysign(  π/2, y );

    // deal with signed zeros
    if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
    if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
    if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );

    // calculate ATAN2 textbook style
    if( x  > 0.0               ) return     ATAN( |y / x| );
    if( x  < 0.0               ) return π - ATAN( |y / x| );
}

Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent a constant and then call/use ATAN2.

When one considers an implementation of ATAN, one must consider several ranges of evaluation::

 x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
 x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
 x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );

I should point out that the add/sub of π/2 can not lose significance since the result of ATAN(1/x) is bounded 0..π/2

The bottom line is that I think you are choosing to make too many of these into OpCodes, making the hardware function/calculation unit (and sequencer) more complicated that necessary.


I might suggest that if there were a way for a calculation to be performed and the result of that calculation

chained to a subsequent calculation such that the precision of the result-becomes-operand is wider than

what will fit in a register, then you can dramatically reduce the count of instructions in this category while retaining

acceptable accuracy:

 z = x / y

can be calculated as::

 z = x * (1/y)

Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008 accurate, but GPUs want speed and

1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It is also not "that inaccurate" displaying 0.625-to-0.52 ULP.

Given that one has the ability to carry (and process) more fraction bits, one can then do high precision multiplies of π or other transcendental radixes.

And GPUs have been doing this almost since the dawn of 3D.

// calculate ATAN2 high performance style
// Note: at this point x != y
//
if( x  > 0.0             )
{
    if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
    if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
    if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
    if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
}
if( x  < 0.0             )
{
    if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
    if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
    if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
    if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
}

This way the adds and subtracts from the constant are not in a precision precarious position.