generating sines in plain C on ARM: understanding [osc~] guts

leonardo

Dear all,
I've been writing some externals in the past, but I never stumbled into the basic problem of writing oscillators, as PD has already plenty of. That's surely a well-known topic, but I never faced it. I want to write my own jack client to generate a sine wave and test audio from command line and use it on a ARM Cortex-A8 board.

I've been looking into some code, for instance Fons Adriaensen's jack_delay where he generates multiple sine waves for delay measurement purposes. Unfortunately, his code relies on math.h sinf and has a branch instruction with two conditions check (AND-ed). It costs 50% of my 1GHz ARM core... A bit too much!

I found out that replacing sinf with a taylor approx (up to x^13) the computational cost is lower, but it doesn't work well for sound synthesis for various reasons.

At the end I decided to look at [osc~] and take it to my barebone jack client. The cosine table creation is easy, but I'm not sure I fully understand how the look up table is read. The linear interp is easy, but how the table address is calculated looks totally weird and unrelated to the frequency argument of [osc~]. In fact x_f (the frequency) is unused. The input signal in my case is zeroed so I can't see how the code is supposed to generate a fixed freq sine. Could anyone shed some light? Anyway, I copied the code and adapted to my client and the CPU load dropped by a factor of 10 from the sinf implementation. Of course it does not play any sound so I need to understand the code and correct what is wrong with it.

BTW: I read the [phasor~] code, it looks similar, with no LUT of course. It points to a paper from Hoelderich, ICMC 1995 but looking into the ICMC archives his paper seemingly has nothing to do with generating signals...

alexandros

Maybe you could combine [osc~] and [tabread4~] code. I also don't really grasp the way [osc~] and [phasor~] work, but [tabread4~] does a cubic interpolation and the code is easier to understand.
Just a suggestion...

seb-harmonik.ar

It's probably good to have this documented somewhere:
Before I walk through it I should mention that the method alexandros mentions is very common and efficient: store the waveform in a table, and read from the table with an interpolation scheme. Lagrange interpolation, used in tabread4, is a bit less expensive than the other scheme commonly used in audio, hermite spline. The way to accomplish this is simply have a float or a double as an accumulator that is incremented based on the frequency of the oscillator, and make it wrap to stay within the table it is reading from. You use the fractional index of this value to interpolate. Check out katja's pd-double file for an example
https://github.com/pd-projects/pd-double/blob/master/src/d_osc.c

osc~ in normal pd uses the same thing, but with some bit trickery and only linear interpolation.

I think the reason for the Hoeldrich method is to avoid branch prediction/piplelining stuff when the index wraps but really I have no clue...
I spent a long time (multiple days) figuring d_osc.c out about a 1/2 year ago:
The thing to do in order to understand the guts is to constantly think of the binary representation of the numbers used. This page may be helpful in this endeavor:
http://www.binaryconvert.com/convert_double.html
The first thing to understand is the properties of UNITBIT32. The file says it is a "power of 3^19 such that bit 32 has place value 1". 3 is 1.1 X 2 in binary, because the point shifts 1 place to the right when multiplying by 2 (in floating point, remember that 1.1 is 1 and 1/2). this number is therefore 1.1 X 2^20.
In double precision floating-point, there are 52 significand bits, 11 exponent bits, and 1 sign bit. binary numbers in exponential notation are always in the form: 1.(something-series-of-0s-and-1s) * (2^exponent). Because the 1 to the left of the decimal (or binary rather?) point is always there, it is not stored in floating point representation and assumed to exist always. The only case where it does not is when the number is 0, which floating-point handles specially.
When floating-point numbers are used, normally the exponent changes whenever the precision calls for another power of 2 to be used. For instance, if you have the number 3 and then multiply by, say, 4 the significand does not change, only the exponent.
What UNITBIT32 does is keep a certain precision when numbers are added or subtracted. There are always 32 bits representing the fractional part of the number. Why? because: the significand is 52 bits, and the number 1.1 X 2^20 takes exactly 20 of the bits for the non-fractional part, leaving 32 bits for the fractional.
tabfudge is obviously a union of 2 32-bit unsigned ints and a double, it's purpose is to set specific bits of a double number using ints.
d_osc.c exploits this in the code for phasor~ by setting the upper bits of the number (tf.tf_i[HIOFFSET]) to the constant normhipart, every sample, which is just the 32 bits on the "left" of UNITBIT32. So every sample, those bits are set to be a certain value. Then, the resulting double precision number is incremented or decremented by the appropriate amount according to frequency, and afterwards normhipart is subtracted back out. The effect of this is to always keep only the fractional part, the 32 bits to the right of the decimal point. (because everything else is wiped clean every sample)
I have no idea why 1.1 X 2^20 was used instead of 1 X 2^20, maybe because it's more efficient to not have the exponent change as much when decrementing?

So now that the properties of UNITBIT32 and the union have been realized, on to the explanation of osc_perform, I'm going to do the non-unwrapped version for simplicity, (it's just like the wrapped version in a different order, for some weird optimization reasons I think):

static t_int *osc_perform(t_int *w)
{
    t_osc *x = (t_osc *)(w[1]);
    t_float *in = (t_float *)(w[2]);
    t_float *out = (t_float *)(w[3]);
    int n = (int)(w[4]);
    float *tab = cos_table, *addr, f1, f2, frac;
    double dphase = x->x_phase + UNITBIT32;
    int normhipart;
    union tabfudge tf;
    float conv = x->x_conv;
    
    tf.tf_d = UNITBIT32;
    normhipart = tf.tf_i[HIOFFSET];

most of this is explained above or self-explanatory, HIOFFSET is for the difference in endianness on different machines, tf.tf_i[HIOFFSET] are the highest bits. dphase is set to the current phase + UNITBIT32, and then tf.tf_d, the "manipulated" double number, is set to UNITBIT32 The last 2 lines are just to set normhipart.

#if 0
    while (n--)
    {
        tf.tf_d = dphase;
        dphase += *in++ * conv;

dphase is advanced by an appropriate amount of samples in relation to the size of the cosine table based on the input frequency. Note that dphase has UNITBIT32 plus x_phase, which sets the precision to have the fractional part in the right 32 bits.
addr = tab + (tf.tf_i[HIOFFSET] & (COSTABSIZE-1));
the integer part of the index is taken modulo COSTABSIZE with masking and the address is obtained. This isn't a problem because COSTABSIZE is less than 20 bits (the amount of bits available in the significand in the highest 32 bits of the manipulated double.)
tf.tf_i[HIOFFSET] = normhipart;
the integer part is erased, and set to the high bits of UNITBIT32

        frac = tf.tf_d - UNITBIT32;
        f1 = addr[0];
        f2 = addr[1];
        *out++ = f1 + frac * (f2 - f1);
    }
#endif

the fractional part is obtained by subtracting UNITBIT32, and the result is interpolated. The next section is to get back to the stored phase x_phase which is the table location to save.

    tf.tf_d = UNITBIT32 * COSTABSIZE;
    normhipart = tf.tf_i[HIOFFSET];

because COSTABSIZE is a power of 2, this is just like the other normhipart but with a different exponent value (the lower 32 bits are not the fractional part of the corresponding double anymore though, now the fraction is moved over to the right by log2(COSTABSIZE) in relation). now everything that won't fit within the domain of COSTABSIZE will be in the HIOFFSET part of this number.
tf.tf_d = dphase + (UNITBIT32 * COSTABSIZE - UNITBIT32);
Remember that when dphase was created it was assigned to x_phase plus UNITBIT32. so with this included the line simplifies to x_phase + UNITIBIT32*COSTABSIZE, where x_phase has been incremented every sample. That's my best guess of why the "- UNITBIT32" part is there

    tf.tf_i[HIOFFSET] = normhipart;
    x->x_phase = tf.tf_d - UNITBIT32 * COSTABSIZE;
    return (w+5);
}

everything that didn't fit into COSTABSIZE was put into HIOFFSET, and now it is set to normhipart in the exact same way the non-fractional part of the read point was erased. Then UNITBIT32*COSTABSIZE is subtracted to get the saved read point in the same way the fractional part was extracted earlier.
I still don't understand a lot in this, like what the unwrapping does..

leonardo

Thank you both.
Seb-harmonik, thank you for the big effort!
The code looks clearer, although I should take more time to look at the double/int trickery. If I get it right, the UNITBIT32 and the usage of int along with double is intended to bring double precision calculus on fixed-point machines or at least to avoid the need for a double-precision ALU. Is this right?

I'll try to fix my "porting", if I still get it wrong I might look at tabread4 which also guarantees a higher accuracy thanks to Lagrange interpolation.

BTW: would the osc~ output lose accuracy at the point you can hear it if it would use only single precision floats?
Regards

seb-harmonik.ar

@leonardo no, I don't think the purpose was to port to fixed-point. at certain points it kinda emulates fixed point but supporting double in the hardware is necessary for all of the manipulations, and for incrementing. I'm not too familiar with fixed point though. As I mentioned I think the purpose is to avoid conditional branching by wrapping automatically rather than jumping the program counter. When control branches and the compiler can't tell beforehand which instruction will come next, the CPU basically has to scramble in order to line up the next set of instructions for execution if it guesses the branch wrong.
and so if you use an "if" statement to check if the phase should be wrapped around you could make one of these branches. Therefore miller basically implements fmod and then separates the fractional and integer parts by hand at every sample. I'm not positive this is the motivation though.

And I have no clue how audible it would be if single precision were used. If I had to guess then I would guess that it wouldn't be audible but I'd be curious if you find out You could use 2 tabreads~ for linear interpolation and a sine lookup table within pd to find out as long as you don't use phasor~ or another method that uses more than single precision float accumulators. (not sure what though)