20200110, 02:52  #12 
Tribal Bullet
Oct 2004
3·1,181 Posts 
Is Abramowitz and Stegun available anywhere online? This sounds like something they would have many possibilities available for.
In addition to the rational approximations, if you want simple code with specified precision maybe the CORDIC algorithm can be used. Minimax series can provide uniformly good approximations with minimal work. With modern GPUs small readonly lookup tables should not be performance bottlenecks. Last fiddled with by jasonp on 20200110 at 03:50 
20200110, 03:10  #13 
Jun 2003
3·17·101 Posts 

20200110, 04:15  #14 
P90 years forever!
Aug 2002
Yeehaw, FL
16726_{8} Posts 
Here is my code. Any improvements are welcome!
Code:
// This version of slowTrig assumes k is positive and k/n <= 0.5 which means we want cos and sin values in the range [0, pi/2] // We found free Sun Microsystems code that is short and efficient in the range [pi/4, pi/4]. /* ==================================================== * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. * * Developed at SunSoft, a Sun Microsystems, Inc. business. * Permission to use, copy, modify, and distribute this * software is freely granted, provided that this notice * is preserved. * ==================================================== */ /* __kernel_sin(x) * kernel sin function on [pi/4, pi/4], pi/4 ~ 0.7854 * Input x is assumed to be bounded by ~pi/4 in magnitude. * * Algorithm * 1. Since sin(x) = sin(x), we need only to consider positive x. * 2. sin(x) is approximated by a polynomial of degree 13 on [0,pi/4] * 3 13 * sin(x) ~ x + S1*x + ... + S6*x * where * * sin(x) 2 4 6 8 10 12  58 *   (1+S1*x +S2*x +S3*x +S4*x +S5*x +S6*x ) <= 2 *  x  * */ double __kernel_sin(double x) { const double S1 = 1.66666666666666324348e01, /* 0xBFC55555, 0x55555549 */ S2 = 8.33333333332248946124e03, /* 0x3F811111, 0x1110F8A6 */ S3 = 1.98412698298579493134e04, /* 0xBF2A01A0, 0x19C161D5 */ S4 = 2.75573137070700676789e06, /* 0x3EC71DE3, 0x57B1FE7D */ S5 = 2.50507602534068634195e08, /* 0xBE5AE5E6, 0x8A2B9CEB */ S6 = 1.58969099521155010221e10; /* 0x3DE5D93A, 0x5ACFD57C */ double z,r,v; z = x*x; v = z*x; r = S2+z*(S3+z*(S4+z*(S5+z*S6))); return x+v*(S1+z*r); } /* * __kernel_cos( x ) * kernel cos function on [pi/4, pi/4], pi/4 ~ 0.785398164 * Input x is assumed to be bounded by ~pi/4 in magnitude. * * Algorithm * 1. Since cos(x) = cos(x), we need only to consider positive x. * 2. cos(x) is approximated by a polynomial of degree 14 on [0,pi/4] * 4 14 * cos(x) ~ 1  x*x/2 + C1*x + ... + C6*x * where the remez error is * *  2 4 6 8 10 12 14  58 * cos(x)(1.5*x +C1*x +C2*x +C3*x +C4*x +C5*x +C6*x ) <= 2 *   * * 4 6 8 10 12 14 * 4. let r = C1*x +C2*x +C3*x +C4*x +C5*x +C6*x , then * cos(x) = 1  x*x/2 + r * since cos(x+y) ~ cos(x)  sin(x)*y * ~ cos(x)  x*y, * a correction term is necessary in cos(x) and hence * cos(x+y) = 1  (x*x/2  (r  x*y)) * For better accuracy when x > 0.3, let qx = x/4 with * the last 32 bits mask off, and if x > 0.78125, let qx = 0.28125. * Then * cos(x+y) = (1qx)  ((x*x/2qx)  (rx*y)). * Note that 1qx and (x*x/2qx) is EXACT here, and the * magnitude of the latter is at least a quarter of x*x/2, * thus, reducing the rounding error in the subtraction. */ double __kernel_cos(double x) { const double C1 = 4.16666666666666019037e02, /* 0x3FA55555, 0x5555554C */ C2 = 1.38888888888741095749e03, /* 0xBF56C16C, 0x16C15177 */ C3 = 2.48015872894767294178e05, /* 0x3EFA01A0, 0x19CB1590 */ C4 = 2.75573143513906633035e07, /* 0xBE927E4F, 0x809C52AD */ C5 = 2.08757232129817482790e09, /* 0x3E21EE9E, 0xBDB4B1C4 */ C6 = 1.13596475577881948265e11; /* 0xBDA8FAE9, 0xBE8838D4 */ double z,r; z = x*x; r = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6))))); #ifndef MORE_ACCURATE return 1.0  (0.5*z  (z*r)); #else double a,hz,qx; int ix; ix = __HI(x)&0x7fffffff; /* ix = x's high word*/ if (ix < 0x3FD33333) /* if x < 0.3 */ return 1.0  (0.5*z  (z*r)); else { if(ix > 0x3fe90000) { /* x > 0.78125 */ qx = 0.28125; } else { __HI(qx) = ix0x00200000; /* x/4 */ __LO(qx) = 0; } hz = 0.5*zqx; a = 1.0qx; return a  (hz  (z*r)); } #endif } // We use the following trig identities to convert our [0, pi/2] range to [pi/4, pi/4] range: // cos(A + B) = cos A cos B  sin A sin B // sin(A + B) = sin A cos B + cos A sin B // We want to compute sin(pi*k/n) and cos(pi*k/n). Let x = pi*k/n  pi/4 // cos(pi*k/n) = cos(x + pi/4) = cos(x) * SQRTHALF  sin(x) * SQRTHALF // sin(pi*k/n) = sin(x + pi/4) = sin(x) * SQRTHALF + cos(x) * SQRTHALF double2 slowTrig(i32 k, i32 n) { double angle = M_PI / n * k  M_PI / 4; double c = __kernel_cos(angle); double s = __kernel_sin(angle); #if DEBUG if (k * 2 > n) printf ("slowTrig fail: k=%d, n=%d\n", k, n); #endif return M_SQRT1_2 * U2(c  s, (c + s)); } 
20200110, 04:27  #15 
Jun 2003
3·17·101 Posts 
Code:
r = z*(C1+z*(C2+z*(C3+z*(C4+z*(C5+z*C6))))); 
20200110, 09:59  #16  
Dec 2012
The Netherlands
1751_{10} Posts 
Quote:
https://pdfs.semanticscholar.org/486...2462c2bb6f.pdf 

20200110, 22:44  #17 
∂^{2}ω=0
Sep 2002
República de California
11660_{10} Posts 
George, presumably you need this functionality for onthefly computation of FFT twiddles, i.e. for complex roots of unity of form exp(I*2*Pi/n), where n is some highly composite integer, likely of form n = [small odd]*2^k. did you actually GPUbenchmark the use of a standard 2smallprecomputedtables approach here, each table consisting of O(sqrt(n)) precomputed roots of unity? I'm wondering precisely what kind of tablelookups are behind your "tables bad" statement, and ISTR that your x86 FFT code uses O(n)length tables, not O(sqrt(n))sized ones.

20200110, 23:25  #18  
P90 years forever!
Aug 2002
Yeehaw, FL
2·3·19·67 Posts 
Quote:
Take the awesome Radeon VII. Using the memory bandwidth from the spec sheets, reading a sin/cos twiddle from memory is 53 clocks latency if memory is overclocked, 78 if not overclocked. This latency can be hidden if the GPU has some compute it can do. Looking at the Taylor series code I posted, we're looking at about 7 muls and 7 adds for both cos and sin. In theory an ADD and MUL can be dispatched each cycle. Latency is 8 or 16 clocks, which again can be hidden if there is other compute to do. In practice, it is easier to hide sin/cos calculation latency than memory read latency. In fact, while waiting for FFT data to be read in the GPU can be independently calculating twiddle values. BTW, the code I posted above led to a 20us improvement over the compiler's sin/cos implementation. A 5M FFT is now at 0.7 ms per iteration, or about 18 hours for a primality test. 

20200111, 08:05  #19 
Jun 2003
3×17×101 Posts 
Using Knuth to optimize the mul count:
Code:
double __kernel_sin(double x) { const double A0 = 1.30608015549532788940e+02, A1 = 1.23312777810503624981e+04, A2 = 5.13167274033946697914e+01, A3 = 8.61464096450666776475e+06, A4 = 1.55798886190603733380e+07, A5 = 6.07511251529788036330e+13, A6 = 1.58969099521155010221e10; double x2,r,z,w; x2 = x*x; z = (x2+A0)*x2+A1; w = (x2+A2)*z +A3; r = ((w+z+A4)*w+A5)*A6; return x*r; } double __kernel_cos(double x) { const double A0 = 2.86788466915238314072e+02, A1 = 6.38040759672674956541e+04, A2 = 1.94403031619891871554e+02, A3 = 2.01131786823940935180e+07, A4 = 1.46249769755550521701e+07, A5 = 5.38508025161968705769e+13, A6 = 1.13596475577881948265e11; double x2,r,z,w; x2 = x*x; z = (x2+A0)*x2+A1; w = (x2+A2)*z +A3; r = ((w+z+A4)*w+A5)*A6; return 1.0+x2*r; } 
20200111, 19:36  #20 
∂^{2}ω=0
Sep 2002
República de California
2^{2}×5×11×53 Posts 
That is more suitable for nonFMA architectures ... for FMAsupporting ones, George's code needs 8 FMA (2 of which are pureMUL), compared to 10 separate operations (mix of ADD, MUL, and FMA) for the code you posted. (Assume one has structured things so as to do as many such computations in parallel so as to hide the latencies of the various instructions.)

20200111, 20:20  #21 
P90 years forever!
Aug 2002
Yeehaw, FL
2·3·19·67 Posts 

20200112, 04:20  #22  
Jun 2003
3×17×101 Posts 
Quote:
Quote:
How does one go about doing error analysis? In theory, the code is computing the same polynomial as the previous one, with just fewer multiplications. In practice, I don't know how much additional error (if any) that the new code introduces. I suppose, one could empirically compare all the interesting sin/cos values of all three (native / poly / Knuth). Also, is it worth combining both sin & cos into a single routine (for either version)? Would that give more options for compiler / scheduler to do a better job? 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Karatsuba algorithm relation to Matrix Multiplication (Strassen Algorithm)?  neomacdev  Computer Science & Computational Number Theory  1  20190730 01:40 
A Study of Primes Through NonResonant Sine Waves  EdH  Math  56  20100411 19:47 
Signs of Remainders of Cosine and Sine Series  jinydu  Homework Help  2  20080429 01:22 
Calculating the number of intersections between sine and cosine curves  ShiningArcanine  Miscellaneous Math  2  20060806 21:55 
Hard proplem for finding sine function  tinhnho  Miscellaneous Math  6  20050117 05:42 