c - How to improve performance of following loop -
i have simple loop in c
convert magnitude
, angle
real
, imaginary
parts. have 2 versions of loop as. version 1
simple loop perform conversion using following code
for(k = 0; k < n; k++){ xreal[k] = mag[k] * cos(angle[k]); ximag[k] = mag[k] * sin(angle[k]); }
an version 2
intrinsics
used vectorize loop.
__m256d cosvec, sinvec; __m256d resultreal, resultimag; __m256d angvec, voltvec; for(k = 0; k < sysdata->totnumofbus; k+=4){ voltvec = _mm256_loadu_pd(volt + k); angvec = _mm256_loadu_pd(theta + k); sinvec = _mm256_sincos_pd(&cosvec, angvec); resultimag = _mm256_mul_pd(voltvec, sinvec); resultreal = _mm256_mul_pd(voltvec, cosvec); _mm256_store_pd(xreal+k, resultreal); _mm256_store_pd(ximag+k, resultimag); }
on core i7 2600k @3.4ghz
processor, these loops give following results:
version 1: n = 18562320, time: 0.2sec version 2: n = 18562320, time: 0.16sec
a simple calculations these values show in version 1
, each iteration takes 36
cycles completed whereas takes 117
cycles version 2
completed. considering fact calculation of sine
, cosine
functions naturally expensive, these number seems not terrible. however, loop serious bottleneck of function profiling shows 1/3
of time spent inside loop. so, wondering if there way expedite loop (e.g. calculating sine
, cosine
functions differently). appreciated if me work around problem , let me know whether there room improve performance of loop.
thanks in advance help
ps: using icc
compile code. also, should mention data not aligned (and cannot be). however, aligning data leads minor performance improvement (less 1 percent).
i recommend make tayler series based sin/cos function , _mm256_stream_pd() store data. here base sample code.
__m256d sin_req[10]; __m256d cos_req[10]; __m256d one_pd = _mm256_set1_pd(1.0); for(int i=0; i<10; ++i) { sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/factorial((i+1)*2+1) ); cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/factorial((i+1)*2+0) ); } for(int i=0; i<count; i+=4) { __m256d voltvec = _mm256_load_pd(volt + i); __m256d angvec = _mm256_load_pd(theta + i); // sin/cos taylor series __m256d anglesq = angvec * angvec; __m256d sinvec = angvec; __m256d cosvec = one_pd; __m256d sin_serise = sinvec; __m256d cos_serise = one_pd; for(int j=0; j<10; ++j) { sin_serise = sin_serise * anglesq; // [1] cos_serise = cos_serise * anglesq; sinvec = sinvec + sin_serise * sin_req[j]; cosvec = cosvec + cos_serise * cos_req[j]; } __m256d resultreal = voltvec * sinvec; __m256d resultimag = voltvec * cosvec; _mm256_store_pd(xreal + i, resultreal); _mm256_store_pd(ximag + i, resultimag ); }
i 57~58 cpu cycles 4 components calculation.
i searched google , ran tests accuracy of sin/cos. articles 10 iteration double-precision accurate while -m_pi/2 < angle < +m_pi/2. , test result show it's more accurate math.h's sin/cos @ -m_pi < angle < +m_pi range. can increase iteration more accuracy large angle valued if needed.
however i'll go deeper optimizing code. code have latency problem calculation tayor series. avx's multiply latency 5 cpu cycle, means can't run 1 iteration faster 5 cycle because [1] uses result previous iteration's result.
we can unroll this.
for(int i=0; i<count; i+=8) { __m256d voltvec0 = _mm256_load_pd(volt + + 0); __m256d voltvec1 = _mm256_load_pd(volt + + 4); __m256d angvec0 = _mm256_load_pd(theta + + 0); __m256d angvec1 = _mm256_load_pd(theta + + 4); __m256d sinvec0; __m256d sinvec1; __m256d cosvec0; __m256d cosvec1; __m256d anglesq0 = angvec0 * angvec0; __m256d anglesq1 = angvec1 * angvec1; sinvec0 = angvec0; sinvec1 = angvec1; cosvec0 = one_pd; cosvec1 = one_pd; __m256d sin_serise0 = sinvec0; __m256d sin_serise1 = sinvec1; __m256d cos_serise0 = one_pd; __m256d cos_serise1 = one_pd; for(int j=0; j<10; ++j) { sin_serise0 = sin_serise0 * anglesq0; cos_serise0 = cos_serise0 * anglesq0; sin_serise1 = sin_serise1 * anglesq1; cos_serise1 = cos_serise1 * anglesq1; sinvec0 = sinvec0 + sin_serise0 * sin_req[j]; cosvec0 = cosvec0 + cos_serise0 * cos_req[j]; sinvec1 = sinvec1 + sin_serise1 * sin_req[j]; cosvec1 = cosvec1 + cos_serise1 * cos_req[j]; } __m256d realresult0 = voltvec0 * sinvec0; __m256d imagresult0 = voltvec0 * cosvec0; __m256d realresult1 = voltvec1 * sinvec1; __m256d imagresult1 = voltvec1 * cosvec1; _mm256_store_pd(xreal + + 0, realresult0); _mm256_store_pd(ximag + + 0, imagresult0); _mm256_store_pd(xreal + + 4, realresult1); _mm256_store_pd(ximag + + 4, imagresult1); }
this result 51~51.5 cycles 4 component calculation. (102~103 cycle 8 component)
it eliminated mutiply latency in taylor calculation loop , uses 85% of avx multiply unit. unrolling solve lots of latency problems while not swap registers memory. generate asm file while compiling , see how compiler handle code. tried unroll more resulted bad because not fit in 16 avx registers.
now go memory optmize. replace _mm256_store_ps() _mm256_stream_ps().
_mm256_stream_pd(xreal + + 0, realresult0); _mm256_stream_pd(ximag + + 0, imagresult0); _mm256_stream_pd(xreal + + 4, realresult1); _mm256_stream_pd(ximag + + 4, imagresult1);
replacing memory write code result 48 cycles 4 component calculation.
_mm256_stream_pd() faster if not going read back. skip cache system , send data direct memory controller , not pollute cache. more databus/cache space reading datas using _mm256_stream_pd().
let's try prefetch.
for(int i=0; i<count; i+=8) { _mm_prefetch((const char *)(volt + + 5 * 8), _mm_hint_t0); _mm_prefetch((const char *)(theta + + 5 * 8), _mm_hint_t0); // calculations here. }
now got 45.6~45.8 cpu cycles per calculation. 94% busy of avx multiply unit.
prefech hints cache faster read. recommend prefech before 400~500 cpu cycles based on physical memory's ras-cas latency. physical memory latency can take 300 cycles on worst case. may vary hardware configuration, not smaller 200 cycle if use expensive low ras-cas delay memory.
0.064 sec (count = 18562320)
end of sin/cos optimization. :-)
Comments
Post a Comment