c - How to improve performance of following loop -


i have simple loop in c convert magnitude , angle real , imaginary parts. have 2 versions of loop as. version 1 simple loop perform conversion using following code

for(k = 0; k < n; k++){     xreal[k] = mag[k] * cos(angle[k]);     ximag[k] = mag[k] * sin(angle[k]); } 

an version 2 intrinsics used vectorize loop.

__m256d cosvec, sinvec; __m256d resultreal, resultimag; __m256d angvec, voltvec; for(k = 0; k < sysdata->totnumofbus; k+=4){      voltvec = _mm256_loadu_pd(volt + k);     angvec = _mm256_loadu_pd(theta + k);      sinvec = _mm256_sincos_pd(&cosvec, angvec);      resultimag = _mm256_mul_pd(voltvec, sinvec);     resultreal = _mm256_mul_pd(voltvec, cosvec);      _mm256_store_pd(xreal+k, resultreal);     _mm256_store_pd(ximag+k, resultimag);  } 

on core i7 2600k @3.4ghz processor, these loops give following results:

version 1: n = 18562320, time: 0.2sec version 2: n = 18562320, time: 0.16sec 

a simple calculations these values show in version 1, each iteration takes 36 cycles completed whereas takes 117 cycles version 2 completed. considering fact calculation of sine , cosine functions naturally expensive, these number seems not terrible. however, loop serious bottleneck of function profiling shows 1/3 of time spent inside loop. so, wondering if there way expedite loop (e.g. calculating sine , cosine functions differently). appreciated if me work around problem , let me know whether there room improve performance of loop.

thanks in advance help

ps: using icc compile code. also, should mention data not aligned (and cannot be). however, aligning data leads minor performance improvement (less 1 percent).

i recommend make tayler series based sin/cos function , _mm256_stream_pd() store data. here base sample code.

    __m256d sin_req[10];     __m256d cos_req[10];     __m256d one_pd =  _mm256_set1_pd(1.0);      for(int i=0; i<10; ++i)     {         sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/factorial((i+1)*2+1) );         cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/factorial((i+1)*2+0) );     }      for(int i=0; i<count; i+=4)     {             __m256d voltvec = _mm256_load_pd(volt + i);             __m256d angvec = _mm256_load_pd(theta + i);              // sin/cos taylor series             __m256d anglesq = angvec * angvec;             __m256d sinvec = angvec;             __m256d cosvec = one_pd;             __m256d sin_serise = sinvec;             __m256d cos_serise = one_pd;             for(int j=0; j<10; ++j)             {                 sin_serise = sin_serise * anglesq; // [1]                 cos_serise = cos_serise * anglesq;                 sinvec = sinvec + sin_serise * sin_req[j];                 cosvec = cosvec + cos_serise * cos_req[j];             }              __m256d resultreal = voltvec * sinvec;             __m256d resultimag = voltvec * cosvec;              _mm256_store_pd(xreal + i, resultreal);             _mm256_store_pd(ximag + i, resultimag );     } 

i 57~58 cpu cycles 4 components calculation.

i searched google , ran tests accuracy of sin/cos. articles 10 iteration double-precision accurate while -m_pi/2 < angle < +m_pi/2. , test result show it's more accurate math.h's sin/cos @ -m_pi < angle < +m_pi range. can increase iteration more accuracy large angle valued if needed.

however i'll go deeper optimizing code. code have latency problem calculation tayor series. avx's multiply latency 5 cpu cycle, means can't run 1 iteration faster 5 cycle because [1] uses result previous iteration's result.

we can unroll this.

    for(int i=0; i<count; i+=8)     {         __m256d voltvec0 = _mm256_load_pd(volt + + 0);         __m256d voltvec1 = _mm256_load_pd(volt + + 4);         __m256d angvec0  = _mm256_load_pd(theta + + 0);         __m256d angvec1  = _mm256_load_pd(theta + + 4);         __m256d sinvec0;         __m256d sinvec1;         __m256d cosvec0;         __m256d cosvec1;          __m256d anglesq0 = angvec0 * angvec0;         __m256d anglesq1 = angvec1 * angvec1;         sinvec0 = angvec0;         sinvec1 = angvec1;         cosvec0 = one_pd;         cosvec1 = one_pd;         __m256d sin_serise0 = sinvec0;         __m256d sin_serise1 = sinvec1;         __m256d cos_serise0 = one_pd;         __m256d cos_serise1 = one_pd;          for(int j=0; j<10; ++j)         {             sin_serise0 = sin_serise0 * anglesq0;             cos_serise0 = cos_serise0 * anglesq0;             sin_serise1 = sin_serise1 * anglesq1;             cos_serise1 = cos_serise1 * anglesq1;             sinvec0 = sinvec0 + sin_serise0 * sin_req[j];             cosvec0 = cosvec0 + cos_serise0 * cos_req[j];             sinvec1 = sinvec1 + sin_serise1 * sin_req[j];             cosvec1 = cosvec1 + cos_serise1 * cos_req[j];         }          __m256d realresult0 = voltvec0 * sinvec0;         __m256d imagresult0 = voltvec0 * cosvec0;         __m256d realresult1 = voltvec1 * sinvec1;         __m256d imagresult1 = voltvec1 * cosvec1;          _mm256_store_pd(xreal + + 0, realresult0);         _mm256_store_pd(ximag + + 0, imagresult0);         _mm256_store_pd(xreal + + 4, realresult1);         _mm256_store_pd(ximag + + 4, imagresult1);     } 

this result 51~51.5 cycles 4 component calculation. (102~103 cycle 8 component)

it eliminated mutiply latency in taylor calculation loop , uses 85% of avx multiply unit. unrolling solve lots of latency problems while not swap registers memory. generate asm file while compiling , see how compiler handle code. tried unroll more resulted bad because not fit in 16 avx registers.

now go memory optmize. replace _mm256_store_ps() _mm256_stream_ps().

    _mm256_stream_pd(xreal + + 0, realresult0);     _mm256_stream_pd(ximag + + 0, imagresult0);     _mm256_stream_pd(xreal + + 4, realresult1);     _mm256_stream_pd(ximag + + 4, imagresult1); 

replacing memory write code result 48 cycles 4 component calculation.

_mm256_stream_pd() faster if not going read back. skip cache system , send data direct memory controller , not pollute cache. more databus/cache space reading datas using _mm256_stream_pd().

let's try prefetch.

    for(int i=0; i<count; i+=8)     {     _mm_prefetch((const char *)(volt + + 5 * 8), _mm_hint_t0);     _mm_prefetch((const char *)(theta + + 5 * 8), _mm_hint_t0);              // calculations here.     } 

now got 45.6~45.8 cpu cycles per calculation. 94% busy of avx multiply unit.

prefech hints cache faster read. recommend prefech before 400~500 cpu cycles based on physical memory's ras-cas latency. physical memory latency can take 300 cycles on worst case. may vary hardware configuration, not smaller 200 cycle if use expensive low ras-cas delay memory.

0.064 sec (count = 18562320)

end of sin/cos optimization. :-)


Comments

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

DOM Manipulation in Wordpress (and elsewhere) using php -

asp.net - Passing parameter to telerik popup -