CodeExplorer Posted November 10, 2019 Posted November 10, 2019 Cuda force use mad.lo.u32 for ROTATE_LEFT ??? Compute Capability 1.2 __global__ void fun(unsigned int * mem) { int a = 3; int b = 5; int c = 6; int d; asm("mad.lo.u32 %0, %1, %2, %3;": "=r"(d) : "r"(a), "r"(b), "r"(c) : ); // d = a*b+c *mem = d; } This produce good result, anyway when I define (try): #define ROTATE_LEFT2(x, n) (int)x*(1>>(32-n))+(x<<(int)n) there is no mad instruction. References: https://www.openwall.com/lists/john-dev/2012/03/22/7 https://devtalk.nvidia.com/default/topic/489750/ptx-assembly-help-33-/ https://devtalk.nvidia.com/default/topic/478578/integer-mad-instruction/ https://www.blackhat.com/presentations/bh-usa-09/BEVAND/BHUSA09-Bevand-MD5-SLIDES.pdf
CodeExplorer Posted November 10, 2019 Author Posted November 10, 2019 This optimization won't work: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievediops.htm ADD Weighted sum of all executed integer additions (IADD). The default weight is 1. MUL Weighted sum of all executed integer multiplications (IMUL). The default weight is 1. MAD Weighted sum of all executed integer multiply-add (IMAD) instructions. The default weight is 2. 1(add)+1(mul) = 2 (mad) so there is no speed improvement.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now