Jump to content
Tuts 4 You

Speed optimization question comparing


CodeExplorer

Recommended Posts

Speed optimization question:
Having this code:

    F1_bak = F1_bak+An+Mn[0]; // add the already computed Mn[0] value
    unsigned int lr_val1_bak = leftrotate(F1_bak, s[0]);
    unsigned int Ban1Test = Ban[0]+lr_val1_bak;

    if (Ban[1]!=Ban1Test)
    {
    //printf("Invalid Ban[1] value!!!!\n");
    goto StartOfSearch;
    }

versus second one:
 

	unsigned int Ban1Test = Ban[0]+leftrotate((F1_bak+An+Mn[0]), s[0]);

	if (Ban[1]!=Ban1Test)
	{
	//printf("Invalid Ban[1] value!!!!\n");
	goto StartOfSearch;
	}

It happens that the first one is faster, there is more code to it and the difference is of 10 seconds!
Can someone explain why?
 

Link to comment

Nobody can explain why, because you only gave a very small part of C code. It's like asking "why this green 2-door car is faster than that red 4-door?"

Everything that is actually important, is missing from your question:
1) full code of the loop/method;
2) input data you used for testing;
3) used compiler/linker with all the switches;
4) actual assembly code produced by compiler/linker;
5) method how you measured the speed;
and I could go on and on.

My only suggestion is - look at the produced assembly code. Use a bloody profiler on both files. See which part of code is the slow one. Analyze that, and repeat. https://softwareengineering.stackexchange.com/questions/212111/how-to-find-bottlenecks-in-an-application

 

From the first look, these methods don't even produce the same results (first one changes F1_bak value, second one doesn't).

 

Link to comment

Here is the loop code: removed, old code!

I find that percentage calculation take a lot of time:
152 seconds or 2.5 minutes
without progress show:
94.00 seconds

 

Edited by CodeExplorer
Link to comment

2) input data you used for testing;
Mn[7] testing value starts with 2147483647 and end with 0; on each loop Mn[7] is decremented by 1
3). I've used the compiler "Pelles C for Windows" x86 version 8.00.60
4.) Used for passed seconds measurement:

On start:
time_t _tm = time(NULL);
struct tm * curtime = localtime (&_tm);
printf("start time %s", asctime(curtime));

When finished:
printf("Finished!!!\n");
time_t _tm_end = time(NULL);
curtime = localtime (&_tm_end);
printf("end time %s", asctime(curtime));
double dif = difftime (_tm_end,_tm);
printf ("Elapsed time is %.2lf seconds.\n", dif );

Already removed progress from main loop and I've used instead:
mRes = timeSetEvent(10, 0, &TimerProc, 0, TIME_PERIODIC);
and the result was great: 50 seconds less.

And once again same strange thing, the fallowing code:

	dif1 = Ban[1]-Ban[0];
	dif1_div = rightrotate(dif1, s[0]);
	F1 = (Cn xor (Bn or (not Dn)))+An;
	Mn[0] = dif1_div-F1;

	F1_bak = F1+Mn[0]; // add the already computed Mn[0] value
	lr_val1_bak = leftrotate(F1_bak, s[0]);
	Ban1Test = Ban[0]+lr_val1_bak;

	if (Ban[1]!=Ban1Test)
	{
	//printf("Invalid Ban[1] value!!!!\n");
	goto StartOfSearch;
	}

is slower then:
 

	dif1 = Ban[1]-Ban[0];
	dif1_div = rightrotate(dif1, s[0]);
	F1 = (Cn xor (Bn or (not Dn)))+An;
	Mn[0] = dif1_div-F1;

	lr_val1_bak = leftrotate(F1+Mn[0], s[0]);
	Ban1Test = Ban[0]+lr_val1_bak;

	if (Ban[1]!=Ban1Test)
	{
	//printf("Invalid Ban[1] value!!!!\n");
	goto StartOfSearch;
	}

So mainly get riding of local variable F1_bak - "F1+Mn[0]" calculated directly (without local variable) is slower then "F1_bak = F1+Mn[0]";.This doesn't make any sense! And the difference is ~10 seconds, to much to be a measurement problem.

These methods produce the same results: F1_bak value is used only there for computing the Ban1Test value, and F1_bak value is used only once for computing Ban1Test value, Ban1Test value is used just once few instruction after (on showed code) for testing the value.

Edited by CodeExplorer
Link to comment

I want an explanation on why the local variable version is faster then with the value calculated directly.

Here is an optimization that makes sense:
instead of "if (lr_values_4%s[4]!=0)" use "if (lr_values_4&(s[4]-1)!=0)"
That makes sense since modulo/divide operations takes lot of time.

The local variable version being faster then directly computed value make no sense!!!
 

Link to comment

Let's put this simple, less number of lines (optimized number) of the same code is surely faster.

But this was when we had VC++ 2005, now that the compilers are super optimized, I think both will have the same performance (or maybe an infinitesimal performance difference).

 

Aside from all that, you use goto a lot, which is bad !

Link to comment
Quote

Aside from all that, you use goto a lot, which is bad !

Thanks. I somehow forget about "continue;" and "break;" statements! :sweat:
 

Quote

Let's put this simple, less number of lines (optimized number) of the same code is surely faster.

That's sure true, but here the version without local variable should be faster. I didn't yet looked at compiled code; I didn't looked at disassembled code. But first I should know the theory.
 

Link to comment
2 hours ago, CodeExplorer said:

 Thanks. I somehow forget about "continue;" and "break;" statements! :sweat:

Hmmm... they're bad too :D

Your code flow shouldn't be controlled explicitly with continue, break, goto, or even switches... you should strive for making easy to follow code like using loops (and putting all your conditions for the loop to keep going there), and don't break the flow with any other statement (so anyone would look only at the while/for condition to know when it's gonna stop).

 

2 hours ago, CodeExplorer said:

but here the version without local variable should be faster

As I said, there is no guarantee that the compiler optimization would get the same result from the one-liner code vs the multi-line code. Most advanced compilers get to the same point.

But do you know where the local vars go ?

To the stack, hmm what then ?

The compiler then should generate a way to cleanup the stack to keep it balanced (at the end of your piece of code).

So, think about your code as a black box... all the stack / heap manipulations happen, but are they gonna stay there ? of course not, there must be extra codes to cleanup those dirty areas in the memory, and that's why the other version MIGHT be slower in case the compiler generated the local vars code in the output executable.

Link to comment
Quote

only at the while/for condition

In other words just one ugly loop!

Here is the new code:

mRes = timeSetEvent(10, 0, &TimerProc, 0, TIME_PERIODIC);

	// Just initial values of A, B, C, D:
	unsigned int Al = 1;
	unsigned int Bl = 2;
	unsigned int Cl = 3;
	unsigned int Dl = 4;
	int IsFound = 0;  

	for (Mn[7]=total;Mn[7]!=0;Mn[7]--)
	{

	Ban[4] = know_sum1-Mn[7];
	lr_values_4 = Ban[5]-Ban[4];  // i = 4

	F_2 = rightrotate(lr_values_2_Clean, s[size-2]);  // i = 6; 
	F_x2 = A xor (Ban6_Clean or (not Ban[4]));
	Ban[3] = F_2-F_x2;  // since Mn[6]=0

	F_3 = rightrotate(lr_values_3_Clean, s[size-3]);  // i = 5
	F_x3 = Ban[4] xor (Ban5_Clean or (not Ban[3]));
	Ban[2] = F_3-F_x3;  // since Mn[5]=0

	F_4 = rightrotate(lr_values_4, s[size-4]);
	F_x4 = Ban[3] xor (Ban[4] or (not Ban[2]));
	Ban[1] = F_4-F_x4-Mn[4];  // Mn[4] is known

	// Calculation of Mn[0]-Mn[3] (messages)
	dif1 = Ban[1]-Ban[0];
	dif1_div = rightrotate(dif1, s[0]);
	F1 = (Cn xor (Bn or (not Dn)))+An;
	Mn[0] = dif1_div-F1;

	F1 += Mn[0]; // add the already computed Mn[0] value
	lr_val1_bak = leftrotate(F1, s[0]);
	Ban1Test = Ban[0]+lr_val1_bak;

	if (Ban[1]!=Ban1Test) continue;

	dif2 = Ban[2]-Ban[1];
	dif2_div = rightrotate(dif2, s[1]);
	F2 = (Bn xor (Ban[1] or (not Cn)))+Dn;  // 	F = C xor (B or (not D));
	Mn[1] = dif2_div-F2;

	F2 += Mn[1]; // add the already computed Mn[1] value
	lr_val2_bak = leftrotate(F2, s[1]);
	Ban2Test = Ban[1]+lr_val2_bak;

	if (Ban[2]!=Ban2Test) continue;

	dif3 = Ban[3]-Ban[2];
	dif3_div = rightrotate(dif3, s[2]);
	F3 = (Ban[1] xor (Ban[2] or (not Bn)))+Cn;
	Mn[2] = dif3_div-F3;

	F3 += Mn[2]; // add the already computed Mn[2] value
	lr_val3_bak = leftrotate(F3, s[2]);
	Ban3Test = Ban[2]+lr_val3_bak;

	if (Ban[3]!=Ban3Test) continue;

	// Jus initial values:
	Al = 1;
	Bl = 2;
	Cl = 3;
	Dl = 4;

	for (int j=0;j<8;j++)
	{
	F = Cl xor (Bl or (not Dl));
	F = F+Al+Mn[j];
	Al = Dl;
	Dl = Cl;
	Cl = Bl;
	Bl = Bl + leftrotate(F, s[j]);

	}

	if ((Al!=A)||(Bl!=B)||(Cl!=C)||(Dl!=D))
	{
	continue;
	}
	else
	{
	IsFound = 1;
	break;
	}

	}


	if (mRes!=0) mRes = timeKillEvent(mRes);
	printf("Finished!!!\n");

 

  • Like 1
Link to comment

#define leftrotate(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))

mov eax, dword ptr [s[0]]
- eax is initialized before

    F1 += Mn[0]; // add the already computed Mn[2] value
    lr_val1_bak = leftrotate(F1, s[0]);
    Ban1Test = Ban[0]+lr_val1_bak;

    if (Ban[1]!=Ban1Test) continue;

00401C6B  |.  8B15 68C34000     |MOV EDX,DWORD PTR DS:[40C368]
00401C71  |.  83C2 F9           |ADD EDX,-7
00401C74  |.  89C1              |MOV ECX,EAX
00401C76  |.  D3C2              |ROL EDX,CL
00401C78  |.  8B45 90           |MOV EAX,DWORD PTR SS:[EBP-70]
00401C7B  |.  01D0              |ADD EAX,EDX
00401C7D  |.  3B45 94           |CMP EAX,DWORD PTR SS:[EBP-6C]
00401C80  |.  0F85 18010000     |JNZ 00401D9E

versus:
    if (Ban[1]!=(Ban[0]+leftrotate(F1+Mn[0], s[0]))) continue;

00401C6B  |.  8B15 68C34000    |MOV EDX,DWORD PTR DS:[40C368]
00401C71  |.  83C2 F9          |ADD EDX,-7
00401C74  |.  89C1             |MOV ECX,EAX
00401C76  |.  89D6             |MOV ESI,EDX
00401C78  |.  D3E6             |SHL ESI,CL
00401C7A  |.  B9 20000000      |MOV ECX,20
00401C7F  |.  29C1             |SUB ECX,EAX
00401C81  |.  D3EA             |SHR EDX,CL
00401C83  |.  09D6             |OR ESI,EDX
00401C85  |.  0375 90          |ADD ESI,DWORD PTR SS:[EBP-70]
00401C88  |.  3B75 94          |CMP ESI,DWORD PTR SS:[EBP-6C]
00401C8B  |.  0F85 18010000    |JNZ 00401DA9

So there is no coincidence, the generated code in second part sucks,
also the ROL instruction is missing!

I think I know what's going on:
    // Just initial values of A, B, C, D:
    An = 1;
    Bn = 2;
    Cn = 3;
    Dn = 4;

So here we have:
F1 = (Cn xor (Bn or (not Dn)))+An;
the compiler is smart enough to replace F1 with -7
 

Link to comment
5 hours ago, CodeExplorer said:

In other words just one ugly loop!

Ah... that's really a long function, how about you refactor the small similar parts into smaller functions as this one

 

	F_2 = rightrotate(lr_values_2_Clean, s[size-2]);  // i = 6; 
	F_x2 = A xor (Ban6_Clean or (not Ban[4]));
	Ban[3] = F_2-F_x2;  // since Mn[6]=0

	F_3 = rightrotate(lr_values_3_Clean, s[size-3]);  // i = 5
	F_x3 = Ban[4] xor (Ban5_Clean or (not Ban[3]));
	Ban[2] = F_3-F_x3;  // since Mn[5]=0

 

And can you avoid all the break continue statements, and use one variable to control the flow ?

As I told you, many modern compilers do better job than humans in optimizing the code... so, you don't need to worry about one-liner code vs multi code.

You gotta focus more on the clarity / readability of your code

 

Link to comment

I still don't know why "rol"/"ror" instructions was replaced with many other instructions,
if I use local variables for calculation rol/ror are there and everything is ok.

Also I added two more "continue" conditions and the speed is now 31 seconds.
Fast enough I would say.
 

Link to comment
3 hours ago, CodeExplorer said:

I still don't know why "rol"/"ror" instructions was replaced with many other instructions,

Because you yourself defined rotate as:

#define leftrotate(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))

In case of a simple assignment compiler is smart enough to figure out it's "rol" and emit one instruction instead. In case of complex assignment, it uses the exact code you wrote.

 

If you want to optimize that part of code, check out the documentation of your compiler. Most of them provide intrinsics for rotates.

Edited by kao
Link to comment
On 9/10/2018 at 12:20 AM, CodeExplorer said:

Can someone explain why?

Your compiler and settings will determine the various optimizations that will be done automatically for you. If you are using Visual Studio, there are various optimization settings you can alter and adjust, as well as setting things to favor different things such as size of code or speed. If you are relying on the compiler for optimizations, then you are going to want to be using the newest version of the compiler/linker that is available for the best optimizations. (I know a lot of people in the reversing scene still use extremely old versions of VS like 2005/2008, so you will want to get 2017 for the best optimizations from the compiler/linker.)

In most cases, in your first post example, the compiler is going to optimize the function in the same way and produce the same output. Your best bet is to compare the output assembly of the code to see which styles produce the output that you prefer / favor as well as play with the various settings. Be sure to compile in release mode as well, disable as much debug code as possible, remove any unwanted/unneeded additional things that you do not need/want as well such as stack checks etc.

Compilers/linkers have tons of settings to optimize and tweak all kinds of things so be sure to review all the settings available, and even undocumented ones in the main settings windows etc. that can help tweak things further. If you are using more in depth things like the STL, you may also want to force the compiler to use newer standards (C++17, C++20, etc.) to get even more performance increases with newer standards.

Link to comment

When I use local version rol/ror instruction are placed like it should,
when I don't use local variable and calculate it directly it place these:

#define leftrotate(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))

 

Here I've used the compiler "Pelles C for Windows" x86 version 8.00.60,
after all this is just a console program.

For native GUI developer I've stuck with Visual C++ 2008.
Visual C++ 2008 is way better then latter, later version adds a strange PE section to generated exes.
So won't ever upgrade to any latter version.
I've also used "Visual C++ 6.0" a lot and I think to still use it.

 

Quote

so you will want to get 2017 for the best optimizations from the compiler/linker.
you may also want to force the compiler to use newer standards (C++17, C++20, etc.) to get even more performance increases with newer standards.

100% doubt these statements!!! I think it doesn't matter that much what compiler/linker I use as long as the generated assembly code is ok, If you really want optimisations use assembler like MASM and use CPU registers: but unforunalety they are only 8 general purpose registers, from which you can exclude SP and BP - so you left with only 6 registers, in the end you would still have to use memory variables.
 

Edited by CodeExplorer
Link to comment
48 minutes ago, CodeExplorer said:

Here I've used the compiler "Pelles C for Windows" x86 version 8.00.60,
after all this is just a console program.

For native GUI developer I've stuck with Visual C++ 2008.
Visual C++ 2008 is way better then latter, later version adds a strange PE section to generated exes.
So won't ever upgrade to any latter version.
I've also used "Visual C++ 6.0" a lot and I think to still use it.

This is why you are getting less then desired optimizations. If you want to keep using old/outdated compilers, then you are going to get old/outdated performance from them. Pelles is still a fairly newer compiler with less established development and optimizations in it so you are not going to get the results you may expect from it.

50 minutes ago, CodeExplorer said:

Visual C++ 2008 is way better then latter, later version adds a strange PE section to generated exes.

There are no extra/strange sections added to generated exe's outside of what you tell the compiler to do. The output depends on compiler settings, code that you use, compiler settings, how you link to the CRT and so on. There was one instance where Microsoft included some telemetry code into compiled exe's that was used for debugging purposes when VS2017 was still in its release candidate state that was later removed. Whatever misconceptions you have of VS are completely unfounded and are nothing more than fear mongering garbage from other non-developers in the RE community. 

 

53 minutes ago, CodeExplorer said:

100% doubt these statements!!! I think it doesn't matter that much what compiler/linker I use as long as the generated assembly code is ok,

Seriously..? The compiler and linker are what handle the automated optimizations. Hence why they have tons of configuration settings to change how those optimizations work, what it will/wont touch, etc. 

You are never going to get better-optimized code from the compiler you are using if you continue to use outdated tech., it's pretty much as simple as that. Unless you hand-write all the ASM yourself to handle every little optimization you want, it's not going to change unless you use newer stuff. 

Link to comment
  • 3 weeks later...

And which part of the SO explanation is confusing to you?

Quote

A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]

 

Link to comment
CodeExplorer
41 minutes ago, kao said:

And which part of the SO explanation is confusing to you?

None. Is just a problem I can't override.
I don't take the address of variable.
When I use any optimization flags register keyword is ignored,
when I set Optimzations to None everything works like it should: variable is stored in CPU register.

 

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...