Advance c++ Optimizing and assembly
This articles aim is to provide a complete, but brief approach to c++ and assembly mixing together for greater optimization of code. I also added a extra section that shows you how you can execute 2 instruction at once(in the advance section)
This section is for newbies to assembly programming, this sections provides simple methods of optimization nothing really advance, this section is recommended for newbies assembly programmers Also you should be familiar with at least some assembly.
I decided to go with
the basics first since they or very easy deal with. I
also decided to provide a small assembly tutorial for those who are not
"to" familiar with assembly, and its concepts.
The first thing we need to learn is register.
Those are the four general-purpose registers; think of register as variables inside your CPU. To use this register is very simple. You will mostly be accessing these registers with inline assembly.
To do any kind of assembly programming in c++, you 22222o1410w must learn about the 'asm' statements, on visual c++ this statement is coded like this
This is call inline assembly
So you must always
include the __asm statement before your assembly code.
Now to play around with registers let's make a simple program that adds a DWORD to the value of 5 with registers,
Now please note a DWORD is a 32 BIT variable, if you don't include windows.h in your program you can't use DWORD, but we know that DWORDS's are just unsigned long type variables right. So lets get on with the code
unsigned long function(unsigned
Now this is just some simple code, no real speed gain here, this is just to show you how easy it is to mix c++ and assembly code.
C/C++ being the high level language that it is must have strict protocols that you have to follow if you don't want your c++ program to crash, learning these protocols provides a great methods for optimizing code, the first thing we need to learn is proper use of the eax register, since this is the register c++ returns values in
This simple statement return 4; compiles into
As you can the eax register holds some special uses, to apply what we learn lets create a simple function that returns 4, using nothing but assembly, please notice, I am going to use the naked specs, with my code, to learn about the naked function read my article about the naked function Naked function article
Very simple, if you did
something like this
if(return4() == 4)
It would actually exit
the program, since eax holds the return value for functions.
Now that we know a little bit more about c++ protocols we can move on to some advance optimization
OPTIMIZE about time:
Most of your optimization with assembly code will be short and brief, nothing to tricky nor complex, the first thing I decided to optimize was a byte swap function, that turns little endian byte order, into big endian byte order, very important when doing socket programming, the tradition way to do it is htons(), so I decided to write a optimization.
int fast_ton (unsigned short v)
This function uses the
X86 Instruction bswap, which swaps the bytes, the first thing we do is clear
eax, by going xor eax, eax this makes eax zero, you might be wondering why not
just mov eax,0
Well xor eax, eax is a more optimize way, second we mov ax, which is the higher 16 bits of eax,
Then we swap the bytes, with bswap eax,
After all we have one problem, eax is a 32 bit register, the value of eax is two high
Image this lets say we
went fast_ton(1); , now when we swap it one will be place in the 32 bit,
meaning the value of 1677723(something around there) instead of 256 which htons
will return. The simple solution is to move/shift the 1 down to the 16th bit
instead of the 32nd bit, which we can accomplish with the shr(Shift Right)
instruction which shifts all the bits right
After that we simply return with the ret instruction.
Of course this function can be more optimize than this but for the sake of simplicity I deiced to code it like this
Some Basic Macros:
Macros are a great way to optimize code, and make it very simple to reuse code. Also they make your code more portable. Lets write the classic variable switch
temp = var1;
var1 = var2;
var3 = var1;
Not only is this very commonly use , it is very poor, when the X86 provides a simple mechanism for this with its xchg instruction and the stack, I decided to use the stack for simplicity sake.
To exchange to value with the stack is simple
int var1 = 3;
int var2 = 5;
this little faster. This works because the stack is FILO meaning the first one you push into the stack will be the last one out, since we push var2 in the stack last, meaning the next thing we pop will get var2's value.
I created a sample project with two Marcos that can be use to push values on the stack and save values to the stack since they are macros this code can be ported
#define m_save(reg) __asm push (reg)
#define m_get(reg) __asm pop (reg)
unsigned long value = 3;
int main(int argc, char
Value = 66
Value = 3
Press any key to continue
You can use the two macros m_save, and m_get in your code just copy and paste them go on do it. Although I didn't show you how to switch variables, I showed you how you could temporally save a variable for later use
This is the last sections for the basic optimizations , this is almost like a reference since I will go though many of the c++ optimization commands.
The first one is __fastcall, when this is declare with a function it makes the first two parameters of that function go into register ecx,edx instead of the stack. Example
int __fastcall superfast_ton(int v1)
Very nice, eh?.
The second optimizations are with the use of the #pragma, you can use this to turn off stack probes. Example.
#pragma (check_stack) off
Second its a good thing to turn off runtime checks(most of the time)
#pragma runtime_checks( "s", off )
Next we should turn on every single optimization
#pragma optimize( "", on )
Up until now we been programming assembly code for the newbies, now its time for something advance
executing two instruction at
How it works is simple, the Pentium has two pipelines one is the U-Pipe, and one is the V-Pipe, under certain circumstances(not all) it is possible for you to pair up to instruction then execute them at the same TIME
Not all instruction can take part of this event, but a few pairs can only when certain condition or met. The first thing is to learn the instruction that can not be paired with each other
1. Shift or rotate instructions with the shift count in CL
2. Long arithmetic instructions for example, MUL, DIV
3. Extended instructions for example, RET, ENTER, PUSHA, MOVS, REP STOS
4. Some floating point instructions for example FSCALE, FLDCW, FST
5. Inter-segment instructions for example, PUSH sreg, CALL far
I got these lists from somewhere, but anyway, this instruction can not be executed at the same time. So which one can? That's is simple, some instruction can only be execute in the U pipe or the V pipe, Some instruction can be executed in both.
U/V Pipe Instruction
Parable instructions issued to U or V pipes (UV
1. Most 8/32 bit ALU operations for example, ADD, INC, XOR
2. All 8/32 bit compare instructions for example, CMP, TEST
3. All 8/32 bit stack operations using registers: PUSH reg, POP reg
These instruction can execute in both pipes, the U pipe and the V pipe
U Pipe Instruction
Pair able instructions issued to U pipe (PU
These instructions must be executed in the U pipe and can be paired with a
suitable instruction in the V pipe.
1. Carry instructions for example, ADC, SBB
2. Prefixed instructions (see later on
3. Shift with immediate
4. Some floating point instructions for example, FADD, FMUL, FLD
These instruction can be only executed in the U pipe.
V Pipe Instruction
Parable instructions issued to V pipe (PV
These instructions can be executed in the U pipe or in the V pipe but they
will only be paired when executed in the V pipe.
1. Simple control transfer instructions for example; CALL near, JMP near, Jcc.
This includes both the Jcc short and Jcc near (0F prefixed
2. The floating point instruction FXCH
These instruction can only be executed in the V Pipe
Here is a table containing all pair able and not pair able instruction
Special Notice Not all instructions can be paired; no pairing can be done when the following
1. The next two
instructions cannot be paired. (At the end of the doc you'll
find a pairing table) In general most arithmetic instructions can be paired.
2. The next two instructions have some register contention. In other words
they update/use the same registers (implicit or explicit
3. Both the instructions are not in the instruction cache. An exception to
this is when the first instruction is a one byte instruction.
This is a small short article written by Opcode Void/ firstname.lastname@example.org, question / comments are welcome