Hi,
Just did not feel like chasing bugs the other day. I decided to have some fun with something that I wondering for a long time: the usefulness of inline i86 assembly in string functions.
This is the test program as.c:
---------------------------------8<------------------------------------- #include <malloc.h> typedef unsigned short WCHAR, *PWCHAR;
static inline WCHAR *strcpyW( WCHAR *dst, const WCHAR *src ) { #ifdef ASM int dummy1, dummy2, dummy3; __asm__ __volatile__( "cld\n" "1:\tlodsw\n\t" "stosw\n\t" "testw %%ax,%%ax\n\t" "jne 1b" : "=&S" (dummy1), "=&D" (dummy2), "=&a" (dummy3) : "0" (src), "1" (dst) : "memory" ); #else WCHAR *p = dst; while ((*p++ = *src++)); #endif return dst; }
#define SZ 3000 main() { int i; PWCHAR s,d; s=malloc(SZ*sizeof(WCHAR)); d=malloc(SZ*sizeof(WCHAR)); memset(s,'x',SZ); s[SZ-1]=0; for(i=0;i<1000000;i++) strcpyW(d,s); } ---------------------------------8<-------------------------------------
The function strcpyW is a copy from Wine with the #ifdef modified.
I used the following commands
gcc-3.3 -O2 as.c -o as -DASM ; time ./as;time ./as; time ./as
and
gcc-3.3 -O2 as.c -o as ; time ./as;time ./as; time ./as
The resulting times are (all user time):
test# asm C ----------------------- 1 15.970 15.899 2 15.966 15.943 3 15.959 15.941 ------ ------ ave 15.964 15.928
Notes: - tested on a PII 450 MHz; - I tested with gcc 2.95 and 3.4.2 as well, result are essentially the same. - size of main() is 0x7a (assembly) vs 0x82 (C-code) bytes; - I experimented with longer strings to see if there was any mem cache hit/miss effects and found none.
Conclusions:
1. these routines are so fast that it is hard to imagine that these functions will be a bottleneck, justifying such optimization; 2. nothing shows here that inline assembly brings any advantage.
Rein.