Recently, I had a constrained C programming task on a tweet at https://twitter.com/reiya_2200/status/1130761526959267841.
The one that comes to my mind is main recursion, but it's not very interesting, so I tried to embed machine language, so I tried it as a material.
As a prerequisite, x86_64 Linux + gcc, and the reproduction environment is Ubuntu18 (WSL / Windows10). ** Even with the same Linux, the behavior can change significantly in different environments **. Please be careful not to be bad.
So tweet was ** the following code with the machine language obediently embedded **.
emb1.c
#include <stdio.h>
int main(int argc,char *argv[]) {
static char __attribute__((section(".text"))) s[]="1\xc0H\x83\xc7\bH\x8b\27H\x85\xd2t\t\xf\xbe\22\215D\20\xd0\xeb\xeb\xc3";
return !printf("%d\n",((int(*)(void*))s)(argv));
}
When executed, the total of the 1-digit numbers specified in the argument is output properly as shown below.
Compile / execute
$ gcc emb1.c
/tmp/ccfyvQPW.s: Assembler messages:
/tmp/ccfyvQPW.s:3: Warning: ignoring changed section attributes for .text
$ ./a.out 4 6 4 9
23
I think it is unnecessary for those who are accustomed to it, but I will explain it for the time being.
The policy is to implement the part that calculates the sum as a function and make it machine language.
An easy implementation would be code like this:
sum.c
int sum(void *pv) {
char **pc=pv;
int s=0;
while ( *++pc ) {
s+=**pc-'0';
}
return s;
}
The argument pv
assumes that ʻargvis passed. The
char * array indicated by ʻargv
is terminated with a null pointer at the end, so it can actually be processed without ʻargc`.
If you compile this and check the machine language, it will look like this:
Compile / disassemble
$ gcc -c -Os -fno-asynchronous-unwind-tables -fno-stack-protector sum.c && objdump -SCr sum.o
sum.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <sum>:
0: 31 c0 xor %eax,%eax
2: 48 83 c7 08 add $0x8,%rdi
6: 48 8b 17 mov (%rdi),%rdx
9: 48 85 d2 test %rdx,%rdx
c: 74 09 je 17 <sum+0x17>
e: 0f be 12 movsbl (%rdx),%edx
11: 8d 44 10 d0 lea -0x30(%rax,%rdx,1),%eax
15: eb eb jmp 2 <sum+0x2>
17: c3 retq
This means that the machine language of the function part will be a code sequence of 31,c0,48,83, ... in hexadecimal. If this is expressed in ASCII character range, it can be expressed as printable characters. " 1 \ xc0H \ x83 \ xc7 \ bH \ x8b \ 27H \ x85 \ xd2t \ t \ xf \ xbe \ 22 \ 215D \ 20 \ It will be xd0 \ xeb \ xeb \ xc3 "
.
Note that gcc's -Os
(size optimization) and other extra processing options are specified to prevent the code from being embedded in the code in order to shorten the code. I wanted an option ** to shorten it as ASCII characters if possible, but I can't help it because it doesn't exist.
So I understood the machine language of the function part.
You can embed this in your code as a normal string.
All you have to do now is treat the string address as the function address and use it to call the function. In printf
, it is((int (*) (void *)) s) (argv)
, and ʻint (*) (void *)is the address type of this function. (A pointer to a function that takes a
void * as an argument and returns a ʻint
) and casts to it.
emb1.c(Repost)
#include <stdio.h>
int main(int argc,char *argv[]) {
static char __attribute__((section(".text"))) s[]="1\xc0H\x83\xc7\bH\x8b\27H\x85\xd2t\t\xf\xbe\22\215D\20\xd0\xeb\xeb\xc3";
return !printf("%d\n",((int(*)(void*))s)(argv));
}
However, there are two points to note. It's about the placement of the string s
.
static
to prevent it from being placed in the stack area..text
section) in the ELF section.At this time, there is a mechanism to prevent unexpected memory areas from being executed as code. Without the above specifications, even if you try to execute the corresponding machine language code, it will cause SEGV. This is necessary knowledge in case the compiler cannot correctly determine that it is executable code, as in this case.
However, I'm a little dissatisfied with the code above. That's because the attribute makes the ELF section structure visible.
Is there a smarter way to embed machine language? So I wrote the following code.
emb2.c
#include <stdio.h>
int main(int argc,char *argv[]) {
asm volatile(".string \"H\\x83\\xc6\\bH\\x8b\\x6H\\x85\\xc0t\\xf\\xf\\xbe\\0\\x8d|\\x7\\xcf\\x89|$\\f\\xeb\\xe7\\xb0\"");
return !printf("%d\n",argc-1);
}
You can see that it still works.
Compile / execute
$ gcc emb2.c
$ ./a.out 4 6 4 9
23
Of course the seeds are easy. If you specify the .string
pseudo-instruction with the ʻasm volatile` instruction that embeds the assembly code in the C language code, the machine language equivalent to the character string is embedded in that place **.
Let's disassemble and check the ʻa.out` created after compilation.
Disassemble
$ objdump -SCr a.out | sed -ne '/<main>:$/,+20p'
000000000000064a <main>:
64a: 55 push %rbp
64b: 48 89 e5 mov %rsp,%rbp
64e: 48 83 ec 10 sub $0x10,%rsp
652: 89 7d fc mov %edi,-0x4(%rbp)
655: 48 89 75 f0 mov %rsi,-0x10(%rbp)
659: 48 83 c6 08 add $0x8,%rsi
65d: 48 8b 06 mov (%rsi),%rax
660: 48 85 c0 test %rax,%rax
663: 74 0f je 674 <main+0x2a>
665: 0f be 00 movsbl (%rax),%eax
668: 8d 7c 07 cf lea -0x31(%rdi,%rax,1),%edi
66c: 89 7c 24 0c mov %edi,0xc(%rsp)
670: eb e7 jmp 659 <main+0xf>
672: b0 00 mov $0x0,%al
674: 8b 45 fc mov -0x4(%rbp),%eax
677: 83 e8 01 sub $0x1,%eax
67a: 89 c6 mov %eax,%esi
67c: 48 8d 3d a1 00 00 00 lea 0xa1(%rip),%rdi # 724 <_IO_stdin_used+0x4>
683: b8 00 00 00 00 mov $0x0,%eax
688: e8 93 fe ff ff callq 520 <printf@plt>
In this, the part from 48,83, c6,08 at address 659 to b0,00 at address 672 is embedded. The instruction at address 672 is not actually processed, but it is prepared as an instruction ending with 00 so that the NUL character (00) can be added to the embedded character string.
Now, about this embedded code.
This simply assumes the following processing. In other words, let's save the calculation result as it is with ʻargc` as the total.
times, the last
printf` is subtracted by 1 to deal with it.Original code
#include <stdio.h>
int main(int argc,char *argv[]) {
while ( *++argv ) { argc+=**argv-'0'-1; }
return !printf("%d\n",argc-1);
}
By the way, according to the x86_64 Linux calling convention, the first argument ʻargc is stored in the rdi register (32-bit part is edi), and the second argument ʻargv
is stored in the rsi register. So, if you loop them directly, it's OK.
Applicable assembly
659: add $0x8,%rsi #Advance argv by one element
65d: mov (%rsi),%rax #Load argv elements into rax
660: test %rax,%rax #Null pointer judgment
663: je 674 #Jump to immediately after the embedded part when a NULL pointer is detected
665: movsbl (%rax),%eax #Read characters into eax
668: lea -0x31(%rdi,%rax,1),%edi # argc(edi)Add to
66c: mov %edi,0xc(%rsp) #Save edi to stack area
670: jmp 659 #Jump to the beginning of the embedded part
672: mov $0x0,%al #Dummy instruction
However, without optimization, ʻargc` used at the time of printf seemed to use the value once saved in the stack, not directly in the register. Therefore, the result of changing the esi register is reflected on the stack by the instruction at address 66c.
On the contrary, if optimization is performed, the stack area for saving ʻargcis not prepared, so the instruction at address 66c will destroy the stack. So, the output is as intended, but be aware that it will cause a segf at the exit of
main`.
Specify optimization options
$ gcc -O3 emb2.c
$ ./a.out 4 6 4 9
23
Segmentation fault (core dumped)
What did you think. Corrupted C programmer level -10 I thought that I could feel the feeling equivalent to level -9, so I introduced the code.
Finally, for the time being, I will also leave the main recursion, which seems to be the expected solution of the questioner (because it can be assembled for the time being!).
Assumed solution?code
#include <stdio.h>
int main(int argc,char *argv[]) {
return argc>0 ? !printf("%d\n",main(0,argv+1)) : *argv ? **argv-'0'+main(0,argv+1) : 0;
}
Recommended Posts