Getting Started with Go Assembly

This article is a translation and addition of Chapter I: A Primer on Go Assembly.

This article assumes the following people:

--Understand the grammar of Go language --Understand the general stack behavior when calling a subroutine

environment

$ go version
go version go1.10 linux/amd64

Pseudo assembly

The assembly output by the Go compiler is an abstraction and is not mapped to the actual hardware. The Go assembler translates this pseudo-assembly into machine language for the target hardware.

It might be helpful to imagine something like Java bytecode.

The biggest advantage of having such an intermediate layer is that it makes it easier to adapt to new architectures. For more information, see [* The Design of the Go Assembler *] by Rob Pike (https://talks.golang.org/2016/asm.slide#1).

The most important thing to know about Go assemblies is the fact that Go assemblies do not correspond directly to the target hardware. Some are directly tied to the hardware, but others are not. This eliminates the need for the compiler to require an assembler Pass in the pipeline, instead the compiler can handle pseudo-assemblies that abstract this hardware, and the Instruction selection (in this case the Go assembly to the actual assembly) The conversion to) is now partly done after code generation (the compiler's generation of the Go assembly). As an example of a pseudo-assembly, the MOV instruction of a GO assembly may be converted to a clear or load instruction, or it may remain (although the name may change) depending on the architecture. While common architectural concepts such as memory data movement and subroutine calls and returns are abstracted, hardware-specific instructions are often represented as-is.

Go Assembler is a program that parses this pseudo-assembly and converts it into instructions for linker input.

Example using a simple program

Consider the following code.

//go:noinline
func add(a, b int32) (int32, bool) { return a + b, true }

func main() { add(10, 32) }

(// go: noinline Directive to prevent inlining) *

Let's compile this code into an assembly.

$ GOOS=linux GOARCH=amd64 go tool compile -S direct_topfunc_call.go

0x0000 TEXT		"".add(SB), NOSPLIT, $0-16
  0x0000 FUNCDATA	$0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB)
  0x0000 FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
  0x0000 MOVL		"".b+12(SP), AX
  0x0004 MOVL		"".a+8(SP), CX
  0x0008 ADDL		CX, AX
  0x000a MOVL		AX, "".~r2+16(SP)
  0x000e MOVB		$1, "".~r3+20(SP)
  0x0013 RET

0x0000 TEXT		"".main(SB), $24-0
  ;; ...omitted stack-split prologue...
  0x000f SUBQ		$24, SP
  0x0013 MOVQ		BP, 16(SP)
  0x0018 LEAQ		16(SP), BP
  0x001d FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
  0x001d FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
  0x001d MOVQ		$137438953482, AX
  0x0027 MOVQ		AX, (SP)
  0x002b PCDATA		$0, $0
  0x002b CALL		"".add(SB)
  0x0030 MOVQ		16(SP), BP
  0x0035 ADDQ		$24, SP
  0x0039 RET
  ;; ...omitted stack-split epilogue...

Dissecting add

0x0000 TEXT "".add(SB), NOSPLIT, $0-16

--0x0000: Represented relative to the beginning of the instruction offset function.

--TEXT "". Add: The TEXT directive indicates that the "" .add symbol is contained in the .text section and that the following instructions are inside this function. The empty string " " is replaced with the current package name at link time. This time it will be main.add.

--(SB): SB is a register virtually defined in the Go assembly, which is a" Static-Base "pointer. It represents the beginning of the program's address space. The " ". add (SB) indicates that the " ". add symbol is at a constant offset calculated by the linker from the beginning of the address space. In other words, it is a global scope function with a fixed address. You can see this clearly with ʻobjdump`.

$ objdump -j .text -t direct_topfunc_call | grep 'main.add'
000000000044d980 g     F .text	000000000000000f main.add

objdump supplement

---j .text Text section only displayed -- -t Display symbol table --000000000044d980 g F .text 000000000000000f main.add Address 0x44d980 has a global function symbol named main.add

All user-defined symbols are described as offsets from the pseudo-registers FP (local) and SB (global). Since the pseudo-register SB can be thought of as the starting point of memory, the foo (SB) symbol can be thought of as a symbol representing the address of foo.

--NOSPLIT: Tells the compiler not to insert the * stack-split * preamble to see if the current stack needs to be expanded. Since the ʻaddfunction has no local variables and does not require a stack frame, it is not necessary to extend the current stack, so checking the stack extension each time the function is called is a waste of CPU resources. The compiler will automatically know this and automatically set thisNOSPLIT` flag. Stack expansion is mentioned later in the Goroutine section.

-- $ 0-16: $ 0 represents the number of bytes in the stack frame allocated to this function, 16 represents the size of the argument (+ return value) passed by the caller. (16 bytes with int 32 x 3 + bool (align with 4 bytes))

In the general case, the size of the stack frame is followed by the size of the argument delimited by a minus sign. (This minus sign does not represent subtraction.) $ 24-8 indicates that the function has a 24-byte stack frame and is called with an 8-byte argument that exists in the calling stack frame. If NOSPLIT is not specified for TEXT, the size of the argument must be specified. For assembly functions that use the Go prototype, go vet checks to see if the argument size is correct.

0x0000 FUNCDATA $0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB)
0x0000 FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)

The FUNCDATA and PCDATA directives contain information for use by the GC.

0x0000 MOVL "".b+12(SP), AX
0x0004 MOVL "".a+8(SP), CX

Go's calling convention allows all arguments to be passed through the stack using pre-allocated space in the caller's stack frame. Therefore, it is the caller's responsibility to pass arguments to the callee and manage the stack size appropriately so that the return value of the callee is returned under the caller.

The Go compiler does not generate PUSH / POP instructions. Instead, the stack is expanded or contracted by adding or subtracting SP, which is a pseudo register that points to the top of the stack.

[UPDATE: We've discussed about this matter in issue #21: about SP register.]

The pseudo-resistor SP is used to refer to local variables and arguments. Since SP points to the beginning of the stack frame, the reference is made using a negative offset in the range [−framesize, 0). e.g. x-8 (SP), y-4 (SP)

The official documentation states that user-defined symbols are represented by offsets from the FP registers, but this is not the case for automatically generated code. Modern Go compilers always refer to arguments and local variables at an offset from the stack pointer. This allows FP to be used as an additional general purpose register on platforms with a small number of registers, such as x86. See * Stack frame layout on x86-64 * for more information. [UPDATE: We've discussed about this matter in issue #2: Frame pointer.]

" ". b + 12 (SP) and " ".a + 8 (SP) refer to the top 12 and 8 byte addresses on the stack, respectively. (Note that the stack extends from the upper address to the lower address)

.a and .b are arbitrary aliases given to the reference location. The name does not affect the processing, but it is essential for using indirect addressing on virtual registers.

The documentation for FP, which is a pseudo-frame pointer, says:

FP is a virtual frame pointer for referencing function arguments. The compiler holds the contents of this register and references the function's arguments on the stack as an offset based on this register. So for 64-bit architecture, 0 (FP) is the first argument of the function and 8 (FP) is the second argument. However, to access the arguments this way, you must start with a name, such as first_arg + 0 (FP) or second_arg + 8 (FP). (Offset from FP is different from SB, which means offset from symbol) Assembler does not accept unnamed writing such as 0 (FP) and 8 (FP) and forces this name specification To do. The actual name is irrelevant to what you are doing, but is used to document the argument name.

Finally, there are two important points.

The first argument ʻais placed in8 (SP)instead of0 (SP). This is because the return address is stored in 0 (SP)when the caller is aCALL` instruction.
Arguments are pushed onto the stack from the back.

0x0008 ADDL CX, AX
0x000a MOVL AX, "".~r2+16(SP)
0x000e MOVB $1, "".~r3+20(SP)

ʻADDL adds two Long-words (4 byte long values) and stores the result in ʻAX. Here, ʻAXandCX are added and the result is stored in ʻAX. The result is then stored in " ". ~ R2 + 16 (SP) on the pre-allocated stack for the caller to receive the return value. Again, " ". ~ R2 has no meaning in terms of processing content.

Since Go supports multiple return values, in this example, the constant true is also returned as a return value. Like the first return value, the offset is different, but the result is stored in " ". ~ R3 + 20 (SP).

0x0013 RET

The final pseudo-instruction RET is to instruct the Go assembler to insert the appropriate instruction to return from the subroutine on the targeted hardware. In most cases, it pops the return address stored in 0 (SP) and jumps there.

The last instruction in the TEXT block must be some kind of jump instruction (usually with RET) If there is no jump instruction, the linker adds an instruction to jump to itself so that it does not execute the instruction beyond the TEXT block.

Since a lot of grammar and explanations have come out, I will write a brief summary.

;;Global function symbol"".Declare add(Main when linking.add)
;; stack-Do not insert split preamble
;;0-byte and 16-byte arguments are passed for the stack frame
;; func add(a, b int32) (int32, bool)
0x0000 TEXT	"".add(SB), NOSPLIT, $0-16
  ;; ...omitted FUNCDATA stuff...
  0x0000 MOVL	"".b+12(SP), AX	    ;;Second argument to AX from the calling stack frame(b)Move
  0x0004 MOVL	"".a+8(SP), CX	    ;;First argument to CX from the calling stack frame(a)Move
  0x0008 ADDL	CX, AX		          ;; AX=CX+AX
  0x000a MOVL	AX, "".~r2+16(SP)   ;;Move the addition result stored in AX to the call stack frame
  0x000e MOVB	$1, "".~r3+20(SP)   ;;constant`true`To the calling stack frame
  0x0013 RET			                ;; 0(SP)Jump to the return destination address stored in

The visualization of the contents of the stack when the processing of main.add is completed is as follows.

   |    +-------------------------+ <-- 32(SP)              
   |    |                         |                         
 G |    |                         |                         
 R |    |                         |                         
 O |    | main.main's saved       |                         
 W |    |     frame-pointer (BP)  |                         
 S |    |-------------------------| <-- 24(SP)              
   |    |      [alignment]        |                         
 D |    | "".~r3 (bool) = 1/true  | <-- 21(SP)              
 O |    |-------------------------| <-- 20(SP)              
 W |    |                         |                         
 N |    | "".~r2 (int32) = 42     |                         
 W |    |-------------------------| <-- 16(SP)              
 A |    |                         |                         
 R |    | "".b (int32) = 32       |                         
 D |    |-------------------------| <-- 12(SP)              
 S |    |                         |                         
   |    | "".a (int32) = 10       |                         
   |    |-------------------------| <-- 8(SP)               
   |    |                         |                         
   |    |                         |                         
   |    |                         |                         
 \ | /  | return address to       |                         
  \|/   |     main.main + 0x30    |                         
   -    +-------------------------+ <-- 0(SP) (TOP OF STACK)

(diagram made with https://textik.com)

Dissecting main

Let's review the contents of the main function again.

func main() { add(10, 32) }

0x0000 TEXT		"".main(SB), $24-0
  ;; ...omitted stack-split prologue...
  0x000f SUBQ		$24, SP
  0x0013 MOVQ		BP, 16(SP)
  0x0018 LEAQ		16(SP), BP
  ;; ...omitted FUNCDATA stuff...
  0x001d MOVQ		$137438953482, AX
  0x0027 MOVQ		AX, (SP)
  ;; ...omitted PCDATA stuff...
  0x002b CALL		"".add(SB)
  0x0030 MOVQ		16(SP), BP
  0x0035 ADDQ		$24, SP
  0x0039 RET
  ;; ...omitted stack-split epilogue...

0x0000 TEXT "".main(SB), $24-0

Same as for the ʻadd` function. This time, 24 bytes are secured in the stack frame so that no argument is received and no return value is returned.

0x000f SUBQ     $24, SP
0x0013 MOVQ     BP, 16(SP)
0x0018 LEAQ     16(SP), BP

Once again, Go's calling convention passes all function arguments through the stack.

By subtracting $ 24 bytes from SP, main reserves 24 bytes of its own stack frame. (Note that the stack stretches downwards)

Use this reserved $ 24 bytes as follows.

--8 bytes (16 (SP)-24 (SP) ) are used to store the current value of the frame pointer BP. This allows you to rewind the stack (follow the function under the call), which is useful when debugging. (MOVQ BP, 16 (SP)) --1 + 3 bytes (12 (SP)-16 (SP) ) is reserved to receive the second return value of the ʻadd function ( bool is 1 byte but ʻamd64) +3 bytes for architectural alignment) --4 bytes (8 (SP)-12 (SP) ) are reserved to receive the first return value of the ʻadd function (ʻint32). --4 bytes (4 (SP)-8 (SP) ) are reserved for the value of the argument of the ʻadd function b (int32) --4 bytes (0 (SP)-4 (SP) ) are reserved for the value of the argument of the ʻadd function ʻa (int32)`

Finally, following the stack allocation, LEAQ calculates the new address of the frame pointer and stores it in BP. (BP = 16 (SP), similar to x86 lea instruction)

0x001d MOVQ     $137438953482, AX
0x0027 MOVQ     AX, (SP)

The caller puts the argument for callee at the top of the stack as an 8-byte Quad-word. The placed values may seem meaningless at first glance, but 137438953482 is a collection of 4-byte 10 and 32.

$ echo 'obase=2;137438953482' | bc
10000000000000000000000000000000001010
\____/\______________________________/
   32                              10

The upper 32-63 bits of 137438953482 represent 100000 (32), and the lower 0-31 bits represent 00000000000000000000000000001010 (10).

0x002b CALL     "".add(SB)

Call the ʻaddfunction with theCALL` instruction as a relative offset from the SB.

Note that CALL places an 8-byte address at the top of the stack as the return destination address, so all SPs referenced in the add function are shifted down 8 bytes. For example, " ". a is represented as8 (SP)instead of0 (SP).

0x0030 MOVQ     16(SP), BP
0x0035 ADDQ     $24, SP
0x0039 RET

Finally,

Rewind the frame register MOVQ 16 (SP), BP
Release the reserved stack frame ʻADDQ $ 24, SP`
Return

And finish the execution of the main function

I think what you are doing through ʻadd and main` is a general subroutine call.

Goroutine stack management

If you take a look at the assembly for Goroutine, you'll be familiar with the instructions for stack management.

To help you understand these patterns as soon as possible, let's understand what we are doing and why we are doing this.

Stacks

The number of Goroutines that appear in your Go program depends on the situation. Practical programs can be in the millions. Go's runtime takes a conservative approach to securing the Goroutine stack so that it doesn't run out of memory. Initially, 2KB of stack space is allocated by the runtime for any Goroutine. (The stack is actually allocated to the heap in the background)

When Goroutine runs, it may require more memory than the originally allocated 2KB. In that case, it may destroy the stack and invade other memory areas. To prevent such a stack overflow, the runtime reserves a stack that is twice as large as the previous one and copies the contents of the stack to it when Goroutine is about to exceed the stack. This process is called * stack-split * and allows you to handle Goroutine's stack size efficiently and dynamically.

Splits

For * stack-split * to work, the compiler inserts some instructions at the beginning and end of each function that may cause a stack overflow to allow you to check for a stack overflow. As we saw earlier, this is useless for functions where stack overflow is unlikely, so NOSPLIT can tell the compiler that it doesn't need to insert an instruction to check.

I've omitted the code for * stack-split * in the main function above, but let's take a look at it now.

0x0000 TEXT	"".main(SB), $24-0
  ;; stack-split prologue
  0x0000 MOVQ	(TLS), CX
  0x0009 CMPQ	SP, 16(CX)
  0x000d JLS	58

  0x000f SUBQ	$24, SP
  0x0013 MOVQ	BP, 16(SP)
  0x0018 LEAQ	16(SP), BP
  ;; ...omitted FUNCDATA stuff...
  0x001d MOVQ	$137438953482, AX
  0x0027 MOVQ	AX, (SP)
  ;; ...omitted PCDATA stuff...
  0x002b CALL	"".add(SB)
  0x0030 MOVQ	16(SP), BP
  0x0035 ADDQ	$24, SP
  0x0039 RET

  ;; stack-split epilogue
  0x003a NOP
  ;; ...omitted PCDATA stuff...
  0x003a CALL	runtime.morestack_noctxt(SB)
  0x003f JMP	0

--At the beginning of the function (prologue), Goroutine checks if the stack is exhausted, in which case it jumps to the end of the function (epilogue). --At the end of the function (epilogue), it triggers the stack expansion process, and after that, returns to the beginning of the function (prologue).

Note that this prologue and epilogue will continue to loop until the stack size is large enough.

Prologue

0x0000 MOVQ	(TLS), CX   ;; store current *g in CX
0x0009 CMPQ	SP, 16(CX)  ;; compare SP and g.stackguard0
0x000d JLS	58	        ;; jumps to 0x3a if SP <= g.stackguard0

TLS is a virtual register managed by the runtime that has a pointer to the current g. This is a data structure that traces all the states of Goroutine.

Let's check the definition of g from the runtime source code.

type g struct {
	stack       stack   // 16 bytes
  //stackguard0 is the stack pointer to compare with Prologue
  //Normally stackgurad0 is stack.lo+Becomes a StackGuard, but can also be a StackPreempt to trigger preemption
  //Preemption:The behavior of a multitasking computer system to temporarily suspend a running task
	stackguard0 uintptr
	stackguard1 uintptr

	// ...omitted dozens of fields...
}

Since g.stack is 16 bytes,16 (CX)is g.stackguard0. This is the stack threshold managed by the runtime, which can be compared to the stack pointer to see if Goroutine has used up the stack space.

The stack grows toward the lower address, so if SP <= stackguard0, the stack space is used up. In that case, prologue jumps to epilogue.

Epilogue

0x003a NOP
0x003a CALL	runtime.morestack_noctxt(SB)
0x003f JMP	0

The process of epilogue is simple: just call the runtime stack extension function to extend the stack and return to the prologue code.

The NOP before the CALL exists to prevent the prologue code from jumping directly to the CALL. Depending on the platform, it may be necessary to explain a fairly deep part, so I will omit the explanation, but it is a common practice to put a NOP instruction before the CALL instruction and jump there. [UPDATE: We've discussed about this matter in issue #4: Clarify "nop before call" paragraph.]

This time I explained only the tip of the iceberg.

The stack expansion mechanism is too detailed and complex to explain here, so I'd like to have a dedicated chapter if I get the chance.

Summary

This time, I tried to explain Go Assembly using a simple example.

We'll dig deeper into Go's internal implementation in the remaining chapters.