Go Plan9 汇编

想要深入理解 go 语言内部机制，了解 go 汇编必不可少。

go 汇编使用 unix 时代早期发明的 plan9 汇编语法。在编译到 plan9 语法的基础上（伪汇编），再转换到特定的硬件平台汇编语法，这也是 go 跨平台能力的来源。

入门实例

使用 GOOS=linux GOARCH=amd64 go tool compile -S main.go > r.S 编译以下代码：

package main
 
func add(a, b int) (int, bool) {
	return a + b, true
}
 
func main() {
	add(10, 32)
}

剖析 add 函数

"".add STEXT nosplit size=20 args=0x10 locals=0x0
	0x0000 00000 (x.go:4)	TEXT	"".add(SB), NOSPLIT|ABIInternal, $0-16
	0x0000 00000 (x.go:4)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:4)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:4)	FUNCDATA	$3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:5)	PCDATA	$2, $0
	0x0000 00000 (x.go:5)	PCDATA	$0, $0
	0x0000 00000 (x.go:5)	MOVL	"".b+12(SP), AX
	0x0004 00004 (x.go:5)	MOVL	"".a+8(SP), CX
	0x0008 00008 (x.go:5)	ADDL	CX, AX
	0x000a 00010 (x.go:5)	MOVL	AX, "".~r2+16(SP) // 结果放入调用者的栈
	0x000e 00014 (x.go:5)	MOVB	$1, "".~r3+20(SP)
	0x0013 00019 (x.go:5)	RET

看不懂？不着急，一行一行扒：

0x0000 (x.go:3)	TEXT	"".add(SB), NOSPLIT|ABIInternal, $0-32

0x0000：当前指令相对于当前函数起始地址的 offset，
TEXT "".add：声明 "".add 符号作为 .text 块的一部分（俺们都知道进程有好几个块），并且表面下面的指令是函数体
SB
"".add(SB)：声明我们的函数符号 add 位于相对于进程地址空间的某个常量 offset（地址由 linker 计算）位置，全局函数符号拥有绝对的直接地址。

可以使用 objdump 确认函数的地址：

GOOS=linux GOARCH=amd64 go build -gcflags -S x.go
objdump -j .text -t x | grep 'main'
 
000000000044f970 g     F .text		 00000018 main.add

NOSPLIT：在我们的 add 函数例子中，编译器自己设置 flag，函数内部并没有局部变量，自身也没有栈帧，可以确认栈不会增长，可以避免浪费 CPU 循环在检测 CALL 指令上
$0-32：$0 表示栈帧（allocated）需要分配多少个字节（byte），$32 指明 CALL 传入的参数有多少个字节，例如 $24-8 表示函数有 24 bytes 帧和 8 bytes 参数在调用者的帧中，如果没有设置 NOSPLIT，参数大小必须提供
FUNCDATA：垃圾收集
PCDATA：垃圾收集
MOVL "".b+12(SP), AX：go 调用函数的约定是所有的参数必须放在栈上，使用调用者（caller）预留的栈空间，调用者有责任分配栈空间给被调用者使用，潜在的返回值会返回给调用者（go 编译器没有栈 push/pop 的质量，一切栈的增减都使用 SP 栈指针），"".b+12(SP) 引用栈顶往上 12 bytes 的位置（栈从上往下增长），.b 是一个当前作用域中的任意别名，没有任何语义，强制性地在使用帧指针的过程中用到，参考下面的官方说明：

The FP pseudo-register is a virtual frame pointer used to refer to function arguments. The compilers maintain a virtual frame pointer and refer to the arguments on the stack as offsets from that pseudo-register. Thus 0(FP) is the first argument to the function, 8(FP) is the second (on a 64-bit machine), and so on. However, when referring to a function argument this way, it is necessary to place a name at the beginning, as in first_arg+0(FP) and second_arg+8(FP). (The meaning of the offset —offset from the frame pointer— distinct from its use with SB, where it is an offset from the symbol.) The assembler enforces this convention, rejecting plain 0(FP) and 8(FP). The actual name is semantically irrelevant but should be used to document the argument's name.

MOVL "".a+8(SP), CX：与上面类似

两个重要结论：

第一个参数 a 位于 8(SP)，而不是 0(SP)，因为调用者使用伪指令 CALL 的时候，在 0(SP) 存储函数返回值地址（return address）
参数使用相反的顺序入栈，后面的参数先入栈
ADDL CX, AX：指令加法，计算寄存器 CX 与 AX 的值相加，并存储结果到 AX 寄存器
MOVL AX, "".~r2+16(SP)：结果移动到调用者的栈空间中，这个空间就是 0(SP)，专门用来存储结果，~r2 在这里没有任何语义，只是使用约定
MOVB $1, "".~r3+20(SP)：因为有两个返回值，所以 MOV 了两次数据，$1 代表代码中的返回值 true
RET：当前函数调用完了，返回值也写入调用者的栈了，此时就需要返回调用者，大多数实现把上层函数地址存储在 0(SP)，取出地址，JMP 回去。

   |    +-------------------------+ <-- 32(SP)              
   |    |                         |                         
 G |    |                         |                         
 R |    |                         |                         
 O |    | main.main's saved       |                         
 W |    |     frame-pointer (BP)  |                         
 S |    |-------------------------| <-- 24(SP)              
   |    |      [alignment]        |                         
 D |    | "".~r3 (bool) = 1/true  | <-- 21(SP)              
 O |    |-------------------------| <-- 20(SP)              
 W |    |                         |                         
 N |    | "".~r2 (int32) = 42     |                         
 W |    |-------------------------| <-- 16(SP)              
 A |    |                         |                         
 R |    | "".b (int32) = 32       |                         
 D |    |-------------------------| <-- 12(SP)              
 S |    |                         |                         
   |    | "".a (int32) = 10       |                         
   |    |-------------------------| <-- 8(SP)               
   |    |                         |                         
   |    |                         |                         
   |    |                         |                         
 \ | /  | return address to       |                         
  \|/   |     main.main + 0x30    |                         
   -    +-------------------------+ <-- 0(SP) (TOP OF STACK)
 
(diagram made with https://textik.com)

剖析 main 函数

0x0000 TEXT		"".main(SB), $24-0
;; ...omitted stack-split prologue...
0x000f SUBQ		$24, SP
0x0013 MOVQ		BP, 16(SP)
0x0018 LEAQ		16(SP), BP
;; ...omitted FUNCDATA stuff...
0x001d MOVQ		$137438953482, AX // 参数
0x0027 MOVQ		AX, (SP)
;; ...omitted PCDATA stuff...
0x002b CALL		"".add(SB)
0x0030 MOVQ		16(SP), BP
0x0035 ADDQ		$24, SP
0x0039 RET
;; ...omitted stack-split epilogue...

0x0000 TEXT "".main(SB), $24-0：没啥新鲜的，栈分配 24 bytes 空间，这些空间都用到哪去了？下面一一解析：
- 首先是调用 add 函数传递参数需要的空间，4 + 4 = 8 bytes，也就是 0(SP) - 4(SP) 参数 a，4(SP) - 8(SP) 参数 b
- 接收返回值需要的空间，bool 1 byte，在 amd64 平台会对齐成 4 byte，所以也是 4 + 4 = 8 bytes，也就是 8(SP) - 12(SP) 返回值一，12(SP) - 16(SP) 返回值二
- 还有 8 bytes 用来干啥了？在 16(SP) - 24{SP} 存储当前 BP 的值（只有一个），主要的用途是调试（stack-unwinding and facilitate debugging）
SUBQ：栈增长 24 bytes
LEAQ 16(SP), BP 在栈增长完成后，LEAQ 计算新的栈指针地址，存储在 BP 中
MOVQ $137438953482, AX 和 MOVQ AX, (SP) 这些都完成后就要开始函数调用了，调用者将参数放到增长完成的栈顶，也就是这个奇怪的数字 137438953482，本质上这个数字是 4 bytes 的 10 和 32 组成的 8 bytes 数字

$ echo 'obase=2;137438953482' | bc
10000000000000000000000000000000001010
\____/\______________________________/
   32                              10

0x002b CALL "".add(SB) 栈里有参数了，返回值的地址也分配好了，开始调用函数，我们调用函数使用相对于 SB 的地址，本质上会直接跳转到 add 的直接地址
CALL 总是会 push 8 bytes 返回地址在栈顶（增长栈空间），所有在 add 函数（该函数不增长栈空间）中取第一个参数的地址不是 0(SP)，而是 8(SP)。
ADDQ $24, SP：调用 add 完成后，收缩栈空间到 24 bytes，

函数调用规约

这是通常的函数调用规约：

rbp 是被调函数的起始地址。

                         +--------------+
                         |              |
                    +    |              |
                    |    +--------------+
                    |    |              |
                    |    |   arg(N-1)   |  starts from 7'th argument for x86_64
                    |    |              |
                    |    +--------------+
                    |    |              |
                    |    |     argN     |
                    |    |              |
                    |    +--------------+
                    |    |              |
                    |    |Return address|  %rbp + 8
Stack grows down    |    |              |
                    |    +--------------+
                    |    |              |
                    |    |     %rbp     |  Frame base pointer
                    |    |              |
                    |    +--------------+
                    |    |              |
                    |    |  local var1  |  %rbp - 8
                    |    |              |
                    |    +--------------+
                    |    |              |
                    |    | local var 2  | <-- %rsp
                    |    |              |
                    v    +--------------+
                         |              |
                         |              |
                         +--------------+

go 语言的函数调用规约略有不同，逻辑上是差不多的，可以参考上面的 add 函数栈。

calling-convention

                                                                                                                              
                                       caller                                                                                 
                                 +------------------+                                                                         
                                 |                  |                                                                         
       +---------------------->  --------------------                                                                         
       |                         |                  |                                                                         
       |                         | caller parent BP |                                                                         
       |           BP(pseudo SP) --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   Local Var0     |                                                                         
       |                         --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   .......        |                                                                         
       |                         --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   Local VarN     |                                                                         
                                 --------------------                                                                         
 caller stack frame              |                  |                                                                         
                                 |   callee arg2    |                                                                         
       |                         |------------------|                                                                         
       |                         |                  |                                                                         
       |                         |   callee arg1    |                                                                         
       |                         |------------------|                                                                         
       |                         |                  |                                                                         
       |                         |   callee arg0    |                                                                         
       |                         ----------------------------------------------+   FP(virtual register)                       
       |                         |                  |                          |                                              
       |                         |   return addr    |  parent return address   |                                              
       +---------------------->  +------------------+---------------------------    <-------------------------------+         
                                                    |  caller BP               |                                    |         
                                                    |  (caller frame pointer)  |                                    |         
                                     BP(pseudo SP)  ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |     Local Var0           |                                    |         
                                                    ----------------------------                                    |         
                                                    |                          |                                              
                                                    |     Local Var1           |                                              
                                                    ----------------------------                            callee stack frame
                                                    |                          |                                              
                                                    |       .....              |                                              
                                                    ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |     Local VarN           |                                    |         
                                  SP(Real Register) ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    +--------------------------+    <-------------------------------+         
                                                                                                                              
                                                              callee

BP 是上一个函数的开始地址。

指令集

经典指令集：RISC
复杂指令集：CISC
优化指令集：AVX512...，加速内存移动、指令运算，基本思路是用硬件方式做 batch，内置一些复杂的逻辑单元，减少指令数，加速计算。

加减

栈增减通过对 SP 寄存器运算实现：

SUBQ

SUBQ $0x18, SP // 减少 SP，分配栈空间

ADDQ

ADDQ $0x18, SP // 增加 SP，收缩栈空间

数据运算

plan9 使用 $num 表示常数，默认为十进制，可以使用 $0x18 表示十六进制。

使用 MOV 系列指令进行数据赋值，赋值的长度由 MOV 的后缀决定。

MOV 系列

MOVB：移动 1 byte，等于 8 位。
MOVW：移动 word 2 bytes，等于 16 位。
MOVD：移动 4 bytes，等于 32 位。
MOVQ：移动 8 bytes，等于 64 位，也是 64 位系统的一个指针大小。

IMULQ

乘法运算：

IMULQ    AX, BX // BX = BX * AX

跳转

JEQ：jump if equal

程序数据分段

汇编里的变量一般是存储在

.text：代码段
.data：初始化值的全局变量
.bss：没有初始化值的全局变量
.rodata：只读数据段

使用 DATA 结合 Global 定义变量。

DATA

DATA    symbol+offset(SB)/width, value

offset 是相对于符号 symbol 的偏移，不是相对于全局某个地址的偏移。

GLOBAL

使用 GLOBAL 声明变量为全局，额外接收两个参数：flag 和变量的大小

GLOBAL    divtab(SB), RODATA, $64

GLOBAL 必须跟在 DATA 指令之后，下面是声明全局变量的例子：

DATA    age+0x00(SB)/4, $18
GLOBAL    age(SB), RODATA, $4
 
DATA    pi+0(SB)/8, $3.1415
GLOBAL    pi(SB), RODATA, $8

所有符号在声明时，offset 一般都是 0

如果要定义全局变量数组，则需要使用非 0 的 offset：

DATA bio+0(SB)/8, $"oh yes i"
DATA bio+8(SB)/8, $"am here "
GLOBL bio(SB), RODATA, $16

函数声明

// func add(a, b int) int
//   => 该声明定义在同一个 package 下的任意 .go 文件中
//   => 只有函数签名，没有实现
TEXT pkgname·add(SB), NOSPLIT, $0-8
    MOVQ a+0(FP), AX
    MOVQ a+8(FP), BX
    ADDQ AX, BX
    MOVQ BX, ret+16(FP)
    RET

TEXT

TEXT 用于声明函数（叫 TEXT 的原因可能是我们的代码在二级制文件中都存储在 .text 段中），在 plan 9 中 TEXT 指令就是用于定义一个函数。

pkgname 可以省略，一般都不写。

中间的 · 比较特殊，是一个 unicode 点，在程序被链接后，所有的点都会被替换为句号 .。

                              参数及返回值大小
                                  | 
 TEXT pkgname·add(SB),NOSPLIT,$32-32
       |        |               |
      包名     函数名         栈帧大小(局部变量+可能需要的额外调用函数的参数空间的总大小，但不包括调用其它函数时的 ret address 的大小)

CALL

函数调用：

CALL    "".add(SB)

CALL => push PC; jmp to callee addr;

将当前地址压入 PC 寄存器，跳到被调用函数的地址执行。

RET

同 PC 寄存器，弹出上层函数地址，返回上层地址。

RET => pop PC;

NOSPLIT

向编译器表明不需要插入堆栈拆分指令（stack-split preamble），插入堆栈拆分指令会检测当前栈需不需要增长（grown），不插入可以省略检测，提高 CPU 执行效率。

因为每个 goroutine 的初始堆栈只有 2kb，在调用函数时需要检测是否需要更多个空间，分配的算法是加倍。

FUNCDATA and PCDATA

这两个指令用于垃圾收集。

LEAQ 地址运算

load effective address 加载有效的地址

地址运算使用 lea 指令 load effective address，amd64 平台地址都是 8 个字节，直接使用 LEAQ 就行。

计算某个位置的地址，存储到某个位置。

LEAQ 16(SP), BP

LEAQ computes the new address of the frame-pointer and stores it in BP

计算 SP + 16 新的帧指针地址，存储到寄存器 BP 中。

LEAQ type."".T(SB), AX // 取出 T 的类型指针放入 AX

LEAQ go.string."point"(SB), AX // 加载 ”point“ 字符串的地址写入 AX

XORPS

XORPS X0, X0

MOVUPS

内存对齐..

MOVUPS X0, ""..autotmp_1+48(SP)

JMP

跳转到某个地址。

autotmp_

临时存储空间名。

我猜测 MOVQ AX, ""..autotmp_1+48(SP) 与 MOVQ AX, "".~r2+16(SP) 的意思一样，前面的 ""..autotmp_1 是 48(SP) 的名字，就像 "".~r2 是 16(SP) 的名字，第二个返回值的地址。

寄存器

AX、BX、CX、DX...

这些都是通用寄存器，可以与各平台寄存器对应 arch。

AX BX CX DX DI SI BP SP R8 R9 R10 R11 R12 R13 R14 PC

默认情况下 BP 与 SP 用来管理栈顶和栈底，其他寄存器都可以参与运算。

伪寄存器

go 汇编引入 4 个伪寄存器：

FP：Frame pointer: arguments and locals.
PC：Program counter: jumps and branches.
SB：Static base pointer: global symbols.
SP：Stack pointer: top of stack.

FP

All user-defined symbols are written as offsets to the pseudo-registers FP (arguments and locals)

FP，使用 symbol+offset(FP) 的方式引用函数参数，例如 arg0+0(FP)、arg1+8(FP) 不加 symbol 无法通过编译。在汇编层面，symbol 没什么用，主要是为了提高代码可读性。

官方文档虽然将伪寄存器 FP 称为 frame poniter，实际上根本不是，传统 x86 的 frame pointer 是指向整个 stack frame 底部的 BP 寄存器。假如当前的 callee 函数是 add，在 add 的代码中引用 FP，该 FP 指向的位置不在 callee 的 stack frame 之内，而是在 caller 的 stack frame 上。

go 中的 FP 寄存器用于引用函数参数。

PC

等同于通用的指令寄存器，程序计数器存储下一条指令的地址，当执行一条指令时，根据 PC 中存放的指令地址，从内存取出指令到指令寄存器。

SB

the address of the beginning of the address-space of our program.

全局静态基指针，是一个全局的符号，指向地址空间的开始位置（其实就是堆内存），一般用于声明函数或全局变量。

SB 可以认为是内存的来源，所以 foo(SB) 可以看成是名字 foo 在内存中的地址（foo 在内存来源中的地址），这种形式可以表示全局函数和变量。如果在名字中添加 <>，例如 foo<>(SB) 可以使 foo 这个名字只在当前源文件可见。对名字对应的地址添加 offset 可以使用 foo+4(SB) 表示在 foo 的开始地址上偏移 4 字节。

SP

在约定的情况下，SP 存储栈指针，指向栈的顶端。

使用 "".b+12(SP) 的方式引用变量。

The SP pseudo-register is a virtual stack pointer used to refer to frame-local variables and the arguments being prepared for function calls. It points to the top of the local stack frame, so references should use negative offsets in the range [−framesize, 0): x-8(SP), y-4(SP), and so on.

Although the official documentation states that "All user-defined symbols are written as offsets to the pseudo-register FP (arguments and locals)", this is only ever true for hand-written code.

Like most recent compilers, the Go tool suite always references argument and locals using offsets from the stack-pointer directly in the code it generates. This allows for the frame-pointer to be used as an extra general-purpose register on platform with fewer registers (e.g. x86).

伪 SP 和硬件 SP 不是一回事，在手写代码时，伪 SP 和硬件 SP 的区分方法是看该 SP 前是否有 symbol。如果有 symbol，那么即为伪寄存器，如果没有，那么说明是硬件 SP 寄存器。

在 go tool compile -S 输出的汇编代码中，没有上面的伪 SP 和 FP 寄存器，只有真实的 SP 寄存器。

应用层代码被翻译为哪些 runtime 函数

defer

package main
 
func f() int {
	var res = 0
	defer func() {
		res++
	}()
	return res
}
 
func main() {}
 
runtime.deferproc(SB)
runtime.deferreturn(SB)

map

package main
 
import "fmt"
 
func main() {
	var a = map[int]int{}
	a[1] = 1
	fmt.Println(a)
}
 
go tool compile -S map.go | grep 'map.go:6'
 
runtime.makemap_small(SB)

new

runtime.newobject(SB)

手写汇编

实现 a + b 函数

package main
 
import "fmt"
 
func add(a, b int) int // 汇编函数声明
 
func main() {
    fmt.Println(add(10, 11))
}
 
---
 
#include "textflag.h"
 
// func add(a, b int) int
TEXT ·add(SB), NOSPLIT, $0-24
    MOVQ a+0(FP), AX // 参数 a
    MOVQ b+8(FP), BX // 参数 b
    ADDQ BX, AX    // AX += BX
    MOVQ AX, ret+16(FP) // 返回
    RET