Introduction to the Golang compiler

  • 2020-06-19 10:31:07
  • OfStack

cmd/compile contains the main packages that make up the Go compiler. The compiler can be logically divided into four phases, which we will briefly describe along with a list of packages containing the corresponding code.
When talking about compilers, you sometimes hear the terms front-end (ES5en-ES6en) and back-end (ES7en-ES8en). Roughly speaking, these correspond to the first and second stages that we will list here. The third term intermediate end (ES9en-ES10en) usually refers to most of the work performed in Phase 2.
Note that the go/* family of packages such as go/parser and go/types are compiler-independent. Since compilers were originally written in C, these go/* packages were developed so that tools such as gofmt and vet could be written to work with Go code 1.
To be clear, the name "gc" stands for "Go compiler (Go compiler)" and is independent of capital GC, which stands for garbage collection (garbage collection).

1, parsing,

cmd/compile/internal/syntax (lexical analyzer (lexer), the parser (parser), syntax tree (syntax tree))

In the compilation of phase 1, the source code is marked (lexical analysis, parsing (syntax analysis), and construct a syntax tree for each source file (" here refers to tag token, it is able to identify 1 set of predefined and string, usually made up of names and values, including 1 name is lexical category, such as identifiers, keywords, separator, operators, text and annotation, etc; Syntax trees, and the abstract syntax trees mentioned below (Abstract Syntax Tree) (AST), refer to trees that express the syntax structure of programming languages, usually with leaf nodes as operands and other nodes as opcodes.
Each syntax tree is an exact representation of the corresponding source file, where the nodes correspond to various elements of the source file, such as expressions, declarations, and statements. The syntax tree also includes location information for error reporting and for creating debug information.

2. Type checking and AST transformation

cmd/compile/internal/gc (create compiler AST, type checking (type - checking), AST transform (AST transformation))

The gc package contains one AST definition inherited from the (earlier) version of the C language implementation. Everything is written on it, so the first thing the gc package must do is convert the syntax tree of the syntax package (defined) to the compiler's AST representation. This extra step may be refactored in the future.
The AST is then type checked. The first step is name resolution and type inference, which determine which object belongs to which identifier and which type each expression has. Type checking includes specific additional checks, such as "declared but not used," and determining whether a function terminates.
Specific transformations are also done based on AST. Some nodes are refined based on type information, such as splitting string addition from the node type of arithmetic addition. Other examples are dead code elimination (dead code elimination), inlining function calls (function call inlining), and escape analysis (escape analysis).

3. General SSA

cmd/compile/internal/gc (converted into SSA) cmd/compile/internal/ssa (SSA related link (pass) and rules)

The compilers of many common high-level languages cannot do all the compiling with a single scan of the source code or AST, but instead scan multiple times, doing one part of the work at a time, and using the output as input for the next scan until the final target code is produced. Each scan here is called 1 link (pass); The results obtained by all the links before the last one can be called intermediate representation. In this paper, AST, SSA and so on all belong to the intermediate representation. SSA, the static single assignment form, is a property of the intermediate notation that requires each variable to be assigned only once and defined before use).
At this stage, AST is converted to static single assignment (Static Single Assignment) (SSA), a low-level intermediate representation with specific attributes (intermediate representation) that makes it easier to optimize and ultimately generate machine code from it.
During this transformation, the built-in function (function intrinsics) is processed. These is a special function, the compiler was told individually analyzed these functions and decide whether to replace them with deep optimization code (" built-in function refers to the function is defined by the language itself, usually the compiler's approach is to use a sequence of instructions that the corresponding implementation function instead of the function call instruction, similar to an inline function).
During the conversion of AST to SSA, specific nodes are also reduced to simpler components so that the rest of the compilation phase can work on them. For example, the built-in copy is replaced with a memory move, and the range loop is rewritten as the for loop. For historical reasons, some of this currently occurs before the conversion to SSA, but the long-term plan is to move them all here (conversion to SSA).
The 1 series of machine-independent rules and compilation steps are then executed. These do not take into account a particular computer architecture, so the values of all GOARCH variables run.
Examples of this kind of generic compilation include dead code elimination, removing unnecessary null checks, and removing unwanted branches. General rewriting rules mainly consider expressions, such as replacing 1 expressions with constants, optimizing multiplication, and floating point operations.

4. Generate machine code

cmd/compile/internal/ssa (SSA DiJiHua and architecture specific link) cmd/internal/obj (Machine code generation)

The machine-related phase in the compiler begins with the "low-level" compilation phase, which rewrites generic variables into their specific machine-code form. For example, in the amd64 schema, operands can operate in memory so that many load-and-store (ES136en-ES137en) operations can be merged.
Note that the low-level compiler runs all machine-specific rewrite rules, so currently it also applies a lot of optimizations.
Once SSA is "low-level" and more specific to the target architecture, it's time to run the compilation phase of the final code optimization. This includes another dead code elimination step, which moves variables closer to where they are used, removes local variables that are never read, and registers (register) allocations.
Other important work done in this step includes the stack layout (stack frame layout), which assigns stack offset positions to local variables, and pointer activity analysis (pointer liveness analysis), which calculates which Pointers on the stack at each garbage collection security point are still active.
At the end of the SSA generation phase, Go functions have been converted to series 1 ES152en.Prog instructions. They are passed to the assembler (cmd/internal/obj), which converts them to machine code and outputs the final target file. The target file will also contain reflection data, export data, and debugging information.

conclusion


Related articles: