C language to implement a simple virtual machine

2020-04-02 03:11:04
OfStack

Necessary preparations and precautions:

Before you start, you need to do the following:

A C compiler - I used clang 3.4 and can use other c99/c11 compilers. Text editor - I recommend an IDE based text editor, I use Emacs; Basic programming knowledge -- basic variables, process control, functions, data structures, etc. Make scripts - Make your program faster.

Why write a virtual machine?

For the following reasons:

Want to know more about how computers work. This article will help you understand how the underlying computer works. Virtual machines provide a clean layer of abstraction. Isn't that the best way to learn how they work? Learn more about how some programming languages work. For example, there are currently many virtual machines that frequently use those languages. This includes the JVM, Lua VM, and FaceBook's Hip - Hop VM (PHP/Hack). Just because I was interested in learning about virtual machines.

Instruction set

We're going to implement a very simple set of custom instructions. I'm not going to talk about some of the more advanced ones like the displacement register, but hopefully I'll get a handle on them after reading this article.

Our virtual machine has A set of registers, A,B,C,D,E, and F. These are general purpose registers, that is, they can be used to store anything. A program will be a sequence of read-only instructions. The virtual machine is a stack-based virtual machine, which means it has a stack that lets us push in and out of values, as well as a small number of registers available. This is much simpler than implementing a register-based virtual machine.

Anyway, here's the instruction set we're going to implement:


PSH 5    ; pushes 5 to the stack
PSH 10   ; pushes 10 to the stack
ADD     ; pops two values on top of the stack, adds them pushes to stack
POP     ; pops the value on the stack, will also print it for debugging
SET A 0   ; sets register A to 0
HLT     ; stop the program

So this is our instruction set, and notice that the POP instruction is going to print the instruction that we POP up, so we can see the ADD instruction working. I also added a SET instruction, which is basically to let you understand that registers are accessible and written to. You can also implement your own instructions like MOV A B (move the value of A to B). The HTL instruction is to tell us that the program has finished running.

How does the virtual machine work?

Now that we're at the heart of this article, virtual machines are simpler than you might think, and they follow a simple pattern: read; Decoding; The execution. First, we read the next instruction from the instruction set or code, then decode the instruction and execute the decoded instruction. For simplicity, we ignore the coding part of the virtual machine, which typically packages an instruction (the opcode and its operands) into a number and decodes the instruction.
The project structure

Before we start programming, we need to set up our project. First, you need a C compiler (I use clang 3.4). We also need a folder to put our project in. I like to put my project in ~/Dev:


$cd ~/Dev/
mkdir mac
cd mac
mkdir src

As above, we start with the CD in the ~/Dev directory, or wherever you want to put it, and then create a new directory (I call this virtual machine "MAC"). Then put the CD into this directory and create our SRC directory, which is used to put the code.

The Makefile

Makefiles are relatively straightforward, we don't need to split anything into multiple files, and we don't need to include anything, so we just need some flags to compile the file:


SRC_FILES = main.c
CC_FLAGS = -Wall -Wextra -g -std=c11
CC = clang
 
all:
  ${CC} ${SRC_FILES} ${CC_FLAGS} -o mac

This is enough for now, you can improve it later, but as long as it does the job, we should be satisfied.
Instruction programming (code)

Now it's time to write the code for the virtual machine. First, we need instructions to define the program. For this, we can use an enumeration type enum, because our instructions are basically Numbers from 0 to X. In fact, you could say you're assembling an assembly file that USES words like mov and translates them into declared instructions.
We can write only one instruction file, such as PSH, 5 is 0, 5, but this is not easy to read, so we use the enumerator!


typedef enum {
  PSH,
  ADD,
  POP,
  SET,
  HLT
} InstructionSet;

Now we can store a test program as an array. We wrote a simple program to test it: add the 5 and 6 and print them out (with POP instructions). If you wish, you can define an instruction to print the value at the top of the stack.

The instructions should be stored as an array, which I will define at the top of the document; But you might want to put it in a header file, and here's our test program:


const int program[] = {
  PSH, 5,
  PSH, 6,
  ADD,
  POP,
  HLT
};

The program above is going to push 5 and 6 on the stack, call the ADD directive, which is going to POP up the top two values on the stack, ADD them up and push the result back on the stack, and then we're going to POP up the result, because the POP command is going to print the value, but you don't have to do it yourself, I've done it and tested it. Finally, the HLT instruction ends the program.

Good, so we have our own program. Now we have implemented the mode of reading, decoding and evaluating the virtual machine. But remember, we're not decoding anything, because we're giving the original instructions. That means we just need to focus on reading and evaluating! We can reduce them to two functions fetch and evaluate.

Get the current instruction

Because we've saved our program as an array, it's easy to get the current instruction. A virtual machine has a counter, generally called a program counter, instruction pointer, etc. These names mean the same thing depending on your personal preference. Shorthand forms such as IP or PC are also common in virtual machine libraries.

If you remember, I said we were going to store the program counter in a register... We'll do that -- in the future. Now, we just create a variable at the top of our code called IP and set it to 0.


int ip = 0;

The IP variable represents the instruction pointer. Since we have stored the program as an array, we use the IP variable to indicate the current index in the program array. For example, if we create a variable x that is assigned to the program's IP index, it stores our program's first instruction.

[suppose IP is 0]


int ip = 0;
 
int main() {
  int instr = program[ip];
  return 0;

If we were to print the variable instr, which should have been PSH, it would be 0 because it's the first value in our enumeration. We could also write a fetch function like this:


int fetch() {
  return program[ip];
}

This function will return the current called instruction. Great. So what if we want the next instruction? Easy, we just need to add the instruction pointer:


int main() {
  int x = fetch(); // PSH
  ip++; // increment instruction pointer
  int y = fetch(); // 5
}

So how do you get it to move? We know that a program doesn't stop until it executes an HLT instruction. So we use an infinite loop that continues until the current instruction is HLT.


// INCLUDE <stdbool.h>!
bool running = true;
 
int main() {
  while (running) {
    int x = fetch();
    if (x == HLT) running = false;
    ip++;
  }
}

It works well, but it's a bit messy. We are looping through each instruction, checking whether it is HLT or not, and if it is, stop the loop, otherwise the "eat" instruction continues the loop.

Judge an instruction

So this is the body of our virtual machine, but we want to actually evaluate each instruction and make it a little bit cleaner. Well, with this simple virtual machine, you can write a "huge" switch declaration. Let each case in switch correspond to an instruction that we define in the enumeration. The eval function will use the arguments of a simple directive to determine this. We don't use any instruction pointer increments in the function unless we want the operands to be wasted.


void eval(int instr) {
  switch (instr) {
    case HLT:
      running = false;
      break;
  }
}

So if we go back to the main function, we can use our eval function like this:


bool running = true;
int ip = 0;
 
// instruction enum here
 
// eval function here
 
// fetch function here
 
int main() {
  while (running) {
    eval(fetch());
    ip++; // increment the ip every iteration
  }
}

The stack!

Good. That will do the job perfectly. Now, before we add any more instructions, we need a stack. Fortunately, the stack is easy to implement, we only need to use an array. The array is set to the appropriate size so that it can contain 256 values. We also need a stack pointer (often abbreviated as sp). This pointer will point to the stack array.

To give us a more visual picture, let's take a look at the stack implemented with arrays:


[] // empty
 
PSH 5 // put 5 on **top** of the stack
[5]
 
PSH 6
[5, 6]
 
POP
[5]
 
POP
[] // empty
 
PSH 6
[6]
 
PSH 5
[6, 5]

So, what happens in our program?


PSH, 5,
PSH, 6,
ADD,
POP,
HLT

We first pushed the 5 on the stack

[5]

Then press in 6:


[5, 6]

Then add instructions, fetch the values, add them together, and push the results on the stack:


[5, 6]
 
// pop the top value, store it in a variable called a
a = pop; // a contains 6
[5] // stack contents
 
// pop the top value, store it in a variable called b
b = pop; // b contains 5
[] // stack contents
 
// now we add b and a. Note we do it backwards, in addition
// this doesn't matter, but in other potential instructions
// for instance divide 5 / 6 is not the same as 6 / 5
result = b + a;
push result // push the result to the stack
[11] // stack contents

So where does our stack pointer work? The stack pointer (or sp) is generally set to -1, which means that the pointer is null. Remember that an array starts at 0, and if the sp value is not initialized, it will be set to a random value placed there by the C compiler.

If we push three values, sp will become 2. So this array holds three values:

Sp points here (sp = 2)
|
V
[1, 5, 9]
0 1 2 < - array index

Now we push the stack once, we just need to reduce the top pointer. For example, if we push 9, the top of the stack will be 5:

Sp points here (sp = 1)
|
V
[1, 5]
0 1 < - array index

So, when we want to know what's at the top of the stack, we just need to look at the current value of sp. OK, you may be wondering how the stack works, but let's implement it in C. Very simple, just like IP, we should also define an sp variable, remember to assign it to -1! Define another array named stack, the code is as follows:


int ip = 0;
int sp = -1;
int stack[256]; //Use arrays or other structures that fit here
 
//Other C codes

Now if we want to push a value, we first increment the top of the stack pointer and then set the value at the current sp (which we just added). Note: the order of the two steps is important!


//Pop down 5
 
// sp = -1
sp++; // sp = 0
stack[sp] = 5; //  The top of the stack now becomes 5

So, in our execution function eval(), we can implement a push instruction like this:


void eval(int instr) {
  switch (instr) {
    case HLT: {
      running = false;
      break;
    }
    case PSH: {
      sp++;
      stack[sp] = program[++ip];
      break;
    }
  }
}

Now, as you can see, it's a little different from the eval() function that we implemented before. First, we put each case statement block in braces. This usage, which you may not be familiar with, allows you to define variables in the scope of each case. You don't need to define variables now, but you will. And it's easy to keep all the case blocks in a consistent style.

The second is the magic expression program[++ IP]. What did it do? Well, our program is stored in an array, and the PSH instruction needs to get an operand. An operand is essentially an argument, just like when you call a function, you can pass an argument to it. This case is called the stack value 5. We can get operands by increasing the instruction pointer IP. When IP is 0, this means that the PSH instruction has been executed, and we want to get the next instruction -- the value of the stack. This can be done by IP autoincrement (note: increasing the IP position is important, we want to autoincrement before we get the instruction, otherwise we just get the PSH instruction), and then we need to jump to the next instruction or it will cause a strange error. Of course, we can also simplify sp++ to stack[++sp].

For the POP instruction, the implementation is very simple. I only need to reduce the stack top pointer, but I generally want to be able to print the stack value on the push.

I omitted the code to implement other instructions and the swtich statement, and only listed the implementation of the POP instruction:


//Remember # include <Stdio. H> !
 
case POP: {
  int val_popped = stack[sp--];
  printf("popped %dn", val_popped);
  break;
}

Now the POP command works! All we did was put the top of the stack into the variable val_popped, and then the top pointer goes down one. If we first subtract one from the top of the stack, we'll get some invalid values, because sp might be 0, so we might assign the stack[-1] to val_popped, which is usually a bad idea.

Finally, the ADD directive. This instruction may cost you some brain cells, which is why we need braces {} to implement the case statement scope.


case ADD: {
  //First, we push the stack and store the value in a
  int a = stack[sp--];
 
  //And then we push the stack, and we store the values in b
 
  //Then add the two variables and push the result on the stack
  int result = a + b;
  sp++; //Add 1 ** to the top of the stack before assignment **
  stack[sp] = result; //Set the top value of the stack
 
  //Done!
  break;
}

register

Registers are an optional part of a virtual machine and are easy to implement. We mentioned earlier that we might need six registers: A, B, C, D, E, and F. As with instruction sets, we implement them with an enumeration.


typedef enum {
  A, B, C, D, E, F,
  NUM_OF_REGISTERS
} Registers;

Tip: enumeration NUM_OF_REGISTERS finally placed a number. This number gets the number of registers, even if you add other registers. Now we need an array to hold values for registers:


int registers[NUM_OF_REGISTERS];

Next you can read the value in the register:


printf("%dn", registers[A]); //Print the value of register A

The revision

I didn't put much thought into the register, but you should be able to write out some instructions for manipulating the register. For example, if you want to do any branch jump, you can do it by storing instruction Pointers and/or stack top Pointers in registers, or by implementing branch instructions.

The former is relatively quick and simple to implement. We can do this by adding registers representing IP and SP:


typedef enum {
  A, B, C, D, E, F, PC, SP,
  NUM_OF_REGISTERS
} Registers;

Now we need the implementation code to use the instruction pointer and the top pointer. A simple way to do this is to delete the sp and IP variables defined above and implement them using macro definitions:


#define sp (registers[SP])
#define ip (registers[IP])

Translator's note: this should be consistent with Registers enumeration, IP should be changed to the PC

This is a nice change, you don't have to rewrite a lot of code, and it works just fine.