Long time ago I started this blog by describing my experiments with programming microcontroller in Forth. I like this language, because it's very simple, yet it gives much of flexibility.
And sometimes it's really practical. James Bowman decided to create special softcore CPU with instruction set that matches closely basic Forth words. He named it J1. He used this CPU for high speed transmission of digital camera data trough gigabit Ethernet. I was impressed to see how simple J1 CPU is. You can read more about it here.
Everybody who had something to do with computer science knows what Turing Machine is. But not everybody knows brainfuck language or P'', which are inspired by Turing Machine. Nevertheless, brainfuck is perhaps the most famous esoteric programming language and there are lots of programs for it.
Recently, I came across Forth implementation of brainfuck compiler. I was struck by simplicity of it:
1024 constant size : init ( -- p *p ) here size erase here 0 ; : right ( p *p -- p+1 *p ) over c! 1+ dup c@ ; : left ( p *p -- p-1 *p ) over c! 1- dup c@ ; \ range check? : compile-bf-char ( c -- ) case [char] [ of postpone begin postpone dup postpone while endof [char] ] of postpone repeat endof [char] + of postpone 1+ endof [char] - of postpone 1- endof [char] > of postpone right endof [char] < of postpone left endof [char] , of postpone drop postpone key endof [char] . of postpone dup postpone emit endof \ ignore all other characters endcase ; : compile-bf-string ( addr len -- ) postpone init bounds do i c@ compile-bf-char loop postpone swap postpone c! postpone ; ;
It almost looks like brainfuck is a subset of Forth language!
That's when I decided I will write my first CPU softcore for FPGA and it will be dedicated for executing brainfuck programs. I will start from J1, remove what's unnecessary and implement only instructions needed for brainfuck.
When we look at sample brainfuck program...
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
...not knowing the language (please forget it for a moment if you do know it), we can see that:
++++++
or <<<<
;Impression number 1 is not important, but other two are easy to understand once we know the language.
-+<>[].,
[
opens the loop and ]
closes the loop<
decrements by 1 and >
increments by one data pointer-
decrements by 1 and +
increments by one data under the pointer,
and .
are used for input and output operationsNot surprisingly:
Because of that, I decided to support incrementation/decrementation by a constant in my CPU.
Instructions are generally 1 byte long (there is long jump, which is two bytes long). We have program counter (PC) and data address (A). We also have stack (S) to store program address of beginning of a loop.
Instruction | What it does? | Used by |
---|---|---|
00XXXXXX |
A += X (signed); PC++ | <> |
01XXXXXX |
(*A) += X (signed); PC++ | -+ |
10000000 |
if (*A) { PC = top(S) } else { pop(S); PC++ } | ] |
100XXXXX (X ≠ 0) |
if (*A) { push(S,PC+1); PC++ } else { PC += X } | [ |
101XXXXX XXXXXXXX |
if (*A) { push(S,PC+2); PC+=2 } else { PC += X } | [ |
11000000 |
store input to (*A) | , |
11100000 |
send (*A) to the output | . |
Translating brainfuck to bytecode is straightforward. There is only one difficulty: sometimes we need to use long jump and that makes jump addresses calculation more complicated.
Anyone interested can find compiler written in Python here.
This is Harvard architecture CPU. It has separate instruction memory and data memory. It also has I/O lines and uses (very simple) ALU.
I implemented it in Verilog and used Verilator to test it on compiled brainfuck programs.
Let's go trough Verilog code. No worries, it's really simple!
As anyone could expect, we have clock and reset signals.
input wire clk, input wire resetq,
We need to interface data memory. We have memory address, data input, data output and memory write signal for that.
output wire [`DADDR_WIDTH-1:0] mem_addr, output reg mem_wr, output reg [`DATA_WIDTH-1:0] mem_dout, input wire [`DATA_WIDTH-1:0] mem_din,
For I/O instructions (.,
) we have:
output reg io_wr, input wire [`DATA_WIDTH-1:0] io_din, output wire [`DATA_WIDTH-1:0] io_dout,
There is interface to program memory too:
output wire [`CADDR_WIDTH-1:0] code_addr, input wire [7:0] insn,
Most of the CPU definition is combinatorial logic. There is only one block which synchronizes with positive edge of clock signal.
always @(negedge resetq or posedge clk) begin if (!resetq) begin { pc, rsp, maddr, lj, lj_offset } <= 0; end else begin { pc, rsp, maddr, lj, lj_offset } <= { pcN, rspN, maddrN, ljN, lj_offsetN }; end end
As you can see, we have a bunch of registers
to store CPU state (pc, rsp, maddr, lj, lj_offset
)
and for each of them we have variable that can
be used to set value in the next clock cycle
(pcN, rspN, maddrN, ljN, lj_offsetN
).
I use stack module taken from J1 CPU:
reg [`DEPTH-1:0] rsp, rspN; reg rstkW = 0; // R stack write wire [`CADDR_WIDTH-1:0] rstkD; // R stack write value wire [`CADDR_WIDTH-1:0] rst0; stack #(.DEPTH(`DEPTH),.WIDTH(`CADDR_WIDTH)) rstack ( .clk(clk), .ra(rsp), .rd(rst0), .we(rstkW), .wa(rspN), .wd(rstkD) );
ALU is very simple. The only operation is signed (complement of two) addition.
reg signed [`DADDR_WIDTH-1:0] alu_a; reg signed [`CADDR_WIDTH-1:0] alu_b; reg [`DADDR_WIDTH-1:0] alu_c; always @(alu_a, alu_b) begin alu_c = alu_a + ($signed({alu_b,2'b0}) >>> 2); end
ALU is used by <>-=[
instructions. Depending on instruction
different values of alu_a
and alu_b
are set.
always @(maddr, insn, mem_din, lj, lj_offset, pc) begin alu_a = 15'bX; // let synthesis decide what takes least resources alu_b = 13'bX; casez ({lj,insn[7:6]}) 3'b0_00: begin alu_a = maddr; alu_b = $signed({insn[5:0],7'b0}) >>> 7; end // < > 3'b0_01: begin alu_a = {7'b0,mem_din}; alu_b = $signed({insn[5:0],7'b0}) >>> 7; end // - + 3'b0_10: begin alu_a = {2'b0,pc}; alu_b = $signed({insn[5:0],7'b0}) >>> 7; end // [ 3'b1_??: begin alu_a = {2'b0,pc}; alu_b = {lj_offset,insn}; end // long jump 3'b0_11: ; // ALU not used endcase end
This is the most important part. Here we have ALU output calculated and we can use it to change CPU state.
For clarity, calculation of program
counter (pcN
) is in a separate block.
In this block we only decide whether we enter the loop
([
) or leave it (]
).
assign io_dout = mem_din; // nothing else can go as IO output always @(pc, maddr, insn, alu_c, io_din, lj) begin // defaults mem_wr = 0; io_wr = 0; ljN = 0; maddrN = maddr; mem_dout = io_din; do_jump_or_ret = 0; do_jump = 0; casez ({lj,insn[7:5]}) 4'b0_00?: begin maddrN = alu_c; end // < or > 4'b0_01?: begin mem_dout = alu_c[7:0]; mem_wr = 1; end // - or + 4'b0_100: begin do_jump_or_ret = 1; do_jump = |insn[4:0]; end // [ or ] 4'b1_???: begin do_jump_or_ret = 1; do_jump = 1; end // do long jump 4'b0_101: begin ljN = 1; end // begin long jump 4'b0_110: begin mem_wr = 1; end // , 4'b0_111: begin io_wr = 1; end // . endcase end
PC is usually set to PC+1. Except when do_jump_or_ret
is high.
assign rstkD = pcN; // if we put anything on stack, it's pcN assign lj_offsetN = insn[4:0]; // remember offset from previous instruction always @ (do_jump_or_ret, do_jump, pc, mem_din, rsp, rst0, alu_c) begin // default: go to the next instruction pcN = pc + 1'b1; rspN = rsp; rstkW = 0; if (do_jump_or_ret) begin if (do_jump) begin // [ if (mem_din != 0) begin rspN = rsp + 1'b1; // into the loop rstkW = 1; end else begin pcN = alu_c[12:0]; // skip the loop end end else begin // ] if (mem_din != 0) pcN = rst0; // loop again else rspN = rsp - 1'b1; // leave the loop end end end
Of course I'm not the first one. There are other brainfuck CPUs. There is even a paper about implementing 256-core brainfuck CPU on FPGA from MIT. But it's my first CPU and I think it's quite elegant.
I plan to run it on my FPGA and communicate with it trough UART.