16-Bit Single-Cycle CPU in Verilog
What this actually is
A CPU is just a tiny machine that reads an instruction, does what it says, and moves on to the next one. The CPU in your laptop has billions of transistors and dozens of optimizations stacked on top — but underneath, it's still doing that one loop.
I wanted to build the simplest possible version of that loop, in hardware, with no shortcuts. The result is a 16-bit single-cycle processor written in Verilog, simulated in Vivado, and finally synthesized onto a Basys 3 FPGA so I could see it run on a real board.
"Single-cycle" means every instruction finishes in one clock tick. "16-bit" means the registers, the ALU, and the memory words are all 16 bits wide — small enough to reason about, big enough to run real programs.
What it can do
The CPU understands ten instructions, grouped by shape:
- Math and logic —
ADD,SUB,AND,SLL(shift left) - Memory —
LW(load a word from RAM),SW(store a word to RAM) - Constants —
ADDI(add a small constant) - Control flow —
BEQ(branch if equal),BNE(branch if not equal),JMP(unconditional jump)
That's enough to write loops, do arithmetic, walk through memory, and make decisions. Not enough to run Doom — but enough to demonstrate every fundamental concept a real CPU is built on.
The pieces
The CPU is eight modules, each doing one job:
┌─────────────┐
│ Program │ "where am I in the program?"
│ Counter │
└──────┬──────┘
▼
┌─────────────┐
│ Instruction │ "fetch the next 16 bits of code"
│ Memory │
└──────┬──────┘
▼
┌─────────────┐ ┌─────────────┐
│ Control │──│ Register │ "decode it; read its operands"
│ Unit │ │ File │
└──────┬──────┘ └──────┬──────┘
│ ▼
│ ┌─────────────┐
└────────▶│ ALU │ "do the math"
└──────┬──────┘
▼
┌─────────────┐
│ Data Memory │ "load or store, if needed"
└──────┬──────┘
▼
┌─────────────┐
│ Write-back │ "save the result back to a register"
└─────────────┘
Each piece is a few dozen lines of Verilog. The Program Counter, for instance, is just a 16-bit register that ticks forward by 2 every clock — unless a branch or jump tells it to go somewhere else:
always @(posedge clk or posedge rst) begin
if (rst) pc_out <= 16'b0;
else if (Jmp) pc_out <= new_address;
else if (Branch && zero_flag) pc_out <= branch_address;
else pc_out <= pc_out + 16'd2;
end
The ALU is similarly minimal — a case statement on the 4-bit op code that picks between add, subtract, AND, and shift. The control unit is the brain that decodes the incoming instruction and flips the right switches on every other module so that everyone knows whether this cycle is a memory read, a register write, a branch test, or something else.
How I tested it
Two layers of testing:
- Simulation. A Verilog testbench drives the clock at 50 MHz, holds reset high for two cycles, then watches every signal in the design while a small program runs. Vivado's waveform viewer made it obvious when something was wrong (the PC didn't update on a branch, the ALU produced the wrong result, the register file wrote to the wrong slot).
- Real hardware. Once simulation passed, I synthesized the design, generated a bitstream, and flashed it onto a Basys 3 board. A tiny test program — "set R1 = 1, then double it forever" — drove the lower 8 bits of the register out to the board's LEDs. The LEDs counted up in powers of two (
0000_0001 → 0000_0010 → 0000_0100 → …), exactly like simulation said they would.
Watching that happen the first time was the moment the whole thing stopped being theoretical.
What I learned
- Hardware is software with the abstraction stripped away. Every "if statement" is a multiplexer. Every "variable" is a register. Every "while loop" is a clock and a controller. Once you see it that way, computer architecture stops being mysterious.
- The control unit is where all the complexity lives. The ALU is easy. Memory is easy. The control unit — the thing that translates instruction bits into the right combination of "write here, read there, route this, ignore that" — is where bugs hide.
- Single-cycle is a great place to start, and a great argument for pipelining. Every instruction takes exactly as long as the slowest instruction (memory access), because the clock period has to accommodate the worst case. That's the whole motivation for pipelining — which would be the next version of this project.
- Simulation is not enough. The design simulated perfectly, but it still took two rounds of pin-mapping and timing-constraint fixes to get it running on the board. The gap between "works in Vivado" and "works on Artix-7 silicon" is its own discipline.
What I'd build next
- Pipelining. Splitting the single cycle into five stages (IF, ID, EX, MEM, WB) — and dealing with the hazards (data dependencies, branch mispredictions) that come with it.
- A bigger instruction set. Multiply, divide, more addressing modes.
- An assembler. Right now I hand-encode test programs as hex into the instruction memory's initial block. A small Python assembler that translates real text into bitstrings would make the CPU much easier to actually program.
Stack
Verilog HDL, Xilinx Vivado (simulation + synthesis + bitstream), Basys 3 FPGA (Xilinx Artix-7), inline SystemVerilog testbench.