Understanding VortexGPGPU: RTL Insights and vecadd Simulation
Let talk about what the RTL actually is
The RTL (register-transfer level) is the abstraction, in which we describe functionality and behaviour of the digital circuit. We do so by using some high-level representation. One of the representations of the circuit could be its schematic, drawn using standard logic cell fabric library. So let represent quite simple digital device, say 4-bit 1-cycle latency adder (q[3:0] = a[3:0] + b[3:0]):
The schematic is created with Yosys synthesis utility
Here we can see that even such simple device looks complicated, even using such high-level representation as schematic.
Luckily today we have even higher level of the digital circuit abstraction -- the HDLs (hardware description languages) like Verilog and VHDL.
So let represent our adder using Verilog:
module adder4b(
input clk,
input [3:0] a,
input [3:0] b,
output[3:0] q
);
reg[3:0] q;
always @(posedge clk)
begin
q <= a+b;
end
endmodule
Looks amazingly laconically, isn't? -- it definitely is.
Circuit developers create quite high-level description of digital devices. Today's synthesis tools so advanced that developers usually not taking in mind how their design be implemented using given logic elements basis (either ASIC, FPGA).
These days designers do not need to explicitly help synthesis tool to optimize the logic, except in some subtle and rare cases. They simply use high-level language's abilities such as addition/subtraction (q = a+b), and sometimes even multiplication and division (q = a*b, q= a/b).
And the developers these days surely don't necessarily need to always know all the details of how their design will be implemented at low(transistor)-level. There are exceptions of course, for example developers have to avoid heavy logic in register-to-register stages, in order to fit logic between registers into cycle time. That is why such processor's instructions as multiplication and division executes in few cycles. The developers intentionally separate such heavy logic into several stages.
More about RTL it is possible to find in internet of course for example here.
Running quick demo vecadd using rtlsim driver
Now let's go back to Vortex GPGPU. Last time we were omitting --driver option to the test, and by default it was assumed that simx driver is used, which is completely software emulation of the VortexGPGPU.
So now we'd like to use rtlsim driver, which is of course also software emulated version of the device, but it was compiled to C++ model out of VortexGPGPU's RTL by means of verilator.
We have to rebuild our project with defined DEBUG variable, in order to enable more verbose output logging, and turn on waveform tracing ability of the RTL's internal signals.
$ make clean
$ export DEBUG=1
$ make
Now we also can see, that execution of test, gives us more verbose output especially regarding its RISC-V's processor:
$ ./ci/blackbox.sh --cores=2 --app=vecadd --driver=rtlsim
CONFIGS=-DNUM_CLUSTERS=1 -DNUM_CORES=2 -DNUM_WARPS=4 -DNUM_THREADS=4
running: CONFIGS=-DNUM_CLUSTERS=1 -DNUM_CORES=2 -DNUM_WARPS=4 -DNUM_THREADS=4 make -C ./ci/../runtime/rtlsim
running: make -C ./ci/../tests/opencl/vecadd run-rtlsim
Workload size=64
1: core0-commit: wid=2, PC=0x62eb8438, ex=ALU, tmask=1011, wb=0, rd=55, sop=0, eop=0, data={0x67e406d5, 0x4972e9a4, 0xae1f8acb, 0x421c7a9} (#14457195637732)
1: D$0 Wr Req: wid=1, PC=0x5caa6d9c, tmask=0100, addr={0x62acbad0, 0x28d24a52, 0x956a6399, 0xd318b944}, atype={00, 00, 00, 00}, byteen=0x1421, data={0xb3b2442f, 0x13d213d2, 0x8bc7b3b3, 0x80a1d8be}, tag=0x473e2cadb2d5caa6d9c32092 (#1223943173835)
...
...
31729: l2cache mem-wr-req: addr=0xff004100, tag=0xb3000000037, byteen=1111000000000000000000000000000000000000000000000000000000000000, data=0xf7abaa76a66fae6d496e758a4329867353fe85866d2a1bf439c26e1b870880286721af5038e6b618ca42ec3cf40e0b0aed5662cd5874db5e5e1aad2e (#192199786496)
...
...
[VXDRV] MEM_FREE: dev_addr=0x100380
[VXDRV] COPY_FROM_DEV: dev_addr=0x100040, host_addr=0x0x7ffe6ba02b9c, size=4
Elapsed time: 2134 ms
Download destination buffer
[VXDRV] COPY_FROM_DEV: dev_addr=0x100280, host_addr=0x0x55b5ba8220d0, size=256
Verify result
PASSED!
[VXDRV] MEM_FREE: dev_addr=0x100080
[VXDRV] MEM_FREE: dev_addr=0x100180
[VXDRV] MEM_FREE: dev_addr=0x100280
[VXDRV] COPY_FROM_DEV: dev_addr=0xff004040, host_addr=0x0x55b5ba8220d0, size=256
PERF: instrs=7931, cycles=14250, IPC=0.556561
The test also created waveform tests/opencl/vecadd/trace.vcd for all internal signals of the processor:
The GTKWave was used to visualize waveform
Next
Next, I will discover how vector addition test actually runs on VortexGPGPU's RISC-V processor.