How to count?

After a while since using Verilator I wanted to see what advice will give it's lint on my latest project.

It didn't find anything really bad. There was WIDTH mismatch somewhere, but also warning related to the use of modulo % operator.

Internet says that modulo operator should be avoided in Verilog. Xilinx forum says it won't synthesize unless it's \(\mod 2^N\).

In my design of binary numeric screen mode I use modulo 9 and it works! Is it very bad?

Test problem

Should I change my design? Should I really avoid using modulo in synthesizable Verilog?

To answer this I decided to compare resources usage of different solutions to the following simple problem:

Input: clock signal

Output: Counter that goes from 0 to 99 cyclically.

Harder version:

Input: clock signal

Output 1: Counter that goes from 0 to 99 cyclically.

Output 2: As above, but modulo 9.

Solutions

Simple problem solution using modulo

module c100_mod (
    input clk,
    output reg [6:0] c100,
    output [3:0] c9
);
assign c9 = 4'd0;

initial c100 = 7'd0;

always @(posedge clk)
    c100 <= (c100 + 1'd1) % 100;

endmodule

Simple problem solution using if

module c100_if (
    input clk,
    output reg [6:0] c100,
    output [3:0] c9
);
assign c9 = 4'd0;

initial c100 = 7'd0;

always @(posedge clk)
    if (c100 == 7'd99)
        c100 <= 7'd0;
    else
        c100 <= c100 + 1'd1;

endmodule

Is there a difference in FPGA resources usage?

Utilization	c100_mod	c100_if
Number of Slice Registers	7	7
* Number used as Flip Flops	7	7
* Number used as Latches	0	0
* Number used as Latch-thrus	0	0
* Number used as AND/OR logics	0	0
Number of Slice LUTs	9	9
* Number used as logic	9	9
* * Number using O6 output only	8	7
* * Number using O5 output only	0	0
* * Number using O5 and O6	1	2
* * Number used as ROM	0	0
* Number used as Memory	0	0
Number of occupied Slices	3	5
Number of MUXCYs used	0	0
Number of LUT Flip Flop pairs used	9	9
* Number with an unused Flip Flop	3	2
* Number with an unused LUT	0	0
* Number of fully used LUT-FF pairs	6	7
* Number of unique control sets	1	2
* Number of slice register sites lost to control set restrictions	1	9

In both designs we have 7 slices used as flip-flops and 9 LUTs used as logic. Surprisingly, FFs and LUTs can be packed better in the design using modulo operator, giving difference of 3 slices used vs 5 slices used.

Harder problem solutions with modulo

In these solutions instead of:

assign c9 = 4'd0;

We have:

assign c9 = c100 % 9;

Verilator lint doesn't like this. Is says that c100 % 9 has different width than c9. We can fix this warning by writing:

wire [6:0] c100mod9;
assign c100mod9 = c100 % 9;
assign c9 = c100mod9[3:0];

reg [6:0] c100mod9;
always @*
begin
    c100mod9 = c100 % 9;
    c9 = c100mod9[3:0];
end

All three versions are equivalent and use the same resources.

Solution combining this code with c100_mod will be referenced as c100_mod_mod9, while solution combining this code with c100_if will be referenced as c100_if_mod9. Let's compare them:

Utilization	c100_mod_mod9	c100_if_mod9
Number of Slice Registers	7	7
* Number used as Flip Flops	7	7
* Number used as Latches	0	0
* Number used as Latch-thrus	0	0
* Number used as AND/OR logics	0	0
Number of Slice LUTs	15	16
* Number used as logic	15	16
* * Number using O6 output only	11	11
* * Number using O5 output only	0	0
* * Number using O5 and O6	4	5
* * Number used as ROM	0	0
* Number used as Memory	0	0
Number of occupied Slices	5	5
Number of MUXCYs used	0	0
Number of LUT Flip Flop pairs used	15	16
* Number with an unused Flip Flop	9	11
* Number with an unused LUT	0	0
* Number of fully used LUT-FF pairs	6	5
* Number of unique control sets	1	1
* Number of slice register sites lost to control set restrictions	1	1

Both versions occupy 5 slices, but version using modulo requires one LUT less.

Harder problem solution with if

Finally, I implemented counter modulo 9 with its own registers. This version is called c100_if_if.

initial
begin
    c100 = 7'd0;
    c9   = 4'd0;
end

always @(posedge clk)
    if (c100 == 7'd99)
    begin
        c100 <= 7'd0;
        c9   <= 4'd0;
    end
    else
    begin
        c100 <= c100 + 1'd1;
        if (c9 == 4'd8)
            c9 <= 0'd0;
        else
            c9 <= c9 + 1'd1;
    end

I didn't expect too much of it, but here are results:

Utilization	c100_mod_mod9	c100_if_mod9	c100_if_if
Number of Slice Registers	7	7	11
* Number used as Flip Flops	7	7	11
* Number used as Latches	0	0	0
* Number used as Latch-thrus	0	0	0
* Number used as AND/OR logics	0	0	0
Number of Slice LUTs	15	16	14
* Number used as logic	15	16	14
* * Number using O6 output only	11	11	13
* * Number using O5 output only	0	0	0
* * Number using O5 and O6	4	5	1
* * Number used as ROM	0	0	0
* Number used as Memory	0	0	0
Number of occupied Slices	5	5	4
Number of MUXCYs used	0	0	0
Number of LUT Flip Flop pairs used	15	16	14
* Number with an unused Flip Flop	9	11	4
* Number with an unused LUT	0	0	0
* Number of fully used LUT-FF pairs	6	5	10
* Number of unique control sets	1	1	1
* Number of slice register sites lost to control set restrictions	1	1	5

Surprisingly, this version uses less LUTs than c100_mod_mod9. What's more interesting, it occupies less Slices than version c100_if! (which is basically the same, but without c9 related logic)

Better Slice usage can be partially explained by higher number of fully used LUT-FF pairs. Previous designs are more combinatorial, while this design uses additional registers. Slice contains both LUTs (that are used for combinatorial logic) and registers (that are used as flip-flops or latches). Design can be better packed into slices when number of registers used is close to the number of LUTs used.

Summary

Modulo (by constant) operator is not as bad as Internet says (at least when using Xilinx FPGAs). There is no simple rule to say which design is better. Balancing combinatorial logic and registers usage may be important.

Other thoughts

Often when I think about some state machine I don't care how states are represented. I usually use numbered states \(0 ... N\) in that case. But is it optimal?

Incrementing state from \(n\) to \(n+1\) represented as binary numbers means dealing with carry propagation delay.

Are there more efficient ways to change states? I'm not the first one to ask this question. There is some research about it.