Verilog Flip-Flop Macros

In this article, we will explore how to properly code flip-flops in Verilog. This may seem like a very basic topic, but Verilog is a language with pit-falls everywhere, and a minor difference in coding style can lead to a huge difference in design scalabilty and verification performance.

First, let's look at a basic flip-flop with a positive edge trigger and an asynchronous, negative edge reset:

always_ff @(posedge clk, negedge rst_n) begin
    if (!rst_n)
        q <= '0;
        q <= d;

Let's break it down piece by piece. First, the always_ff statement is relatively new to Verilog. It was added in the 1800-2005 IEEE spec to provide a mechanism for the compiler to infer the designer's intent to instantiate a flip-flop. The always_comb and always_latch keywords infer combinational logic and latches, respectively. The @(posedge clk, negedge rst_n) part tells a Verilog simulator that this process should execute whenever the clk signal transitions from 0 to 1 (its positive edge), or the rst_n signal transitions from 1 to 0 (its negative edge). The value of q is either set to '0 (all zeros), or the value of d, depending on whether the reset is active.

The <= is known as the non-blocking assignment operator. It tells the simulator that the assignment happens simultaneously with all other non-blocking assignments at the end of the current time slot. If you are scratching your head at that last sentence, join the crowd. This is a perfect illustration of how Verilog's semantics are confusing. In order to just instantiate a simple D-Q flip-flop, the language forces you to understand how an event simulator works. As a result, most digital designers will simply memorize a few idioms, and consult a style guide whenever they can't remember whether to use a blocking assignment =, or a non-blocking assignment <=.

The essential components of a flip-flop are well understood by someone with an education in digital design. So let's try raising the abstraction level a bit, and do away with all the non-essential simulator boilerplate code.

One way for design projects to enforce a common style for sequential elements is to create a library of preprocessor macros. For example, here is the macro definition for a D flip-flop with an asynchronous, negative-edge reset:

// D flip-flop w/ async reset_n
`define DFF_ARN(q, d, clk, rst_n) \
always_ff @(posedge clk, negedge rst_n) begin \
    if (!rst_n) q <= '0; \
    else        q <= (d); \

Now to instantiate a flip-flop, all you have to type is `DFF_ARN(q, d, clk, rst_n), and the preprocessor will expand the rest.

See my AES flops header for suggestions on a few other sequential elements.

There are several benefits to this approach. It gets rid of all the unimportant details, and allows the designer to just focus on the structure of his or her logic. Also, it places the flip-flop code in a common header file so you can perform updates globally across the project.

Let's play devil's advocate for a bit and discuss some problems.

First, what are the types of q, d, clk, and rst_n? The macro has no type annotations at all. As a result, the code is not safe and potentially hard to debug. Verilog does not have strong run-time type-checking capabilities, so there is no way to statically cast the '0 to the same type as q. For simple logic vectors this won't be a problem, but things can get complicated when dealing with struct and union types.

Second, preprocessor usage is problematic (some say it is evil). On a philosophical level, introducing preprocessor macros is really a crutch— it indicates that your language is incapable of expressing a concept and you must invent your own syntax for it. But more importantly, preprocessor usage is just not strict. Macros can legally be redefined at will (though compilers generally warn about this). Also, due to Verilog's lack of a file scope, you can introduce compilation ordering problems that create portability issues.

Okay, so we have a working solution that has some limitations. Can we do better? Let's try accomplishing the same task, but this time we will use a module to abstract the flip-flop's functionality:

// D flip-flop w/ async reset_n
module dff_arn #(type T = logic [7:0])
    input logic clk,
    input logic rst_n,

    output T q,
    input  T d

    always_ff @(posedge clk, negedge rst_n) begin
        if (!rst_n) q <= T'(0);
        else        q <= d;

endmodule: dff_arn

It looks like we have solved all of our problems. By using a type parameter, we are enforcing strict typing of the input d and output q. We can even do a static cast of 0 for the reset value. And, hallelujiah, this solution does not require any preprocessor at all!

Okay, so what's the catch?

Let's create a testbench to test the performance of our flip-flop constructs. The following code goes through a reset sequence, and then injects ten thousand random vectors into a 32-bit pipeline with ten thousand stages. There is no logic in between stages—this is just a giant shift register.

module test_flops();

    parameter int N = 10000;

    logic clk = 1'b0;
    logic rst_n = 1'b1;
    logic [31:0] flops [N];

    // *** Substitute flip-flop instances here ***

    always #10 clk = ~clk;

    initial begin
        // Reset sequence
        rst_n = 1'b0;
        rst_n = 1'b1;

        for (int i = 0; i < 10000; ++i) begin
            flops[0] = $random();
            @(posedge clk) #1;


endmodule: test_flops

To test the flop macros, substitute the following:

    // Method #1: Use flop macros
        for (genvar gi = 1; gi < N; ++gi) begin
            `DFF_ARN(flops[gi], flops[gi-1], clk, rst_n)

And to test the flop modules, substitute the following:

    // Method #2: Use flop module instances
        for (genvar gi = 1; gi < N; ++gi) begin
            dff_arn #(logic [31:0]) INST (.*, .q(flops[gi]), .d(flops[gi-1]));

I ran that simulation using a well-known, modern simulator from a commercial EDA company. I apologize for being vague on the specifics, but I would encourage readers to do their own experiments.

Here are the results. Using flop macros, the simulation took approximately 5 seconds. Using flop module instances, the simulation took fifty-five minutes. That is 660 times slower than the flop macros!

This might seem like a contrived example, but what a difference! It appears that module hierarchy has a devastating effect on simulation performance, and it is simply not economical to add an additional level of hierarchy for primitive components such as flip-flops.

So what have we learned?

  1. For large projects it is a good idea to use flip-flop macros to standardize on the programming style of sequential elements.
  2. What might seem like a good idea in Verilog might be a major pitfall. Always profile your simulations :).

Comments !