## C2RTL: A High-level Synthesis System for IP Lookup and Packet Classification MD Iftakharul Islam, Javed I Khan Department of Computer Science Kent State University Kent, OH, USA. IEEE HPSR, 2021 Existing HLS tools Programming convention in C2RTL C2RTL design flow Evaluation #### IP lookup and packet classification - Two of the most fundamental operations of a router. - IP lookup ⇒ longest prefix match Table: An example forwarding table | Prefix | Next-hop | |-------------------|----------| | 131.123.252.42/32 | 1 | | 169.254.0.0/16 | 2 | | 169.254.192.0/18 | 3 | Packet classification ⇒ exact/prefix match to find action Table: An example packet classifier | Rule | Src. IP | Dest. IP | Src. Port | Dest. Port | Protocol | Action | |------|---------|----------|-----------|------------|----------|---------| | r1 | 01100* | 01100* | 111 | 111 | 80 | enqueue | | r2 | 11010* | * | 10* | 11* | 22 | drop | | r2 | 11* | 11011* | 10* | 11* | 22 | modify | | r4 | * | * | * | * | * | forward | - Implemented in TCAM or Pipelined ASIC. - ASIC generally executes **Trie** based IP lookup and packet classification algorithms in hardware. #### ASIC Design - ASIC is developed register-transfer (RTL) level Verilog or VHDL ⇒ extremely complex, requires huge effort - Need to use circuit level abstractions - Need to keep track of path latency to schedule operations at right time. - This calls for designing ASIC in a higher level. - High-level Synthesis (HLS) is a design methodology where pipelined ASIC is developed using high-level languages such as C or SystemC. - HLS generates corresponding Verilog or VHDL RTL from C or SystemC code. - Higher design productivity and lower complexity - Shorter simulation cycle. - HLS has not been adopted in routing/switching chips, to the best of our knowledge. #### C2RTL High-level Synthesis - Primarily designed for IP lookup or packet classification, but we would like to extend it for other data plane functions as well. - C2RTL takes an IP lookup or packet classification algorithm in C as input and generates corresponding Verilog RTL. - Implemented an a GCC plugin - We made the source code publicly available https://tamimcse.github.io/c2rtl. # Current HLS tools focus on C or SystemC based system design. - Mentor's Catapult - Cadence's Stratus - NEC's CyberWorkBench - Synopsys' Synphony - SWSL [ANCS 2013] (designed for IP lookup) - Switch compiler for programmable ASIC. - LEAP [ANCS 2012], Domino [SIGCOMM 2016] - HLS tools for FPGA. - Xilinx's Vivado - Bambu [FPL 2013] - LegUp [FPGA 2011] ### Programming convention in C2RTL An input program in C2RTL is implemented in C language, but with several restrictions. #### Table: programming restrictions in C2RTL | No loop (while, for, do-while) | |------------------------------------------------------| | No unstructured control flow (goto, break, continue) | | No ternary operation | | No dynamic memory allocation | | No global variables | | No structure | | No switch | | No function call | | Each branch has to have a separate return statement | bb 2 #### Intermediate code generation by GCC ``` idx 22 = ip 21(D) >> 16; 1 = (long unsigned int) idx 22; 2 = 1 * 2; 3 = C16 \ 23(D) + 2; 4 = * 3: cidx 25 = (unsigned int) 4: if (cidx 25 != 0) #include <stdint.h> uint8 t sail(uint32 t ip, uint8 t N16[100 bb 3 uint16 t C16[100], uint8 t N24[100], 5 = cidx 25 << 8; uint16 t C24[100], uint8 t N32[100]) { _6 = ip_21(D) >> 8; 7 = 6 \& 255; bb 6 idx 28 = 5 + 7; unsigned int idx, cidx; 18 = (sizetype) idx 22; 8 = (long unsigned int) idx 28: 19 = N16 26(D) + 18; 9 = 8 * 2; 27 = * 19: 10 = C24 \ 29(D) + 9; idx = ip \gg 16; 11 = * 10: cidx = C16[idx]: cidx 30 = (unsigned int) 11; if (cidx) { if (cidx 30 != 0) idx = (cidx << 8) + ((ip >> 8) & 255) cidx = C24[idx]; if (cidx) { bb 4 12 = cidx 30 << 8; idx = (cidx << 8) + (ip & 255); bb 5 13 = ip \ 21(D) \& 255; return N32[idx]: 16 = (sizetype) idx 28: idx 33 = 12 + 13; 17 = N24 31(D) + 16: } else { 14 = (sizetype) idx 33: 32 = * 17; 15 = N32 34(D) + 14; return N24[idx]; _35 = *_15; } else { return N16[idx]; bb 7 <L4> [0.00%]: return 20: 10/17 ``` #### Control and Data flow graph (CDFG) - As Late as Possible (ALAP) scheduling [TCAD 1991]. - We obtained the latency of operations from Bambu's [FPL 2013] 45 nm Nandgate Open Cell library characterization. - We insert register when the result of an operation crosses cycle boundary. - We implement each operation using a Verilog module from a component library obtained from Bambu. - We implement each arrays using register file and SRAM. #### MUX tree generation Table: An example predicates for a SSA $\phi$ operation | | Selectors | | | | |-----------------------|-----------|-------|-------|-------| | Inputs | $S_1$ | $S_2$ | $S_3$ | $S_4$ | | <i>I</i> <sub>1</sub> | * | * | * | 0 | | <i>I</i> <sub>2</sub> | * | 1 | 1 | 1 | | <i>I</i> <sub>3</sub> | * | 0 | 1 | 1 | | 14 | 0 | * | 0 | 1 | | <b>I</b> 5 | 1 | * | 0 | 1 | (a) Predicates splitting tree (b) Corresponding MUX tree #### **Evaluation** - We evaluate C2RTL by implementing SAIL [SIGCOMM 2014], Poptrie [SIGCOMM 2015] and CP-Trie [HPSR 2021] based IPv6 lookup and TabTree [ANCS 2019] based packet classification. - We evaluate the resulting Verilog using Icarus Verilog simulator. It shows that the generated Verilog is functionally correct. Figure: C code to physical chip layout generation We use OpenROAD to generate physical chip layout from the Verilog. #### **Evaluation in ASIC** | | Poptrie | SAIL | CP-Trie | TabTree | |-----------------|-------------------------------|--------------------------------|-------------------------------|---------------------------------| | Clock speed | 1 GHz | 1 GHz | 1 GHz | 1 GHz | | Internal Power | 76.5 mW | 0.722 mW | 64.6 mW | 0.033 mW | | Switching Power | 24.4 mW | 0.229 mW | 22.2 mW | 0.0054 mW | | Leakage Power | 1.15 mW | 0.0108 mW | 0.926 mW | 0.00061 mW | | Total Power | 102.05 mW | 0.961 mW | 87.726 mW | 0.0391 mW | | Area | 0.0658 <i>mm</i> <sup>2</sup> | 0.00061 <i>mm</i> <sup>2</sup> | 0.0523 <i>mm</i> <sup>2</sup> | 0.000034 <i>mm</i> <sup>2</sup> | C2RTL design flow ## Thank You