This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Welcome to Tensil

What if you could just run this to get a custom ML accelerator specialized to your needs?

$ tensil rtl --arch <my_architecture>

What if compiling your ML model for that accelerator target was as easy as running this?

$ tensil compile --arch <my_architecture> --model <my_model>

Wonder no more: with Tensil you can!

What is Tensil?

Tensil is a set of tools for running machine learning models on custom accelerator architectures. It includes an RTL generator, a model compiler, and a set of drivers. It enables you to create a custom accelerator, compile an ML model targeted at it, and then deploy and run that compiled model.

The primary goal of Tensil is to allow anyone to accelerate their ML workloads. Currently, we are focused on supporting convolutional neural network inference on edge FPGA (field programmable gate array) platforms, but we aim to support all model architectures on a wide variety of fabrics for both training and inference.

You should use Tensil if:

  • you have a convolutional neural network based ML workload
  • you need to run it at the edge (i.e. not in a data-center)
  • you want to avoid changing your model to make it work on a GPU/CPU
  • you want to offload heavy computation from your host CPU or microcontroller

Unique benefits of Tensil

With Tensil you can:

  • run your model as-is, without quantization or other degradation
  • achieve significantly better performance per watt
  • make use of a huge variety of FPGA platforms

Limitations of Tensil (for now)

At present, these are Tensil’s limitations:

  • only supports convolutional neural networks
  • driver support for FPGAs only

Join us on Discord or on Github to help us plan our roadmap!

Where should I go next?

Select a section below to dive in. We recommend beginning at Getting Started.

1 - Getting Started

The essentials for getting started with Tensil

Prerequisites

The easiest way to get started with Tensil is through our Docker containers. Therefore, we recommend installing Docker before continuing.

Installation

To install from Docker, run:

$ docker pull tensilai/tensil:latest
$ docker run -v $(pwd):/work -w /work -it tensilai/tensil:latest bash

You will be dropped into a shell inside the Tensil container. Run

$ tensil compile --help

to verify that it is working correctly.

Try it out!

Try compiling an example ML model:

$ tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/resnet20v2_cifar.onnx -o "Identity:0" -s true

Next up, try a tutorial to learn how to use Tensil.

For Contributors

Installation from source

See the project README for instructions on how to build from source.

2 - How To

Recipes for common tasks

2.1 - Compile an ML model

How to compile your ML model for an accelerator architecture

Things you’ll need

  • your ML model. If you don’t have one handy, continue on to use one of the demo ones.
  • an architecture file in .tarch format. If you don’t know what this is yet, continue on and we’ll supply one for you.

1. Convert your ML model to ONNX

The first thing you need to do is convert your ML model to the ONNX format. ONNX stands for Open Neural Network Exchange, and converting to ONNX is supported by all the major frameworks. Instructions for:

2. Run the Tensil compiler

First, ensure you have Tensil installed by pulling and running the Tensil Docker container:

$ docker pull tensilai/tensil:latest
$ docker run -v $(pwd):/work -w /work -it tensilai/tensil:latest bash

Then from the container shell, run:

$ tensil compile -a <tarch_file> -m <onnx_file> -o output_node -s true

To compile with an example model and architecture file, the command is

$ tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/resnet20v2_cifar.onnx -o "Identity:0" -s true

You should see some output like this:

$ tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/resnet20v2_cifar.onnx -o "Identity:0" -s true
NCHW[1,3,32,32]=NHWC[1,32,32,1]=1024*16
List(-1, 256)
----------------------------------------------------------------------------------------------
COMPILER SUMMARY
----------------------------------------------------------------------------------------------
Model:                                           resnet20v2_cifar_onnx_ultra96v2 
Data type:                                       FP16BP8                         
Array size:                                      16                              
Consts memory size (vectors/scalars/bits):       2,097,152                       33,554,432 21
Vars memory size (vectors/scalars/bits):         2,097,152                       33,554,432 21
Local memory size (vectors/scalars/bits):        20,480                          327,680    15
Accumulator memory size (vectors/scalars/bits):  4,096                           65,536     12
Stride #0 size (bits):                           3                               
Stride #1 size (bits):                           3                               
Operand #0 size (bits):                          24                              
Operand #1 size (bits):                          24                              
Operand #2 size (bits):                          16                              
Instruction size (bytes):                        9                               
Consts memory maximum usage (vectors/scalars):   35,743                          571,888    
Vars memory maximum usage (vectors/scalars):     13,312                          212,992    
Consts memory aggregate usage (vectors/scalars): 35,743                          571,888    
Vars memory aggregate usage (vectors/scalars):   46,097                          737,552    
Number of layers:                                23                              
Total number of instructions:                    102,741                         
Compilation time (seconds):                      71.562                          
True consts scalar size:                         568,474                         
Consts utilization (%):                          97.210                          
True MACs (M):                                   61.476                          
MAC efficiency (%):                              0.000                           
----------------------------------------------------------------------------------------------
---------------------------------------------
ARTIFACTS
---------------------------------------------
Manifest:  /work/resnet20v2_cifar_onnx.tmodel
Constants: /work/resnet20v2_cifar_onnx.tdata
Program:   /work/resnet20v2_cifar_onnx.tprog
---------------------------------------------

Next Steps

Congrats! You’ve compiled your model and generated three important artifacts, a .tmodel, .tdata and .tprog. All three are needed to run your compiled model, so keep them handy. Assuming you have an accelerator built, you’re now ready to run your model. If not, it’s time to generate an accelerator.

Troubleshooting

If you got an error or saw something you didn’t expect, please let us know! You can either join our Discord to ask a question, open an issue on Github or email us at support@tensil.ai.

Converting to ONNX didn’t work?

If you’re using Tensorflow and the ONNX converter failed, don’t despair! We also support compiling from a frozen graph in PB format. To freeze a Tensorflow model, use the freeze_graph tool located here in the Tensorflow repo.

If you have Tensorflow installed, you can use it in a script by doing

from tensorflow.python.tools.freeze_graph import freeze_graph

graph_def = "some_graph_def.pb"
ckpt = "model.ckpt-1234567"
output_graph = "frozen_graph.pb"
output_nodes = ["softmax"]
input_binary = graph_def.split(".")[-1] == "pb"

freeze_graph(
        graph_def,
        "",
        input_binary,
        ckpt,
        ",".join(outputs_nodes),
        "save/restore_all",
        "save/Const:0",
        output_graph,
        True,
        )

or you can use it directly from the command line by running

python -m tensorflow.python.tools.freeze_graph \
 --input_graph=some_graph_def.pb --input_binary \
 --input_checkpoint=model.ckpt-1234567 \
 --output_graph=frozen_graph.pb --output_node_names=softmax

2.2 - Generate an accelerator

How to generate an accelerator with a given architecture

Things you’ll need

  • an architecture file in .tarch format. If you don’t know what this is yet, continue on and we’ll supply one for you.
  • an AXI data width in bits (check your FPGA product page)

1. Run the Tensil RTL generator

First, ensure you have Tensil installed by pulling and running the Tensil Docker container:

$ docker pull tensilai/tensil:latest
$ docker run -v $(pwd):/work -w /work -it tensilai/tensil:latest bash

Then from the container shell, run:

$ tensil rtl -a <tarch_file> -d <axi_port_width>

To compile with an example model and architecture file, the command is

$ tensil rtl -a /demo/arch/ultra96v2.tarch -d 128

You should see some output like this:

$ tensil rtl -a /demo/arch/ultra96v2.tarch -d 128
Elaborating design...
Done elaborating.
-------------------------------------------------------
ARTIFACTS
-------------------------------------------------------
Verilog bram_dp_256x4096:   /work/bram_dp_256x4096.v
Verilog bram_dp_256x20480:  /work/bram_dp_256x20480.v
Verilog top_ultra96v2:      /work/top_ultra96v2.v
Driver parameters C header: /work/architecture_params.h
-------------------------------------------------------

Next Steps

You’ve generated several RTL artifacts (the files ending in .v) - now it’s time to integrate them into your system.

Troubleshooting

I can’t figure out what AXI width to use

Here’s a table with some known values:

FPGA Family AXI Data Width Tensil Flag
Zynq-7000 64 bit -d 64
Zynq Ultrascale+ 128 bit -d 128

If your FPGA family isn’t listed and you need help, ask a question on Discord or email us at support@tensil.ai.

2.3 - Integrate the Tensil RTL

How to integrate the generated Tensil RTL into your system

Things you’ll need

  • an FPGA board (e.g. the Ultra96-V2)
  • an EDA tool that can target your FPGA (e.g. if you purchased an Ultra96-V2, it should have come with a free license to Xilinx Vivado)
  • the set of RTL (*.v) files that were emitted by the RTL generator. If you don’t have those, see how to generate RTL

This guide will assume you are using the Xilinx Vivado block design interface, but the methodology should be broadly the same for any EDA tool.

1. Instantiate the IP block

Create a new project, choose the appropriate board constraints file and add a block design. Instantiate the host processor: in the case of the Ultra96-V2, this will be the Zynq UltraScale+ processing system. Be sure to run any block automation required.

Move the generated RTL files into your project sources. In Vivado this can be achieved by hitting Add sources and selecting the files. Make sure to add all generated files. If you generated them using the guide, the files will be called top_ultra96v2.v, bram_dp_256x20480.v and bram_dp_256x4096.v.

Then, drag and drop the Top block (named top_<arch>.v, e.g. top_ultra96v2.v) into the block design. We’ll refer to this block as the top block from here on.

2. Connect the AXI interfaces

There are three AXI interfaces needed for basic operation, one for receiving instructions and two for interacting with host memory.

The instruction interface is an AXI stream slave that needs to be driven by the host processor. The easiest way to achieve this is to instantiate an AXI DMA block with one write port. Connect the AXI stream master (M_AXIS_MM2S) to the instruction interface on the top block. You may need to use an AXI data width converter to ensure the widths match.

Next, connect the memory interfaces. The host processor should have AXI slave ports that provide access to host memory, although these may need to be enabled in the configuration settings. For Ultra96-V2, go to the PL Interfaces section and enable S_AXI_HP0_FPD and S_AXI_HP2_FPD. On the top block, connect m_axi_dram0 -> S_AXI_HP0_FPD and connect m_axi_dram1 -> S_AXI_HP2_FPD.

3. Generate bitstream

The block design should now be complete. See below for an example of what a complete design looks like (you can ignore the sample and status interfaces: they are for performance testing and debugging respectively).

Save your design and then create a HDL wrapper if necessary. Finally, start the implementation by hitting “Generate bitstream”. This may take around 10 minutes. If all goes well, you should end up with a .bit file, which is the bitstream itself, and possibly a hardware hand-off file with an extension like .hwh. For Vivado, bitstream can be found at <project_name>.runs/impl_1/design_1_wrapper.bit and the hardware handoff file can be found at <project_name>.srcs/sources_1/bd/design_1/hw_handoff/design_1.hwh.

Next Steps

Now that you have a hardware implementation, you are ready to run your compiled ML model.

Troubleshooting

How to integrate the RTL block will vary from system to system, and there are many quirks and gotchas that could get in the way. If you get stuck, don’t despair! We’re here to help: ask a question on Discord or email us at support@tensil.ai.

2.4 - Run a compiled model

How to run your compiled model on a system with a Tensil accelerator

Things you’ll need

  • an FPGA board (e.g. the Ultra96-V2)
  • a compiled model (e.g. the set of three files: resnet20v2_cifar_onnx.tmodel, resnet20v2_cifar_onnx.tdata, resnet20v2_cifar_onnx.tprog)
  • a fully implemented bitstream (.bit) and a hardware handoff file (.hwh): if you don’t have these, learn how to integrate the RTL

In this guide we’ll assume you are using the PYNQ execution environment, but we also support bare metal execution with our embedded C driver.

1. Move files onto the FPGA

With PYNQ, you can achieve this by running

$ scp <my_model>.t* xilinx@192.168.2.99:~/

and then doing the same for the .bit and .hwh files. For example:

$ scp resnet20v2_cifar_onnx.t* xilinx@192.168.2.99:~/
$ scp design_1_wrapper.bit xilinx@192.168.2.99:~/ultra96-tcu.bit
$ scp design_1.hwh xilinx@192.168.2.99:~/ultra96-tcu.hwh

Note that with PYNQ, the .bit and .hwh files must have the same name up to the extension.

2. Copy the Python driver onto the FPGA

If you haven’t already cloned the repository, get the Tensil source code from Github, e.g.

curl -L https://github.com/tensil-ai/tensil/archive/refs/tags/v1.0.0.tar.gz | tar xvz

Now copy the Python driver over:

$ scp -r tensil-1.0.0/drivers/tcu_pynq xilinx@192.168.2.99:~/

3. Execute

Now it’s time to hand everything over to the driver and tell it to execute the model. This guide will only cover the bare necessities for doing so, go here for a more complete example.

Import the Tensil driver

from pynq import Overlay
import sys
sys.path.append('/home/xilinx')
from tcu_pynq.driver import Driver
from tcu_pynq.architecture import ultra96

Flash the bitstream onto the FPGA

bitstream = '/home/xilinx/ultra96-tcu.bit'
overlay = Overlay(bitstream)
tcu = Driver(ultra96, overlay.axi_dma_0)

Load the compiled model

resnet = '/home/xilinx/resnet20v2_cifar_onnx_ultra96v2.tmodel'
tcu.load_model(resnet)

Run

Pass your input data to the driver in the form of a dictionary. You can see which inputs the driver expects by printing tcu.model.inputs.

img = ...
inputs = {'x:0': img}
outputs = tcu.run(inputs)

If all went well, outputs should contain the results of running your model.

Next Steps

You’ve successfully run your compiled model on Tensil’s accelerator implemented on your FPGA. You’re ready to use this capability in your application. Reach out to us if you need help taking it from here.

Troubleshooting

As always, if you run into trouble please ask a question on Discord or email us at support@tensil.ai.

3 - Tutorials

Complete worked examples to help you learn about Tensil

3.1 - Learn how to combine Tensil and TF-Lite to run YOLO on Ultra96

In this tutorial you’ll learn the how to use Tensil in combination with TF-Lite to run YOLO v4 Tiny ML model on Ultra96 development board

Originally posted here.

Introduction

This tutorial will use Avnet Ultra96 V2 development board and Tensil open-source inference accelerator to show how to run YOLO v4 Tiny–the state-of-the-art ML model for object detection–on FPGA. The YOLO model contains some operations that Tensil does not support. These operations are in the final stage of processing and are not compute-intensive. We will use TensorFlow Lite (TF-Lite) to run them on the CPU to work around this. We will use the PYNQ framework to receive real-time video from a USB webcam and show detected objects on a screen connected to Display Port. This tutorial refers to the previous Ultra96 tutorial for step-by-step instructions for generating Tensil RTL and getting Xilinx Vivado to synthesize the bitstream.

If you get stuck or find an error, you can ask a question on our Discord or send an email to support@tensil.ai.

detect

Overview

Before we start, let’s get a bird’s eye view of what we want to accomplish. We’ll follow these steps:

  1. Generate and synthesize Tensil RTL
  2. Compile YOLO v4 Tiny model for Tensil
  3. Prepare PYNQ and TF-Lite
  4. Execute with PYNQ

1. Generate and synthesize Tensil RTL

Back to top

In the first step, we’ll be getting Tensil tools to generate the RTL code and then using Xilinx Vivado to synthesize the bitstream for the Ultra96 board. Since this process is identical to other Ultra96 tutorials, we refer you to sections 1 through 4 in the ResNet20 tutorial.

Alternatively, you can skip this step and download the ready made bitstream. For this we include instructions in the subsequent section.

2. Compile YOLO v4 Tiny model for Tensil

Back to top

Now, we need to compile the ML model to a Tensil binary consisting of TCU instructions executed by the TCU hardware directly. The YOLO v4 Tiny model is included in two resolutions, 192 and 416, in the Tensil docker image at /demo/models/yolov4_tiny_192.onnx and /demo/models/yolov4_tiny_416.onnx. The higher resolution will detect smaller objects using more computation and thus have fewer frames per second. Note that below we will be using 192 resolution, but simply replacing it with 416 should work as well.

As we mentioned in the introduction, we will be using the TF-Lite framework to run the postprocessing of YOLO v4 Tiny. Specifically, this postprocessing includes Sigmoid and Exp operations not supported by the Tensil hardware. (We plan to implement them using table lookup based on Taylor expansion.) This means that for Tensil we need to compile the model ending with the last convolution layers. Below these layers, we need to compile the TF-Lite model. To identify the output nodes for the Tensil compiler, take a look at the model in Netron.

yolo_heads

Two last convolution operation have outputs named model/conv2d_17/BiasAdd:0 and model/conv2d_20/BiasAdd:0.

From within the Tensil docker container, run the following command.

tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/yolov4_tiny_192.onnx -o "model/conv2d_17/BiasAdd:0,model/conv2d_20/BiasAdd:0" -s true

The resulting compiled files are listed in the ARTIFACTS table. The manifest (tmodel) is a plain text JSON description of the compiled model. The Tensil program (tprog) and weights data (tdata) are both binaries to be used by the TCU during execution. The Tensil compiler also prints a COMPILER SUMMARY table with interesting stats for both the TCU architecture and the model.

---------------------------------------------------------------------------------------------
COMPILER SUMMARY
---------------------------------------------------------------------------------------------
Model:                                           yolov4_tiny_192_onnx_ultra96v2 
Data type:                                       FP16BP8                        
Array size:                                      16                             
Consts memory size (vectors/scalars/bits):       2,097,152                      33,554,432 21
Vars memory size (vectors/scalars/bits):         2,097,152                      33,554,432 21
Local memory size (vectors/scalars/bits):        20,480                         327,680    15
Accumulator memory size (vectors/scalars/bits):  4,096                          65,536     12
Stride #0 size (bits):                           3                              
Stride #1 size (bits):                           3                              
Operand #0 size (bits):                          24                             
Operand #1 size (bits):                          24                             
Operand #2 size (bits):                          16                             
Instruction size (bytes):                        9                              
Consts memory maximum usage (vectors/scalars):   378,669                        6,058,704  
Vars memory maximum usage (vectors/scalars):     55,296                         884,736    
Consts memory aggregate usage (vectors/scalars): 378,669                        6,058,704  
Vars memory aggregate usage (vectors/scalars):   130,464                        2,087,424  
Number of layers:                                25                             
Total number of instructions:                    691,681                        
Compilation time (seconds):                      92.225                         
True consts scalar size:                         6,054,190                      
Consts utilization (%):                          98.706                         
True MACs (M):                                   670.349                        
MAC efficiency (%):                              0.000                          
---------------------------------------------------------------------------------------------

3. Prepare PYNQ and TF-Lite

Back to top

Now it’s time to put everything together on our development board. For this, we first need to set up the PYNQ environment. This process starts with downloading the SD card image for our development board. There’s the detailed instruction for setting board connectivity on the PYNQ documentation website. You should be able to open Jupyter notebooks and run some examples. Note that you’ll need wireless internet connectivity for your Ultra96 board in order to run some of the commands in this section.

There is one caveat that needs addressing once PYNQ is installed. On the default PYNQ image, the setting for the Linux kernel CMA (Contiguous Memory Allocator) area size is 128MB. Given our Tensil architecture, the default CMA size is too small. To address this, you’ll need to download our patched kernel, copy it to /boot, and reboot your board. Note that the patched kernel is built for PYNQ 2.7 and will not work with other versions. To patch the kernel, run these commands on the development board:

wget https://s3.us-west-1.amazonaws.com/downloads.tensil.ai/pynq/2.7/ultra96v2/image.ub
sudo cp /boot/image.ub /boot/image.ub.backup
sudo cp image.ub /boot/
rm image.ub
sudo reboot

Now that PYNQ is up and running, the next step is to scp the Tensil driver for PYNQ. Start by cloning the Tensil GitHub repository to your work station and then copy drivers/tcu_pynq to /home/xilinx/tcu_pynq onto your board.

git clone git@github.com:tensil-ai/tensil.git
scp -r tensil/drivers/tcu_pynq xilinx@192.168.3.1:

Next, we’ll download the bitstream created for Ultra96 architecture definition we used with the compiler. The bitstream contains the FPGA configuration resulting from Vivado synthesis and implementation. PYNQ also needs a hardware handoff file that describes FPGA components accessible to the host, such as DMA. Download and un-tar both files in /home/xilinx by running these commands on the development board.

wget https://s3.us-west-1.amazonaws.com/downloads.tensil.ai/hardware/1.0.4/tensil_ultra96v2.tar.gz
tar -xvf tensil_ultra96v2.tar.gz

If you’d like to explore using Tensil RTL tool and Xilinx Vivado to synthesize the bitstream yourself, we refer you to sections 1 through 4 in the ResNet20 tutorial. Section 6 in the same tutorial includes instructions for copying the bitstream and hardware handoff file from Vivado project onto your board.

Now, copy the .tmodel, .tprog and .tdata artifacts produced by the compiler on your work station to /home/xilinx on the board.

scp yolov4_tiny_192_onnx_ultra96v2.t* xilinx@192.168.3.1:

Next, we need to set up TF-Lite. We prepared the TF-Lite build compatible with the Ultra96 board. Run the following commands on the development board to download and install.

wget https://s3.us-west-1.amazonaws.com/downloads.tensil.ai/tflite_runtime-2.8.0-cp38-cp38-linux_aarch64.whl
sudo pip install tflite_runtime-2.8.0-cp38-cp38-linux_aarch64.whl

Finally, we will need the TF-Lite model to run the postprocessing in YOLO v4 Tiny. We prepared this model for you as well. We’ll also need text labels for the COCO dataset used for training the YOLO model. Download these files into /home/xilinx by running these commands on the development board.

wget https://github.com/tensil-ai/tensil-models/raw/main/yolov4_tiny_192_post.tflite
wget https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-2014_2017.txt

4. Execute with PYNQ

Now, we will be tying everything together in PYNQ Jupyter notebook. Let’s take a closer look at our processing pipeline.

  • Capture the frame image from the webcam;
  • Adjust the image size, color scheme, floating-point channel representation, and Tensil vector alignment to match YOLO v4 Tiny input;
  • Run it through Tensil to get the results of the two final convolution layers;
  • Subsequently run these results through the TF-Lite interpreter to get the model output for bounding boxes and classification scores;
  • Filter bounding boxes based on the score threshold and suppress overlapping boxes for the same detected object;
  • Use the frame originally captured from the camera to plot bounding boxes, class names, scores (red), the current value for frames per second (green), and the detection area (blue);
  • Send this annotated frame to Display Port to show on the screen.

At the beginning of the notebook, we define global parameters: frame dimensions for both camera and screen and YOLO v4 Tiny resolution we will be using.

model_hw = 192
frame_w = 1280
frame_h = 720

Next, we import the Tensil PYNQ driver and other required utilities.

import sys
sys.path.append('/home/xilinx/')

import time
import math
import numpy as np
import tflite_runtime.interpreter as tflite
import cv2
import matplotlib.pyplot as plt
import pynq

from pynq import Overlay
from pynq.lib.video import *

from tcu_pynq.driver import Driver
from tcu_pynq.util import div_ceil
from tcu_pynq.architecture import ultra96

Now, initialize the PYNQ overlay from the bitstream and instantiate the Tensil driver using the TCU architecture and the overlay’s DMA configuration. Note that we are passing axi_dma_0 object from the overlay–the name matches the DMA block in the Vivado design.

overlay = Overlay('/home/xilinx/tensil_ultra96v2.bit')
tcu = Driver(ultra96, overlay.axi_dma_0)

Next, we need to initialize the capture from the webcam using OpenCV library.

cap = cv2.VideoCapture(0)

cap.set(cv2.CAP_PROP_FRAME_WIDTH, frame_w);
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, frame_h);

And initialze the Dispay Port.

displayport = DisplayPort()
displayport.configure(VideoMode(frame_w, frame_h, 24), PIXEL_RGB)

If you are connecting the board to an HDMI screen, make sure to use active DP-to-HDMI cable, such as this one.

Next, load the tmodel manifest for the model into the driver. The manifest tells the driver where to find the other two binary files (program and weights data).

tcu.load_model('/home/xilinx/yolov4_tiny_{0}_onnx_ultra96v2.tmodel'.format(model_hw))

Then, instantiate the TF-Lite interpreter based on YOLO postprocessing model.

interpreter = tflite.Interpreter(model_path='/home/xilinx/yolov4_tiny_{0}_post.tflite'.format(model_hw))
interpreter.allocate_tensors()

Now we load the COCO labels and define several utility functions.

with open('/home/xilinx/coco-labels-2014_2017.txt') as f:
    labels_coco = f.read().split('\n')
    
def set_tensor(driver, interpreter, hw_size, data):
    input_details = interpreter.get_input_details()
    input_idxs = [i for i in range(len(input_details))
                  if input_details[i]['shape'][1] == hw_size and input_details[i]['shape'][2] == hw_size]
    inp = input_details[input_idxs[0]]
    data = data.astype(inp['dtype'])
    inner_dim = inp['shape'][-1]
    inner_size = div_ceil(inner_dim, driver.arch.array_size) * driver.arch.array_size
    if inner_size != inner_dim:
        data = data.reshape((-1, inner_size))[:, :inner_dim]
    data = data.reshape(inp['shape'])
    interpreter.set_tensor(inp['index'], data)
    
def filter_and_reshape(boxes, scores, score_threshold=0.4):
    scores_max = np.max(scores, axis=-1)
    mask = scores_max > score_threshold
    
    filtered_boxes = boxes[mask]
    filtered_scores = scores[mask]
    
    filtered_boxes = np.reshape(filtered_boxes, [scores.shape[0], -1, filtered_boxes.shape[-1]])    
    filtered_scores = np.reshape(filtered_scores, [scores.shape[0], -1, filtered_scores.shape[-1]])

    return filtered_boxes, filtered_scores


def non_maximum_suppression(boxes, iou_threshold=0.4):
    if len(boxes) == 0:
        return boxes
    
    area = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    ll_x = np.maximum.outer(boxes[:, 0], boxes[:, 0])
    ll_y = np.maximum.outer(boxes[:, 1], boxes[:, 1])
    ur_x = np.minimum.outer(boxes[:, 2], boxes[:, 2])
    ur_y = np.minimum.outer(boxes[:, 3], boxes[:, 3])
    intersection_x = np.maximum(0, ur_x - ll_x)
    intersection_y = np.maximum(0, ur_y - ll_y)
    intersection = intersection_x * intersection_y
    
    iou = intersection / area - np.identity(area.shape[-1])
    p = iou >= iou_threshold
    p = p & p.T
    n =  p.shape[-1]
    
    no_needs_merge = set()
    for i in range(n):
        if not p[i].any():
            no_needs_merge.add(i)
    
    needs_merge = set()
    for i in range(n):
        for j in range(n):
            if p[i, j]:
                needs_merge.add(tuple(sorted((i, j))))

    def merge(needs_merge):
        result = set()
        discarded = set()
        for indices in needs_merge:
            idx = indices[0]
            if idx not in discarded:
                result.add(indices[0])
            discarded.add(indices[1])
            if indices[1] in result:
                result.remove(indices[1])
        return result

    return sorted(list(no_needs_merge) + list(merge(needs_merge)))

Finally, we tie the pipeline together in a loop to process a fixed number of frames. (You may replace it with while(1): to run the pipeline indefinitely.)

for _ in range(600):
    start = time.time()
    
    cap_frame = displayport.newframe()
    cap.read(cap_frame)
    
    crop_h = int(max(0, (frame_h - frame_w) / 2))
    crop_w = int(max(0, (frame_w - frame_h) / 2))
    ratio_h = (frame_h - crop_h * 2)/model_hw
    ratio_w = (frame_w - crop_w * 2)/model_hw

    x_frame = cap_frame    
    x_frame=x_frame[crop_h:frame_h - crop_h, crop_w:frame_w - crop_w]
    x_frame=cv2.resize(x_frame, (model_hw, model_hw), interpolation=cv2.INTER_LINEAR)
    x_frame=cv2.cvtColor(x_frame, cv2.COLOR_BGR2RGB)    
    x_frame = x_frame.astype('float32') / 255
    x_frame = np.pad(x_frame, [(0, 0), (0, 0), (0, tcu.arch.array_size - 3)], 'constant', constant_values=0)
    
    inputs = {'x:0': x_frame}    
    outputs = tcu.run(inputs)
    
    set_tensor(tcu, interpreter, model_hw / 32, np.array(outputs['model/conv2d_17/BiasAdd:0']))
    set_tensor(tcu, interpreter, model_hw / 16, np.array(outputs['model/conv2d_20/BiasAdd:0']))

    interpreter.invoke()

    output_details = interpreter.get_output_details()
    scores, boxes_xywh = [interpreter.get_tensor(output_details[i]['index']) for i in range(len(output_details))]

    boxes_xywh, scores = filter_and_reshape(boxes_xywh, scores)
    
    boxes_xy, boxes_wh = np.split(boxes_xywh, (2,), axis=-1)
    boxes_x0y0x1y1 = np.concatenate([boxes_xy - boxes_wh/2, boxes_xy + boxes_wh/2], axis=-1)
    
    box_indices = non_maximum_suppression(boxes_x0y0x1y1[0])

    latency = (time.time() - start)
    fps = 1/latency
    
    for i in box_indices:
        category_idx = np.argmax(scores, axis=-1)[0, i]
        category_conf = np.max(scores, axis=-1)[0, i]
        text = f'{labels_coco[category_idx]} = {category_conf:.2}'
        
        box_x0y0x1y1 = boxes_x0y0x1y1[0, i]        
        box_x0y0x1y1[0] *= ratio_w
        box_x0y0x1y1[1] *= ratio_h
        box_x0y0x1y1[2] *= ratio_w
        box_x0y0x1y1[3] *= ratio_h
        box_x0y0x1y1[0] += crop_w
        box_x0y0x1y1[1] += crop_h
        box_x0y0x1y1[2] += crop_w
        box_x0y0x1y1[3] += crop_h
        box_x0y0x1y1 = box_x0y0x1y1.astype('int')
        
        cap_frame = cv2.rectangle(cap_frame, (crop_w, crop_h), (frame_w - crop_w, frame_h - crop_h), (255, 0, 0), 1)
        cap_frame = cv2.rectangle(cap_frame, (box_x0y0x1y1[0], box_x0y0x1y1[1]), (box_x0y0x1y1[2], box_x0y0x1y1[3]), (0, 0, 255), 1)
        cap_frame = cv2.putText(cap_frame, text, (box_x0y0x1y1[0] + 2, box_x0y0x1y1[1] - 2), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255))
            
    
    cap_frame = cv2.putText(cap_frame, f'{fps:.2}fps', (2, frame_h - 2), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 255, 0))
    displayport.writeframe(cap_frame)  

After running the pipeline, we clean up the camera capture and Display Port resources.

displayport.close()
cap.release()
tcu.close()

Congratulations! You ran a state-of-the-art object detection ML model on a custom accelerator hooked to a webcam and a screen for real-time object detection! Just imagine the things you could do with it…

Wrap-up

Back to top

In this tutorial we used Tensil to show how to run YOLO v4 Tiny ML model on FPGA with a postprocessing step handled by TF-Lite. We showed how to analyze the model to detemine the layers at which to split the processing between TF-Lite and Tensil. We included step-by-step explanation how to do real-time video processing pipeline using PYNQ.

If you made it all the way through, big congrats! You’re ready to take things to the next level by trying out your own model and architecture. Join us on Discord to say hello and ask questions, or send an email to support@tensil.ai.

3.2 - Learn Tensil with ResNet and PYNQ Z1

In this tutorial you’ll learn the concepts behind Tensil through a worked example using PYNQ Z1 development board

Originally posted here.

Introduction

This tutorial will use the PYNQ Z1 development board and Tensil’s open-source inference accelerator to show how to run machine learning (ML) models on FPGA. We will be using ResNet-20 trained on the CIFAR dataset. These steps should work for any supported ML model – currently all the common state-of-the-art convolutional neural networks are supported. Try it with your model!

We’ll give detailed end-to-end coverage that is easy to follow. In addition, we include in-depth explanations to get a good understanding of the technology behind it all, including the Tensil and Xilinx Vivado toolchains and PYNQ framework.

If you get stuck or find an error, you can ask a question on our Discord or send an email to support@tensil.ai.

board

Overview

Before we start, let’s look at the Tensil toolchain flow to get a bird’s eye view of what we want to accomplish. We’ll follow these steps:

  1. Get Tensil
  2. Choose architecture
  3. Generate TCU accelerator design (RTL code)
  4. Synthesize for PYNQ Z1
  5. Compile ML model for TCU
  6. Execute using PYNQ

flow

1. Get Tensil

Back to top

First, we need to get the Tensil toolchain. The easiest way is to pull the Tensil docker container from Docker Hub. The following command will pull the image and then run the container.

docker pull tensilai/tensil
docker run -v $(pwd):/work -w /work -it tensilai/tensil bash

2. Choose architecture

Back to top

Tensil’s strength is customizability, making it suitable for a very wide range of applications. The Tensil architecture definition file (.tarch) specifies the parameters of the architecture to be implemented. These parameters are what make Tensil flexible enough to work for small embedded FPGAs as well as large data-center FPGAs. Our example will select parameters that provide the highest utilization of resources on the XC7Z020 FPGA part at the core of the PYNQ Z1 board. The container image conveniently includes the architecture file for the PYNQ Z1 development board at /demo/arch/pynqz1.tarch. Let’s take a look at what’s inside.

{
    "data_type": "FP16BP8",
    "array_size": 8,
    "dram0_depth": 1048576,
    "dram1_depth": 1048576,
    "local_depth": 8192,
    "accumulator_depth": 2048,
    "simd_registers_depth": 1,
    "stride0_depth": 8,
    "stride1_depth": 8
}

The file contains a JSON object with several parameters. The first, data_type, defines the data type used throughout the Tensor Compute Unit (TCU), including in the systolic array, SIMD ALUs, accumulators, and local memory. We will use 16-bit fixed-point with an 8-bit base point (FP16BP8), which in most cases allows simple rounding of 32-bit floating-point models without the need for quantization. Next, array_size defines a systolic array size of 8x8, which results in 64 parallel multiply-accumulate (MAC) units. This number was chosen to balance the utilization of DSP units available on XC7Z020 in case you needed to use some DSPs for another application in parallel, but you could increase it for higher performance of the TCU.

With dram0_depth and dram1_depth, we define the size of DRAM0 and DRAM1 memory buffers on the host side. These buffers feed the TCU with the model’s weights and inputs, and also store intermediate results and outputs. Note that these memory sizes are in number of vectors, which means array size (8) multiplied by data type size (16-bits) for a total of 128 bits per vector.

Next, we define the size of the local and accumulator memories which will be implemented on the FPGA fabric itself. The difference between the accumulators and the local memory is that accumulators can perform a write-accumulate operation in which the input is added to the data already stored, as opposed to simply overwriting it. The total size of accumulators plus local memory is again selected to balance the utilization of BRAM resources on XC7Z020 in case resources are needed elsewhere.

With simd_registers_depth, we specify the number of registers included in each SIMD ALU, which can perform SIMD operations on stored vectors used for ML operations like ReLU activation. Increasing this number is only needed rarely, to help compute special activation functions. Finally, stride0_depth and stride1_depth specify the number of bits to use for enabling “strided” memory reads and writes. It’s unlikely you’ll ever need to change this parameter.

3. Generate TCU accelerator design (RTL code)

Back to top

Now that we’ve selected our architecture, it’s time to run the Tensil RTL generator. RTL stands for “Register Transfer Level” – it’s a type of code that specifies digital logic stuff like wires, registers and low-level logic. Special tools like Xilinx Vivado or yosys can synthesize RTL for FPGAs and even ASICs.

To generate a design using our chosen architecture, run the following command inside the Tensil toolchain docker container:

tensil rtl -a /demo/arch/pynqz1.tarch -s true

This command will produce several Verilog files listed in the ARTIFACTS table printed out at the end. It also prints the RTL SUMMARY table with some of the essential parameters of the resulting RTL.

----------------------------------------------------------------------
RTL SUMMARY
----------------------------------------------------------------------
Data type:                                      FP16BP8   
Array size:                                     8         
Consts memory size (vectors/scalars/bits):      1,048,576 8,388,608 20
Vars memory size (vectors/scalars/bits):        1,048,576 8,388,608 20
Local memory size (vectors/scalars/bits):       8,192     65,536    13
Accumulator memory size (vectors/scalars/bits): 2,048     16,384    11
Stride #0 size (bits):                          3         
Stride #1 size (bits):                          3         
Operand #0 size (bits):                         16        
Operand #1 size (bits):                         24        
Operand #2 size (bits):                         16        
Instruction size (bytes):                       8         
----------------------------------------------------------------------

4. Synthesize for PYNQ Z1

Back to top

It is now time to start Xilinx Vivado. I will be using version 2021.2, which you can download free of charge (for prototyping) at the Xilinx website.

Before you create new Vivado project you will need to download PYNQ Z1 board definition files from here. Unpack and place them in /tools/Xilinx/Vivado/2021.2/data/boards/board_files/. (Note that this path includes Vivado version.) Once unpacked, you’ll need to add board files path in Tools -> Settings -> Board Repository.

new_project_rtl

First, create a new RTL project named tensil-pynqz1 and add Verilog files generated by the Tensil RTL tool.

new_project_rtl

Choose boards and search for PYNQ. Select PYNQ-Z1 with file version 1.0.

new_project_board

Under IP INTEGRATOR, click Create Block Design.

create_design

Drag top_pynqz1 from the Sources tab onto the block design diagram. You should see the Tensil RTL block with its interfaces.

design_tensil_rtl

Next, click the plus + button in the Block Diagram toolbar (upper left) and select “ZYNQ7 Processing System” (you may need to use the search box). Do the same for “Processor System Reset”. The Zynq block represents the “hard” part of the Xilinx platform, which includes ARM processors, DDR interfaces, and much more. The Processor System Reset is a utility box that provides the design with correctly synchronized reset signals.

Click “Run Block Automation” and “Run Connection Automation”. Check “All Automation”.

Double-click ZYNQ7 Processing System. First, go to Clock Configuration and ensure PL Fabric Clocks have FCLK_CLK0 checked and set to 50MHz.

zynq_clocks

Then, go to PS-PL Configuration. Check S AXI HP0 FPD, S AXI HP1 FPD, and S AXI HP2 FPD. These changes will configure all the necessary interfaces between Processing System (PS) and Programmable Logic (PL) necessary for our design.

zynq_ps_pl

Again, click the plus + button in the Block Diagram toolbar and select “AXI SmartConnect”. We’ll need 4 instances of SmartConnect. First 3 instances (smartconnect_0 to smartconnect_2) are necessary to convert AXI version 4 interfaces of the TCU and the instruction DMA block to AXI version 3 on the PS. The smartconnect_3 is necessary to expose DMA control registers to the Zynq CPU, which will enable software to control the DMA transactions. Double-click each one and set “Number of Slave and Master Interfaces” to 1.

smartconnect

Now, connect m_axi_dram0 and m_axi_dram1 ports on Tensil block to S00_AXI on smartconnect_0 and smartconnect_1 correspondigly. Then connect SmartConnect M00_AXI ports to S_AXI_HP0 and S_AXI_HP2 on Zynq block correspondingly. The TCU has two DRAM banks to enable their parallel operation by utilizing PS ports with dedicated connectivity to the memory.

Next, click the plus + button in the Block Diagram toolbar and select “AXI Direct Memory Access” (DMA). The DMA block is used to organize the feeding of the Tensil program to the TCU without keeping the PS ARM processor busy.

Double-click it. Disable “Scatter Gather Engine” and “Write Channel”. Change “Width of Buffer Length Register” to be 26 bits. Select “Memory Map Data Width” and “Stream Data Width” to be 64 bits. Change “Max Burst Size” to 256.

dma

Connect the instruction port on the Tensil top block to the M_AXIS_MM2S on the AXI DMA block. Then, connect M_AXI_MM2S on the AXI DMA block to S00_AXI on smartconnect_2 and, finally, connect smartconnect_2 M00_AXI port to S_AXI_HP1 on Zynq.

Connect M00_AXI on smartconnect_3 to S_AXI_LITE on the AXI DMA block. Connect S00_AXI on the AXI SmartConnect to M_AXI_GP0 on the Zynq block.

Finally, click “Run Connection Automation” and check “All Automation”. By doing this, we connect all the clocks and resets. Click the “Regenerate Layout” button in the Block Diagram toolbar to make the diagram look nice.

design_final

Next, switch to the “Address Editor” tab. Click the “Assign All” button in the toolbar. By doing this, we assign address spaces to various AXI interfaces. For example, the instruction DMA (axi_dma_0) and Tensil (m_axi_dram0 and m_axi_dram1) gain access to the entire address space on the PYNQ Z1 board. The PS gains access to the control registers for the instruction DMA.

design_address

Back in the Block Diagram tab, click the “Validate Design” (or F6) button. You should see the message informing you of successful validation! You can now close the Block Design by clicking x in the right upper corner.

The final step is to create the HDL wrapper for our design, which will tie everything together and enable synthesis and implementation. Right-click the tensil_pynqz1 item in the Sources tab and choose “Create HDL Wrapper”. Keep “Let Vivado manage wrapper and auto-update” selected. Wait for the Sources tree to be fully updated and right-click on tensil_pynqz1_wrapper. Choose Set as Top.

Now it’s time to let Vivado perform synthesis and implementation and write the resulting bitstream. In the Flow Navigator sidebar, click on “Generate Bitstream” and hit OK. Vivado will start synthesizing our Tensil design – this may take around 15 minutes. When done, you can observe some vital stats in the Project Summary. First, look at utilization, which shows what percentage of each FPGA resource our design is using. Note how we kept BRAM and DSP utilization reasonably low.

utilization

The second is timing, which tells us about how long it takes for signals to propagate in our programmable logic (PL). The “Worst Negative Slack” being a positive number is good news – our design meets propagation constraints for all nets at the specified clock speed!

timing

5. Compile ML model for TCU

Back to top

The second branch of the Tensil toolchain flow is to compile the ML model to a Tensil binary consisting of TCU instructions, which are executed by the TCU hardware directly. For this tutorial, we will use ResNet20 trained on the CIFAR dataset. The model is included in the Tensil docker image at /demo/models/resnet20v2_cifar.onnx. From within the Tensil docker container, run the following command.

tensil compile -a /demo/arch/pynqz1.tarch -m /demo/models/resnet20v2_cifar.onnx -o "Identity:0" -s true

We’re using the ONNX version of the model, but the Tensil compiler also supports TensorFlow, which you can try by compiling the same model in TensorFlow frozen graph form at /demo/models/resnet20v2_cifar.pb.

tensil compile -a /demo/arch/pynqz1.tarch -m /demo/models/resnet20v2_cifar.pb -o "Identity" -s true

The resulting compiled files are listed in the ARTIFACTS table. The manifest (tmodel) is a plain text JSON description of the compiled model. The Tensil program (tprog) and weights data (tdata) are both binaries to be used by the TCU during execution. The Tensil compiler also prints a COMPILER SUMMARY table with interesting stats for both the TCU architecture and the model.

------------------------------------------------------------------------------------------
COMPILER SUMMARY
------------------------------------------------------------------------------------------
Model:                                           resnet20v2_cifar_onnx_pynqz1 
Data type:                                       FP16BP8                      
Array size:                                      8                            
Consts memory size (vectors/scalars/bits):       1,048,576                    8,388,608 20
Vars memory size (vectors/scalars/bits):         1,048,576                    8,388,608 20
Local memory size (vectors/scalars/bits):        8,192                        65,536    13
Accumulator memory size (vectors/scalars/bits):  2,048                        16,384    11
Stride #0 size (bits):                           3                            
Stride #1 size (bits):                           3                            
Operand #0 size (bits):                          16                           
Operand #1 size (bits):                          24                           
Operand #2 size (bits):                          16                           
Instruction size (bytes):                        8                            
Consts memory maximum usage (vectors/scalars):   71,341                       570,728   
Vars memory maximum usage (vectors/scalars):     26,624                       212,992   
Consts memory aggregate usage (vectors/scalars): 71,341                       570,728   
Vars memory aggregate usage (vectors/scalars):   91,170                       729,360   
Number of layers:                                23                           
Total number of instructions:                    258,037                      
Compilation time (seconds):                      25.487                       
True consts scalar size:                         568,466                      
Consts utilization (%):                          97.545                       
True MACs (M):                                   61.476                       
MAC efficiency (%):                              0.000                        
------------------------------------------------------------------------------------------

6. Execute using PYNQ

Back to top

Now it’s time to put everything together on our development board. For this, we first need to set up the PYNQ environment. This process starts with downloading the SD card image for our development board. There’s the detailed instruction for setting board connectivity on the PYNQ documentation website. You should be able to open Jupyter notebooks and run some examples.

Now that PYNQ is up and running, the next step is to scp the Tensil driver for PYNQ. Start by cloning the Tensil GitHub repository to your work station and then copy drivers/tcu_pynq to /home/xilinx/tcu_pynq onto your board.

git clone git@github.com:tensil-ai/tensil.git
scp -r tensil/drivers/tcu_pynq xilinx@192.168.2.99:

We also need to scp the bitstream and compiler artifacts.

Next we’ll copy over the bitstream, which contains the FPGA configuration resulting from Vivado synthesis and implementation. PYNQ also needs a hardware handoff file that describes FPGA components accessible to the host, such as DMA. Place both files in /home/xilinx on the development board. Assuming you are in the Vivado project directory, run the following commands to copy files over.

scp tensil-pynqz1.runs/impl_1/tensil_pynqz1_wrapper.bit xilinx@192.168.2.99:tensil_pynqz1.bit
scp tensil-pynqz1.gen/sources_1/bd/tensil_pynqz1/hw_handoff/tensil_pynqz1.hwh xilinx@192.168.2.99:

Note that we renamed bitstream to match the hardware handoff file name.

Now, copy the .tmodel, .tprog and .tdata artifacts produced by the compiler to /home/xilinx on the board.

scp resnet20v2_cifar_onnx_pynqz1.t* xilinx@192.168.2.99:

The last thing needed to run our ResNet model is the CIFAR dataset. You can get it from Kaggle or run the commands below (since we only need the test batch, we remove the training batches to reduce the file size). Put these files in /home/xilinx/cifar-10-batches-py/ on your development board.

wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar xfvz cifar-10-python.tar.gz
rm cifar-10-batches-py/data_batch_*
scp -r cifar-10-batches-py xilinx@192.168.2.99:

We are finally ready to fire up the PYNQ Jupyter notebook and run the ResNet model on TCU.

Jupyter notebook

First, we import the Tensil PYNQ driver and other required utilities.

import sys
sys.path.append('/home/xilinx')

# Needed to run inference on TCU
import time
import numpy as np
import pynq
from pynq import Overlay
from tcu_pynq.driver import Driver
from tcu_pynq.architecture import pynqz1

# Needed for unpacking and displaying image data
%matplotlib inline
import matplotlib.pyplot as plt
import pickle

Now, initialize the PYNQ overlay from the bitstream and instantiate the Tensil driver using the TCU architecture and the overlay’s DMA configuration. Note that we are passing axi_dma_0 object from the overlay – the name matches the DMA block in the Vivado design.

overlay = Overlay('/home/xilinx/tensil_pynqz1.bit')
tcu = Driver(pynqz1, overlay.axi_dma_0)

The Tensil PYNQ driver includes the PYNQ Z1 architecture definition. Here it is in an excerpt from architecture.py: you can see that it matches the architecture we used previously.

pynqz1 = Architecture(
    data_type=DataType.FP16BP8,
    array_size=8,
    dram0_depth=1048576,
    dram1_depth=1048576,
    local_depth=8192,
    accumulator_depth=2048,
    simd_registers_depth=1,
    stride0_depth=8,
    stride1_depth=8,
)

Next, let’s load CIFAR images from the test_batch.

def unpickle(file):
    with open(file, 'rb') as fo:
        d = pickle.load(fo, encoding='bytes')
    return d

cifar = unpickle('/home/xilinx/cifar-10-batches-py/test_batch')
data = cifar[b'data']
labels = cifar[b'labels']

data = data[10:20]
labels = labels[10:20]

data_norm = data.astype('float32') / 255
data_mean = np.mean(data_norm, axis=0)
data_norm -= data_mean

cifar_meta = unpickle('/home/xilinx/cifar-10-batches-py/batches.meta')
label_names = [b.decode() for b in cifar_meta[b'label_names']]

def show_img(data, n):
    plt.imshow(np.transpose(data[n].reshape((3, 32, 32)), axes=[1, 2, 0]))

def get_img(data, n):
    img = np.transpose(data_norm[n].reshape((3, 32, 32)), axes=[1, 2, 0])
    img = np.pad(img, [(0, 0), (0, 0), (0, tcu.arch.array_size - 3)], 'constant', constant_values=0)
    return img.reshape((-1, tcu.arch.array_size))

def get_label(labels, label_names, n):
    label_idx = labels[n]
    name = label_names[label_idx]
    return (label_idx, name)

To test, extract one of the images.

n = 7
img = get_img(data, n)
label_idx, label = get_label(labels, label_names, n)
show_img(data, n)

You should see the image.

horse

Next, load the tmodel manifest for the model into the driver. The manifest tells the driver where to find the other two binary files (program and weights data).

tcu.load_model('/home/xilinx/resnet20v2_cifar_onnx_pynqz1.tmodel')

Finally, run the model and print the results! The call to tcu.run(inputs) is where the magic happens. We’ll convert the ResNet classification result vector into CIFAR labels. Note that if you are using the ONNX model, the input and output are named x:0 and Identity:0 respectively. For the TensorFlow model they are named x and Identity.

inputs = {'x:0': img}

start = time.time()
outputs = tcu.run(inputs)
end = time.time()
print("Ran inference in {:.4}s".format(end - start))
print()

classes = outputs['Identity:0'][:10]
result_idx = np.argmax(classes)
result = label_names[result_idx]
print("Output activations:")
print(classes)
print()
print("Result: {} (idx = {})".format(result, result_idx))
print("Actual: {} (idx = {})".format(label, label_idx))

Here is the expected result:

Ran inference in 0.1513s

Output activations:
[-19.49609375 -12.37890625  -8.01953125  -6.01953125  -6.609375
  -4.921875    -7.71875      2.0859375   -9.640625    -7.85546875]

Result: horse (idx = 7)
Actual: horse (idx = 7)

Congratulations! You ran a machine learning model a custom ML accelerator that you built on your own work station! Just imagine the things you could do with it…

Wrap-up

Back to top

In this tutorial we used Tensil to show how to run machine learning (ML) models on FPGA. We went through a number of steps to get here, including installing Tensil, choosing an architecture, generating an RTL design, synthesizing the desing, compiling the ML model and finally executing the model using PYNQ.

If you made it all the way through, big congrats! You’re ready to take things to the next level by trying out your own model and architecture. Join us on Discord to say hello and ask questions, or send an email to support@tensil.ai.

3.3 - Learn Tensil with ResNet and Ultra96

In this tutorial you’ll learn the concepts behind Tensil through a worked example using Ultra96 development board

Originally posted here.

Introduction

This tutorial will use the Avnet Ultra96 V2 development board and Tensil’s open-source inference accelerator to show how to run machine learning (ML) models on FPGA. We will be using ResNet-20 trained on the CIFAR dataset. These steps should work for any supported ML model – currently all the common state-of-the-art convolutional neural networks are supported. Try it with your model!

We’ll give detailed end-to-end coverage that is easy to follow. In addition, we include in-depth explanations to get a good understanding of the technology behind it all, including the Tensil and Xilinx Vivado toolchains and PYNQ framework.

If you get stuck or find an error, you can ask a question on our Discord or send an email to support@tensil.ai.

board

Overview

Before we start, let’s look at the Tensil toolchain flow to get a bird’s eye view of what we want to accomplish. We’ll follow these steps:

  1. Get Tensil
  2. Choose architecture
  3. Generate TCU accelerator design (RTL code)
  4. Synthesize for Ultra96
  5. Compile ML model for TCU
  6. Execute using PYNQ

flow

1. Get Tensil

Back to top

First, we need to get the Tensil toolchain. The easiest way is to pull the Tensil docker container from Docker Hub. The following command will pull the image and then run the container.

docker pull tensilai/tensil
docker run -v $(pwd):/work -w /work -it tensilai/tensil bash

2. Choose architecture

Back to top

Tensil’s strength is customizability, making it suitable for a very wide range of applications. The Tensil architecture definition file (.tarch) specifies the parameters of the architecture to be implemented. These parameters are what make Tensil flexible enough to work for small embedded FPGAs as well as large data-center FPGAs. Our example will select parameters that provide the highest utilization of resources on the ZU3EG FPGA part at the core of the Ultra96 board. The container image conveniently includes the architecture file for the Ultra96 development board at /demo/arch/ultra96v2.tarch. Let’s take a look at what’s inside.

{
    "data_type": "FP16BP8",
    "array_size": 16,
    "dram0_depth": 2097152,
    "dram1_depth": 2097152,
    "local_depth": 20480,
    "accumulator_depth": 4096,
    "simd_registers_depth": 1,
    "stride0_depth": 8,
    "stride1_depth": 8
}

The file contains a JSON object with several parameters. The first, data_type, defines the data type used throughout the Tensor Compute Unit (TCU), including in the systolic array, SIMD ALUs, accumulators, and local memory. We will use 16-bit fixed-point with an 8-bit base point (FP16BP8), which in most cases allows simple rounding of 32-bit floating-point models without the need for quantization. Next, array_size defines a systolic array size of 16x16, which results in 256 parallel multiply-accumulate (MAC) units. This number was chosen to maximize the utilization of DSP units available on ZU3EG, but if you needed to use some DSPs for another application in parallel, you could decrease it to free some up.

With dram0_depth and dram1_depth, we define the size of DRAM0 and DRAM1 memory buffers on the host side. These buffers feed the TCU with the model’s weights and inputs, and also store intermediate results and outputs. Note that these memory sizes are in number of vectors, which means array size (16) multiplied by data type size (16-bits) for a total of 256 bits per vector.

Next, we define the size of the local and accumulator memories which will be implemented on the FPGA fabric itself. The difference between the accumulators and the local memory is that accumulators can perform a write-accumulate operation in which the input is added to the data already stored, as opposed to simply overwriting it. The total size of accumulators plus local memory is again selected to maximize the utilization of BRAM resources on ZU3EG, but if necessary you could reduce these to free up resources needed elsewhere.

With simd_registers_depth, we specify the number of registers included in each SIMD ALU, which can perform SIMD operations on stored vectors used for ML operations like ReLU activation. Increasing this number is only needed rarely, to help compute special activation functions. Finally, stride0_depth and stride1_depth specify the number of bits to use for enabling “strided” memory reads and writes. It’s unlikely you’ll ever need to change this parameter.

3. Generate TCU accelerator design (RTL code)

Back to top

Now that we’ve selected our architecture, it’s time to run the Tensil RTL generator. RTL stands for “Register Transfer Level” – it’s a type of code that specifies digital logic stuff like wires, registers and low-level logic. Special tools like Xilinx Vivado or yosys can synthesize RTL for FPGAs and even ASICs.

To generate a design using our chosen architecture, run the following command inside the Tensil toolchain docker container:

tensil rtl -a /demo/arch/ultra96v2.tarch -s true -d 128

Note the -d 128 parameter, which specifies that the generated RTL will be compatible with 128-bit AXI interfaces supported by the ZU3EG part. This command will produce several Verilog files listed in the ARTIFACTS table printed out at the end. It also prints the RTL SUMMARY table with some of the essential parameters of the resulting RTL.

-----------------------------------------------------------------------
RTL SUMMARY
-----------------------------------------------------------------------
Data type:                                      FP16BP8   
Array size:                                     16        
Consts memory size (vectors/scalars/bits):      2,097,152 33,554,432 21
Vars memory size (vectors/scalars/bits):        2,097,152 33,554,432 21
Local memory size (vectors/scalars/bits):       20,480    327,680    15
Accumulator memory size (vectors/scalars/bits): 4,096     65,536     12
Stride #0 size (bits):                          3         
Stride #1 size (bits):                          3         
Operand #0 size (bits):                         24        
Operand #1 size (bits):                         24        
Operand #2 size (bits):                         16        
Instruction size (bytes):                       9         
-----------------------------------------------------------------------

4. Synthesize for Ultra96

Back to top

It is now time to start Xilinx Vivado. I will be using version 2021.2, which you can download free of charge (for prototyping) at the Xilinx website.

First, create a new RTL project named tensil-ultra96v2 and add Verilog files generated by the Tensil RTL tool.

new_project_rtl

Choose boards and search for Ultra96. Select Ultra96-V2 Single Board Computer with file version 1.2. You may need to click the Install icon in the Status column. (If you don’t find the board, click on the Refresh button below.)

new_project_board

Under IP INTEGRATOR, click Create Block Design.

create_design

Drag top_ultra96v2 from the Sources tab onto the block design diagram. You should see the Tensil RTL block with its interfaces.

design_tensil_rtl

Next, click the plus + button in the Block Diagram toolbar (upper left) and select “Zynq UltraScale+ MPSoC” (you may need to use the search box). Do the same for “Processor System Reset”. The Zynq block represents the “hard” part of the Xilinx platform, which includes ARM processors, DDR interfaces, and much more. The Processor System Reset is a utility box that provides the design with correctly synchronized reset signals.

Click “Run Block Automation” and “Run Connection Automation”. Check “All Automation”.

Double-click Zynq UltraScale+ MPSoC. First, go to Clock Configuration and ensure PL Fabric Clocks have PL0 checked and set to 100MHz.

zynq_clocks

Then, go to PS-PL Configuration. Uncheck AXI HPM1 FPD and check AXI HP1 FPD, AXI HP2 FPD, and AXI HP3 FPD. These changes will configure all the necessary interfaces between Processing System (PS) and Programmable Logic (PL) necessary for our design.

zynq_ps_pl

Now, connect m_axi_dram0 and m_axi_dram1 ports on Tensil block to S_AXI_HP1_FPD and S_AXI_HP2_FPD on Zynq block correspondingly. The TCU has two DRAM banks to enable their parallel operation by utilizing separate PS ports.

Next, click the plus + button in the Block Diagram toolbar and select “AXI Direct Memory Access” (DMA). The DMA block is used to organize the feeding of the Tensil program to the TCU without keeping the PS ARM processor busy.

Double-click it. Disable “Scatter Gather Engine” and “Write Channel”. Change “Width of Buffer Length Register” to be 26 bits. Select “Memory Map Data Width” and “Stream Data Width” to be 128 bits. Change “Max Burst Size” to 256.

dma

Connect the instruction port on the Tensil top block to M_AXIS_MM2S on the AXI DMA block. Then, connect M_AXI_MM2S on the AXI DMA block to S_AXI_HP1_FPD on Zynq.

Once again, click the plus + button in the Block Diagram toolbar and select “AXI SmartConnect. The SmartConnect is necessary to expose DMA control registers to the Zynq CPU, which will enable software to control the DMA transactions. Double-click it and set “Number of Slave and Master Interfaces” to 1.

smartconnect

Connect M00_AXI on the AXI SmartConnect block to S_AXI_LITE on the AXI DMA block. Connect S00_AXI on the AXI SmartConnect to M_AXI_HPM0_FPD on the Zynq block.

Finally, click “Run Connection Automation” and check “All Automation”. By doing this, we connect all the clocks and resets. Click the “Regenerate Layout” button in the Block Diagram toolbar to make the diagram look nice.

design_final

Next, switch to the “Address Editor” tab. Click the “Assign All” button in the toolbar. By doing this, we assign address spaces to various AXI interfaces. For example, m_axi_dram0 and m_axi_dram1 gain access to the entire address space on the Ultra96 board, including DDR memory and control register spaces. We only need access to DDR, so you can manually exclude the register address space if you know what you’re doing.

design_address

Back in the Block Diagram tab, click the “Validate Design” (or F6) button. You should see the message informing you of successful validation! You can now close the Block Design by clicking x in the right upper corner.

The final step is to create the HDL wrapper for our design, which will tie everything together and enable synthesis and implementation. Right-click the tensil_ultra96v2 item in the Sources tab and choose “Create HDL Wrapper”. Keep “Let Vivado manage wrapper and auto-update” selected. Wait for the Sources tree to be fully updated and right-click on tensil_ultra96v2_wrapper. Choose Set as Top.

Now it’s time to let Vivado perform synthesis and implementation and write the resulting bitstream. In the Flow Navigator sidebar, click on “Generate Bitstream” and hit OK. Vivado will start synthesizing our Tensil design – this may take around 15 minutes. When done, you can observe some vital stats in the Project Summary. First, look at utilization, which shows what percentage of each FPGA resource our design is using. Note how we pushed BRAM and DSP resources to high utilization.

utilization

The second is timing, which tells us about how long it takes for signals to propagate in our programmable logic (PL). The “Worst Negative Slack” being a positive number is good news – our design meets propagation constraints for all nets at the specified clock speed!

timing

5. Compile ML model for TCU

Back to top

The second branch of the Tensil toolchain flow is to compile the ML model to a Tensil binary consisting of TCU instructions, which are executed by the TCU hardware directly. For this tutorial, we will use ResNet20 trained on the CIFAR dataset. The model is included in the Tensil docker image at /demo/models/resnet20v2_cifar.onnx. From within the Tensil docker container, run the following command.

tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/resnet20v2_cifar.onnx -o "Identity:0" -s true

We’re using the ONNX version of the model, but the Tensil compiler also supports TensorFlow, which you can try by compiling the same model in TensorFlow frozen graph form at /demo/models/resnet20v2_cifar.pb.

tensil compile -a /demo/arch/ultra96v2.tarch -m /demo/models/resnet20v2_cifar.pb -o "Identity" -s true

The resulting compiled files are listed in the ARTIFACTS table. The manifest (tmodel) is a plain text JSON description of the compiled model. The Tensil program (tprog) and weights data (tdata) are both binaries to be used by the TCU during execution. The Tensil compiler also prints a COMPILER SUMMARY table with interesting stats for both the TCU architecture and the model.

----------------------------------------------------------------------------------------------
COMPILER SUMMARY
----------------------------------------------------------------------------------------------
Model:                                           resnet20v2_cifar_onnx_ultra96v2 
Data type:                                       FP16BP8                         
Array size:                                      16                              
Consts memory size (vectors/scalars/bits):       2,097,152                       33,554,432 21
Vars memory size (vectors/scalars/bits):         2,097,152                       33,554,432 21
Local memory size (vectors/scalars/bits):        20,480                          327,680    15
Accumulator memory size (vectors/scalars/bits):  4,096                           65,536     12
Stride #0 size (bits):                           3                               
Stride #1 size (bits):                           3                               
Operand #0 size (bits):                          24                              
Operand #1 size (bits):                          24                              
Operand #2 size (bits):                          16                              
Instruction size (bytes):                        9                               
Consts memory maximum usage (vectors/scalars):   35,743                          571,888    
Vars memory maximum usage (vectors/scalars):     13,312                          212,992    
Consts memory aggregate usage (vectors/scalars): 35,743                          571,888    
Vars memory aggregate usage (vectors/scalars):   46,097                          737,552    
Number of layers:                                23                              
Total number of instructions:                    102,741                         
Compilation time (seconds):                      30.066                          
True consts scalar size:                         568,474                         
Consts utilization (%):                          97.210                          
True MACs (M):                                   61.476                          
MAC efficiency (%):                              0.000                           
----------------------------------------------------------------------------------------------

6. Execute using PYNQ

Back to top

Now it’s time to put everything together on our development board. For this, we first need to set up the PYNQ environment. This process starts with downloading the SD card image for our development board. There’s the detailed instruction for setting board connectivity on the PYNQ documentation website. You should be able to open Jupyter notebooks and run some examples.

There is one caveat that needs addressing once PYNQ is installed. On the default PYNQ image, the setting for the Linux kernel CMA (Contiguous Memory Allocator) area size is 128MB. Given our Tensil architecture, the default CMA size is too small. To address this, you’ll need to download our patched kernel, copy it to /boot, and reboot your board. Note that the patched kernel is built for PYNQ 2.7 and will not work with other versions. To patch the kernel, run these commands:

wget https://s3.us-west-1.amazonaws.com/downloads.tensil.ai/pynq/2.7/ultra96v2/image.ub
scp image.ub xilinx@192.168.3.1:
ssh xilinx@192.168.3.1
sudo cp /boot/image.ub /boot/image.ub.backup
sudo cp image.ub /boot/
rm image.ub
sudo reboot

Now that PYNQ is up and running, the next step is to scp the Tensil driver for PYNQ. Start by cloning the Tensil GitHub repository to your work station and then copy drivers/tcu_pynq to /home/xilinx/tcu_pynq onto your board.

git clone git@github.com:tensil-ai/tensil.git
scp -r tensil/drivers/tcu_pynq xilinx@192.168.3.1:

We also need to scp the bitstream and compiler artifacts.

Next we’ll copy over the bitstream, which contains the FPGA configuration resulting from Vivado synthesis and implementation. PYNQ also needs a hardware handoff file that describes FPGA components accessible to the host, such as DMA. Place both files in /home/xilinx on the development board. Assuming you are in the Vivado project directory, run the following commands to copy files over.

scp tensil-ultra96v2.runs/impl_1/tensil_ultra96v2_wrapper.bit xilinx@192.168.3.1:tensil_ultra96v2.bit
scp tensil-ultra96v2.gen/sources_1/bd/tensil_ultra96v2/hw_handoff/tensil_ultra96v2.hwh xilinx@192.168.3.1:

Note that we renamed bitstream to match the hardware handoff file name.

Now, copy the .tmodel, .tprog and .tdata artifacts produced by the compiler to /home/xilinx on the board.

scp resnet20v2_cifar_onnx_ultra96v2.t* xilinx@192.168.3.1:

The last thing needed to run our ResNet model is the CIFAR dataset. You can get it from Kaggle or run the commands below (since we only need the test batch, we remove the training batches to reduce the file size). Put these files in /home/xilinx/cifar-10-batches-py/ on your development board.

wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar xfvz cifar-10-python.tar.gz
rm cifar-10-batches-py/data_batch_*
scp -r cifar-10-batches-py xilinx@192.168.3.1:

We are finally ready to fire up the PYNQ Jupyter notebook and run the ResNet model on TCU.

Jupyter notebook

First, we import the Tensil PYNQ driver and other required utilities.

import sys
sys.path.append('/home/xilinx')

# Needed to run inference on TCU
import time
import numpy as np
import pynq
from pynq import Overlay
from tcu_pynq.driver import Driver
from tcu_pynq.architecture import ultra96

# Needed for unpacking and displaying image data
%matplotlib inline
import matplotlib.pyplot as plt
import pickle

Now, initialize the PYNQ overlay from the bitstream and instantiate the Tensil driver using the TCU architecture and the overlay’s DMA configuration. Note that we are passing axi_dma_0 object from the overlay – the name matches the DMA block in the Vivado design.

overlay = Overlay('/home/xilinx/tensil_ultra96v2.bit')
tcu = Driver(ultra96, overlay.axi_dma_0)

The Tensil PYNQ driver includes the Ultra96 architecture definition. Here it is in an excerpt from architecture.py: you can see that it matches the architecture we used previously.

ultra96 = Architecture(
    data_type=DataType.FP16BP8,
    array_size=16,
    dram0_depth=2097152,
    dram1_depth=2097152,
    local_depth=20480,
    accumulator_depth=4096,
    simd_registers_depth=1,
    stride0_depth=8,
    stride1_depth=8,
)

Next, let’s load CIFAR images from the test_batch.

def unpickle(file):
    with open(file, 'rb') as fo:
        d = pickle.load(fo, encoding='bytes')
    return d

cifar = unpickle('/home/xilinx/cifar-10-batches-py/test_batch')
data = cifar[b'data']
labels = cifar[b'labels']

data = data[10:20]
labels = labels[10:20]

data_norm = data.astype('float32') / 255
data_mean = np.mean(data_norm, axis=0)
data_norm -= data_mean

cifar_meta = unpickle('/home/xilinx/cifar-10-batches-py/batches.meta')
label_names = [b.decode() for b in cifar_meta[b'label_names']]

def show_img(data, n):
    plt.imshow(np.transpose(data[n].reshape((3, 32, 32)), axes=[1, 2, 0]))

def get_img(data, n):
    img = np.transpose(data_norm[n].reshape((3, 32, 32)), axes=[1, 2, 0])
    img = np.pad(img, [(0, 0), (0, 0), (0, tcu.arch.array_size - 3)], 'constant', constant_values=0)
    return img.reshape((-1, tcu.arch.array_size))

def get_label(labels, label_names, n):
    label_idx = labels[n]
    name = label_names[label_idx]
    return (label_idx, name)

To test, extract one of the images.

n = 9
img = get_img(data, n)
label_idx, label = get_label(labels, label_names, n)
show_img(data, n)

You should see the image.

frog

Next, load the tmodel manifest for the model into the driver. The manifest tells the driver where to find the other two binary files (program and weights data).

tcu.load_model('/home/xilinx/resnet20v2_cifar_onnx_ultra96v2.tmodel')

Finally, run the model and print the results! The call to tcu.run(inputs) is where the magic happens. We’ll convert the ResNet classification result vector into CIFAR labels. Note that if you are using the ONNX model, the input and output are named x:0 and Identity:0 respectively. For the TensorFlow model they are named x and Identity.

inputs = {'x:0': img}

start = time.time()
outputs = tcu.run(inputs)
end = time.time()
print("Ran inference in {:.4}s".format(end - start))
print()

classes = outputs['Identity:0'][:10]
result_idx = np.argmax(classes)
result = label_names[result_idx]
print("Output activations:")
print(classes)
print()
print("Result: {} (idx = {})".format(result, result_idx))
print("Actual: {} (idx = {})".format(label, label_idx))

Here is the expected result:

Ran inference in 0.03043s

Output activations:
[-13.59375    -12.25        -7.90625     -6.21484375  -8.25
 -12.24609375  15.0390625  -15.10546875 -10.71875     -9.1796875 ]

Result: frog (idx = 6)
Actual: frog (idx = 6)

Congratulations! You ran a machine learning model a custom ML accelerator that you built on your own work station! Just imagine the things you could do with it…

Wrap-up

Back to top

In this tutorial we used Tensil to show how to run machine learning (ML) models on FPGA. We went through a number of steps to get here, including installing Tensil, choosing an architecture, generating an RTL design, synthesizing the desing, compiling the ML model and finally executing the model using PYNQ.

If you made it all the way through, big congrats! You’re ready to take things to the next level by trying out your own model and architecture. Join us on Discord to say hello and ask questions, or send an email to support@tensil.ai.

4 - Reference

Handy reference material

4.1 - Benchmarks

Performance benchmarks and information

Methodology

Benchmarks are generated using the Tensil compiler. Each instruction is evaluated against a latency model to compute expected execution time. Actual results may therefore differ somewhat from the numbers listed here. Help us improve the latency model!

ResNet-20v2

Trained for CIFAR.

FPGA Board Tensil Array Size Clock (MHz) Latency (ms) Frames per second
Arty A7-35 8x8 150 21 48
Pynq Z1 12x12 150 14 71
Ultra96-V2 16x16 300 4 250

YoloV4-tiny

Trained for ImageNet.

FPGA Board Tensil Array Size Clock (MHz) Latency (ms) Frames per second
Arty A7-35 8x8 150 175 5.7
Pynq Z1 12x12 150 112 8.9
Ultra96-V2 16x16 300 36 28

ResNet-50v2

Trained for ImageNet.

FPGA Board Tensil Array Size Clock (MHz) Latency (ms) Frames per second
Arty A7-35 8x8 150 1969 0.5
Pynq Z1 12x12 150 833 1.2
Ultra96-V2 16x16 300 260 3.8

4.2 - Compiler

Compiler concepts, components and their interaction

4.2.1 - Structure diagram

Tensil compiler structure block diagram

architecture

4.2.2 - Frontend

Description of compiler frontend

The frontend is responsible for handling the compiler’s primary input–an ML model. With many ML frameworks in existence, the compiler isolates the specific framework support in the frontend. In other words, we envision multiple dedicated frontends able to handle models created by each ML framework. Currently, there are two frontends supporting TensorFlow and ONNX with input in the form of model.pb and model.onnx files correspondingly. The frontend parses the model, represented in the form of a graph. It uses one or more output nodes to linearize the graph in a series of nodes respecting dataflow dependencies.

The frontend then processes this linearized series. During this processing, the frontend is grouping model nodes to form layers. Each layer represents one entire cycle started with matrix multiplication, followed by a series of accumulator operations and finalized with moving the result out of accumulators. In essence, the content of accumulators and systolic array weights is never shared between layers.

The frontend interacts with the memory manager to obtain necessary memory objects. There are two banks of memory directly accessible to the host: DRAM0 and DRAM1. The compiler dedicates DRAM0 to store variable data objects (Vars) such as inputs, outputs, and the data passed between layers. Next, it dedicates DRAM1 to various constants (Consts), such as matrix multiplication weights and bias, constants used in accumulator operations, and constants used to blend with variable data objects (like zero-padding). The frontend creates a new instance of the scheduler for each layer and submits a series of high-level intermediate representation (HIR) operations based on model nodes present in the layer. The frontend allocates special temporary (Temp) memory objects to pass the data between HIR operations within a single layer. The scheduler is later responsible for mapping this temporary memory to available accumulators.

4.2.3 - Opsets

Supported operations

Tensorflow

Operation Comments
MatMul
Conv2D Only SAME and VALID paddings are supported.
BiasAdd
ResizeBilinear Resize image with align corners is not supported.
FusedBatchNormV3
MaxPool Only SAME and VALID paddings are supported.
AvgPool Only SAME and VALID paddings are supported.
Mean Only channel mean is supported.
Relu
LeakyRelu
AddV2
ConcatV2 Only last dimension concat is supported.
Split Only last dimension split is supported.
SplitV Only last dimension split is supported.
Pad Only 4D padding is supported. Only height/width padding is supported.
Reshape
Cast* Only DT_INT32 to DT_FLOAT cast is supported.
Tile*
Pack* Only first axis pack is supported.
StridedSlice* Only 1D strided slice is supported. Only strided slice with shrink axis is supported.
Shape*
  • Only compile-time constants folding

Onnx

We support a subset of ONNX v8.

4.2.4 - Memory manager

Description of compiler memory manager

The memory manager is responsible for allocating and freeing, when necessary, memory objects. Memory object represents a series of memory addresses (memory span) with associated tensor dimensions. The scheduler uses dimensions to ensure the correctness of the dataflow. In addition, the memory manager is tracking pending constants found in model nodes. The pending means that when the frontend processes the constant, it is unknown if it will become a memory object or be used as a parameter to one of the HIR operations. When a pending constant becomes a Const memory object, it gets emitted as a part of the model.tdata file later used by the driver to place into host memory. The memory manager also emits a memory map for Consts and Vars memories. Such a map is included in model.tmodel file to inform the driver of the memory layout to place the content of model.tdata file, as well as model’s inputs and outputs.

4.2.5 - High-level intermediate representation

Explanation of HIR

High-level intermediate representation (HIR) is an interface offered by the scheduler. It expresses common ML operations abstract from specific ML frameworks, such as ONNX. It also operates in terms of memory objects.

Following are a few examples of HIR.

def emitMatMul(
      weightsObjs: Seq[MemoryObject],
      biasObj: Option[MemoryObject],
      inputOutputPairs: Seq[MemoryOptionalInputOutputObjects]
  ): Unit

The emitMatMul function takes weights and the bias memory objects, and a sequence of input-output object pairs. It performs matrix multiplication for each input memory object and places results in the output memory object. Input is optional, in which case it is assumed to be all zeroes. Weights and the bias must be Consts objects. Input must be Vars object, and output must be Temp object.

def emitRelu(
      inputObj: MemoryObject,
      outputObj: MemoryObject
  ): Unit

The emitRelu function performs ReLU activation on the input object and places the result in the output object. Both input and output must be Temp objects.

def emitSave(
      inputObj: MemoryObject,
      outputObj: MemoryObject
  ): Unit

The emitSave function moves data from the Temp input object to Vars output object, usually at the end of the layer.

4.2.6 - Scheduler

Description of compiler execution scheduler

scheduler

The scheduler is responsible for transforming the high-level intermediate representation (HIR) produced by the frontend to the low-level intermediate representation (LIR) consumed by the backend. The main objective of such transformation is to schedule HIR operations expressed in terms of relatively large Vars, Consts, and unlimited Temp memories to limited SRAM local memory and accumulators available to a specific configuration of the processing unit. Internally it achieves this by building a dataflow graph based on memory addresses and finding its maximum partitioning that fits the local memory and accumulators. Such a partition is called a stage. The scheduler then produces LIR for every stage independently. Like for the frontend layers, stages don’t share weights in the systolic array nor the content of accumulators. At the moment, they don’t share data in the local memory either, which we expect to change once the compiler has to work efficiently with larger-sized local memory.

4.2.7 - Low-level intermediate representation

Explanation of LIR

Low-level intermediate representation (LIR) is an interface offered by the backend. It expresses instructions supported by the processing unit. Unlike HIR, it operates in terms of memory addresses. Each memory address is tagged with its memory type. While HIR memory objects are expected to contain Vars, Consts and Temp tagged addresses, LIR only accepts Vars, Consts, Local memory, and Accumulator tagged addresses. One of the key scheduler roles is to do this translation.

Following are a few examples of LIR. Each produces the corresponding processing unit instruction. Note that LIR is not using instruction flags directly. The backend role is to infer these flags based on LIR arguments, such as accumulate and toLocal booleans and memory address tags.

def emitMatMul(
      accumulate: Boolean,
      localStride: Int,
      localAddress: MemoryAddress,
      accumulatorStride: Int,
      accumulatorAddress: MemoryAddress,
      size: Long,
      comments: List[String] = List()
  ): Unit

def emitSIMD(
      accumulate: Boolean,
      simdOp: Int,
      simdSourceLeft: Int,
      simdSourceRight: Int,
      simdDestination: Int,
      writeAccumulatorAddress: MemoryAddress,
      readAccumulatorAddress: MemoryAddress,
      comments: List[String] = List()
  ): Unit

def emitDataMove(
      toLocal: Boolean,
      accumulate: Boolean,
      localAddress: MemoryAddress,
      address: MemoryAddress,
      size: Long,
      comments: List[String] = List()
  ): Unit

4.2.8 - Backend

Description of compiler backend

The backend is responsible for translating LIR into the model.tprog and model.tmodel files containing the binary representation of the processing unit program and the information required by the driver to feed the program into the processing unit. It computes the instruction layout based on compiler options such as memory and SIMD registers depth. To produce instruction binary form, the backend infers instruction flags based on LIR arguments.

4.3 - Hardware

Hardware architecture and implementation details

4.3.1 - Architectural paremeters

A list of architectural parameters and their descriptions
Parameter Description Allowable values Example value
Data type The numerical format used to perform calculations in hardware FP16BP8,FP32B16 FP16BP8, which means “Fixed point format with width 16 bits and with the binary point at 8 bits”
Array size The size of the systolic array and also the number of scalars in each vector 2-256 8
DRAM0 depth The number of vectors allocated in DRAM bank 0 2^{1-32} 1048576 (= 2^20)
DRAM1 depth The number of vectors allocated in DRAM bank 1 2^{1-32} 1048576 (= 2^20)
Local depth The number of vectors allocated in on-fabric main memory 2^{1-16} 16384 (= 2^14)
Accumulator depth The number of vectors allocated in on-fabric accumulator memory 2^{1-16} 4096 (= 2^12)
SIMD registers depth The number of registers to instantiate for each ALU in the SIMD module 0-16 1

4.3.2 - Architecture diagram

Tensor Compute Unit architecture block diagram

Tensor Compute Unit architecture diagram

4.3.3 - Configuration registers

A description of the Tensor Compute Unit configuration registers
Name Meaning Width (bits) Number Default value
DRAM0 address offset* The offset in DRAM of the memory space allocated for use by the DRAM0 interface 32 0x00 0x0000
DRAM0 cache behaviour This register is passed without modification to the AXI converter for the DRAM0 interface, where it is used as the value of the AxCACHE field in both reads and writes. Default value of 0b0000 indicates no caching allowed. See AXI4 protocol spec for more details. 4 0x01 0b0000
<unused> - - 0x02-0x03 -
DRAM1 address offset* The offset in DRAM of the memory space allocated for use by the DRAM1 interface 32 0x04 0x0000
DRAM1 cache behaviour This register is passed without modification to the AXI converter for the DRAM1 interface, where it is used as the value of the AxCACHE field in both reads and writes. Default value of 0b0000 indicates no caching allowed. See AXI4 protocol spec for more details. 4 0x05 0b0000
<unused> - - 0x06-0x07 -
Timeout The number of cycles the decoder will remain in the same state before raising the timeout flag. This usually indicates something has stalled and is useful for triggering debug ILAs. 16 0x08 0x0064
Tracepoint The value of the program counter at which the tracepoint flag will be raised. Useful for triggering debug ILAs to inspect hardware state during execution. 32 0x09 0xFFFFFFFF
Program counter Increments by 1 every time an instruction is completely processed. 32 0x0A 0x00000000
Sample interval The period in cycles at which to sample the program counter and decoder control bus handshake signals. Default value of 0x0000 disables sampling. 16 0x0B 0x0000

*DRAM address offsets are specified in 64K blocks. The real address offset is 2^16 times the address offset register value configured. That is, if the DRAMx address offset register is configured with value 0x0001, the actual address offset that will appear in requests on the AXI bus will be 0x00010000.

4.3.4 - Instruction set

A description of the Tensor Compute Unit instruction set
Name Description Opcode Flags Operand #0 Operand #1 Operand #2
NoOp Do nothing 0x0 - - - -
MatMul Load input at memory address into systolic array and store result at accumulator address 0x1 Accumulate? Zeroes? Local Memory stride/address Accumulator stride/address Size
DataMove Move data between the main memory and either the accumulators or one of two off-chip DRAMs 0x2 Data flow control enum (see below) Local Memory stride/address Accumulator or DRAM stride/address Size
LoadWeight Load weight from memory address into systolic array 0x3 Zeroes? (Ignores operand #0) Local Memory stride/address Size -
SIMD Perform computations on data in the accumulator 0x4 Read? Write? Accumulate? Accumulator write address Accumulator read address SIMD sub-instruction
LoadLUT Load lookup tables from memory address. 0x5 - Local Memory stride/address Lookup table number -
<unused> - 0x6-0xE - - - -
Configure Set configuration registers 0xF - Register number Value -

Notes

  • Weights should be loaded in reverse order

  • Since Size = 0 doesn’t make sense, the size argument is interpreted as 1 less than the size of data movement requested i.e.

    • size = 0 means transfer 1 vector
    • size = 1 means transfer 2 vectors
    • size = 255 means transfer 256 vectors etc.
  • Instruction width is a parameter supplied to the RTL generator

    • Opcode field is always 4 bits
    • Flags field is always 4 bits
    • Instruction must be large enough to fit the maximum values of all operands in the longest instruction (MatMul, DataMove, SIMD)
  • Flags are represented in the following order: [3 2 1 0]

    • i.e. the first flag listed is at bit 0 (the 4th bit), second flag is at bit 1 (the 3rd bit) etc.
  • Arguments are in the following order: [2 1 0]

    • e.g. in MatMul the bits of the instruction will be, from most significant bit to least: opcode, optional zero padding, accumulate?, size, accumulator stride/address, memory stride/address

    • Address unit for all memories is one array vector

    • Stride has a fixed number of bits followed by the number of bits for the address of the largest memory that may appear in the operand. The address for smaller memories gets padded by zeros. Stride is encoded as power of 2. For example the 3-bit stride is as follows 000=1, 001=2, 010=4, 011=8,.. 111=128

      • e.g. in a 2-byte argument with an 11-bit address and a 3-bit stride, the bit layout would be

        • 15:14 = padding, to be set to zeroes
        • 13:11 = stride
        • 10:0 = address
    • Size unit is one array vector

  • Data flow control enum flag values are:

    • 0b0000 = 0 = 0x0 = DRAM0 to memory
    • 0b0001 = 1 = 0x1 = memory to DRAM0
    • 0b0010 = 2 = 0x2 = DRAM1 to memory
    • 0b0011 = 3 = 0x3 = memory to DRAM1
    • 0b1100 = 12 = 0xc = accumulator to memory
    • 0b1101 = 13 = 0xd = memory to accumulator
    • 0b1110 = 14 = 0xe = <reserved>
    • 0b1111 = 15 = 0xf = memory to accumulator (accumulate)
  • SIMD instructions have some subtleties

    • can read or write (+accumulate) in the same instruction

    • when the read flag is set, data is read from the accumulators into the ALU array

    • when the write flag is set, the ALU array output is written into the accumulators

      • the accumulate flag determines whether this is an accumulate or an overwrite

      • the output to be written is computed from the input that was read in on the same instruction

        • i.e. if `x` is read from the accumulators at the specified read address, and the ALU computes `f(_)` then `f(x)` will be written to the accumulators at the specified write address from the same instruction
      • data is output from the ALU array on every instruction i.e. even if the destination is specified as register 1, you can still write into the accumulators from the output

    • before reading out from the accumulators with a DataMove, you should wait at least 2 instructions since the last SIMD instruction in which the write flag was high. This is because the data takes about 2 instructions to propagate into the accumulators from the ALUs. The easiest way to achieve this is just to insert 2 no-ops before the DataMove instruction.

      • 2 instructions is an empirical estimate. The number may need to be higher in certain cases. If you see data being dropped/corrupted/repeated, talk totom@tensil.ai about it

SIMD sub-instructions

All SIMD instructions are composed of 4 parts: opcode, source left, source right and destination. The widths are as follows:

  • opcode = ceil(log2(numOps))

    • numOps is currently fixed at 16, so opcode is 4 bits
  • source left = ceil(log2(numRegisters+1))

    • numRegisters is currently fixed at 1, so source left is 1 bit
  • source right = source left

  • dest = source left

Source left is the left argument for binary operations, and the single argument for unary operations. Source right is the right argument for binary operations, and is ignored for unary operations.

The_Move_ opcode allows you to move data from one register to another, or to read the data in a register to output. The _NoOp_ opcode is only a true no-op when both the read and write flags are set to false in the SIMD instruction. Otherwise, _NoOp_ has an overloaded meaning: it is used to trigger an external read or write. That is, to write into the accumulators from the PE array, or to read out from the accumulators into on-chip main memory.

Opcode Source left Source right Destination
0x00 = NoOp**
0x01 = Zero
0x02 = Move*
0x03 = Not*
0x04 = And
0x05 = Or
0x06 = Increment*
0x07 = Decrement*
0x08 = Add
0x09 = Subtract
0x0A = Multiply
0x0B = Abs*
0x0C = GreaterThan
0x0D = GreaterThanEqual
0x0E = Min
0x0F = Max
0x10 = Lookup*
0 = input
1 = register 1
2 = register 2…
0 = input
1 = register 1
2 = register 2…
0 = output
1 = output & register 1
2 = output & register 2…

*unary operation

**arguments are ignored

Lookup sub-instruction

The lookup sub-instruction returns_N+1_ result values where _N_ is the number of lookup tables in the architecture. The results are, in order

  • the difference between the argument and the closest key found in the lookup table index
  • the value found in the first lookup table
  • the value found in the second lookup table
  • etc.

The destination register_d_specifies the register which will receive the first result i.e. the difference. The remaining results will be populated in the registers numbered ascending from_d_, so that the result from lookup table_i_ will be written to register _d+i+1_. This assumes that the architecture is configured with sufficient registers to store all the results, and that_d+N_ <= total number of registers. The behaviour of the lookup sub-instruction is undefined if these requirements aren’t met.

4.3.5 - Performance samples

A description of the Tensor Compute Unit performance samples

Performance sampling

The program counter and decoder control bus handshake signals can be sampled at a fixed interval of L cycles in order to measure system performance. The samples are written out to the sample IO bus in blocks of N sample words. The block is terminated by asserting the AXI stream TLAST signal. Each sample word is a 64-bit word, with the following meaning:

Bus name Signal Bit field(s) Comments
Program counter 0:31 Contains all 1s if the sample is invalid. Invalid samples are produced when the sampling interval is set to 0.
Array Valid 32 Contains all 0s if the sample is invalid.
Ready 33
Acc Valid 34
Ready 35
Dataflow Valid 36
Ready 37
DRAM1 Valid 38
Ready 39
DRAM0 Valid 40
Ready 41
MemPortB Valid 42
Ready 43
MemPortA Valid 44
Ready 45
Instruction Valid 46
Ready 47
<unused> 48:64

Value of L can be changed by setting the configuration register. Value of N is defined by architecture.

5 - Concepts

Key concepts that will help you understand what Tensil does

Register Transfer Level (RTL) code

RTL is code that describes the behaviour of computational hardware. It contains constructs like modules, input and output ports, signals, registers and low-level operations. Typical RTL languages are Verilog and VHDL. An example of Verilog is shown below. Electronic design automation (EDA) tools can turn RTL into descriptions of physically realizable circuits, which can be flashed onto an FPGA or taped out as an ASIC.

module foo(
  input a,
  input b,
  output c
);
  assign c = a || b;
endmodule

RTL Generator

An RTL generator produces a blob of RTL given some high level architectural parameters. This allows you to easily create customized RTL that is specialized for a given application or use case without having to redesign the whole system. Tensil contains an RTL generator for ML accelerators.

Chisel

Tensil’s RTL generator is built using Chisel, a next generation hardware design language developed out of UC Berkeley. From the Chisel website:

Chisel is a hardware design language that facilitates advanced circuit generation and design reuse for both ASIC and FPGA digital logic designs. Chisel adds hardware construction primitives to the Scala programming language, providing designers with the power of a modern programming language to write complex, parameterizable circuit generators that produce synthesizable Verilog. This generator methodology enables the creation of re-usable components and libraries, such as the FIFO queue and arbiters in the Chisel Standard Library, raising the level of abstraction in design while retaining fine-grained control.    – Source: https://www.chisel-lang.org/, retrieved 2022/03/04

Model compiler

A model compiler takes an ML model and a target architecture and produces binary artifacts that can be executed by that architecture. In Tensil’s case, the model compiler produces three artifacts. The .tprog file is the executable containing instructions to be interpreted by the accelerator, the .tdata file contains the model’s parameters in the appropriate format, and the .tmodel file tells the driver how to set up inputs and outputs.

Driver

A driver takes an architecture description and a compiled model and interacts with the abstractions in the operating system or execution environment (i.e. low-level libraries) to feed the compiled model into the hardware. It is also responsible for setting up inputs and outputs, and managing any other resources that might be revelant on a given hardware platform.

6 - Roadmap

Our plans for continuing development

Coming soon!

7 - Scaladoc

Automatically generated Scaladoc reference materials

Packages