Computer Vision with FPGA Computer Vision with FPGA

Computer Vision with FPGA

WORK IN PROGRESS

This project page is heavily a WIP, but publishing it first to pressure myself to finish it ASAP!

TLDR Summary

Developed an object tracking system with a GUI for a fully customisable computer vision pipeline, all implemented on a single Basys3 FPGA board.

Introduction

“eh why every EE project end up tuning PID ah?”

This was the final project for EE2026: Digital Design. This module is an introductory digital logic course, with a strong emphasis on the practical aspects of implementing digital systems on FPGAs using Verilog.

The final project was completely open-ended, with the only restriction being that only a single bitstream can be used with up to 4 FPGA boards (Basys3 FPGA Board).

The typical projects implemented are recreations of popular games or enhanced versions of the suggested project, which was a graphing calculator. However, given that this was essentialy the only module in our course that allowed for a fully open-ended project, we decided to push the limits of what could be done with this simple FPGA board. There were two key considerations:

  1. The project should exploit the features of the FPGA e.g. high level of parallelism, low and deterministic latency, etc. In other words, it should not be something that could be easily done with a standard microcontroller or CPU.
  2. The final result should be a product, i.e. a cohesively designed system that has a clear function, rather than a haphazard collection of boards and wires.

Ultimately, we landed on an educational tool for teaching basic computer vision (CV) concepts. This was done by implementing a GUI, displayed on a VGA monitor, that allowed users to customise an image processing pipeline for colour-based object tracking using simple drag-and-drop blocks. The camera was also mounted on a pan-tilt assembly with tunable PD controllers to physically track the object in real-time.

This project met all of our requirements: image processing operations ran in parallel with fast, predictable memory access times to allow for real-time performance, while the GUI and pan-tilt assembly made for a fun and interactive product.

The sections below detail the various components of the project, while also explaining the design choices we had to make along the way, in a (somewhat) chronological order.

CV Pipeline

The overall architecture of the CV pipeline is shown below. Broadly speaking, the pipeline can be divided into 5 main stages:

  1. Camera Interfacing (Grey): Configuring the camera’s settings, and reading the pixel data from the camera.
  2. Pre-Processing (Green): Blurring the image to reduce noise.
  3. Thresholding and Morphology (Orange): Thresholding the image based on colour ranges to create a binary bitmask, which undergo erode/dilate operations to clean up noise.
  4. Blob Detection (Yellow): Finding contiguous regions in the bitmask i.e. finding objects.
  5. Buffering and Display (Purple): Saving the processed image into BRAM buffers and rendering them on the VGA display.

In addition, user settings in the GUI (Blue) are decoded into control signals that modify the datapath accordingly, as summarised in the tables at the bottom.

CV Pipeline Architecture

Camera Interfacing

The camera used was the OV7670 Camera Module. This is a low-cost camera module that had been used in many existing FPGA projects that inspired us to embark on CV in the first place, e.g. Camera Display to VGA, Edge Filter, Stereo Vision.

To interface with the camera, the SCCB protocol is used to set up the camera’s registers, which configures the camera’s operating parameters e.g. resolution, colour format, sensor gain, etc. This protocol can be regarded as a proprietary variant of I2C, since writing of data follows a “3-Phase Write Transmission Cycle”, which is essentially the same as I2C’s structure of procedurally sending the (1) device address, (2) sub-address, (3) data byte, all at a 100 kHZ clock.

This was implemented as a simple state machine that sequentially writes to a list of pre-defined configuration registers upon power-up, as shown in the diagram below.

<TODO: make SCCB state machine diagram>

Odd quirks: A stupid issue encountered was that after writing the first byte (0x80) to the COM7 register (0x12), which is supposed to reset all other registers, a “pause” had to be inserted by writing a default value to some register, before proceeding to write the rest of the configuration registers.

Not doing so resulted in our next immediate register write (which was meant to set the image size and pixel format) to be ignored, causing the pixel capture module to output garbage pixels.

Once the camera is configured, pixel data is streamed in the chosen format (RGB565) over an 8-bit parallel data bus synced to a pixel clock (PCLK) generated by the camera. Synchronisation signals similar to VGA are also provided to indicate frame ends (VSYNC) and line ends (HREF).

Each RGB565 pixel is sent over 2 bytes, with the first byte and second byte containing the upper and lower 8 bits respectively. (Image Source: Omnivision Technologies)

RGB565 output timing diagram

The Capture module is responsible for reconstructing the pixels from this byte stream by buffering the first and second byte into a 16 bit register. Due to memory limits, the image is downsampled into RGB444 format by truncating the least significant bits of each colour channel, and this 12-bit pixel is output on every 2nd PCLK rising edge.

Simultaneously, a row and column counter is maintained to keep track of the pixel’s position in the frame, which is also reset upon VSYNC/HREF signals to ensure synchonicity. Due to the same memory limits, the image is also downscaled from the camera’s 640x480 resolution to 320x240 by only capturing the 2nd pixel in both dimensions. Hence, the least significant bit of both row and column counters are ignored before calculating the pixel’s address, which is simply given by . This address directly represents the BRAM address where the pixel will be stored.

Actually, the OV7670 camera has a built-in downsampling feature (QVGA) that can be enabled via register settings. For some reason, I could not get it to work (likely just some timing issues), but since credit was not given for interfacing with external peripherals, I did not bother debugging for too long as manual downsampling was already implemented.

Sliding Window

All the image processing operations used (both in pre-processing and morphology stages) fundamentally rely on the sliding window, just with different kernel operations. This is implemented most efficiently with shift registers to store the values in the kernel, while FIFO buffers store the remaining rows of pixels, where is the height of the kernel. Due to both LUT and BRAM constraints, all operations were limited to a 3x3 kernel.

<TODO: image zoomed into the sliding window module on the datapath diagram>

However, there are two issues to address for sliding windows, namely border handling and latency. At the borders of the image, there are not enough pixels to fill the kernel. To handle this, we can either ignore the border pixels, which shrinks the output image size by one pixel on each side, or use padding methods to fill in the missing values.

Padding was chosen to keep the image size consistent throughout the pipeline. Replication was used for pre-processing operations (i.e. RGB images), while min-/max-padding was used for morphology operations (i.e. binary bitmasks).

<TODO: diagram of border handling methods>

The other issue is latency from stacking multiple sliding window operations. At the start of a frame, the operation needs to wait for the first row of the image to fill the top 2 rows of the kernel and FIFO buffers (using either replicate or min/max padding for the top row), as well as the second row first 3 pixels fills the last row of the kernel. That is when the kernel is fully filled and thus the first kernel operation can be performed. The same issue persists for each subsequent sliding window stage, resulting in a latency of one row plus one pixel introduced for every new stage.

<TODO: diagram of kernels lagging behind each other>

This becomes an issue at the end of the frame, where each convolution stage is still processing the last few rows of the image while the next frame has already started streaming in. The proper way to fix this would be to implement an additional one-line buffer for each stage that reads the next frame’s pixels until all the previous frame’s pixels have been processed with the existing sliding window.

However, this would have increased BRAM usage (which eventually reached 100% utilisation). Hence, a simpler solution was used instead: simply ignore the last few rows of the image that was not processed. This was acceptable since the GUI blocked the bottom part of the frame anyways and a few rows of pixels made little difference to the overall user experience.

<TODO: diagram of end of frame handling>

Kernel Operations

With the sliding window in place, implementing the various kernel operations was rather straightforward. The main restriction was that all operations had to be completed within a single clock cycle to ensure no further latency was introduced.

For RGB images, each individual colour channel was processed separately in parallel, then recombined back into a RGB pixel, which is valid for linear filters like gaussian blur. For non-linear filters like the median filter, the proper way would be to find the pixel with the smallest sum of squared differences across all 3 colour channels, compared with every other pixel in the kernel. Alternatively, the pixels could be converted to another colour space like HSV where performing median filtering on each separate channel makes sense.

However, both methods added computationally expensive operations that were deemed infeasible to run in a single clock cycle. Hence, the median filter was applied to each individual colour channel instead. Colour inaccuracies produced were deemed acceptable since each colour channel is only a 4-bit value, so accuracy was not a major concern.

Explanation of gaussian blur and median filter hardware goes here.

<TODO: cropped in image of gaussian blur and median filters from datapath diagram>

  • RGB images: Convolution on each channel separately then recombine

    • Gaussian blur: use whole numbers so can all multiplications with shifts
    • Median filter: efficient comparison inspired from that paper
  • Binary images: use bitwise and/or operations

    • Erode: bitwise AND of all pixels in kernel
    • Dilate: bitwise OR of all pixels in kernel

Colour Thresholding

  • RGB
  • maybe if add in HSV can say here

Memory Management

  • BRAM —> double buffer into single buffer, clock domain crossing
  • Double boards??

Blob Detection (UFDS)

bruh

GUI Design

  • Screenshots of GUI states
  • GUI state machine

Overall Datapath and Control

  • Block diagram of overall datapath
  • Control signals table

Servo PD Controller

  • PWM signal generation
  • PD control loop + tuning

Mechanical Design

  • 3D printed mounts
  • Pan tilt assembly

Video Demo

  • link

Areas For Improvement

  • “the Laters list”

Conclusion

For all the differences that programming with a HDL brings, I surprisingly found

Perhaps drawing parallels to the microcontroller module I took last semester, I unironically enjoyed programming in Assembly over C due to the fine-grained, low-level control it provided. Similarly, many aspects of the system design was simply easier when dealing at the hardware levels, e.g. timers <TODO: continue>


← Back to projects