Skip to main navigation Skip to search Skip to main content

High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Hiroki Nakahara, Zhiqiang Que, Wayne Luk

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

34 Citations (Scopus)

Abstract

The growing interest in using FPGAs to accelerate convolutional neural network (CNN) workloads is driving the deployment of FPGAs on cloud services such as Amazon AWS and Microsoft Azure. Such current cloud-based FPGAs have serious problems concerning data transfer bandwidth. In this paper, we compress a transfer image using customized JPEG coding and implement a customized image decoder architecture. We analyze the trade-off between data transfer speed-up and recognition accuracy drop. Based on this compression scheme, we design a high-throughput CNN inference engine. Almost all existing FPGA-based CNN accelerators are based with the same idea as their GPU counterparts, where operations from different network layers are mapped onto the same hardware units working in a multiplexed way. Our fully pipelined architecture maps all the network layers on-chip and transfers the computation from different layers to their unit with independent optimization. We apply two CNN optimization techniques to a residual network, one is a channel shift and point-wise approximation, and the other is a binary weight quantization. We implement the proposed CNN inference accelerator on the Xilinx Virtex UltraScale+ XCVU9P FPGA. Our system peak-performance achieves 2.41 TOPS. Our compressed JPEG image transfer only consumes 4% of the system resource, drops 0.3 points of accuracy and achieves 81,120 FPS which is 65.27 times faster than the conventional straightforward RGB data transfer. Thus, our proposed data transfer architecture is sufficient to increase system performance. As for the system throughput, our system is 3.84-34.41 times higher than existing FPGA implementations. Compared with the Xeon CPU, it achieves 138.38 times higher throughput, and it dissipates 1.2 times lower power, so its efficiency is 177.12 times better. Compared with the Tesla V100 GPU, it achieves 9.48 times higher throughput, dissipates 3.9 times lower power, and its efficiency is 37.52 times better. Thus, our parallel architecture on an FPGA provides superior throughput for the acceleration of a CNN.
Original languageEnglish
Title of host publicationProceedings - 28th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-9
Number of pages9
ISBN (Electronic)9781728158037
DOIs
Publication statusPublished - 11 Jun 2020
Event28th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020 - Fayetteville, United States
Duration: 3 May 20206 May 2020

Publication series

NameProceedings - IEEE International Symposium on Field-Programmable Custom Computing Machines
PublisherIEEE
ISSN (Print)2576-2613
ISSN (Electronic)2576-2621

Conference

Conference28th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020
Country/TerritoryUnited States
CityFayetteville
Period3/05/206/05/20

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

Fingerprint

Dive into the research topics of 'High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression'. Together they form a unique fingerprint.

Cite this