Skip to main navigation Skip to search Skip to main content

Optimizing LLM inference for FPGAs

Jorge R. De Freitas*, Jose G.F. Coutinho, Ce Guo, Suleyman Demirsoy, Wayne Luk, Zhiqiang Que

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

1 Citation (Scopus)

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance but demand high computation and memory, making deployment in resource-limited settings challenging. Field-Programmable Gate Arrays (FPGAs) offer parallelism and efficiency, yet most prior FPGA accelerators rely on low-level, platform-specific flows that hinder portability. This work presents oneLLM, to our knowledge, the first FPGA-based LLM inference design using Intel's oneAPI, enabling a unified high-level programming model across CPUs, GPUs, and FPGAs. Our deeply pipelined, multi-kernel hardware architecture connects specialized kernels via oneAPI pipes for on-chip streaming, reducing host-device communication. Implemented on an Intel Agilex 7 FPGA, it achieves 3 times faster than a CPU implementation, and 8.8 times faster than a non-pipelined baseline while meeting resource constraints, demonstrating the potential of portable FPGA development for LLM acceleration. Code available at https://github.com/custom-computing-ic/llm-oneapi-fpga.
Original languageEnglish
Title of host publication2025 IEEE 16th International Conference on ASIC (ASICON)
PublisherIEEE Computer Society
Number of pages4
ISBN (Electronic)9798331539177
ISBN (Print)9798331539184
DOIs
Publication statusPublished - 19 Jan 2026
Event2025 IEEE 16th International Conference on ASIC, ASICON 2025 - Kunming, China
Duration: 21 Oct 202524 Oct 2025

Publication series

NameProceedings of International Conference on ASIC
ISSN (Print)2162-7541
ISSN (Electronic)2162-755X

Conference

Conference2025 IEEE 16th International Conference on ASIC, ASICON 2025
Country/TerritoryChina
CityKunming
Period21/10/2524/10/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Fingerprint

Dive into the research topics of 'Optimizing LLM inference for FPGAs'. Together they form a unique fingerprint.

Cite this