Benchmarking the NVIDIA V100 GPU and Tensor Cores

Matt Martineau*, Patrick Atkinson, Simon McIntosh-Smith

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

12 Citations (Scopus)


The V100 GPU is the newest server-grade GPU produced by NVIDIA and introduces a number of new hardware and API features. This paper details the results of benchmarking the V100 GPU and demonstrates that it is a significant generational improvement, increasing memory bandwidth, cache bandwidth, and reducing latency. A major new addition is the Tensor core units, which have been marketed as deep learning acceleration features that enable the computation of a 4 × 4 × 4 half precision matrix-multiply-accumulate operation in a single clock cycle. This paper confirms that the Tensor cores offer considerable performance gains for half precision general matrix multiplication; however, programming them requires fine control of the memory hierarchy that is typically unnecessary for other applications.

Original languageEnglish
Title of host publicationEuro-Par 2018
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2018 International Workshops, Revised Selected Papers
EditorsGabriele Mencagli, Dora B. Heras
PublisherSpringer Verlag
Number of pages12
ISBN (Print)9783030105488
Publication statusPublished - 1 Jan 2019
Event24th International Conference on Parallel and Distributed Computing, Euro-Par 2018 - Turin, Italy
Duration: 27 Aug 201828 Aug 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11339 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference24th International Conference on Parallel and Distributed Computing, Euro-Par 2018


Dive into the research topics of 'Benchmarking the NVIDIA V100 GPU and Tensor Cores'. Together they form a unique fingerprint.

Cite this