Phytoplankton identification and abundance data are now commonly feeding plankton distribution databases worldwide. This study is a first attempt to compile the largest possible body of data available from different databases as well as from individual published or unpublished datasets regarding diatom distribution in the world ocean. The data obtained originate from time series studies as well as spatial studies. This effort is supported by the Marine Ecosystem Model Inter-Comparison Project (MAREMIP), which aims at building consistent datasets for the main plankton functional types (PFTs) in order to help validate biogeochemical ocean models by using carbon (C) biomass derived from abundance data. In this study we collected over 293 000 individual geo-referenced data points with diatom abundances from bottle and net sampling. Sampling site distribution was not homogeneous, with 58% of data in the Atlantic, 20% in the Arctic, 12% in the Pacific, 8% in the Indian and 1% in the Southern Ocean. A total of 136 different genera and 607 different species were identified after spell checking and name correction. Only a small fraction of these data were also documented for biovolumes and an even smaller fraction was converted to C biomass. As it is virtually impossible to reconstruct everyone's method for biovolume calculation, which is usually not indicated in the datasets, we decided to undertake the effort to document, for every distinct species, the minimum and maximum cell dimensions, and to convert all the available abundance data into biovolumes and C biomass using a single standardized method. Statistical correction of the database was also adopted to exclude potential outliers and suspicious data points. The final database contains 90 648 data points with converted C biomass. Diatom C biomass calculated from cell sizes spans over eight orders of magnitude. The mean diatom biomass for individual locations, dates and depths is 141.19 μg C l−1, while the median value is 11.16 μg C l−1. Regarding biomass distribution, 19% of data are in the range 0–1 μg C l−1, 29% in the range 1–10 μg C l−1, 31% in the range 10–100 μg C l−1, 18% in the range 100–1000 μg C l−1, and only 3% > 1000 μg C l−1. Interestingly, less than 50 species contributed to > 90% of global biomass, among which centric species were dominant. Thus, placing significant efforts on cell size measurements, process studies and C quota calculations of these species should considerably improve biomass estimates in the upcoming years. A first-order estimate of the diatom biomass for the global ocean ranges from 444 to 582 Tg C, which converts to 3 to 4 Tmol Si and to an average Si biomass turnover rate of 0.15 to 0.19 d−1. Link to the dataset: doi:10.1594/PANGAEA.777384.