Face-to-face communication is multimodal at its core: it consists of a combination of vocal and visual signalling. However, current evidence suggests that, in the absence of an established communication system, visual signalling, especially in the form of visible gesture, is a more powerful form of communication than vocalization and therefore likely to have played a primary role in the emergence of human language. This argument is based on experimental evidence of how vocal and visual modalities (i.e. gesture) are employed to communicate about familiar concepts when participants cannot use their existing languages. To investigate this further, we introduce an experiment where pairs of participants performed a referential communication task in which they described unfamiliar stimuli in order to reduce reliance on conventional signals. Visual and auditory stimuli were described in three conditions: using visible gestures only, using non-linguistic vocalizations only and given the option to use both (multimodal communication). The results suggest that even in the absence of conventional signals, gesture is a more powerful mode of communication compared with vocalization, but that there are also advantages to multimodality compared to using gesture alone. Participants with an option to produce multimodal signals had comparable accuracy to those using only gesture, but gained an efficiency advantage. The analysis of the interactions between participants showed that interactants developed novel communication systems for unfamiliar stimuli by deploying different modalities flexibly to suit their needs and by taking advantage of multimodality when required.