IoT Device Authentication Using Self-Organizing Feature Map Data Sets

Sensors and actuators connected via the Internet of Things (loT) have now become embedded within our critical infrastructure offering improved observation and control as well as reduced costs. Given that software defined radios (SDRs) can be readily programmed to imitate loT devices, there is now a greater risk that assets can be spoofed or compromised. This necessitates an urgent need for loT device authentication, avoiding the need to upgrade the many thousands of individual devices. However, the lack of publicly available data sets severely hampers the development of effective authentication algorithms and mechanisms. In this regard, this article introduces a technique for facilitating loT device authentication when the radio frequency (RF) characteristics are highly correlated using self-organizing feature maps (SOFMs), thus aiming to promote state-of-the-art research in this field. The associated techniques demonstrated in this article exploit a novel data set of RF fingerprints and are, in particular, suitable for low-cost and long-range wireless application scenarios of the loT, for example, LoRa. Here, a well trained convolutional neural network (CNN) based on the SOFM data set can rapidly profile apparently correlated RF fingerprint patterns and thereby ascertain the nature of a specific device (friend or foe). In this way, a reliable and efficient loT device authentication strategy for LoRa devices can be established. The experimental results presented in this article substantiate the effectiveness and efficiency of the SOFM based approach, and the data sets are introduced in detail and shared with the research community.


I. INTRODUCTION
Multi-layer encryption and authentication are embedded within our cellular standards making connections in general secure and robust, whereas non-cellular connectivity solutions do not offer the same degree of protection.However, the latter can offer low-price and long-range connectivity for lowrate applications as well as a long battery life, thus making non-cellular based solutions now popular for sensors and actuators as part of the Internet of Things (IoT) [1].This growing use of non-cellular IoT is becoming commonplace and imperative in society, including applications within critical infrastructure.Cyber crime against critical infrastructure manifest in several ways [2], with thousands of cyber attacks exploiting vulnerabilities of communication networks can be detected and prevented in the network layer and above every day.Most cyber attacks are transparent to the radio frequency (RF) interfaces in the network, albeit in various forms and with various attacking patterns.Attacks of this nature are launched remotely by radio signals through wireless channels, e.g., side-channel attack; to this end, there are increasing efforts and resources invested in network security for thwarting such attacks, e.g., the more widespread adoptions of sophisticated encryption.However, much less focus exists on the attacks launched directly targeting RF interfaces per se.
By virtue of the propagation of RF signals, geographically speaking, cyber attacks targeting wireless systems are likely to be launched on a localized basis; however, once aiming at degrading the network management functions, cyber attacks could also be set off remotely within a wireless network connecting victims, e.g., an attempt to accomplish a widespread denial-of-service (DoS) attack against critical infrastructure.
Accordingly, a variety of novel air-interface technologies and protocols have been proposed and investigated over the last decade, addressing the heterogeneous transmission needs for different long-range low-rate IoT applications, e.g., LoRa [3], so does the emergent need for the RF cyber-physical security.
All radio systems are, to some extent, vulnerable at the RF interface level because of the open nature of these interfaces.Conventionally, the high cost and complexity brought by employing an RF transceiver for launching such wireless attacks posed a barrier to this being a widespread problem, resulting in it neither being widely investigated nor driving the development of relevant wireless standards.Nowadays, however, low-cost programmable RF and baseband modules, a.k.a.software defined radio (SDR), are being commonly utilized, which have lowered the barrier to launch a set of malicious operations targeting RF interfaces, for example, the HackRF and NESDR device families.What is worse, as it is a legal requirement to publicize the technical details for most commercial and civil communication standards, these technical details are also known by adversaries and can be exploited for adapting attacking strategies depending on RF interfaces.Therefore, the studies and research on security mechanisms for RF interfaces and effective identification methods for RF cyber attacks and vulnerabilities at RF interfaces are in dire need.
Accordingly, the U.K. Government has taken a positive stance toward the initiative of Secure by Design and jointly funded the Prosperity Partnership in Secure Wireless Agile Networks (SWAN) through UKRI/EPSRC in 2019 [4].The research objectives and remits of the five-year S WAN project are demonstrated in Fig. 1.Particular attention of the SWAN project has been directed to a range of resource-constrained and non-cellular IoT communication technologies that are popular with increasing market share in recent years.These connectivity solutions are not subjected to the same degree of rigor as the cellular standards in terms of cyber resilience.
One of the key research achievements of the SWAN project so far is our experimental testbed with a flexible configuration thus facilitating the emulation and testing of cyber attacks against IoT devices.Further, we proposed the use of selforganizing feature maps (SOFMs) for the RF fingerprinting of apparently correlated IoT devices, thus aiding friend or foe authentication.Based on experiments conducted along with the SWAN project, we have also collected data sets that can be used to promote future research activities related to security mechanisms and attack prevention for resource-constrained IoT networks.
In this article, we introduce one of our data sets and report our latest findings b y e mploying S OFMs f or security enhancement.In particular, we demonstrate the technical details of the constructed testbed and how the SOFM data can be used to enable accurate IoT device authentication.Most setups and know-how from our study can be easily extended to a wide range of wireless applications for dealing with similar detection and authentication problems.

II. RF FINGERPRINTING
Machine learning (ML) enabled RF fingerprinting has emerged as one of the most effective approaches for device authentication and RF cyber-physical security enhancement [5], [6].Specifically, r aw i n-phase/quadrature ( I/Q) samples derived after the down-conversion/heterodyning of an RF waveform can be applied to a convolutional neural network (CNN) for the purposes of classifying modulation schemes and radio identification.T able I s ummarizes t he state-of-theart ML enabled RF fingerprinting m ethods.H owever, t he RF fingerprinting f or I oT d evices h as r eceived o nly v ery scant attention and is still in its infancy.Meanwhile, identification s olely b ased o n t raditional device dependent radio-metrics are unlikely to yield a robust means of authentication for securing an RF open attack surface woefully exposed to spoofing a ttacks f rom u biquitous and cheap imitators, such as SDRs.It should be noted that a radiometric reflects a unique feature of an RF waveform, such as amplitude, a.k.a. received signal strength indicator (RSSI), frequency, phase, and any other features derived from these basics.A device creates a unique set of radio-metrics associated with its emitted signal due to RF impairments, such as I/Q imbalance, phase offset, phase noise, total harmonic distortion, and power amplifier (PA) non-linearity.
Erroneous identification leads to compromised authentication mechanisms for detecting cyber intrusion vectored through RF interfaces.For instance, LoRa modems generate modulated chirps with a long symbol duration, which increases the vulnerability to spoofing by rogue nodes.These chirps have constant envelopes of power, resulting in highly correlated I/Q samples.This results in an advantage for those intent on penetrating LoRa networks, some of which could be within critical infrastructure.In this regard, the focal challenge now is how to develop a generic and easy-to-implement technique that can facilitate the orthogonalization of highly correlated samples of raw LoRa I/Q data sets arising for electronic components produced through the same or very similar silicon fabrication process.
This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

III. TESTBED CONFIGURATION AND DATA SET COLLECTION
Above all, the effectiveness of ML enabled RF fingerprinting based authenticators is intrinsically upper bounded by the quality of data sets, while their worst performance is resulted from the scope, scale, and availability of data sets.Although researchers and practitioners have devoted to improving the quality of ML algorithms from the perspectives of neural architectures and automated feature selections, efforts toward improving the data set quality or enhancing the accessibility are quite rare.
Therefore, in an effort to create a reference data set standard for achieving self-reliance in the physical layer for the research of IoT device authentication, in particular LoRa, we established an in-house experimental testbed as a milestone of the SWAN project.The testbed is focused on the physical layer of LoRa systems.Our implementation requires five SODAQ Explorer modules as well as an instrumentation-grade Rohde & Schwarz SMATE200A arbitrary waveform generator (ARB) as shown in Fig. 2. With our methodology, any of the devices could be a potential rogue, the characteristics of which do not need to be This article has been accepted for inclusion in a future issue of this magazine.Based on the above hardware setups and following the instructive video tutorial of MATLAB Instrument Control Toolbox [13], a host PC installed with MATLAB R2020a is utilized to perform the following functions: • General purpose instrumentation bus (GPIB) control of the testbed.
• Event transmission at all of the six devices and the corresponding initialization of the receiver buffer for every transmit event.
• Generation of the baseband LoRa waveform which is upconverted to the RF chain by the ARB.• Periodic querying of the VSA for I/Q sampling and capturing.One of the crucial requirements before applying data sets for RF fingerprinting assisted IoT device authentication is to understand of the attributes of the data set at hand, especially the correlation among I/Q samples.Failure to do so could result in inaccurate analytics [5], [6].For example, the high correlation between the raw data sets of LoRa I/Q vectors is one of the most compelling factors facilitating the effective cyber intrusion by rogue devices when spoofing on LoRa waveforms.Such correlation occurs because of a constant envelope of power and is illustrated through a plot of 1,000 samples of an I/Q waveform in Fig. 3.These samples were collected from one of the six LoRa devices and captured by the VSA.
As a direct consequence of high correlation, a raw data set of LoRa I/Q samples are orthogonally inseparable as provided in data set 1 [6].With an orthogonally inseparable data set, the non-convex training problem in ML classifiers becomes NPhard or polynomial-time hard.This is because no algorithms exist in the problem dimensions that can derive optimized neural weights, even over several epochs, in order to construct a polynomial which produces a response close enough to match the original label vector of the devices.
As a result, an NP-hard training problem might delay the back-propagation process used to update gradient descents for a CNN and might even lead to erroneous estimations of labels in the data set.That is, an authenticator maintains a table of the device labels and its RF fingerprint features in a database; during the classification stage, once receiving a new message from an unknown device that needs to be classified, the authenticator classifies the labels of devices depending on the previously stored data.Apropos of detecting cyber intrusion by device authenticating, the raw data set of LoRa I/Q samples subject to stalling or mis-identification, in turn, leads to inaccurate device authentication, which is hard to circumvent without improving the quality of the data per se.Therefore, it becomes imperative to pre-process the raw data set of LoRa I/Q samples for achieving fool-proof device authentication.

IV. LOCALIZATION AND AUTHENTICATION USING PRE-PROCESSED DATA SETS
Improving the quality of raw data sets by employing preprocessing operations addresses the data quality gap from the following perspectives: • Enhancing ML model performance and convergence across designed metrics, e.g., the mean squared error (MSE) and accuracy (ACC) of CNNs [6].This is architecturally similar to AlexNet commonly employed in device authentication [14], which utilizes the adaptive stochastic gradient descent (ASGD) solver to derive the set of optimal neural weights through a back-propagation mechanism that maximizes or minimizes a chosen metric.• Accurately profiling the shift invariant RF fingerprint patterns in the pre-processed data sets that are uniquely class-attributed [6], or localized, to particular IoT devices.This can reinforce the RF cyber intrusion detection capability of the adopted ML model.This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.authentication requests within the order of the transmission frame duration, which is typically in the range of tens of milliseconds for low-power IoT devices, e.g., LoRa [6].
Taking the characteristics of resource-constrained IoT devices and data samples into account, we implemented a tailored unsupervised batch-SOFM competitive learning algorithm to pre-process the raw data set of LoRa I/Q samples prior to being input to SOFMs.We utilized an artificial neural network (ANN) initialized with 100 neurons, thereby attaining the objective of enhanced ML model performance, resulting in accurate localization.Dimensionality reduction in the order of O{10 4 }, computed as a ratio of the number of raw LoRa I/Q samples per LoRa device (i.e., 2×10 6 ) to the number of neurons in the ANN (i.e., 100), contributes to an expeditiousness in device authentication that is measured in the order of the LoRa frame duration.
The holistic procedure of the SOFM based algorithm is illustrated in Fig. 4. At the initial epoch, the input layer of every I/Q sample in the raw data set matrix of LoRa I/Q samples is first indexed serially.Then, a hidden layer of weights are initialized with random values and batched into a group of six matrices, corresponding to the six LoRa devices; here, every batch of weight matrix associated with a single LoRa I/Q sample of the corresponding LoRa device in the output layer of ANN matrix is initialized with 100 neurons, as also illustrated in Fig. 4. The restricted size of the ANN with neurons as compared with the sample size of I/Q samples for each LoRa device, i.e., 2 × 10 6 , ensures dimensionality reduction for the pre-processed SOFM data set.Utilizing every batch of weight matrix that links the corresponding I/Q samples of every LoRa device to each of the 100 neurons in the ANN, six batches of Euclidean norms are computed, from which the batch of six minimum Euclidean norms is selected.The neurons in the ANN, linked to that element within each weight matrix batch that minimizes the Euclidean norm (using the I/Q samples in the corresponding LoRa device), are known as the winning neurons, colloquially referred to as the best matching units (BMUs).In this way, a batch of six BMUs can be extracted for updating six batch matrices by a preset learning rate [6].Therefore, there are six batch update matrices yielded through the processing of the first epoch, which will then be applied to the original ANN to generate six corresponding offspring ANNs as illustrated in Fig. 4.
Since the magnitude of six batch update matrices depends upon the learning rate, the extent of cluster of neurons in the batch of six offspring ANNs as compared to the original ANN is also determined by this hyperparameter.This procedure is repeated from the second epoch onward until stipulated convergence and/or termination conditions have been satisfied.Note that the six batch update matrices obtained at the end of any epoch are supposed to be applied to the six offspring ANNs generated from the previous epoch, instead of the original ANN prior to training.Lastly, the batch of six offspring ANN matrices generated at the last epoch characterizes the preprocessed SOFM data set of the original LoRa I/Q samples, with each SOFM in the data set being associated with a specific LoRa device.
Through the aforementioned procedure, the SOFMs illustrated in Fig. 4 clearly exhibit RF fingerprint patterns, each unique to a particular LoRa device.These patterns, comprised by feature clusters, are generated by the dissimilar extent of cluster of neurons in the batch of offspring ANNs by the last epoch and are color marked as shown in Fig. 4. As such, these feature clusters are imprints of the I/Q imbalance on the LoRa I/Q data set 1, which is a radio-metric quantifying the gain and phase mismatches between the parallel sections of the RF front end.
Such distinct RF fingerprint patterns transform an otherwise NP-hard training problem of MSE minimization in CNN classifiers into an unconstrained minimization problem.As a result, ASGD solvers can be employed to optimize the neural weights, thereby constructing a polynomial that is capable of producing a response to match the label vector of the LoRa devices.In this way, the pre-processed SOFM data set facilitates the class-attribution of each of the RF fingerprint patterns to particular LoRa devices with high accuracy.This phenomenon is colloquially referred to as localization.However, there occurs a trade-off between localization accuracy and training time dominated by the similarity among the RF fingerprint patterns in the SOFM data set across the six LoRa This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
devices.While more pooling layers in a CNN improves the localization accuracy through a unique assembly of feature clusters, a larger sized ANN allows the network to "see" little context by the construction of dissimilar features into the overall patterns, albeit with a much faster speed of preprocessing.
Our device authentication approach using SOFMs to select CNN architectures is similar to AlexNet [14], with or without batch normalization (BN), whose convolutional layers are activated by specifically c hosen a ctivation f unctions.Acting upon an engine of SOFM data sets, the proposed authentication approach is able to achieve almost 100% localization accuracy in the RF fingerprint patterns by precisely class-attributing them to particular LoRa devices.It should be noted that as compared to the state-of-the-art ML assisted device authentication strategies in Table I that utilize GPUs, we employ merely a standard PC with Intel Core i7 CPU clocked at 3.6 GHz.This computing advantage reduces the cost and complexity for conducting experiments on device authentication and greatly facilitates the research on cyber-physical security.

V. NOTES ON DATA SETS' FILE STRUCTURES
All data sets discussed in this article can be accessed in IEEE DataPort and University of Bristol data repository.To ease the use of the data sets introduced in this article for follow-up research activities, we have a couple of important notes in this section.
First, using our experimental testbed, the original data set 1 of raw LoRa I/Q samples from each of the six LoRa devices was captured by querying the R&S FSQ26 VSA utilizing the MATLAB R2020a Instrumentation Control Toolbox.Therefore, as such, the file f ormat of data set 1 of raw L oRa I/Q samples is a MATLAB compatible data format, i.e., ".mat".Moreover, because the unsupervised batch-SOFM competitive learning algorithm employed to pre-process the raw I/Q samples into SOFMs was also implemented on MATLAB, data set 2 of SOFM images was initially generated in a MATLAB compatible figure format, i.e., ".fig".Later, these images were stored in the portable networks graphics (PNG) file format and then were randomly split into training and testing sets for classification a nd d evice a uthentication purposes.
The data sets provided have been obtained via a wired or conductive testing method and thus do not include an antenna and wireless channel artifacts.This can be most readily added through post-processing using Rayleigh, Rician, or similar channel models, to add the fading statistics of an operational environment of choice.

VI. CONCLUDING REMARKS AND FUTURE WORK
In this article, we first i ntroduced a n i n-house experimental testbed, established for studying RF cyber-physical security for low-cost IoT devices.Then, we also introduced the data sets generated and collected from the experimental testbed and showed how they can be used to conduct in-depth research on wireless security enhancement.In order to overcome the correlation across raw LoRa I/Q samples in the original data set, an expeditious pre-processing approach to generate SOFM images by applying the unsupervised batch-SOFM competitive learning algorithm was introduced.Furthermore, we demonstrated that SOFMs enabled ML classifiers, such as CNNs, are able to accurately localize RF fingerprint patterns, providing a robust device authentication strategy.The utilized techniques in this article can be extended to similar low-power and/or low-rate IoT air-interfaces, e.g., Sigfox, LTE Cat-M, NB-IoT, Zigbee, and Bluetooth.We aim to ultimately achieve the conception of Secure by Design for all these wireless systems using the developed experimental testbed and the data sets.
As a preliminary study aiming to raise the awareness of RF cyber-physical security and promote further research activities, we have made all data sets discussed in this article openly accessible.Using the testbed plus the published data sets, there exist a multitude of research directions that are worth investigating as future work.For example, a variety of diverse wireless environments, characterized by different antenna setups, fading conditions, and mobility, would be programmed and integrated into the testbed as independent functional modules.As a result, the testbed will be capable of generating composite SOFMs integrating RF front-end and environmental factors.This can be achieved by utilizing the developed testbed in conjunction with RF channel emulators, e.g., Keysight F8.In addition, it is also worth updating the testbed, enabling it to take heterogeneous user demands on data rate and security into account.

Fig. 2 .
Fig. 2. Experimental testbed and its architecture for raw data set generation.

Fig. 3 .
Fig. 3. Plot of the first 1,000 samples of the I/Q waveform of a LoRa device captured by the VSA, demonstrating a constant envelope of signal power.

Fig. 4 .
Fig. 4. Illustration of the entire device authentication process using SOFMs (Dataset 2) generated from the data set of LoRa I/Q vectors (Dataset 1).

TABLE I STATE
-OF-THE-ART ML ENABLED RF FINGERPRINTING AND DEVICE AUTHENTICATION APPROACHES