
Peer reviewed version

Link to published version (if available): 10.1109/VETEC.1999.778065

Link to publication record in Explore Bristol Research

PDF-document

University of Bristol - Explore Bristol Research

General rights

This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/
Design of a Novel Delayed LMS Decision Feedback Equaliser for HIPERLAN/1 FPGA Implementation

Y. Sun, A.R. Nix,
D.R. Bull & D. Milford
University of Bristol, Centre for Communications Research, MVB Room 2.19,
Woodland Road, Bristol BS8 1UB, UK
Fax: +44-117-954-5206;
E-mail: Yong.Sun@bristol.ac.uk

Abstract — This paper presents the investigation of a new equaliser algorithm and architecture optimised for low cost FPGA implementation. The design was performed as part of the ESPRIT WINHOME project and is fully compliant with the European third generation HIPERLAN/1 wireless LAN standard. The equaliser supports GMSK modulation at an instantaneous transmission data-rate of just under 24 Mbit/s.

In this paper the equaliser algorithm and pipelined DLMS DFE architecture is presented. Issues such as signal quantisation, bit and frame synchronisation and frequency offset correction are discussed in detail. The final structure is shown to achieve considerable hardware simplification together with improved performance when compared to a standard implementation of the complex LMS equaliser.

I. Introduction

HIPERLAN represents a family of new European high-speed wireless LAN standards. The first of these standards, HIPERLAN/1, was completed in 1996 and products are expected early in the new millennium. Standardisation bodies in Europe (ETSI BRAN), North America (IEEE 802.11) and Japan (MMAC-PC) are continuing to develop similar standards for high-speed wireless LANs.

With the convergence of computing, broadcasting and telecommunications, multimedia computers have started to enter the home in volume. The desire for high bit-rate wireless communications is not only emerging from industry and education, but also from private individuals in the home. The ESPRIT WINHOME (Wireless INnovation in the HOME) project [1] aims to provide domestic environments with high quality interactive digital television, internet, satellite access (via Eutelsat), videophone communications and other multi-media services via the HIPERLAN/1 wireless interface.

The main focus of the WINHOME project is to achieve high quality wireless video links throughout the home environment using HIPERLAN/1 technology. This will be achieved through a combination of advanced equaliser design and optimised video coding techniques. The WINHOME system incorporates an adaptive Decision Feedback Equaliser (DFE) filter to eliminate harmful Inter-Symbol Interference (ISI), realised using FPGA (Field Programmable Gate Array) technology. However, even the simplest standard LMS algorithm is too complicated for such technology. Hence, the design of a simple, high bit rate, low power adaptive equaliser is a major challenge within the WINHOME project.

When designing the equaliser there were four main factors that directly affected its complexity: (1) the training algorithm; (2) the equaliser structure; (3) the quantisation level; and (4) the internal data representation. In addition, for ad-hoc networks, the receiver must operate without control from a central node. As a consequence, synchronisation and frequency offset correction have to be carefully designed. In this paper we present a low complexity, high performance DLMS DFE equaliser algorithm and describe the architecture chosen for its implementation. In section II, the developed Delayed LMS (DLMS) DFE architecture is presented. Section III describes and demonstrates the particular form of the LMS update algorithm proposed. The signal quantisation study is presented in section IV while section V explains our recommended synchronisation and frequency offset correction techniques.

II. DLMS DFE Architecture

A general standard LMS update algorithm, without considering the Feedback Filter (FBF), can be written as:

\[ C_{k+1} = C_k + \Delta \cdot e_k \cdot V_k \]

where \( C \), \( e \) and \( V \) represent coefficients, error and input data, respectively, and \( \Delta \) is the adaptation step size.

A standard LMS DFE critical path can be obtained as shown in figure 1, where the multipliers operate with complex inputs. The compare device determines the difference (error) between the received signal and the training sequence. The gradient estimate is weighted by the step size and the step size is selected as an exact power-of-two (POT) term. Assuming a fixed step size, a simple hard-wired bit shift can be used (as shown in figure 1). The hard detection
process is not shown in the critical path since this can be performed in parallel with the coefficient update process. Furthermore, the hard decision process is very simply implemented by signing the real or imaginary part of the DFE output data according to even or odd bit slots in the GMSK modem.

The latency of the LMS DFE is mainly determined by two complex multipliers, the adder trees and the compare device. To perform these tasks sequentially at the HIPERLAN bit rate is not possible using current FPGA technology (the critical path in figure 1 must be performed within 42ns). Hence, to enable a practical implementation the Delayed LMS DFE must be considered.

The conventional complex form of the DLMS algorithm can be written as,

\[
C_{k+1} = C_k + \Delta \cdot e_k \cdot V_{k-1}
\]

\[
e_k = d_k - C_k \cdot V_k
\]

where \(C_k\) represents the vector of filter coefficients, \(V_k\) the vector of filter input data, \(\Delta\) is the step-size and \(e_k\) the error between the equaliser output, \(Z_k\), and the desired response \(d_k\) (here \(d_k\) is obtained from the 450 bit HIPERLAN/1 training sequence). \(D\) represents the delay relaxation measured in bit periods.

Naturally, by applying the above algorithm, the FeedForward Filter (FFF) and the Coefficient Update (CU) can be computed in parallel. This allows one complex multiplier plus the error production and CU parts to be removed from the critical path. The latency has now been reduced to half that of the standard sequential LMS. However, since the detection device and the error production processes are significantly less complex in the feedback part of the DFE (compared to the feedforward section), it is possible to move the detection device into the feedback section to balance the latency of the FFF and the FBF. By carefully considering the connection between the FFF and the FBF, a sub-summation and a \(T\) delay is introduced after the FFF to completely parallel the FFF and the FBF process.

Further pipelining of the DLMS DFE structure can be implemented on the FFF and the CU to isolate the complex multipliers, assuming these processes dominate the latency. A further delayed DLMS DFE is proposed in order to guarantee that the computation time available for a complex multiply is at least one symbol period (42ns). The resulting DFE architecture was developed into a software demonstrator as shown in figure 2.

It is now shown that by saving multipliers in the FFF structure, it is possible to dramatically improve the resulting gate count. In our design, in order to cope with the home environment radio channel, we have proposed a DFE (6, 5) structure. The resulting multiplier count could rise to \(2 \times 6 + 5 = 17\) complex multipliers, thus resulting in 68 real multipliers in a non-optimised scheme.

The following simplification was developed for the WINHOME implementation and relies on performing the signal and coefficient calculations in series in the feedforward loop. This can be achieved by the latest generation of FPGA families (e.g. the newest 1 million gate VIRTEX family from XILINX) which allows double frequency real multipliers (47 MHz in our case). Instead of four real multipliers in each stage, we use only two, as shown in figure 3, resulting in a complete DFE (6, 5) performed with just 34 real multipliers.

With the proposed complex multiplier structure as shown in figure 3, the combined complex multiplier and adder tree stages of the pipeline can be implemented with a latency of just two symbol periods. However, the proposed architec-
ture utilises half the FPGA area. As a result, the number of 8x8 complex multipliers decreases from 68 to 34 in this scheme.

With this improved architecture, the critical path of the standard LMS DFE structure (as shown in figure 1) can thus be modified to produce a pipelined DLMS architecture as shown in figure 4.

![Figure 4: Critical path for the Pipelined DLMS DFE](image)

**III. LMS Algorithm Development**

There are many ways to simplify the LMS update algorithm. The purpose of these simplifications is to reduce the complexity of the target hardware. However, the simplification process normally degrades the final system performance. In this paper a simplification scheme is proposed which actually improves the resulting equaliser performance for GMSK (known as the real-error scheme).

The real-error scheme is proposed for use with GMSK modulation. The GMSK signal takes alternate (I, Q) values of \((\pm 1, 0), (0, \pm j)\) according to the sampling axis. This also means that the corresponding symbols are either pure real or pure imaginary. Therefore, a series real-error scheme for DFE coefficient update is proposed.

In the general LMS equation (equation 1), \(e_t\) represents the complex error and can be written as,

\[
e_t = I_t - C^* \cdot V_t^n
\]

where \(I_t\) is the information symbol transmitted in the \(k\)-th signalling interval. Here, the symbol sequence for \(I_t\) is fixed using a pre-set training sequence.

According to the GMSK modem design in our implementation, the criteria can consist of minimising \(E[Re\{e_t\}^2]\) or \(E[Im\{e_t\}^2]\) rather than \(E\{e_t\}^2\). The justification of the equation can be obtained as follows:

\[
\text{grad}_{Re\{e_t\}} = -Re\{e_t\} \cdot V_t^n, \quad (5)
\]

\[
\text{grad}_{Im\{e_t\}} = -j Im\{e_t\} \cdot V_t^n. \quad (6)
\]

This leads to the following equations for the first real-error scheme (real-error scheme-1):

\[
C_{t+1} = C_t + \Delta \cdot Re\{e_t\} \cdot V_t^n 
\]

when the decision is to be taken on the real axis, or

\[
C_{t+1} = C_t + \Delta \cdot j \cdot Im\{e_t\} \cdot V_t^n \quad (8)
\]

when the decision is to be taken on the imaginary axis.

\(Re\{e_t\}\) and \(Im\{e_t\}\) represent the value of real and imaginary parts of the error signal. Real-error scheme-1 can be further simplified to produce the following two schemes:

**Real-error scheme-2:** This method takes just the imaginary part of the error, resulting in equation (9),

\[
C_{t+1} = C_t + \Delta \cdot Im\{e_t\} \cdot V_t^n \quad (9)
\]

**Real-error scheme-3:** This method combines scheme-1 with a power-of-two scheme by taking the 2-logarithm of \(Re\{e_t\}\) or \(Im\{e_t\}\). This can be expressed as:

\[
C_{t+1} = C_t + \Delta \cdot 2^{\text{round}(\log\{Re\{e_t\}\})} \cdot V_t^n \quad (10)
\]

\[
C_{t+1} = C_t + \Delta \cdot 2^{\text{round}(\log\{Im\{e_t\}\})} \cdot V_t^n \quad (11)
\]

where \([\cdot]\) indicates rounding towards plus infinity (round up) and \([\cdot]_r\) rounding towards minus infinity (round down).

These real-error schemes have the advantage of using a real multiplier instead of a complex multiplier in the hardware implementation of the CU. For scheme-3, the real multiplier can be implemented as a barrel shifter followed by an adder. Furthermore, scheme-1 has been shown to significantly improve system performance. The performance study of these schemes is shown in figure 5.

![Figure 5: FEBR performance of real-error schemes](image)
IV. Quantisation Study

It is shown in [2] that the accumulated quantisation noise of the LMS equaliser's coefficients result in an output quantisation error whose mean squared value is, approximately, inversely proportional to the adaptation step size $\Delta$. Taking $\Delta$ to be very small (in order to reduce the excess mean square error) can result in a considerable quantisation error. Here, two critical parts are identified (the FFF and the FFF-CU) and two different quantisation loops (internal filter coefficients and external filter coefficients) are defined to balance the need for high efficiency and low complexity.

For the FFF, the quantisation resolution can be defined as $M_1$ bits and the structure of the FFF can be redrawn as shown in figure 6.

![Figure 6: Resolution setup for feedforward filter](image)

From feedback filter

Figure 6 shows that all branches in the FFF employ the same level of quantisation resolution, $M_1$.

From equation 2, it can be seen that the quantisation level of the update section should be different from the adder and the previous FFF multipliers. This arises since, if the updating value $\Delta e_k + v_i^k$ drops below the quantisation level, the updating value is lost. Thus, the quantisation level is set at $M_1$ for the multiplier, while a higher quantisation level of $M_2$ is used for the coefficient update section. The resulting structure is shown in figure 7.

![Figure 7: Resolution setup for coefficient update](image)

Naturally, the value of $M_2$ should equal $2M_1$ after the complex multiplier. The top branch, $C_k$ output, is used for the DFE filter, while the bottom branch, $C_k$ stored, is used in the calculation of the next coefficient update (where greater accuracy is required). By employing the above scheme, both the excess mean square error and the quantisation error can be controlled and minimised to an optimal value.

A simulation study of the quantisation performance in a typical indoor channel has been performed and results are shown in figure 8. The graph shows the FEBR performance with quantisation $(M_1, M_2)$ based on a DFE $(4, 3)$. $M_2$ is assumed to equal $2M_1$ and $Q(\text{None})$ means no quantisation is applied, i.e. full floating point precision is used. The worst case channel conditions described in section III were assumed for this study.

![Figure 8: FEBR performance with Q(m, 2m)](image)

From the results above, a quantisation level of $Q(5, 10)$ achieves a reasonable performance level. However, compared with $Q(6, 12)$, a difference of about 10% FEBR appears. Thus, $Q(6, 12)$ appears to offer an acceptable level of performance. This study shows that the value of $M_1$ should be greater or equal to 6. Further study was performed to identify a minimum value for $M_2$ (by fixing the value of the $M_1$ and varying $M_2$). The results of this study indicate that $M_2$ should have a minimum value of 10 bits.

V. Synch & Frequency Offset

A method for bit synchronisation combined with coarse frequency offset correction for HIPERLAN/1 was proposed in [3]. In this study, the method has been updated to support the final version of the HIPERLAN/1 training sequence [4]. The sequence is shown in figure 9.

![Figure 9: Finalised training sequence in HIPERLAN/1](image)

The first m-sequence, $m_1$, can be used for the purpose of bit synchronisation and the other m-sequences can be employed.

303
to perform coarse frequency offset correction. The simulated performance of the coarse frequency offset method is shown in figure 10. The results indicate that for errors up to 104 kHz, the mean detection error is less than 1 kHz.

VI. Conclusions

In this paper, we have concentrated on the design and implementation of a high-speed equaliser for home based HIPERLAN/1 applications. The quantisation study found that there were two main quantisation levels needed in the design. The internal processing in the update loop needed high bit precision to guarantee the accuracy of the step weights and to ensure convergence of the DFE.

The development of a real-error (rather than the traditional complex error) scheme for the HIPERLAN/1 GMSK equaliser resulted in reduced complexity and improved performance. Results have shown that in the worst case home environment, the designed DFE (6,5) achieved a 98% error free block transmission rate.

The proposed pipelined DLMS DFE architecture guarantees FPGA implementation by reducing the latency of key critical paths. The use of two real dual frequency multipliers results in a significant reduction in the number of required real multipliers, and hence the FPGA size.

Coarse frequency offset correction was proposed in order to properly run the DFE engine and to minimise the effects of delayed frequency offset within the delayed LMS update algorithm. The transposed transversal filter structure combined with fixed coefficients can be used to implement synchronisation and coarse frequency offset detection.

A full FPGA implementation is now underway using XILINX VERTEX technology as part of the ESPRIT WINHOME project.

Acknowledgements

The authors (UoB and Thomson-CSF Detexis) would like to thank the other members of the WINHOME consortium: Grundig (UK), SCT (UK) and Eutelsat (Fr) as well as the European Commission for their continual support and encouragement. This work was performed as part of the ESPRIT WINHOME project (25048).

References