
Peer reviewed version

Link to published version (if available):
10.1109/82.700934

Link to publication record in Explore Bristol Research
PDF-document

University of Bristol - Explore Bristol Research

General rights

This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/
Pipelined DFE Architectures Using Delayed Coefficient Adaptation

R. Perry, David R. Bull, and A. Nix

Abstract—In this paper the delayed least-mean-square (DLMS) algorithm is proposed for training a transversal filter-based decision feedback equalizer (DFE). Delays in the filter coefficient update process are used to pipeline the DFE, thereby increasing the throughput rate, for a given speed of hardware. The filter structures selected for the feedforward and feedback section of the DFE facilitate the use of a shared error signal, thereby reducing communication costs. The new resulting structure is highly modular and is very suitable for very large scale integration (VLSI) implementation. A pipelined form for the normalized least-mean-square algorithm (NMLS) is also obtained which removes the dependency of the convergence speed on the input signal power. The convergence and residual mean-square-error characteristics of the different pipelined filters are compared.

Index Terms—Adaptive filters, delayed least-mean-squares, equalization, pipelined.

I. INTRODUCTION

Although numerous adaptive equalizers have been reported in the literature, decision feedback equalization (DFE) remains a popular choice of equalization technique, especially for high-data-rate (e.g., HIPERLAN) applications, where alternatives, such as Viterbi equalizers, are precluded due to their complexity [1]. The use of long training sequences enables the selection of relatively simple gradient-based adaptive algorithms for equalizer training. Despite this concession, the realization of a high-sampling-rate DFE still remains challenging. This is due to the inherent sampling rate limitation of any adaptive filter associated with the generation and feedback of the error signal required for the filter coefficient updates. Exploitation of the natural parallelism in the least-mean-square (LMS) algorithm can help to reduce the algorithm iteration period. However, the regularity of the filtering structure is limited [2] and global broadcasting of the error signal to all coefficient update modules is still required. This has motivated the use of approximations to the LMS algorithm which sacrifice performance to increase the level of pipelining in the adaptation feedback loop. The delayed LMS (DLMS) algorithm is an example of an approximate form of the LMS algorithm, where the coefficient updates are delayed by an arbitrary number of sample periods [3].

In this brief a DFE architecture is derived, comprising a cascade of identical processing modules (PM’s). The order recursive filter (ORF), described in [4], is used for the feedforward section but cannot be used for the feedback filter (FBF) because of the latency in the output. A transposed direct-form transversal filter (TF) together with a modified coefficient update is used to realize a pipelined FBF. By exploiting shared input parameters between the feedback filter (FF) and FBF stages, a new modular DFE structure is obtained (Section II-A), requiring minimal global communication.

Despite the attraction of its relative simplicity, a particular problem of implementing the LMS algorithm is the dependency of the algo-
rithm stability and convergence speed on the power and correlation properties of the input data [5]. If a fixed step size is used, this must be kept sufficiently small to ensure stability under all expected operating conditions. A potential solution is to provide some degree of automatic gain control (AGC) by scaling the input data prior to equalization such that the step size × power product is maintained within some predetermined design limits [6]. A practical means of implementing this is to use a data-dependent step size which is set to be inversely proportional to the sum of the square values of the current filter inputs. This approach results in the normalized LMS (NLMS) algorithm. Using the pipelining techniques described above, a pipelined form of the NLMS algorithm is also obtained in Section II. In Section III the performance of the pipelined DFE’s are compared with a conventional realization of the LMS algorithm.

II. PIPELINED ADAPTIVE DFE’S USING DELAYED COEFFICIENT UPDATES

In this section it is shown how the introduction of a delay in the coefficient update process can be used to pipeline a transversal DFE using the basic LMS algorithm for training (Section II-A). In Section II-B a pipelined DFE structure adapted using the NLMS algorithm is developed. A comparison with other existing adaptive filter structures is presented in Section II-C.

A. DLMS Pipelined DFE

The complex form of the DLMS algorithm is given by (1) and (2)

\[ W(n) = W(n-1) + \beta e^*(n-D)X(n-D) \]

\[ e(n) = d(n) - WH(n-1)X(n) \]

(1)\hspace{1cm} (2)

where \( W(n) \) is the vector of filter coefficients, \( X(n) \) is the vector of input data fed into the filter, \( \beta \) is the step size, and \( e(n) \) is the error between the desired response \( d(n) \) and the estimate of the desired response, produced by the equalizer at time \( n \) [7]. \( D \) is an arbitrary delay introduced in the gradient term \( e^*(n-D)X(n-D) \). For the case of a DFE, \( W(n) \) and \( X(n) \) are first partitioned into feedforward and feedback vectors and the delay \( D \) is set to \( L \), the length of both the FFF’s and FBF’s (equal length filters are assumed without loss of generality). Equation (1) can be written in terms of these partitioned vectors as in (3)

\[ [W_f(n), W_b(n)] = [W_f(n-1), W_b(n-1)] + \beta e^*(n-L)[X_f(n-L), X_b(n-L)] \]

(3)

where

\[ X_f(n) = [x_f(n) \hspace{0.2cm} x_f(n-1) \hspace{0.2cm} \cdots \hspace{0.2cm} x_f(n-L+2) \]

\[ \cdot x_f(n-L+1)]^T \]

\[ X_b(n) = [x_b(n) \hspace{0.2cm} x_b(n-1) \hspace{0.2cm} \cdots \hspace{0.2cm} x_b(n-L+2) \]

\[ \cdot x_b(n-L+1)]^T \]

\[ W_f(n) = [w_{f0}^0(n) \hspace{0.2cm} w_{f1}^0(n) \hspace{0.2cm} \cdots \hspace{0.2cm} w_{fL-1}^0(n) \hspace{0.2cm} w_{fL}^0(n)]^T \]

\[ W_b(n) = [w_{b0}^0(n) \hspace{0.2cm} w_{b1}^0(n) \hspace{0.2cm} \cdots \hspace{0.2cm} w_{bL-1}^0(n) \hspace{0.2cm} w_{bL}^0(n)]^T \]

Note that \( X_b(n) \) is the vector of previous training symbols, thus \( x_b(n) = d(n-1) \). During decision-directed mode, \( X_b(n) \) will contain the previously detected symbols. \( X_f(n) \) is the vector of input samples to the FFF. The pipelining approach adopted here is to decompose the inner product computation in (2) into a cascaded series of pipelined stages, the output of the final stage being the value of the complete inner products of the FFF and FBF sections. The DFE output at time \( L+1 \), \( y_{L+1}(n-L+1) \) is composed of two parts: the FFF output \( y_{L+1}(n-L+1) \) and the FBF output \( y_{L+1}(n-L+1) \)

\[ y_{L+1}(n-L+1) = y_{L+1}(n-L+1) + y_{L+1}(n-L+1) \]

From (6a), it is clear that the time indexes of the filter coefficients are skewed in time with respect to one another and thus this structure does not represent a strict realization of the DLMS algorithm. However, the time skew can be removed by delaying the coefficient updates as shown in the filter denoted TF2 in Fig. 2. The \( \frac{L-1}{k+1} \) delay in this figure ensures the equivalence to the original ORF. This structure is not, however, used in its given form for the FBF, as the delays introduced in the coefficient updates can be distributed to yield a filter structure with improved pipelining.

For the TF2 structure, the filter output is described by a modified form of (6a), where an additional delay of \( (L-1-k) \) sample periods is included in the \( k \)th coefficient time index. In this case \( y_{L+1}(n-L+1) \) is given by

\[ y_{L+1}(n-L+1) = \sum_{k=0}^{L-1} x_b(n-i-k)w_{b}^0(n-1-k-(L-1-k)) \]
Fig. 2. Introduction of delays in the transposed structure (TF2).

and with time delay \( i = L - 1 \)

\[
y_{6, L-1}(n - L + 1) = \sum_{k=0}^{L} x_k(n - L + 1 - k)w_{k}^{m}(n - L) = \mathbf{W}_{L}^{m}(n - L)\mathbf{X}_{k}(n - L + 1). \tag{6b}
\]

In environments where the change in the channel ISI is small over \( D \) symbol periods, then \( w_{k}^{m}(n - D) \equiv w_{k}^{m}(n - k - 1)k \leq D \). The output of the filter in Fig. 1 (TF1), following convergence, will then be similar to that from the filter in Fig. 2 (TF2). Consequently, both structures can be considered for use as the FBF. The DFE structure using TF1 will be referred to as the transposed direct-form DFE (TFDFE) and the DFE using a retimed version of TF2 will be referred to as the DLMS DFE. Using the form of (6b) for \( y_{6, L-1}(n - L + 1) \), a diagram showing the data flow for a (3,3) DLMS DFE is given in Fig. 3. The input to the FFF section enters from the left while the previous decision is input to all of the FFB sections simultaneously. The index for the FFF coefficients increases left to right, but for the FBF coefficients it decreases left to right (because of the transposition). The latency of the filter is \( 2L - 1 \) sample periods. This is the time for the FFF to fill and for the estimate of the desired response to propagate along the filter structure.

Updates for the filter coefficients are now derived. For the FFF, the \( i \)th coefficient update is obtained from (3) as

\[
w_{i}^{u}(n - i) = w_{i}^{u}(n - 1 - i) + \beta e^{*}(n - L - i)x_{j}(n - L - 2i). \tag{7}
\]

For the two alternative forms of \( y_{6, L-1}(n - L + 1) \), there are different coefficient updates. Using (6a) to compute \( y_{6, L-1}(n - L + 1) \) requires that \( w_{i}^{u}(n) \) is obtained at each iteration. Using (3), this is given by

\[
w_{i}^{u}(n) = w_{i}^{u}(n - 1) + \beta e^{*}(n - L)x_{i}(n - L - i) \tag{8a}
\]

Using (6b) to compute \( y_{6, L-1}(n - L + 1) \) requires the updated coefficients \( w_{i}^{u}(n - (L - 1 - i)) \). Using (3) gives

\[
w_{i}^{u}(n - (L - 1 - i)) = w_{i}^{u}(n - 1 - (L - 1 - i)) + \beta e^{*} \cdot (n - 2L + 1 + i)x_{i}(n - 2L + 1). \tag{8b}
\]

In (8a) the previous error term is broadcast to all of the FFB sections. In (8b) the same feedback data sample \( x_{i}(n - 2L + 1) \) is broadcast to all of the FFB sections. Since the feedback data can be represented with a very short wordlength [1], broadcasting the feedback data sample is advantageous to reduce the global interconnection. It should also be noted that the reversed ordering of the filter coefficients in the FBB results in an identical error term in the update of the \( i \)th FBB coefficient in (8b) to that of the \((L - 1 - i)\)th FFF coefficient in (7). Thus, the error term can be shared between the update modules for the FFF and FFB coefficients. This is an attractive feature of the TF2 structure compared to the TF1 structure since it reduces the communication costs in the DFE. This leads to the adoption of a modular processing section for the DLMS DFE structure as shown in Fig. 4, using the update in (8b). The step size \( \beta \) is shown multiplying the gradient estimates for the FFF’s and FBF’s in multipliers M2 and M5. In practice it is likely that \( \beta \) will be constrained to be a power-of-two value. Thus, the multipliers M2 and M5 can be replaced with hardwired or programmable shifts.

The estimate of the desired response, output from the final PM stage, is quantized to the nearest constellation point by a decision device, determined by the modulation format. This quantized output is used as the reference sequence during decision-directed mode and must therefore be computed within the same clock period as the computation of the equalizer output. Assuming the computation of the error is also completed within the same clock cycle, then the clock period will be determined as \( t_{m} + 2t_{a} \), where \( t_{m} \) and \( t_{a} \) are the times required to perform a multiplication and addition, respectively. This assumes that multiplication by the step size \( \beta \) is reduced to a shift and that \( t_{a} \leq t_{s} \), where \( t_{s} \) is the time required to perform a shift. By delaying the coefficient update by a further iteration period, the error computation can be eliminated from the critical path. The critical path will then be the longer of either the time to compute the DFE output and decision, in the final stage, or the time required to perform the coefficient update, i.e., \( t_{m} + t_{a} + t_{s} \).

It is noted that since the DFE is trained using the DLMS algorithm, the convergence and stability analysis, previously developed in [3], is applicable here.

B. Pipelined Delayed NLMS DFE

To reduce the sensitivity of the LMS convergence speed on the input signal power, gain control has been suggested [6]. This can be realized by dividing the input data samples by an amount proportional to the estimated mean signal power, thereby adjusting the input signal power to within some predetermined range. However, the estimation of the received signal power, from the input data samples, imposes a delay before equalization may take place and can be unresponsive to rapid variations in the input signal statistics. For these reasons, the NLMS algorithm has been adopted which is able to achieve the same effect as AGC. In this algorithm the step size is adjusted in proportion to the inverse of the sum of the squares of the inputs to the adaptive filter [9]. By incorporating a delay in the coefficient update as before, the delayed NLMS algorithm (DNLMS) is obtained. The coefficient update for the DNLMS algorithm proposed here is given by

\[
\mathbf{W}(n) = \mathbf{W}(n - 1) + \beta(n - D)e^{*}(n - D)\mathbf{X}(n - D) \tag{9}
\]

where \( \beta(n) \) is a time-varying step size defined as

\[
\beta(n) = \frac{\beta}{\mathbf{X}^{H}(n)\mathbf{X}(n)}. \tag{10}
\]

The scaling factor \( \beta \) controls the speed of adjustment and is chosen in proportion to the length of the FFF’s and FBF’s. It will be shown below that \( \beta(n) \) can be computed order-recursively in the same manner as the error signal. Note that during startup, when the FFF is not completely full of input data samples, the variable step size will be undetermined and, hence, must be initialized to some fixed value during this period. Since the computation of the error as used in (9) is the same as in (2), it can be performed in the same way as was described in Section II-A. The coefficient updates must, however, be modified to account for the time-varying step size. For the ORF, the feedforward coefficient updates are given by

\[
w_{j}(n - i) = w_{j}(n - 1 - i) + \beta(n - L - i) \cdot e^{*}(n - L - i)x_{j}(n - L - 2i). \tag{12}
\]
It should be noted that when the FBF is full, the case of a nonamplitude modulated training sequence, $Z_{L-1}^f(n)$ is a constant equal to $L[d(n)]^2$. In such a case, the contribution to the inner product in (12) from the FBF inputs does not explicitly require computation. It thus remains to compute an update for $w_f(n-i)$. From (11), the update of $w_f(n-i)$ requires $Z_{L-1}^f(n-L-i)$. Now, from (12)

$$Z_{L-1}^f(n-i) = \sum_{k=0}^{i} |x_f(n-i-k)|^2$$  \hspace{1cm} (14)

which, in the same manner as for the data estimate calculation in (5), can be computed order recursively as

$$Z_{L-1}^f(n-i) = Z_{L-1}^f(n-i) + |x_f(n-2i)|^2. \hspace{1cm} (15)$$

This computation can thus be carried out in parallel with the other computations in the existing PM. The computation of $Z_{L-1}^f(n-i)$ can be integrated in each PM, with the value of $Z_{L-1}^f(n-L+1)$ becoming available in the last PM, from which the modified step size $\beta(n-L+1)$ is computed. It has been implicitly assumed in the above derivation that a variable step size is used in the coefficient updates of both the FFF and FBF coefficients. It is readily shown, however, that only a variable step size for the FFF is necessary to remove the dependency of the convergence speed on the received signal power. In [6] the training sequence (and hence the values of the feedback data stream) was also scaled by the same factor as the input data. However, since the feedback data inputs are normally selected from a limited alphabet of symbols, this can be readily exploited to reduce the complexity of the inner product computation. Scaling the training sequence by a variable factor would preclude this. If a fixed step size is used for the FBB updates, the term $Z_{L-1}^f(n-i)$ may be omitted from (12).

Implementing a variable step size for the FFF will severely reduce the throughput rate because of the need for a division operation. In addition, there is an increase in the interconnect cost associated with the transmission of an additional parameter (the variable step size) between each of the PM’s. To avoid this, the product of the step size and estimation error should be computed once at the output of the final DFE stage. The DFE throughput can be maintained by including an additional $M$ delays in the coefficient updates, sufficient to pipeline the division and multiplication operations. The equalizer output is then approximated as

$$y_{L-1}(n-L+1) \approx W^f(n-L-M)X(n-L). \hspace{1cm} (16)$$

Provided $M$ is reasonably small, the effect on convergence will not be significant and will eliminate the error and step-size computations from the critical path. The critical path will now be in the coefficient update with associated computational delay of $2t_m + t_a$. 

---

**Fig. 3.** Data flow in the pipelined (3,3) DLMS DFE.

**Fig. 4.** A single PM for the pipelined DLMS DFE.
C. Comparison with Other Pipelined Architectures

In a previous publication by the authors [8] and in a recent publication [9] the transposed form of a transversal filter was used in a realization of the DLMS algorithm. In [9], a factor two slowdown was employed to introduce extra delays for pipelining, in addition to delaying the coefficient adaptation. Hardware utilization efficiency was maintained by folding two operations onto each processing element. This reduced the number of multiplier elements, which dominate the total chip area. It was argued that the structure was more area efficient, for a given sampling rate, than the architectures in [4] and [10] (the structure in [10] was the first systolic array proposed for the DLMS algorithm) and, thus, also the structures used in this paper. However, the comparison made in [9] did not take into consideration some straightforward simplifications of the structure in [4] which enable a significant reduction in the sampling period. Firstly, the critical path in the filter proposed in [4] was wrongly assessed in [9], since the error computation time was combined with the time required to perform the coefficient update. As described in Section II-A, the error computation should be included in the critical path associated with the formation of the DFE output. Secondly, the multiplication of the step size with the error term can be easily factored out of the critical path, as noted in Section II-B. Finally, the provision of a multiplication for the step-size operand is excessive and, in practice, a power-of-two size with the error term can be easily factored out of the critical path, as noted in Section II-B. Finally, the provision of a multiplication for the step-size operand is excessive and, in practice, a power-of-two value is generally sufficient [10]. Using these simple modifications, the critical path delay in the adaptive filters used to construct the DFE in Section II-A is \( t_m + t_a \). The adaptive filters proposed in [9] have minimum sampling periods of \( 2(t_m + t_a) \) and, therefore, do not provide the same performance as the filters described here. This is an obvious consequence of using fewer arithmetic elements. Circuit folding [12] may be applied to the DFE structure in Fig. 4, allowing an approximate factor of two reduction in the number of arithmetic elements, at the expense of a similar reduction in the throughput.

The resulting filters provide virtually identical performance to those described in [9], i.e., the same hardware requirements for the same throughput rate. It is noted that an efficient DFE structure using a fractionally spaced FFF can be realized by the application of circuit slowdown and folding to the DFE structure in Fig. 4.

Other techniques to increase the throughput of adaptive filters have been proposed. In [13], very high throughput rate architectures were proposed, utilizing fine-grain pipelining of the arithmetic elements. Pipelining has been used here to restructure the DFE in order to ease implementation as well as increase throughput. In addition, for mobile applications (e.g., [1]), the throughput rate of the proposed pipelined DFE is adequate, while avoiding excessive increases in the convergence time (Section III) which would reduce the bandwidth efficiency. It is also noted that fine-grain pipelining of a DFE is not possible, since the DFE output must be available at the end of each iteration in order to cancel the effects of postcursor ISI. Although in [13] look-ahead was applied as a means of overcoming this inherent throughput limitation, the resulting algorithm was nonlinear and has to be considerably approximated in order to simplify implementation.

In [14], multiple DFE’s were proposed in order to increase the throughput rate by processing different sections of a transmitted frame of data. By the insertion of known symbols to correctly initialize the DFE’s, a speedup factor equal to the number of DFE’s is achievable. While this technique can, in principle, provide unlimited speedup, the hardware overhead is likely to limit the number of DFE’s actually used. The method can, however, be used in conjunction with pipelining as proposed in this paper.

III. EQUALIZER PERFORMANCE COMPARISON

The simulated performance of the pipelined DFE’s are compared using the channel models proposed in [5] and [15]. Channels 1 and 2 are given by channels “a” and “b” from [5], respectively; channel 3 is from [15]. The performance of the pipelined DFE realizing the DLMS algorithm for training is compared with the conventional LMS algorithm and the TFDFE structure discussed in Section II-A. In the convergence curves, the log of the mean-square-error, normalized with respect to the squared value of the training data, is plotted. For clarity, not all points on the convergence curves are shown. In the simulation a quadrature phase-shift keying (QPSK) modulated signal is used with root-raised-cosine filtering at the receiver and transmitter. Additive noise was added with the \( E_b/N_0 \) figure for the system fixed at 20 dB. The convergence curves were generated by averaging over 150 independent experiments.

In Fig. 5 the convergence of the (12,12) DFE structures are compared over channel 1. The DFE has been overspecified in this experiment in order to exaggerate the differences in the convergence behavior. The behavior of the DLMS and TFDFE algorithms are noticeably different. The convergence speed of the LMS algorithm is superior for the given step size 0.015. As expected, the convergence speed of the TFDFE lies between that of the DLMS and LMS algorithms. This is due to the additional delay in the filter coefficient updates used in the DLMS algorithm.

Channel 3 is a much more severe test than channel 1 since it introduces a null in the frequency response. It was found for equalization of channel 3 that a reasonable compromise between performance and convergence speed could be obtained using a (10,10) DFE structure with \( \beta = 0.01 \). The convergence behavior of the DLMS and LMS algorithms is compared in Fig. 6; the TFDFE
performance was similar to the DLMS algorithm. In these cases the reduced step size and more severe channel distortion resulted in very little performance variation between the three algorithms.

In Fig. 7, the performance of the LMS and DLMS algorithms are compared in terms of the bit-error rate (BER) over channel 2 for a (6,6) DFE (500 training symbols were allocated). Perfect decision feedback and imperfect decision feedback were compared. The performance loss in this case is only marginal, since the delay in adaptation is not very large. This is expected since, for both algorithms, convergence was reliably achieved within the 500-symbol training period.

IV. CONCLUSION

In this paper pipelined transversal filter-based DFE’s employing the DLMS training algorithm have been described. An order-recursive DFE structure was developed which allows a DFE of arbitrary length to be constructed by cascading a series of identical processing modules. Alternative filtering structures were chosen for the FFF’s and FBF’s in order to minimize the global communication. The performance of the new DFE’s were compared using simulated channels to introduce ISI and were found to be only marginally inferior to those for the conventional DFE. However, the pipelined DFE’s more than double the throughput rate of conventional structures and are very suitable for VLSI implementation. A pipelined version of the DNLMS algorithm was also proposed for a DFE, which removes the dependency of the convergence speed on the input signal power.

ACKNOWLEDGMENT

The authors would like to thank the other members at the Centre for Communications, University of Bristol, and the reviewers of this paper for their helpful suggestions and for pointing out some of the references.

REFERENCES


Intermodulation Noise Related to THD in Dynamic Nonlinear Wide-Band Amplifiers

Henrik Sjöland and Sven Mattisson

Abstract—In this brief it is shown that the power of the intermodulation noise of a wide-band amplifier with dynamic nonlinearities can be estimated by the total harmonic distortion (THD) with a sinusoid input signal of appropriate amplitude and frequency. The THD is, as opposed to the intermodulation noise, easy to measure and use as a design parameter. This brief is an extension of our paper [1], which treats static nonlinearities.

Index Terms—Distortion, intermodulation, wide-band amplifiers.

I. INTRODUCTION

In [1] it was shown that the intermodulation noise power due to a static nonlinearity can be estimated by a total harmonic distortion

Manuscript received September 5, 1996; revised July 7, 1997. This paper was recommended by Associate Editor V. Porra.

The authors are with the Department of Applied Electronics, Lund University, S-22100 Lund, Sweden (e-mail: hsd@ide.lth.se).

Publisher Item Identifier S 1057-7130(98)05062-9.