Copyright © 1997, by the author(s). All rights reserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

## FUNCTIONAL MEASUREMENTS OF THE FIRST ANALOG INPUT/OUTPUT CNN UNIVERSAL CHIP

by

A. Zarándy, J. M. Cruz, P. Szolgay, P. Földesy, L. O. Chua, and T. Roska

Memorandum No. UCB/ERL M97/95

15 December 1997

## FUNCTIONAL MEASUREMENTS OF THE FIRST ANALOG INPUT/OUTPUT CNN UNIVERSAL CHIP

.

by

A. Zarándy, J. M. Cruz, P. Szolgay, P. Földesy, L. O. Chua, and T. Roska

Memorandum No. UCB/ERL M97/95

15 December 1997

## **ELECTRONICS RESEARCH LABORATORY**

College of Engineering University of California, Berkeley 94720

# Functional Measurements of the First Analog Input/Output CNN Universal Chip

Á. Zarándy, J. M. Cruz, P. Szolgay, P. Földesy, L.O. Chua, and T. Roska

Abstract: The measurements and the functional tests of the first operational analog input - analog output CNN Universal chip are done and presented in this paper. The chip design issues are recalled. Some image processing template applications are introduced. The solution of a 2D partial differential equation is also introduced. Some nonideal properties of the analog VLSI realization and a possible compensation method is showed.

### **1** Introduction

The Cellular Neural/Nonlinear Networks (CNN) [1,2] as mainly locally connected nonlinear dynamic processor arrays play an increasingly important role in understanding , generating, and applying spatiotemporal phenomena [16]. The invention of the CNN Universal Machine architecture uses the CNN dynamics as a core in the first analogic (analog and logic) cellular array computer. This computer is fully stored programmable. The first two VLSI implementations, the CNN Universal Chips designed in Seville in 1995 [11,12] and in Berkeley in 1996 [5,6] mark a breakthrough in CNN technology. With the CNN Chip Prototyping System designed in Budapest [7] we have an experimental system demonstrating close to Tera (trillion,  $10^{12}$ ) equivalent operations per second per cm<sup>2</sup> in a stored programmable environment. Meanwhile the second design run in Seville resulted in a fairly robust optical input chip with binary input/output, the measurement results are published in [12].

In this paper we report on the measurements of the first analog input/output experimental CNN Universal Chip with electrical input designed in Berkeley. We show useful completely operational templates and template sequences as well as the parametric limits of the operation. An important new design issue is presented: how to design a sequence of robust templates to implement the equivalent of a template which is beyond the robust parameter range.

Section 2 contains the summary of the CNN Universal Chip under consideration. In section 3 and 4 the measurements of some useful templates are shown for mathematical morphology and PDE simulations, respectively. Section 5 contains measurement results related to some basic circuit elements of the chip, while in Section 6 limits, errors and error corrections are discussed. Conclusions are summarized in Section 7.

# 2 The CNN Universal Chip architecture and design issues

The main features of the chip are as follows:

- continuous-time analog dynamics according to the original CNN equations;
- two local analog memories;

- two local logic memories;
- full 3x3 neighborhood of local interaction with 19 analog template elements;
- column-by-column analog input;
- column-by-column logic output;
- serial analog output;
- capability of stopping the whole cell array dynamics at a specified time instant;
- capability of externally observing the internal dynamic waveforms of any selected cell.

The chip contains an array of 16x16 analog processor, which permits simultaneous processing of image blocks of 14x14 pixels. Each of these blocks is processed typically in less than 250 ns, which allows real time manipulation of high resolution images.



Figure 1. Diagram of the implemented transconductance-mode CNN cell using ideal circuit elements.

The high processing speed is achieved by high parallelism. The chip contains close to 5,000 multipliers (19 in each of the cells) which operate fully parallel, performing both the feedforward and feedback multiplication simultaneously. Excluding I/O operations each cell has a peak computing power of 4 million pixels per second and the entire chip has a peak computing capability of about one billion pixels per second when operating at 5 volt supply and consuming 0.3 Watts. Since one cell computes about  $20+10 \times 9 \times 2=200$  equivalent digital operations, this chip has an equivalent computing power of 200 billion operations per second (OPS) on about  $0.2 \text{ cm}^2$  array area. This is about  $10^{12}$  OPS per cm<sup>2</sup> using a 0.8 µm technology.

Figure 1 shows a diagram of the basic CNN chip cell. To optimize the amount of inter-cell wiring we used a transconductance approach in which the cell VCCSs are controlled by their local input or state variables, uij and xij. The current outputs of these current sources are merged by pairs and delivered into the state nodes of neighborhood cells.

Figure 2 shows a more detailed diagram of the analog components of one of the cells. The analog memories are implemented by field-oxide capacitors; the VCCSs are implemented by a combination of nonlinear transconductance amplifiers (OTA A and B) and two arrays of current multipliers; the state resistor is implemented by a linear transconductance amplifier (OTA R), which is connected in parallel to the state capacitor during the CNN processing. The value of this state resistor and of the state capacitor determine the speed of the CNN.



Figure 2. Detailed diagram of the analog components of a cell.

Figure 3 shows a diagram of the logical components of the cell. This components are useful when running CNN algorithms requiring the storage and retrieval of intermediate binary images for further processing. This local circuitry allows to store the output of the analog processing locally, and

then feed it into the state or input variable of the cell. This local circuitry also permits to perform Boolean operations of stored images.



Figure 3: Diagram of the cell logic components.

# 3 Implementation of mathematical morphology: erosion, dilation, and reconstruction [8]

In this section we show the implementation of the basic binary mathematical morphological operations: the erosion, the dilation, and the reconstruction. Each of them is implemented with a single template. In case of the erosion and the dilation an 'L' shaped structuring element set were used. For the reconstruction, we used a '+' shaped structuring element set. The erosion and the dilation are local non-propagating operators, while the reconstruction is a propagating type operator. The later has a longer transient settling time. The template design methodology is according to [8] and the measurements were made by using the CNN chip prototyping system [7].

#### 3.1 Erosion and Dilation

The erosion operator with an 'L' shaped structuring element set drives white all of those black pixels in an image which has no black upper and black right neighbor. The operational template was the following:

$$\mathbf{A} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0.9 & 0 \\ 0 & 1 & 0.9 \\ 0 & 0 & 0 \end{bmatrix}, \quad Z = -2.7 \tag{1}$$

The transient settled in 250ns. The result is correct (Figure 4).



Figure 4. Erosion with an 'L' shaped structuring element set. (a): input; (b): output.

A dilation operator with 'L' shaped structuring element set drives black all of those white pixels in an image which has lower or left black neighbor. The operational template was the following:

$$\mathbf{A} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0 & 0 \\ 0.9 & 1 & 0 \\ 0 & 0.9 & 0 \end{bmatrix}, \quad Z = 2.7$$
(2)

The transient settled in 250ns. The result is correct (Figure 5).



Figure 5. Dilation with 'L' shaped structuring element set. (a): input; (b): output.

One can implement the binary opening and closing operators by applying the erosion and the dilation operators one after the other. Figure 5b is the result of the opening operator to Figure 4a (an erosion is followed by a dilation with the same structuring element set).

### **3.2 Reconstruction operator**

The reconstruction is a propagating type operator. It is defined for a mask and a marker image. Those objects in the mask image which are marked (hit) by the marker are fully extracted. The template is the following:

$$\mathbf{A} = \begin{bmatrix} 0 & 0.7 & 0 \\ 0.7 & 3 & 0.7 \\ 0 & 0.7 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad Z = 1$$
(3)

The transient settled in 2.7µs. The result is correct (Figure 6.).



Figure 6. The implementation of the reconstruction operator. The result is correct. (a): input (mask); (b): initial state (marker); (c): result.

# 4 One dimensional diffusion: transient response computation of a simple mechanical vibrating system [10]

The dynamics of a one-dimensional homogeneous chain system having 14 discrete components (masses) with unit damping factor have been analyzed by the chip. The elements of the system can move only in x direction. The motion dynamics of the system, without external force, can be described by fourteen variables ( $x_i(t)$ , i=1,2,...14) according to:

$$\ddot{x}_{i} = \frac{c}{m} x_{i-1} - \frac{2c}{m} x_{i} + \frac{c}{m} x_{i+1} - \dot{x}_{i}$$
<sup>(4)</sup>

Let us suppose that m=1, c=1, the initial displacements of masses are given and the initial speed of the masses are zero.

The dynamics of the continuous mechanical vibrating systems is described by a partial differential equation (Lame equation), it can be modeled by a 1D cellular neural network and the limits of this approach is discussed as well [9,10]. The second order ODEs (a discretized mechanical system) can be implemented on a CNN by two coupled 1D layers and the bias term  $z_1 = z_2 = 0$ .

In case of a linear mechanical vibrating system with one degree of freedom the following two coupled layer CNN is used:

$$\frac{dv_{xij}^{1}(t)}{dt} = -v_{xij}^{1}(t) + \sum_{kl \in N_{r}(ij)} \mathbf{A}_{ij;kl}^{21} v_{ykl}^{2} + \sum_{kl \in N_{r}(ij)} \mathbf{B}_{ij;kl}^{11} v_{ukl}^{1}; \quad \frac{dv_{xij}^{2}(t)}{dt} = \sum_{kl \in N_{r}(ij)} \mathbf{A}_{ij;kl}^{12} v_{ykl}^{1}$$
(5)

It is supposed that  $v_{yij} = f(v_{xij}) = v_{xij}$ .

The following templates were used:

$$\mathbf{A}_{12} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{A}_{21} = \begin{bmatrix} 0 & 0 & 0 \\ 1 & -2 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B}_{11} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$$
(6)

The two one dimensional layer were implemented on the two dimensional chip sequentially. Here the possibility of the stopping analog transient on the chip was used. The outputs at different time instants were measured in the CNN Chip Prototyping System CCPS[7] by running the transient repeatedly and having measured the outputs at different time instants.

The displacement responses to a given initial displacement of the masses were computed. If the initial displacement was not large enough the signals were not distinguishable from the space variant bias. In this case the result was even qualitatively incorrect (Figure 7a and Figure 7c). Due to the fact that the exciting force ( the input), in this application was zero, the bias compensation can be downloaded to the input. Using the outlined method the displacement fields of the chain system were decided for zero damping and for unit damping cases (Figure 7b and Figure 7d). The outputs were close to the calculated values within 20%.



(a)



**(b)** 



(c)



(d)

Figure 7. The transient response of the chain system with zero damping component (a) without bias compensation and (b) with bias compensation. The transient response of a chain system with unit damping component (c) without bias compensation and (d) with bias compensation.

No damping effect can be seen in Figure 7c in spite of using a unit damping when there is no bias compensation and the displacement of the masses tend to zero in Figure 7d if bias compensation was used.

## 5 Measurements of some basic functional circuit elements of the CNN Universal Chip

In this section, we describe the results of four characteristic measurements: the transfer characteristics, the analog storage time, the characteristics of the template elements (multipliers), and the time constant of the cells.

## 5.1 Measurement of the input-output transfer characteristics

An image was loaded to the STATE local analog memory, and without doing any analog operation the OUTPUT was read out. The image in a line contained a ramp (a linear series of pixel values from -1 to 1) with the resolution of 0.1. Figure 8 shows the transfer characteristics of all the cells in the same graph. As it can be seen in the graph the curve has a sigmoid shape rather than being linear.



Figure 8. The input-output characteristic of some cells.

## 5.2 Measurement of the storage time of the local analog memories

Here the storage time of the local analog memories were measured. The same image was loaded to the STATE memory, and it was read out at different time instants in every 20 ms. The image contained different values. Figure 9 shows the measured results. Evaluating the measured data we determined that the chip can store an image with 5 % accuracy at least for 1 sec.



Figure 9. The signal degradation in the analog memories. The memories can store the values with 5 % accuracy at least for 1 second.

## 5.3 Measurement of the accuracy distribution of the template elements

The accuracy distribution of the template coefficients were measured. Of the many results here we show the accuracy distribution of the central element of the **B** template  $(b_{00})$  only. In the experiment we fixed all the template values, except the bias current (z) and  $b_{00}$ . These two values were the free parameters of the measurements. We set these fixed values independently in their total dynamic range in 32 steps. This means 1024 measurement points in the parameter space. The input

and the initial state were set to +1. After the transient settled, the number of the +1 pixels were counted and divided by the total number of the pixels. We called this result as the Ratio Of Black pixels (ROB). Its value varies from 0% to 100% (all of the pixels were white or black). It describes the accuracy distribution of the bias current (z) and  $b_{00}$ . Ideally, it should be two equal trapezoid (one at the 100%, and the other at the 0% level), and the drop should be vertical (Figure 10), because if  $b_{00}$  is larger than z all the pixels should be in the black region, hence, ROB equals 100%.

The ROB in the 1024 measured points can be seen in Figure 11a. In the graph it seems like a waterfall. Figure 11b shows the cross-section of a waterfall at line a.



Figure 10. The ideal ROB graph.



Figure 11. The measurement results of the accuracy distribution of the central B template element  $(b_{00})$ . (a): the ratio of the black pixels in the function of the  $b_{00}$  and the z value; (b): the cross-section of (a) at the  $b_{00}=0$  plan indicated with bold line on (a).

## 5.4 Measurement of the time constant of a cell ( $\tau_{CNN}$ )

We used indirect measurement. A CCD template was used and the propagation time was measured. From this measurement the time constant  $\tau_{CNN}$  was calculated. It was 97ns.

# 6 Limits, errors and error corrections by using template sequences

The limited accuracy of the multipliers of this chip confines the range of the implementable templates. In our experiments, it turned out that in a given template the smaller the template values are, the more difficult (or impossible) to implement it on the chip. The linear scaling of the template values is possible only in the template range of the chip only. (In this chip the absolute values of the template elements should not exceed 3.)

As it turned out from the experiments, those templates worked correctly, in which all the nonzero template values were larger or smaller than a certain value (See templates (1),(2),(3)). But several operations require templates which do not satisfy this requirement. These templates can be decomposed into a few operational ones.

Here we consider the Local Concave Places Finder template [15]. The template drives black a pixel if it finds neighbor configurations according to Figure 12(a), (b), or (c). With other words, the template seeks for those pixels, which have black eastern <u>and</u> black western, <u>and</u> white southern, <u>and</u> black south-eastern <u>or</u> black south-western neighbor.



Figure 12. The Local Concave Places Finder template drives a pixel black if it finds either (a) or (b) or (c) binary pattern around it.

The template was out of the accuracy limit of the chip. The best version, which still made some faults, was the following:

$$\mathbf{A} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0 & 0 \\ 0.9 & 0 & 0.9 \\ 0.5 & -0.9 & 0.5 \end{bmatrix}, \quad Z = -3$$
(7)





The template can be decomposed into two templates and a local logic OR operation. The two templates are as follows:

$$\mathbf{A} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0 & 0 \\ 0.9 & 0 & 0.9 \\ 0 & -0.9 & 0.9 \end{bmatrix}, \quad Z = -3$$
(8)  
$$\mathbf{A} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 0 & 0 & 0 \\ 0.9 & 0 & 0.9 \\ 0.9 & -0.9 & 0 \end{bmatrix}, \quad Z = -3$$
(9)

The first extracts those pixels which are surrounded by a configuration showed in Figure 12ac, the second does it with Figure 12bc. The pixel-by-pixel logic OR combination of the two subresults gives the correct final result (Figure 14). Hence a simple CNN software implements the original faulty template in a fault tolerant way.



Figure 14. The Local Concave Places Finder template decomposed into a template sequence. (a): the result of the first template; (b) the result of the second template; (c) the pixel-by-pixel logic OR combination of the previous two results, this gives the correct final result.

### 7 Conclusion

A quite aggressive VLSI design, in speed and area, of the CNN Universal Machine architecture was fabricated via the MOSIS service. It has analog input-output and programmed transient timing. The first measurements summarized in the paper clearly show the feasibility of the technology. By determining the parametric limitations and using a new design technique we can overcome the problems coming from the parameter deviations.

As to the later issue, stored programmability of the architecture and the chips provide new features. In addition to solve complex image analysis, processing and recognition tasks with ultra high speed: (i) fault tolerant programming, and (ii) internal functional testing can be achieved by correct software control.

### References

- [1] L.O. Chua and L. Yang, "Cellular Neural Networks: Theory and Applications", *IEEE Transactions on Circuits and Systems*, vol. 35, no. 10, October 1988, pp. 1257-1290.
- [2] L.O. Chua and T. Roska, "The CNN Paradigm", *IEEE Transactions on Circuits and Systems I*, vol. 40, no. 3, March 1993, pp. 147-156.
- [3] T. Roska and L.O. Chua, "The CNN Universal Machine: An Analogic Array Computer", *IEEE Transactions on Circuits and Systems II*, vol. 40, March 1993, pp. 163-173.
- [4] L.O. Chua, T. Roska, T. Kozek, and Á. Zarándy, "CNN Universal Chips Crank Up the Computing Power" IEEE Circuits and Devices, pp.18-28, July, 1996.
- [5] J.M. Cruz, L.O. Chua, and T. Roska, "A Fast, Complex and Efficient Test Implementation of the CNN Universal Machine", Proc. of the third IEEE Int. Workshop on Cellular Neural Networks and their Application (CNNA-94), pp. 61-66, Rome Dec. 1994.
- [6] J.M. Cruz, "VLSI implementation of Cellular Neural Networks", Thesis (Ph.D. in Electrical Engineering and Computer Science), University of California, Berkeley, December 1996.
- [7] T. Roska, P. Szolgay, Á. Zarándy, P. L. Venetianer, A. Radványi, T. Szirányi "On a CNN Chip Prototyping System" Proc. CNNA-94, Rome, pp.375-380. 1994.
- [8] Á. Zarándy, A. Stoffels, T. Roska, F. Werblin, and L.O. Chua "Implementations of Binary and Grayscale Mathematical Morphology on the CNN Universal Machine" Univ. of Cal. Berkeley, memo No. UCB-ERL 96/19, 1996. accepted to Int. J. CTA
- [9] P. Szolgay, G. Vörös and Gy. Erőss," On the applications of the Cellular Neural Network (CNN) paradigm in mechanical vibrating systems", IEEE Transactions on Circuits and Systems Vol.40, pp. 222-227, 1993.
- [10] P. Szolgay, T. Szabó, "On the detection of critical frequencies of a tool machine", Proc. of IEEE CNNA'96,pp.357-362, Seville, 1996
- [11] R.Dominguez-Castro, S.Espejo, A.Rodriguez-Vazquez, R.Carmona, "A CNN Universal Chip in CMOS Technology", Proc. of the third IEEE Int. Workshop on Cellular Neural Networks and their Application (CNNA-94), pp. 91-96, Rome, 1994.

- [12] R.Dominguez-Castro, S.Espejo, A.Rodriguez-Vazquez, R.Carmona, P. Földessy, Á. Zarándy, P. Szolgay, T. Szirányi T. Roska, "A 0.8µm CMOS 2D Programmable Mixed-Signal Focal-Plane Array Processor with On-Chip Binary Imaging and Instruction Storage" accepted for publication at IEEE Trans. J Solid State Circuits, 1997.
- [13] Proceedings of the International Workshop on Cellular Neural Networks and their Applications: CNNA-90, Budapest, 1990, CNNA-92, Münich, 1992, CNNA-94, Rome, 1994, CNNA-96, Seville, 1996.
- [14] Special Issues on Cellular Neural Network: Int. J. CTA Sep-Oct 92, Jan-Feb 96, May-June 96.
- [15] "CNN Software Library, (Templates and Algorithms)", edited by T.Roska, L. Kék, L. Nemes, and Á. Zarándy, Comp. and Auto. Ins. of the Hung. Acad. of Sci. DNS-1-1997, Budapest, 1997.
- [16] L.O. Chua, M. Hasler, G. Moschitz and J. Neirynck, "Autonomous Cellular Neural Networks: A Unified Paradigm for Pattern Formation and Active Wave Propagation", IEE CAS Vol. 42, No. 10, pp. 559-577, October, 1995.