### Copyright © 1994, by the author(s). All rights reserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

# PERFORMANCE ANALYSIS AND OPTIMIZATION OF MIXED ASYNCHRONOUS SYNCHRONOUS SYSTEMS

by

J. Teich, S. Sriram, L. Thiele, and M. Martin

Memorandum No. UCB/ERL M94/95

30 November 1994

# PERFORMANCE ANALYSIS AND OPTIMIZATION OF MIXED ASYNCHRONOUS SYSTEMS

by

J. Teich, S. Sriram, L. Thiele, and M. Martin

Memorandum No. UCB/ERL M94/95
30 November 1994

#### **ELECTRONICS RESEARCH LABORATORY**

College of Engineering University of California, Berkeley 94720

# Performance Analysis and Optimization of Mixed

#### Asynchronous Synchronous Systems

J. Teich, S. Sriram

Department of EECS

University of California at Berkeley

Berkeley, CA 94720

email: teich@hoff.eecs.berkeley.edu

L. Thiele, M. Martin
Institute on Microelectronics
University of Saarland
D-66041 Saarbrücken, Germany
email: thiele@ee.uni-sb.de

November 30, 1994

#### Abstract

This paper deals with the system-level performance analysis and optimization of a class of digital systems we call mixed asynchronous – synchronous systems. In such a system, each computation module is either synchronous (i.e., clocked) or asynchronous (i.e., selftimed). The communication between all the modules is assumed to be selftimed. In order to adequately describe the timing of such architectures, we introduce a graph model called MASS which is based on several extensions of the model of timed marked graphs. The first extension is that the node set V is partitioned into synchronous nodes  $V_S$  and selftimed nodes  $V_A$ . Another extension is to specify additional schedule constraints on synchronous nodes: A synchronous node can only fire at ticks of its local module clock. Based on these extensions, we analyze the behavior of MASS, in particular period, periodicity and

maximal throughput rate. Our results are bounds on the maximal throughput rate of a MASS. These can be computed in polynomial time. For several interesting cases (e.g.,  $V = V_A$  selftimed system,  $V = V_S$  globally asynchronous locally synchronous system), the maximal throughput rate is determined exactly using these bounds.

Finally, we introduce the optimization problem of assigning appropriate clock phases to synchronous nodes so to maximize the throughput rate of the resulting system. An exact solution as well as a polynomial time algorithm for nearly optimal phase assignment are presented. We claim that the MASS model is valuable for future generations of system-level CAD tools due to its simplicity and due to the proven efficiency of performance analysis and optimization algorithms.

#### 1 Introduction

This paper is concerned with the performance analysis and optimization of a class of digital systems we call mixed asynchronous – synchronous systems. In such a system, each module is either synchronous, i.e., synchronously clocked <sup>1</sup> or asynchronous, i.e., selftimed. <sup>2</sup> The communication between all the modules is assumed to be selftimed.

#### 1.1 Motivation

In pure synchronous systems, a periodic clock signal is used to control its state. The advantages of synchronous design styles can be summarized as follows:

#### • Maturity of existing CAD design tools

<sup>&</sup>lt;sup>1</sup>Synchronously clocked circuit modules compute functions by separating stages of combinational logic with latches or registers that are clocked with a globally distributed clock.

<sup>&</sup>lt;sup>2</sup>Asynchronous systems do not employ a global clock for enforcing system activities. In case they use *handshakes* to sequence operations, they are called *selftimed* [Sei80].

- Hazard avoidance and elimination
- No circuit overhead due to handshaking, completion signal generation, etc.

Today, architectures that consist of a combination of dedicated circuits, (i.e., custom VLSI circuits) and general purpose components (e.g., memory blocks, A/D and D/A converters, DSPs, etc.) are becoming increasingly attractive and can be built at relatively low costs. Such systems are often referred to as embedded systems (see, e.g., [GVNG94] and references therein). Very often, they contain asynchronous components such as selftimed circuits for the following reasons:

#### • Low power

In asynchronous circuits, precharging and discharging of transistors occurs only in those portions of the circuit that are currently involved in an operation.

• Average-case instead of worst-case performance
In a synchronous system, the maximal clock rate is determined by the slowest computation module.

#### • Reliability

The communication based on a selftimed handshake is reliable due to paradigm of *delay-insensitivity* [MFR85, Mar89, BS89]. Hence, physical design variations may have no influence on the correctness of the circuit. Also, a better technology migration is possible.

#### • Easing of global timing issues

High speed synchronous subsystems must be designed carefully in order to function correctly. Typical problems are *clock skew* and correct critical path estimation.

• Popularity in todays systems

Most of today's digital systems have asynchronous interfaces (e.g., interrupt handling).

For a more complete summary of the advantages of asynchronous systems, and for an overview of the state of the art in design methodologies

for asynchronous circuit design, we refer to the collection of excellent papers edited in [GB93, MM94].

However, at a certain point in the design hierarchy, the communication costs for building a selftimed circuit (e.g., for dual-rail coding [Sei80], differential logic [MBM89], handshake logic) are not justified any more by the possible gains. As a consequence, there is an optimum ground for the structure of a system where we do not have a monolithic synchronous system, but a mixture of asynchronous and synchronous design styles. Some of these architectures are also referred to as globally asynchronous locally synchronous systems (GALS) ([Sha84], [WB93]). For a discussion on advantages and disadvantages, see [GJ93]. Aspects of describing such systems can be found in [Sei80], [Sha84], [Sub91], and [WB93]. Whereas [Sei80] and [Sha84] describe implementation aspects when building globally asynchronous locally synchronous systems, the approach presented in [WB93] proposes a language oriented approach to the design of globally asynchronous locally synchronous systems. There, the goal is to derive a circuit from the high level language SIGNAL. Finally, the work described in [WTWL94] concentrates on synthesizing a synchronous finite state machine from a mixed synchronous/asynchronous state graph with the same behavior.

#### 1.2 Goals

Unfortunately, none of the work described above proposes a model that enables the exact timing behavior of a system containing synchronous and asynchronous modules. Here, we are concerned with an exact analysis of performance, e.g., in determining the minimal period (or maximal throughput rate) achievable by such systems. Primarily, our goal is to generate a timing model that satisfies the following requirements:

- Simplicity and Methodology Independence
  In order to be amenable for CAD, the model should be simple and not focus one one particular implementation style or design methodology (e.g., circuit delay model).
- Efficiency
  For the same reasons as above, we would like to investigate efficient

techniques for analysis and optimization, i.e., polynomial time algorithms, if possible.

#### • Exactness

The model should mirror the exact timing of mixed asynchronous – synchronous systems.

#### • Generality

It should be able to analyze and optimize GALS (all nodes are locally synchronous) or purely selftimed architectures (all nodes are asynchronous nodes) as special cases.

The theory of mixed asynchronous – synchronous systems as introduced here combines aspects of asynchronous as well as synchronous performance analysis. In the realm of synchronous architecture design, Leiserson et al. ([LRS83]) have developed a theory for analysis and optimization of signal flow graphs (SFG).

In the domain of asynchronous systems, graph models such as timed Petri net models have been investigated for analysis of performance [Ram74, RH80, Bur90]. In case of deterministic computation times, the corresponding Petri net is decision-free and can be represented by a marked graph [CH71], for a classification see e.g., [Pet81]. In [Rei68] and [RH80], it is shown that under certain conditions, systems modeled by marked graphs have an asymptotically periodic behavior and that the minimal period of such a system is given by the maximal cycle mean. A detailed analysis of this class of event systems is contained in [BCOQ92].

Based on these results, we will introduce a graph model called MASS (mixed asynchronous – synchronous systems) which is an extended timed marked graph model. The first extension is that the node set of a MASS is a partition of asynchronous and synchronous nodes. Also, in contrast to marked graphs in which a computation module can commence its operation if all incoming arcs contain valid data, a synchronous node in an MASS can only start or finish its computation at a tick of its local module clock.

#### 1.3 Overview

In section 2, we will formally define the MASS model. In particular, definitions and results concerned with timed marked graphs are reviewed first

(e.g., model of computation, scheduling, throughput rate). In section 3, we analyze MASS graphs in terms of maximal throughput rates. It turns out that the maximal throughput rate of a MASS with node set V can be determined in polynomial time for most of the interesting cases, e.g., for selftimed systems  $(V = V_A)$ , and GALS  $(V = V_S)$ . In section 4, the problem of optimal phase assignment is introduced. For systems, where all synchronous modules are driven by the same clock oscillator, clock signals may have a significant clock skew, hence different modules have different clock phases. Phase assignment addresses the optimization problem of maximizing the throughput of a system by adjusting the clock phases appropriately. An exact solution as well as a nearly optimal phase assignment algorithm with polynomial runtime are presented. The resulting set of node phases may serve as input to placement and clock routing CAD tools. Finally, in section 5, we show how realistic implementation issues may be expressed in the MASS model, e.g., finite buffering, and interfacing of synchronous and selftimed hardware (e.g., stoppable clocks). We conclude that the MASS model is valuable for future generations of system-level CAD tools, where mixtures of synchronous and selftimed design styles will be typically encountered.

#### 2 Models of computation

#### 2.1 Marked graphs and their unfolding

The main purpose of this section is to introduce some notation and to recall the computational model associated with marked graphs, because this is the basis for the extensions described in this paper. For more details, see e.g., [Pet81, BCOQ92] and the references therein.

**Definition 1 (Marked Graph)** A timed marked graph G = (V, A, d, h) denotes a directed graph with

- nodes  $V = \{v_1, v_2, \cdots, v_{|V|}\},\$
- arcs  $A = \{a_1, a_2, \dots, a_{|A|}\}$ , where any arc is an ordered pair of nodes  $a_p = (v_i, v_j)$ ,

• a distance function  $d: A \to Z_{\geq 0}$  and a weight function  $h: A \to R_{\geq 0}$ <sup>3</sup> We use  $d_{ij}$  and  $h_{ij}$  as short hand notations for the distance  $d(v_i, v_j)$ and weight  $h(v_i, v_j)$  associated with arc  $(v_i, v_j)$ , respectively.

A node of G represents a computation module. Each arc  $a_p = (v_i, v_j)$  represents a queue of data directed from the computation module assigned to  $v_i$  to the computation module assigned to  $v_j$ . Thus the computation module  $v_i$  places the results of its calculation onto arc  $a_p$  and the data on  $a_p$  are available as inputs to the computation module  $v_j$ . The distances and weights in Definition 1 are interpreted as follows:

- $d_{ij}$  denotes the initial number of data (tokens) on arc  $a_p = (v_i, v_j)$ .
- $h_{ij}$  denotes the holding time of a token in the queue associated with arc  $(v_i, v_j)$ .

These quantities are related to the token game one can play on timed marked graphs: If a node  $v_i$  fires, then the first token of each queue ending in  $v_i$  is removed and one token enters each queue originating from  $v_i$ . The transfer is assumed to take zero time.

It remains to be addressed when such an event can take place. To this end, we say that a node can fire only if it is enabled. Now, a token must spend the holding time  $h_{ij}$  in the queue from  $v_i$  to  $v_j$  before contributing to the enabling of the downstream node  $v_j$ . The firing of a node can't take place before all tokens already being in the queues contribute to the enabling. A node need not fire immediately after it is enabled. Many other essentially equivalent notions of marked graphs are possible, see e.g., [Rei68, Pet81, BCOQ92] and the references therein.

Example 1 As an example, consider the marked graph on the left hand side of Fig. 1. It is characterized by  $V = \{v_1, v_2\}$ ,  $A = \{a_1, a_2, a_3\}$ ,  $a_1 = (v_1, v_1)$ ,  $a_2 = (v_1, v_2)$ ,  $a_3 = (v_2, v_1)$ ,  $d_{11} = 1$ ,  $d_{12} = 0$ ,  $d_{21} = 2$ ,  $h_{11} = 2.5$ ,  $h_{12} = 3.5$  and  $h_{21} = 2$ . One possible evolution of the marked graph is shown for three time instances. At time t = 0, all tokens are placed onto the arcs. At time t = 2, both tokens in arc  $(v_2, v_1)$  contribute to the enabling of  $v_1$ . At time t = 2.5 the node is enabled because the holding time of the token in arc  $(v_1, v_1)$ 

<sup>&</sup>lt;sup>3</sup>Let Z stand for the set of integer numbers, and let R stand for the set of real numbers.

is elapsed. Now, a token transfer may take place. If the firing happens at t = 2.5, the new situation shown in the middle graph results.



Figure 1: Example of a timed marked graph and its evolution

To be more precise about the above model, e.g., with respect to the initial conditions and the properties of the queues, the concept of (event)—unfolding is introduced which represents the set of all evolutions of a timed marked graph and includes information on the corresponding scheduling. Most of the results described in this paper will be derived using the notion of unfolding.

**Definition 2 (Unfolding)** The unfolding of a timed marked graph G = (V, A, d, h) is an infinite directed graph  $\Sigma(G) = (V_{\Sigma}, A_{\Sigma}, h_{\Sigma})$  where

- 1.  $V_{\Sigma}$  contains nodes  $v_i(k)$  for all  $v_i \in V$  and for all integers  $k > -\max\{d_{ij} : v_j \in V \text{ such that } (v_i, v_j) \in A\}$ ,
- 2.  $A_{\Sigma} = \{(v_i(k-d_{ij}), v_j(k)) : (v_i, v_j) \in A \land k \in \mathbb{Z}_{>0}\},$
- 3. to each arc in  $A_{\Sigma}$  the weight of the corresponding arc in G, i.e.,  $h_{\Sigma}$ :  $A_{\Sigma} \to R$  with  $h_{\Sigma}(v_i(k-d_{ij}), v_j(k)) = h_{ij}$  for all  $(v_i(k-d_{ij}), v_j(k)) \in A_{\Sigma}$  is assigned.

A node  $v_i(k)$  of the unfolding represents the kth firing of node  $v_i$  in the marked graph. An arc from  $v_i(k)$  to  $v_j(l)$  expresses the fact that the lth firing of node  $v_i$  can take place only after the kth firing of node  $v_i$ . The nodes  $v_i(k)$  for  $k \leq 0$  represent the initial conditions of the marked graph, i.e., the placement of  $d_{ij}$  tokens into the queue corresponding to arc  $(v_i, v_j)$ .

The scheduling of a timed marked graph that has been introduced informally, can now be described more precisely. An admissible schedule is defined as follows:

**Definition 3 (Admissible Schedule)** An admissible schedule function  $\sigma: V_{\Sigma} \to R_{>0}$  satisfies:

1. 
$$\sigma_i(k) = 0$$
 for all  $v_i(k) \in V_{\Sigma} \land k \leq 0$ ,

2. 
$$\sigma_j(k) \geq \sigma_i(k - d_{ij}) + h_{ij}$$
 for all  $(v_i(k - d_{ij}), v_j(k)) \in A_{\Sigma}$ 

where  $\sigma(v_i(k))$  is written  $\sigma_i(k)$ .

Here,  $\sigma_j(k)$  denotes the time when node  $v_j$  fires for the kth time. From the interpretation of a marked graph it should be obvious that this can take place only if the corresponding input tokens are in the queues  $(v_i, v_j)$  for at least time  $h_{ij}$ . As these tokens originate from the  $(k - d_{ij})$ th firings of nodes  $v_i$ , an admissible schedule of node  $v_j$  satisfies  $\sigma_j(k) \geq \sigma_i(k - d_{ij}) + h_{ij}$ . The first condition in Definition 3 serves to consider the initial timing conditions of the token game. All initial tokens are placed into the queues at time 0.

An example explaining this model of computation is given next.

**Example 2** Consider the marked graph shown on the left hand side of Fig. 1 again. The corresponding unfolding is shown in Fig. 2. Obviously, the graph contains for each node of the marked graph  $v_i$  an infinite number of nodes  $v_i(k)$  corresponding to its k firings. The weights assigned to the arcs correspond to the holding times  $h_{ij}$ . If the nodes fire as soon as they are enabled, the following admissible schedule (called free schedule) is obtained:

$$\sigma_1 = \begin{pmatrix} 0 & 2.5 & 5 & 8 & 10.5 & \dots \end{pmatrix}$$

$$\sigma_2 = \begin{pmatrix} 0 & 0 & 6 & 8.5 & 11.5 & 14 & \dots \end{pmatrix}$$

Therefore, nodes  $v_1$  and  $v_2$  fire at times  $t=2.5, 5, 8, 10.5, \ldots$  and  $t=6, 8.5, 11.5, 14, \ldots$ , respectively. This result may be compared to the representation in Fig. 1.



Figure 2: Unfolding of the marked graph shown in Fig. 1

#### 2.2 Maximal-rate schedules of marked graphs

For obvious reasons, one is interested in a schedule of a given marked graph that lead to the maximal throughput rate. This throughput rate expresses the average time interval between two successive firings of any node of maximal-rate schedules. This observation leads to the following definition:

Definition 4 (Maximal-Rate Schedule) The rate R of an admissible schedule  $\sigma$  is defined as  $R(\sigma) = \frac{1}{P(\sigma)}$  with the average period

$$P(\sigma) = \max \left\{ \lim_{k \to \infty} \frac{\sigma_i(k)}{k} : v_i \in V \right\}$$

The maximal rate  $R_{max}$  is defined by  $R_{max} = \frac{1}{P_{min}}$  with the minimal average period

$$P_{min} = \min \{P(\sigma) : \sigma \text{ is an admissible schedule } \}$$

In order to compare the timing of different realizations, we are particularly interested in periodic admissible schedules, because in real-time systems for digital signal processing, for example, one is very often faced with periodic signals. Also, a finite state controller can only enforce a periodic schedule.

**Definition 5 (L-Periodic Schedule)** An admissible schedule of a marked graph G = (V, A, d, h) is L-periodic with  $L \in \mathbb{Z}_{>0}$  and period  $P \geq 0$ , if for all  $v_i \in V$ 

$$\sigma_i(k+L) = \sigma_i(k) + LP, \ \forall k \in \mathbb{Z}_{>0} \tag{1}$$

The following theorem can be similarly found in [Rei68] for computation graphs.

**Theorem 1 (Maximal Cycle Mean)** A marked graph G = (V, A, d, h) has an admissible periodic schedule with period P iff  $\forall v_i \in V \ \exists \tau_i \in R$ , such that

$$\tau_j - \tau_i \ge h_{ij} - Pd_{ij} \ \forall (v_i, v_j) \in A. \tag{2}$$

Using the definition of the maximal cycle mean  $P_{cm}$  of G with

$$P_{cm} = \max \left\{ \frac{\sum_{(v_i, v_j) \in W} h_{ij}}{\sum_{(v_i, v_j) \in W} d_{ij}} : W \in C(G) \right\}$$
(3)

where W contains all arcs in a directed cycle and C(G) contains all simple directed cycles of G the following statements hold:

- There are admissible 1-periodic schedules for all  $P \geq P_{cm}$ .
- A 1-periodic admissible schedule with  $P = P_{cm}$  has minimal period, i.e.,  $P_{cm} = P_{min}$ .

**Proof**: The proof is given in [CDQV85]. The first equation (2) can be proven by combining the admissible schedule equation  $\sigma_j(k) \geq \sigma_i(k-d_{ij}) + h_{ij}$  with (1) and choosing e.g.,  $\tau_i = \sigma_i(1)$  for all  $v_i \in V$ . Then the one-periodic schedule

$$\sigma_i(k) = \tau_i + (k-1)P, \ \forall k \in \mathbb{Z}_{>0} \tag{4}$$

is admissible. The proof of the maximal cycle mean (3) uses the dual representation of (2).

 $P_{cm}$  is the maximum mean of accumulated path weights  $h_{ij}$  and number of tokens  $d_{ij}$  for all directed cycles of G. Using the above theorem, the maximal-rate schedule can be determined by solving the linear program

$$P_{cm} = \min\{P : \tau_j - \tau_i \ge h_{ij} - Pd_{ij} \ \forall (v_i, v_j) \in A\}$$
 (5)

Note that the maximum cycle mean of a marked graph G = (V, A, d, h) can be determined efficiently in time O(|V||A|) using an extension of an algorithm described in [Kar78]. The algorithm in [Kar78] considers the case  $d_{ij} = 1$  for all  $(v_i, v_j) \in A$  only. Feasible potentials  $\tau_i$  for all  $v_i \in V$  can be determined using a shortest path algorithm in time O(|V||A|) [CLR90].

We have already shown that the evolution of a marked graph can be obtained by calculating the longest paths in the unfolding. But the above results do not necessarily mean that an evolution of a marked graph results in a maximal-rate periodic schedule. It can be shown that in case of a strongly connected marked graph (there is a directed path form any node to any other node), the schedule becomes L-periodic. A detailed analysis is given in [CDQV85, BCOQ92].

Example 3 We compute the maximal throughput rate for the marked graph used in Examples 1 and 2. The maximal cycle mean

$$P_{cm} = \max\left\{\frac{h_{11}}{d_{11}}, \frac{h_{12} + h_{21}}{d_{12} + d_{21}}\right\} = \max\{2.5, 2.75\} = 2.75$$

could have also been determined by solving the linear program (5) (for example by solving a shortest path problem) which yields  $\tau_1 = 0$  and  $\tau_2 = 3.5$ . Using (1), an admissible fastest periodic schedule is obtained as

$$\sigma_1(k) = 2.75(k-1), \ \sigma_2(k) = 3.5 + 2.75(k-1) \ \ \forall k \in \mathbb{Z}_{>0}$$

In the following, we will analyze the performance of mixed asynchronous – synchronous systems. But first, we will introduce the required extensions to the timed marked graph model in order to adequately describe such systems.

## 2.3 Modeling of mixed asynchronous – synchronous systems

We assume that in a mixed asynchronous – synchronous system, a synchronous node can only send or receive data at its local clock ticks. The formal model of MASS to be defined next is an extended marked graph model.

Definition 6 (Mixed Asynchronous – Synchronous System) A mixed asynchronous – synchronous system (MASS) denotes an extended marked graph G = (V, A, d, h, r, p). The set of nodes V is partitioned into disjoint subsets  $V_A$  and  $V_S$ , corresponding to asynchronous and synchronous nodes, respectively. In addition, the function  $r: V \to N$  assigns a clock period, the function  $p: V \to R$  assigns a clock phase  $0 \le p_i < r_i$  to each synchronous node  $v_i \in V_S$ . For  $v_i \in V_A$ ,  $r_i = 1$ ,  $p_i = 0$  holds.

In the unfolding  $\Sigma(G)$ , the clock period  $r_i$  and the clock phase  $p_i$  are assigned to each node  $v_i(k) \in V_{\Sigma}$ .

In this paper, we restrict ourselves to analyzing single clock rate MASS  $(r_j = 1 \text{ for all } v_j \in V)$ . Hence, we will often simply denote a MASS graph as G = (V, A, d, h, p). The following example clarifies the chosen representation of mixed asynchronous – synchronous systems.

Example 4 Again, the system described in Example 1 is considered. Now, we suppose that node  $v_2$  is a synchronous node with the local clock phase  $p_2 = 0.1$ . By definition, we have  $p_1 = 0$  as  $v_1$  is an asynchronous node. Fig. 3 shows the corresponding extended graph.



Figure 3: Example of a mixed asynchronous – synchronous system

The major difference over the purely asynchronous case is the token game played on the MASS. In case of a single clock rate for all synchronous nodes, we normalize its value to 1. To each synchronous node  $v_j$ , a local clock phase  $p_j$  is assigned, i.e., the local clock is delayed with respect to the global one by  $p_j$ . As a consequence, the time instances  $\sigma_j(k)$  when a synchronous node can complete an operation is constrained as follows:

$$\sigma_j(k) - p_j \in Z$$

This restriction is motivated by the fact that a synchronous module can deliver a value at its local clock ticks only. Therefore, the firing of the node is delayed until the next clock event. In order to simplify the notation, to each asynchronous node the clock phase  $p_i = 0$  is assigned.

As a result, the definitions of admissible schedules must be extended as follows:

**Definition 7 (Admissible Schedule)** An admissible schedule function of a mixed asynchronous – synchronous system is a function  $\sigma: V_{\Sigma} \to R_{\geq 0}$  that satisfies

1. 
$$\sigma_i(k)^* = 0$$
 for all  $v_i(k) \in V_{\Sigma} \land k \leq 0$ ,

2. 
$$\sigma_j(k)^* \geq F_j(\sigma_i(k-d_{ij})^* + h_{ij}^*)$$
 for all  $(v_i(k-d_{ij}), v_j(k)) \in A_{\Sigma}$ 

where the function  $F_i$  is

$$F_j(a) = \begin{cases} a : v_j \in V_A \\ \lceil a \rceil : v_j \in V_S \end{cases}$$

and 
$$\sigma_i(k)^* = \sigma_i(k) - p_i$$
,  $h_{ij}^* = h_{ij} + p_i - p_j$ .

In case of an asynchronous node, the same definition as in Definition 3 is obtained because  $\sigma_j(k)^* \geq F_j(\sigma_i(k-d_{ij})^* + h_{ij}^*)$  is equivalent to  $\sigma_j(k) - p_j \geq \sigma_i(k-d_{ij}) - p_i + h_{ij} + p_i - p_j$  which yields  $\sigma_j(k) \geq \sigma_i(k-d_{ij}) + h_{ij}$ .

For synchronous nodes,  $\sigma_j(k)^*$  is integral as desired. Moreover, in comparison to the asynchronous case the firing time of a synchronous node must satisfy

$$\sigma_j(k) \geq \sigma_i(k - d_{ij}) + h_{ij}$$

On the other hand, the node should be able to fire as soon as possible, i.e., at the *next* available clock tick. Consequently, we have

$$\sigma_j(k) \geq p_j + \lceil \sigma_i(k - d_{ij}) + h_{ij} - p_j \rceil$$

which is identical to Definition 7. Note that the term on the right hand side of this inequality is always greater than or equal to  $\sigma_i(k-d_{ij}) + h_{ij}$  but smaller than  $\sigma_i(k-d_{ij}) + h_{ij} + 1$ , as desired.

Example 5 The same system as in Example 4 is used. Its unfolding is shown in Fig. 4. The earliest possible firing times  $\sigma_j(k)$  are assigned to the nodes. It can be seen that the second firing of the synchronous node  $v_2$  happens at time  $\sigma_2(2) = 9.1$  because  $\sigma_1(2) + 3.5 = 8.5$  has been rounded up to the next integer plus  $p_2 = 0.1$ .



Figure 4: Unfolding of the system shown in Fig. 3

#### 3 Maximal – rate schedules of mixed asynchronous – synchronous systems

Now, we will analyze the periodic behavior of MASS graphs and their maximal throughput rates. We will derive bounds on the maximal throughput rates; these can be computed in polynomial time. For several interesting subclasses of MASS, these bounds are tight, i.e., the exact maximal throughput rate can be determined in polynomial time.

Obviously, one possible maximal-rate schedule is obtained if a node fires as soon as it is enabled. This fact is elaborated in the following theorem.

Theorem 2 (Free Schedule) The following conditions for the free schedule of a MASS are equivalent:

1. There is no admissible schedule with smaller firing times  $\sigma_i(k)$ .

2. 
$$\sigma_j(k)^* = \max\{0, F_j(\sigma_i(k-d_{ij})^* + h_{ij}^*) : (v_i(k-d_{ij}), v_j(k)) \in A_{\Sigma}\}$$

The free schedule is a maximal-rate schedule according to Definition 4.

**Proof**: Let us suppose that there is a schedule with a smaller kth firing time  $\sigma_i(k)$  for node  $v_i$ . Comparing the second condition with condition 2. in Definition 7 yields that one of the predecessors of  $v_i(k)$  must have a smaller firing time, too. Using induction and condition 1. in Definition 7, this leads to a contradiction. Now, suppose that for some node  $v_i(k)$ , we have  $\sigma_j(k)^* > \max\{0, F_j(\sigma_i(k-d_{ij})^* + h_{ij}^*) : (v_i(k-d_{ij}), v_j(k)) \in A_{\Sigma}\}$ . Then the firing time  $\sigma_i(k)$  can be reduced without violating condition 1. or 2. in Definition 7. The free schedule is a maximal-rate schedule because of Definition 4.

Note that the above theorem is closely related to Bellman's shortest path equations. Consequently, the free schedule can be computed by solving a variation of a longest path problem on the unfolding:

- 1. The nodes of the unfolding are ordered topologically, i.e., if there is an arc from node  $v_i(k)$  to node  $v_j(l)$ , then  $v_i(k)$  precedes  $v_j(l)$  in the ordering.
- 2. Then, the node potentials  $\sigma_j(k)$  are successively determined according to

$$\sigma_j(k)^* = \max\{0, F_j(\sigma_i(k-d_{ij})^* + h_{ij}^*) : (v_i(k-d_{ij}), v_j(k)) \in A_{\Sigma}\}$$

Now, bounds on the average period of the maximal-rate schedule determined above will be given. These bounds can be computed efficiently (in polynomial time). In particular, they can be related to the maximal cycle mean of marked graphs with appropriately chosen weights.

**Theorem 3 (Rate Bounds)** The minimal average period  $P_{min}$  of a MASS G = (V, A, d, h, p) can be bounded by

$$P_{cm}(\hat{G}) \le P_{min} \le P_{cm}(\tilde{G})$$

where  $P_{cm}(\hat{G})$  and  $P_{cm}(\tilde{G})$  denote the maximal cycle means of the marked graphs  $\hat{G}$  and  $\tilde{G}$ , respectively, which are defined as follows:

• 
$$\hat{G} = (V, A, d, \hat{h})$$
 with

$$\hat{h} = \left\{ egin{array}{ll} h_{ij}^* & : & v_j \in V_A \\ \lceil h_{ij}^* 
ceil & : & v_i, v_j \in V_S \\ h_{ij}^* & : & v_i \in V_A \ , & v_j \in V_S \end{array} \right.$$

•  $\tilde{G} = (V, A, d, \tilde{h})$  with

$$\tilde{h} = \left\{ egin{array}{ll} h_{ij}^* & : & v_j \in V_A \\ \lceil h_{ij}^* 
ceil & : & v_i, v_j \in V_S \\ h_{ij}^* + 1 & : & v_i \in V_A \ , & v_j \in V_S \end{array} \right.$$

**Proof**: The firing times of G,  $\hat{G}$  and  $\tilde{G}$  in the case of free schedules are denoted  $\sigma_i(k)$ ,  $\hat{\sigma}_i(k)$  and  $\tilde{\sigma}_i(k)$ , respectively. Moreover, remember that  $\sigma_i(k)^* = \sigma_i(k) - p_i$  and  $h_{ij}^* = h_{ij} + p_i - p_j$ .

If we show that for free schedules in G,  $\hat{G}$  and  $\tilde{G}$  the relations  $\hat{\sigma}_i(k) \leq \sigma_i(k)^* \leq \tilde{\sigma}_i(k)$  hold for all nodes  $v_i \in V$ ,  $k \in \mathbb{Z}_{>0}$ , then  $P_{cm}(\hat{G}) \leq P_{min} \leq P_{cm}(\tilde{G})$  holds as  $P_{min}(\hat{G}) = P_{cm}(\hat{G})$  and  $P_{min}(\tilde{G}) = P_{cm}(\tilde{G})$ .

For all nodes without predecessor we have  $\hat{\sigma}_i(k) = \tilde{\sigma}_i(k) = \sigma_i(k)^* = 0$ . Consequently, the initial conditions for all unfoldings are identical with  $\hat{\sigma}_i(k) = \sigma_i(k)^* = \tilde{\sigma}_i(k)$ .

Let us consider an arc  $(v_i(k-d_{ij}), v_j(k))$  in the unfoldings of G,  $\hat{G}$  and  $\tilde{G}$ . We will show now that the inequalities implied by such an arc in G:

$$\sigma_j(k)^* - \sigma_i(k - d_{ij})^* \ge r_{ij}$$

for some  $r_{ij}$  is less strict in the case of  $\hat{G}$ :

$$\sigma_j(k)^* - \sigma_i(k - d_{ij})^* \ge r_{ij} - \delta$$
 ,  $\delta \ge 0$ 

and stricter in the case of  $\tilde{G}$ :

$$\sigma_j(k)^* - \sigma_i(k - d_{ij})^* \ge r_{ij} + \gamma$$
,  $\gamma \ge 0$ 

As this holds for all arcs,  $\hat{\sigma}_i(k) \leq \sigma_i(k)^* \leq \tilde{\sigma}_i(k)$  follows because of the monotonicity of the Bellman-type equations for free schedules, see Theorem 2. Let us consider three cases

- $v_j \in V_A$ : In the unfolding of G we have  $\sigma_j(k)^* \geq \sigma_i(k d_{ij})^* + h_{ij}^*$  which leads to  $\sigma_j(k)^* \sigma_i(k d_{ij})^* \geq h_{ij}^*$ . As  $\hat{h}_{ij} = \tilde{h}_{ij} = h_{ij}^*$  in this case, we have  $\delta = \gamma = 0$ .
- $v_i, v_j \in V_S$ : In the unfolding of G we have  $\sigma_j(k)^* \geq \lceil \sigma_i(k d_{ij})^* + h_{ij}^* \rceil$ . As  $\sigma_i(k d_{ij})^*$  is integral in a free schedule, we have  $\sigma_j(k)^* \geq \sigma_i(k d_{ij})^* + \lceil h_{ij}^* \rceil$  and  $\sigma_j(k)^* \sigma_i(k d_{ij})^* \geq \lceil h_{ij}^* \rceil$ . As  $\hat{h}_{ij} = \tilde{h}_{ij} = \lceil h_{ij}^* \rceil$  in this case, we have  $\delta = \gamma = 0$ .
- $v_i \in V_A, v_j \in V_S$ : In the unfolding of G we have  $\sigma_j(k)^* \geq \lceil \sigma_i(k-d_{ij})^* + h_{ij}^* \rceil$  which leads to  $\sigma_j(k)^* \sigma_i(k-d_{ij})^* \geq \lceil \sigma_i(k-d_{ij})^* + h_{ij}^* \rceil \sigma_i(k-d_{ij})^*$ . Consequently, we have  $r_{ij} = \lceil \sigma_i(k-d_{ij})^* + h_{ij}^* \rceil \sigma_i(k-d_{ij})^*$ . Now,  $r_{ij} 1 < h_{ij}^* \leq r_{ij}$  and  $\hat{h}_{ij} = h_{ij}^*$  which leads to  $\hat{h}_{ij} = r_{ij} \delta$  with  $1 > \delta \geq 0$ . On the other hand,  $\tilde{h}_{ij} = h_{ij}^* + 1$  leads to  $\tilde{h}_{ij} = r_{ij} + \gamma$  with  $1 \geq \gamma > 0$ .

Example 6 Consider the MASS on the left side of Fig. 5 with an asynchronous node  $v_1$  and a synchronous node  $v_2$ . In the middle, resp. on the right hand side of Fig. 5, the associated marked graphs  $\tilde{G}$  and  $\hat{G}$  are shown. The maximal cycle means of the associated marked graphs which determine the bounds for  $P_{min}$  are  $P_{cm}(\tilde{G}) = 3.25$  and  $P_{cm}(\hat{G}) = 3.1$ , respectively. Without considering the unfolding of G, we can say according to Theorem 3 that  $3.1 \leq P_{min}(G) \leq 3.25$ . By the determination of a maximal-rate schedule for G using Theorem 2, we get  $P_{min}(G) = 28/9$ .



Figure 5: Example of a MASS and associated marked graphs  $ilde{G}$  and  $\hat{G}$ 

As a result of the above theorem, we have  $P_{cm}(\hat{G}) = P_{min} = P_{cm}(\tilde{G})$  for the subclass of MASS that do not have arcs from asynchronous to synchronous nodes because  $\hat{G} = \tilde{G}$ . Note that this subclass of MASS includes selftimed systems  $(V = V_A)$  and GALS (globally asynchronous locally synchronous systems)  $(V = V_S)$  as special cases.

It turns out that one can analyze the behavior of this subclass in much more detail. In particular, we are interested in the determination of a maximal-rate periodic schedule in this case. We already know that the corresponding period is  $P_{cm}(\hat{G})$  where  $\hat{G} = (V, A, d, \lceil h^* \rceil)$ . As a simple consequence of the proof given for Theorem 3, one can even compare the free schedules of both graphs directly.

Corollary 1 The free schedule of a MASS G = (V, A, d, h, p) containing no arcs from asynchronous to synchronous nodes is identical to that of the corresponding marked graph  $\hat{G}$  defined in Theorem 3. In particular we have  $\sigma_i(k) = \hat{\sigma}_i(k) + p_i$  for all  $v_i(k) \in V_{\Sigma}$  where  $\sigma_i(k)$  and  $\hat{\sigma}_i(k)$  denote the firing times of node i in G and  $\hat{G}$  in a free schedule, respectively.

**Proof**: The proof is based on the fact that according to the case  $v_i, v_j \in V_S$  and  $v_j \in V_A$ , the inequalities which determine the firing times in G and  $\hat{G}$  are identical, i.e.,  $\delta = \gamma = 0$ . Therefore, we have  $\sigma_i(k)^* = \sigma_i(k) - p_i = \hat{\sigma}_i(k)$ .

Finally, the following theorem and the corresponding constructive proof lead to periodic maximal-rate admissible schedules for a subclass of mixed asynchronous - synchronous systems.

**Theorem 4** Given is a MASS G containing no arcs from asynchronous to synchronous nodes. Then any admissible schedule of  $\hat{G}$  as defined in Theorem 3 leads to an admissible schedule for G with

$$\sigma_i(k) = \begin{cases} \lceil \hat{\sigma}_i(k) \rceil + p_i & : \quad v_i \in V_S \\ \hat{\sigma}_i(k) + p_i + 1 & : \quad v_i \in V_A \end{cases}$$

with the same computation rate. If G contains synchronous nodes only, then there is an L-periodic maximal-rate schedule of G where L denotes the number of tokens in the simple cycle which determines the maximal cycle mean

of  $\hat{G}$ . 4

**Proof:** An admissible schedule for  $\hat{G}$  satisfies  $\hat{\sigma}_{j}(k) \geq \hat{\sigma}_{i}(k - d_{ij}) + \hat{h}_{ij}$ . Note that we have defined  $\sigma_{i}(k)^{*} = \sigma_{i}(k) - p_{i}$  in Definition 7.

For  $v_i, v_j \in V_A$  we have  $\hat{h}_{ij} = h_{ij}^*$  which leads to  $\hat{\sigma}_j(k) \geq \hat{\sigma}_i(k - d_{ij}) + h_{ij}^*$ . With  $\hat{\sigma}_j(k) = \sigma_j(k)^* - 1$  and  $\hat{\sigma}_i(k - d_{ij}) = \sigma_i(k - d_{ij})^* - 1$  we get  $\sigma_j(k)^* \geq \sigma_i(k - d_{ij})^* + h_{ij}^*$  which is admissible for G.

For  $v_j \in V_A$ ,  $v_i \in V_S$  we have  $\hat{h}_{ij} = h_{ij}^*$  which again leads to  $\hat{\sigma}_j(k) \geq \hat{\sigma}_i(k - d_{ij}) + h_{ij}^*$ . With  $\hat{\sigma}_j(k) = \sigma_j(k)^* - 1$  and  $\sigma_i(k - d_{ij})^* = \lceil \hat{\sigma}_i(k - d_{ij}) \rceil$  we get  $\sigma_j(k)^* \geq \hat{\sigma}_i(k - d_{ij})^* + h_{ij}^* + 1 \geq h_{ij}^* + \sigma_i(k - d_{ij})^*$ . Therefore,  $\sigma_j(k)^* \geq \sigma_i(k - d_{ij})^* + h_{ij}^*$  which is admissible for G.

For  $v_i, v_j \in V_S$  we have  $\hat{h}_{ij} = \lceil h_{ij}^* \rceil$  which leads to  $\hat{\sigma}_j(k) \geq \hat{\sigma}_i(k-d_{ij}) + \lceil h_{ij}^* \rceil$  and also  $\lceil \hat{\sigma}_j(k) \rceil \geq \lceil \hat{\sigma}_i(k-d_{ij}) + \lceil h_{ij}^* \rceil \rceil = \lceil \hat{\sigma}_i(k-d_{ij}) \rceil + \lceil h_{ij}^* \rceil$ . With  $\sigma_j(k)^* = \lceil \hat{\sigma}_j(k) \rceil$  and  $\sigma_i(k-d_{ij})^* = \lceil \hat{\sigma}_i(k-d_{ij}) \rceil$  we get  $\sigma_j(k)^* \geq \sigma_i(k-d_{ij})^* + \lceil h_{ij}^* \rceil = \lceil \sigma_i(k-d_{ij})^* + h_{ij}^* \rceil$  which is admissible for G.

Now we will prove the periodicity part. First, a 1-periodic maximal-rate admissible schedule for  $\hat{G}$  is constructed, see section 2.2. This leads to

$$\hat{\sigma}_i(k+1) = \hat{\sigma}_i(k) + P_{cm}(\hat{G})$$

According to Theorem 1,  $P_{cm}(\hat{G}) = \frac{H}{L}$  where L is the number of tokens and H is the sum of weights in the critical cycle. Note that  $H, L \in Z$  as  $\hat{h}_{ij}, d_{ij} \in Z$ . Therefore, we can write

$$\hat{\sigma}_i(k+L) = \hat{\sigma}_i(k) + H$$

With  $\sigma_i(k)^* = \lceil \hat{\sigma}_i(k) \rceil$  we have

$$\sigma_i(k+L)^* = \sigma_i(k)^* + H$$

which clearly is an L-periodic schedule. A remark: The above result can be extended to the case where G contains asynchronous nodes. The construction of a maximal-rate periodic schedule is again possible. In this case one may use the fact that there are no paths from asynchronous nodes to synchronous nodes. Therefore, both parts of G can be dealt with independently.

<sup>&</sup>lt;sup>4</sup>We assume for simplicity, that this cycle is unique. However, it can be easily shown that the result can be extended to the case where the critical cycle is not unique. Let the numbers of tokens in different critical cycles have different values  $L_i$ . Then, the solution has a periodicity at most  $\operatorname{lcm}(L_i)$  over all critical cycles.

Example 7 Given the MASS of Fig. 6 with two synchronous nodes  $v_1$  and  $v_2$ , clock phases  $p_1 = 0.1$  and  $p_2 = 0.6$  and holding times  $h_{12} = 1.9$  and  $h_{21} = 1.7$ , respectively. If we determine the transformed arc weights, we get  $h_{12}^* = 1.4$  and  $h_{21}^* = 2.2$ . Then we have  $\hat{h}_{12} = 2$  and  $\hat{h}_{21} = 3$ , which can be seen in the associated marked graph  $\hat{G}$  shown on the right hand side of Fig. 6. We obtain  $P_{min}(\hat{G}) = 2.5$ . Considering the unfolding with corresponding firing times, we can see that there is a 2-periodic schedule for G according to Theorem 4. For example, we obtain the following schedule:

$$\sigma_1(2k) = 0.1 + 5k, \ \sigma_1(2k+1) = 3.1 + 5k,$$
  
$$\sigma_2(2k) = 0.6 + 5k, \ \sigma_2(2k+1) = 2.6 + 5k \ \forall k \in \mathbb{Z}_{>0}$$



Figure 6: MASS G, associated marked graph  $\hat{G}$  and unfolding of G

#### 4 Optimal phase assignment

Thus far, we know from Theorem 3 that the average period  $P_{min}$  of a MASS G can be bounded by  $P_{cm}(\hat{G}) \leq P_{min} \leq P_{cm}(\tilde{G})$ . The bounds on the minimal average period strongly depend on the values  $\hat{h}$  and  $\tilde{h}$  of  $\hat{G}$  and  $\tilde{G}$ , respectively. By definition,  $h_{ij}^* = h_{ij} + p_i - p_j$ . Hence, the minimal average period depends on the given node phase assignment.

Let us suppose we have the freedom to adjust the clock phases of synchronous nodes of a given MASS. Then we might be able to find a combination of phases leading to another MASS having a smaller average period. Hence, the objective of this chapter is to investigate the effect of the phases of synchronous nodes on the minimal average period of topologically equivalent MASS graphs.

Example 8 Given the MASS of Fig. 6 with two synchronous nodes  $v_1$  and  $v_2$ , and holding times  $h_{12} = 1.9$ , and  $h_{21} = 1.7$ , respectively. The average periods for different combinations of clock phases can be computed as follows:

1. 
$$p_1 = 0.5$$
,  $p_2 = 0.5 \Rightarrow h_{12}^* = 1.9$ ,  $\hat{h}_{12} = \lceil 1.9 \rceil = 2$ ,  $h_{21}^* = 1.7$ ,  $\hat{h}_{21} = \lceil 1.7 \rceil = 2 \Rightarrow P_{cm}(\hat{G}) = P_{min} = 2$ .

2. 
$$p_1 = 0.1, p_2 = 0.6 \Rightarrow P_{cm}(\hat{G}) = P_{min} = 2.5.$$

3. 
$$p_1 = 0.2, p_2 = 0.0 \Rightarrow P_{cm}(\hat{G}) = P_{min} = 2.5.$$

Fig. 7 is a density plot of the period  $P_{min}$  in dependence of  $p_1$  (x-axis) and  $p_2$  (y-axis). For all combinations of clock phases  $p_1, p_2 \in [0, 1)$ , it turns out that only two different periods exist.

This simple example shows that the average period may vary due to different assigned clock phases. In this chapter, we therefore would like to consider the *optimal phase assignment problem*. Before we introduce the optimization problem, we define when a MASS G is called *phase optimal*.

**Definition 8** Given the set of MASS  $M = \{G = (V, A, d, h, p) : 0 \le p_i < 1\}$ . We define

$$P_{opt} := \min_{G \in \mathcal{M}} P_{min}(G)$$



Figure 7: Density plot  $P_{min}$  in dependence of  $p_1$  (x-axis) and  $p_2$  (y-axis).  $P_{min}(G) = 2$  is shown in black,  $P_{min}(G) = 2.5$  is shown in white.

and call  $P_{opt}$  the optimal average period of the set of MASS M. A MASS  $G \in M$  is said to be phase optimal, if  $P_{min}(G) = P_{opt}$ .

In other words, a MASS G is called phase optimal if there exists no MASS G' with same topology and same weight and distance functions, but different phase assignment p and smaller average period. In this section, we are going to address the following questions:

- How can we find the minimal period  $P_{opt}$  in case the clock phases of all synchronous nodes are freely adjustable?
- Sometimes, some clock phases are fixed, others are not. What is the best adjustment of the remaining clock phases then?

We will answer these questions by proposing a simple optimization procedure. First, we introduce an exact procedure for finding an optimal phase if the MASS contains no asynchronous nodes. In the subsequent section, we will consider the general case. We will introduce a polynomial time procedure for finding a MASS  $G \in M$  such that  $P_{min}(G)$  satisfies  $P_{min}(G) \leq P_{opt} + 1$ .

#### 4.1 Exact phase optimization

From Theorem 1, we know that when given a MASS containing no arcs from asynchronous to synchronous nodes, then the free schedule is identical to that of the corresponding marked graph given in Theorem 3. Furthermore, Theorem 4 states how an admissible schedule for  $\hat{G}$  can be related to an admissible schedule for G with the same computation rate. Therefore, we are going to analyze the influence of the clock phases on the period of  $\hat{G}$ .

**Theorem 5** Given a set of MASS M containing synchronous nodes only, an optimal phase MASS  $G \in M$  may be found by solving the following MILP:

minimize P  
subject to  

$$\tau_{j} - \tau_{i} + Pd_{ij} \geq \tilde{h}_{ij} \geq h_{ij} + p_{i} - p_{j}, \quad \tilde{h}_{ij} \in Z \quad \forall (v_{i}, v_{j}) \in A$$

$$0 \leq p_{i} \leq 1 \quad \forall v_{i} \in V$$
(6)

Let  $P = \frac{H}{L}$ ,  $H, L \in \mathbb{Z}$  and H, L coprime, be an optimal solution of (6) and p an optimal phase. Then the minimal average period of the corresponding MASS is  $P_{min} = P = P_{opt}$ . Furthermore, the schedule  $\sigma_i(k+L) = \sigma_i(k) + H$  is an L-periodic maximal-rate admissible schedule for G where

$$\sigma_i(k) = p_i + \left[\tau_i + (k-1)\frac{H}{L}\right] \quad \forall 1 \le k \le L$$
 (7)

**Proof:** Let the phases of a given MASS be fixed. Then we are able to determine  $P_{min}$  by determining the minimal average period of the marked graph  $\hat{G}$  by solving the linear program (see (5))

$$P_{min} = \min\{P : \tau_j - \tau_i \ge \lceil h_{ij}^* \rceil - Pd_{ij} \ \forall (v_i, v_j) \in A\}$$

Using  $h_{ij}^* = h_{ij} + p_i - p_j$  and replacing  $\tilde{h}_{ij} = \lceil h_{ij}^* \rceil$  by  $\tilde{h}_{ij} \in Z$  and  $\tilde{h}_{ij} \geq h_{ij}^*$ , the final MILP in (6) is obtained.

The MILP has 2|V|+1 rational variables (vectors  $\tau$ , p and P) and |A| integer variables  $\tilde{h}_{ij}$ . The proof that (7) is a periodic admissible schedule is almost trivial. From (4) in the proof of Theorem 1, we know that  $\hat{\sigma}_i(k) = \tau_i + (k-1)\frac{H}{L}$ ,  $k \in \mathbb{Z}_{>0}$  is a maximal-rate periodic schedule for  $\hat{G}$ . We apply the transformation described in Theorem 4 which transforms this admissible schedule for  $\hat{G}$  to obtain the following admissible schedule for G with the same computation rate

$$\sigma_i(k) = \lceil \hat{\sigma}_i(k) \rceil + p_i = \left\lceil \tau_i + (k-1) \frac{H}{L} \right\rceil + p_i \quad \forall k \in \mathbb{Z}_{>0}$$

**Example 9** Given the MASS G in Example 8. Solving the corresponding MILP in (6) provides the solution P=2, the vector  $\tau=(\tau_1,\tau_2)=(0.0,1.0)$  and an optimal phase vector  $p=(p_1,p_2)=(0.9,0.2)$ . With  $\tilde{h}_{12}=3$  and  $\tilde{h}_{21}=1$ , we have H=2, L=1. From the periodic admissible schedule  $\hat{\sigma}$  of  $\hat{G}$  with

$$\hat{\sigma}_1(k) = \tau_1 + (k-1)2 = 2(k-1)$$

$$\hat{\sigma}_2(k) = \tau_2 + (k-1)2 = 1.0 + 2(k-1)$$

an admissible schedule  $\sigma$  for G becomes

$$\sigma_1(k) = \lceil 2(k-1) \rceil + 0.9 = 2k - 1.1$$
  
 $\sigma_2(k) = \lceil 2(k-1) + 1.0 \rceil + 0.2 = 2k - 0.8$ 

Note that the MILP in Theorem 5 can be changed in a simple manner to account for other constraints concerning  $p_i$ .

## 4.2 A polynomial time procedure for nearly optimal phase assignment

In this section we give an efficient procedure to find a phase assignment p for a set of MASS M such that the corresponding MASS  $G \in M$  has a period  $P_{min}(G)$  at most one unit greater than  $P_{opt}$ .

**Definition 9** Given the MASS G = (V, A, d, h, p), define  $G^A = (V, A, d, h)$  to be a marked graph in which all nodes are assumed to be asynchronous, i.e.,  $F_j(a) = a$  for all  $v_i \in V$  in Definition 7.

Since the nodes in  $G^A$  may start computation at arbitrary time instances,  $P_{cm}(G^A) \leq P_{opt}$ . In other words, a lower bound on  $P_{opt}$  is  $P_{cm}(G^A)$ , the maximal cycle mean of  $G^A$ .

Lemma 1 Let  $\sigma_A$  be an admissible 1-periodic schedule for  $G^A$  with period  $\lceil P_{cm}(G^A) \rceil$  and  $\sigma_i^A(k) = \tau_i + (k-1) \lceil P_{cm}(G^A) \rceil$ . Then  $\sigma^A$  is a 1-periodic admissible schedule for the MASS G with the phase assignment  $p_i = \tau_i - \lfloor \tau_i \rfloor \ \forall v_i \in V_S \ (and \ p_i = 0 \ \forall v_i \in V_A)$ .

**Proof:** A schedule as required by Lemma 1 can be determined using (4) and (5) in time O(|A||V|). Using

$$\sigma_i^A(k) = p_i + \lfloor \tau_i \rfloor + (k-1) \lceil P_{cm}(G^A) \rceil \quad \forall v_i \in V_S$$

one obtains  $\sigma_i^A(k) - p_i \in Z$  for all  $v_i \in V_S$ . Additionally,

$$\sigma_j^A(k) \ge \sigma_i^A(k - d_{ij}) + h_{ij} \quad \forall (v_i, v_j) \in A$$

because  $\sigma_i^A(k)$  is an admissible schedule for  $G^A$ . As a result, we have

$$\sigma_j^A(k) - p_j \ge \sigma_i^A(k - d_{ij}) - p_i + (h_{ij} + p_i - p_j) \quad \forall v_j \in V$$

Moreover, for  $v_i \in V_S$  one obtains

$$\sigma_j^A(k) - p_j \ge \lceil \sigma_i^A(k - d_{ij}) - p_i + (h_{ij} + p_i - p_j) \rceil$$

because  $\sigma_j^A(k) - p_j \in Z$  for all  $v_j \in V_S$ . Thus, by Definition 7,  $\sigma^A$  is a one-periodic admissible schedule for G with the phase assignments  $p_i = \tau_i - \lfloor \tau_i \rfloor \ \forall v_i \in V_S$ .

**Theorem 6** The phase assignment p as described in Lemma 1 leads to a MASS G whose minimal average period satisfies  $P_{min}(G) \leq P_{opt} + 1$ .

**Proof**: From Lemma 1 we know, that the phase assignment p leads to a MASS G with  $P_{min}(G) \leq \lceil P_{cm}(G^A) \rceil$ . Moreover, we have  $P_{cm}(G^A) \leq P_{opt}$  which leads to  $P_{cm}(G^A) - P_{opt} \leq 0$  and  $\lceil P_{cm}(G^A) \rceil - P_{opt} \leq 1$ . As a result, we have  $P_{min}(G) - P_{opt} \leq 1$ . The entire procedure to determine p takes O(|V||A|) time according to (4) and (5).

The whole procedure is explained in the following example.

**Example 10** Consider the MASS shown in Fig. 6 again. We obtain  $P_{cm}(G^A)$  = 1.8 and  $\lceil P_{cm}(G^A) \rceil = 2$ . Thus, we can use Lemma 1 to determine a phase assignment  $p_1$ ,  $p_2$  such that the resulting MASS G satisfies  $P_{min}(G) \leq 2$ . The following steps lead to a phase assignment according to Lemma 1 and Theorem 6:

• We obtain the following 1-periodic schedule for  $G^A$ :

$$\sigma_1^A(k) = \tau_1 + (k-1)2$$
  
 $\sigma_2^A(k) = \tau_2 + (k-1)2$ 

where  $\tau_1 = 0$  and  $\tau_2 = 0$ .

• Then we have

$$p_1 = \tau_1 - \lfloor \tau_1 \rfloor = 0$$

$$p_2 = \tau_2 - \lfloor \tau_2 \rfloor = 0$$

• For the phase assignment defined above,

$$\sigma_1(k) = 2k, \quad \sigma_2(k) = 2k$$

is an admissible schedule for G. Therefore, we have found a MASS G  $(p_1 = p_2 = 0)$  that is provably nearly optimal, in this case even optimal.

#### 5 Modeling Issues of Implementation

Finally, we would like to show how the MASS model may be used for representing relevant aspects of the design of mixed asynchronous – synchronous systems. It should be obvious that MASS includes the class of systems that can be described by marked graphs, hence selftimed systems with deterministic computation times of nodes can be analyzed.

#### 5.1 Modeling of finite queue sizes

It is possible to model an asynchronous communication involving a queue of limited size. Fig. 8a may model the following situation: There is a selftimed communication from module  $v_1$  to module  $v_2$  with a buffer size of 3. A node cannot fire until its output queues contain at least one empty buffer. Two of the buffers contain one initial token each. This situation can be modeled by using a forward arc  $(v_1, v_2)$  to model the tokens and a backward arc  $(v_2, v_1)$  to model the empty buffers (spaces, bubbles [GS90, Wil94]). Module  $v_2$  is not pipelined, i.e., a new token can not be processed before the previous one has been transferred. The evaluation time of module  $v_2$  is 2, a token needs time 0.2 to propagate from  $v_1$  to  $v_2$ , a space needs time 0.1 to propagate from  $v_2$  to  $v_1$ . This situation is modeled by introducing the appropriate holding times for the tokens.



Figure 8: Modeling mixed asynchronous - synchronous systems

#### 5.2 Simulation of synchronous systems

The MASS model can be used for modeling synchronous systems. Starting point is a given synchronous signal flow graph consisting of combinational modules and their interconnections which may contain synchronous registers. For the sake of simplicity, we assume only one clock phase to clock all of the registers (i.e., single clock, monophase, edge-triggered). There are many ways of modeling such a system in the MASS model.

For example, if only the sequence of operations is of concern, one may simply replace the registers by initial tokens and the combinational modules by asynchronous nodes of a marked graph.

Another possibility is the modeling shown in Figs. 8b,c: The synchronous registers are replaced by synchronous nodes as shown in Fig. 8b. The combinational modules are modeled using asynchronous nodes whose holding times are the delays of the corresponding combinational modules, see Fig. 8c. The timing corresponds to that of the synchronous system if input tokens are continuously available and if all accumulated delay times between registers are smaller than the clock period.

The next possibility takes into account the global clock generation of a synchronous system. An extra clock node is responsible for the simultaneous transfer of tokens to all synchronous nodes each of which models one register. The connection between the clock node and the synchronous nodes is bidirectional and corresponds to an interconnection with a queue length of one. This prevents the accumulation of clock tokens and guarantees that if the longest delay path between two registers (synchronous nodes) is larger than the clock period, the period of the whole system is slowed down correspondingly. The modeling of the registers is shown in Fig. 8d where combinational modules are modeled as in Fig. 8c.

#### 5.3 Interfacing

Finally, some remarks concerning interfacing asynchronous and synchronous subsystems are presented. The MASS shown in Fig. 8e models the situation where node  $v_1$  belongs to an asynchronous subsystem whereas node  $v_2$  models the 'first' register of a synchronous subsystem. This input register accepts tokens only at integral time instances. The holding time h models

the communication time and the processing time of the synchronous node  $v_2$ , e.g., for executing the communication protocol. If the asynchronous part is faster than the synchronous one, tokens will accumulate in the queue  $(v_1, v_2)$ . This can be avoided by using finite length queues, as in Fig. 8a. If the synchronous part is slower, the synchronous node  $v_2$  does not process a token every clock tick, but all other nodes in the synchronous subsystem do. This behavior can be implemented in hardware by sending a token – bit in addition to the data which signals that a register contains valid data. The clock operates continuously. In essence, the operation of an asynchronous system is simulated by the synchronous one.

Another possibility is shown in Fig. 8f. Here, the concept of a global clock node is used, which is bidirectionally connected to all synchronous nodes of the synchronous subsystem. The incoming token is split into one that represents the data and another one that serves to enable the clock signal. In a circuit implementation, this corresponds to stopping the clock of the synchronous subsystem if no input data is present.

If the output of the synchronous subsystem is connected to an asynchronous subsystem, similar situations occur.

#### 6 Summary

We have introduced a graph model called MASS for expressing the timing of mixed asynchronous – synchronous systems. A MASS is an extended timed marked graph where the node set V is partitioned into asynchronous nodes  $V_A$  and synchronous nodes  $V_S$ . The synchronous nodes have the property that they can fire only at instances of their local clock tick. Based on this model, we analyzed timing in terms of minimal average period (or equivalently, the maximal throughput rate) and periodicity. It has been shown that bounds on the maximal throughput rate can be determined based on maximal cycle mean computations on related marked graphs. For interesting subclasses of MASS (e.g., GALS  $(V = V_S)$ ), the lower and upper bounds are identical, hence the maximal throughput rate is determined exactly. Hence, efficient performance analysis is possible.

Next, we investigated the problem of optimal phase assignment. In contrast to existing approaches for phase optimization in level-sensitive circuits

[DR89, SMO92], we assumed different clock phases due to clock skew in edge-triggered synchronous circuits. The assumption of deterministic clock phases is justified in systems where different subsystems receive clock signals from the same clock oscillator. Based on these assumptions, we formulated the problem of optimal phase assignment. An exact solution (based on the solution of a MILP) as well as a polynomial time algorithm that returns a close-to-optimal solution have been proposed. The phase assignment may be used as input for placement and clock routing tools.

Finally, we have shown how finite buffering and interfacing of selftimed and synchronous subsystems can be modeled using the MASS model.

#### 7 Conclusions and Future Research

In this paper, we assumed deterministic node holding (communication + computation) times. For synchronous nodes, this assumption is justified by the fact that we assume that the corresponding synchronous subsystems have been individually optimized and synthesized using high level synthesis tools. Such tools typically return a fixed clock period and fixed latency. This assumption, however, may not be accurate for asynchronous nodes for which node holding times depend on the implementation style of the selftimed circuit. Whereas fixed holding times may be justified by design styles like micropipelines [Sut89], some implementation styles create data-dependent completion signals, and, as a result, achieve better average throughput rates [GS90]. In the latter case, a stochastic modeling of holding times for asynchronous nodes is required. This is one subject for future research. We would also like to extend our analysis to multiple clock rate synchronous nodes. Another issue of interest is to be able to concatenate high-level synthesis tools for synchronous circuits (see [WC91] for an overview), and asynchronous circuits (see papers edited in [GB93, MM94]), and investigate the influence of local node optimizations on the system-level performance, given e.g., by a MASS graph. Finally, the MASS model may be useful also in the domain of interface synthesis, which synergistically combines asynchronous and synchronous design styles. Due to the proven efficiency of the presented analysis algorithms, we believe that the MASS model is an interesting model for system-level CAD tools.

The first author would like to thank the DFG for the grant (TE163/4-1) that supported his work at UC Berkeley. S. Sriram was supported by SRC under grant 94-DC-008. Thanks also to Edward Lee for many helpful comments on the subject of this paper, and to the Ptolemy project for supplemental support.

#### References

- [BCOQ92] F. Baccelli, G. Cohen, G.J. Olsder, and J.-P. Quadrat. Synchronization and Linearity. John Wiley, Sons, New York, 1992.
- [BS89] E. Brunvand and R. F. Sproull. Translating Concurrent Programs into Delay-Insensitive circuits. In *Prof. of ICCAD*, pages 262–265, 1989.
- [Bur90] S.M. Burns. Performance Analysis and Optimization of Asynchronous Circuits. PhD thesis, CIT, 1990.
- [CDQV85] G. Cohen, D. Dubois, J. P. Quadrat, and M. Viot. A Linear-System-Theoretic View of Discrete-Event Processes and Its Use for Performance Evaluation in Manufacturing. *IEEE Transactions on Automatic Control*, AC-30, No. 3:210-220, March 1985.
- [CH71] F. Commoner and A.W. Holt. Marked directed graphs. Journal of Computer and System Sciences, 5:511-523, 1971.
- [CLR90] T. Cormen, C. Leiserson, and R. Rivest. Introduction to algorithms. McGraw-Hill, Cambridge, MA, 1990.
- [DR89] M. Dagenais and R. Rumin. On the calculation of optimal clocking parameters in synchronous circuits with level-sensitive latches. *IEEE Transactions on CAD*, 8(3):268-278, 1989.
- [GB93] G. Gopalakrishnan and E. Brunvand. Special Issue on asynchronous design. *INTEGRATION*, the VLSI journal, 15(3), 1993.

- [GJ93] G. Gopalakrishnan and L. Josephson. Towards amalgating the synchronous and asynchronous styles. In TAU'93, ACM Workshop on Timing Issues in the specification and synthesis of digital systems, page paper 10, Malente, Germany, 1993.
- [GS90] M. R. Greenstreet and K. Steiglitz. Bubbles can make self-timed pipelines fast. Journal of VLSI Signal Processing, 2(3):139-148, 1990.
- [GVNG94] D. Gajski, F. Vahid, S. Naranyan, and J. Gong. Specification and Design of Embedded Systems. Prentice Hall, Englewood Cliffs, NJ, 1994.
- [Kar78] R.M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23:309-311, 1978.
- [LRS83] C.E. Leiserson, F.M. Rose, and J.B. Saxe. Optimizing synchronous circuitry by retiming. In *Proc. Third Caltech Conf. on VLSI*, pages 87-116, 1983.
- [Mar89] A. J. Martin. Programming in VLSI: From Communicating Processes to Delay-Sensitive Circuits. In C. A. R. Hoare, editor, UT Year of Programming Institute on Concurrent Programming, pages 1-64. Addison-Wesley, 1989.
- [MBM89] T. H. Meng, R. W. Brodersen, and D. G. Messerschmitt. Automatic synthesis of asynchronous circuits from high-level specifications. *IEEE Transactions on Computer-Aided Design*, 8(11):1185-1205, 1989.
- [MFR85] C. E. Molnar, T. P. Fang, and F. U. Rosenberger. Synthesis of Delay-Insensitive Modules. In Proc. 1985 Chapel Hill Conference on Advanced Research in VLSI, pages 67-86, 1985.
- [MM94] T. H. Meng and S. Malik. Special Issue: Asynchronous circuit design for VLSI signal processing. Journal of VLSI Signal Processing, 7(1-2), 1994.
- [Pet81] J.L. Peterson. Petri Net Theory and Modeling of Systems. Prentice Hall, Reading, Mass., 1981.

- [Ram74] C. Ramachandani. Analysis of asynchronous concurrent systems by Petri nets. Technical Report Project MAC, TR-120, MIT, Cambridge, MA, 1974.
- [Rei68] R. Reiter. Scheduling parallel computations. J. of the ACM, 15:590-599, 1968.
- [RH80] C. V. Ramamoorthy and G. S. Ho. Performance evaluation of asynchronous concurrent systems using Petri nets. *IEEE Transactions on Software Engineering*, SE-6(5):440-449, 1980.
- [Sei80] C. E. Seitz. System Timing, pages 219-262. Introduction to VLSI Systems, C. A. Mead and L. A. Conway eds. Addison-Wesley, Reading, MA, 1980.
- [Sha84] D. M. Shapiro. Globally Asynchronous, Locally Synchronous Systems. PhD thesis, Stanford University, 1984.
- [SMO92] K. Sakallah, T. Mudge, and O. Olukotun. Analysis and design of latch-controlled synchronous digital circuits. *IEEE Trans. on CAD*, 11(3):322-333, 1992.
- [Sub91] P. A. Subrahmanyam. Automated synthesis of mixed-mode (asynchronous and synchronous) systems. AT&T Technical Journal, 70(1):111-24, 1991.
- [Sut89] I. E. Sutherland. Micropipelines. Communication of the ACM, 32:720-738, 1989.
- [WB93] K. Wolinski and M. Belhadj. Vers la synthese automatique de programmes signal. Technical Report 746, IRISA, Rennes, France, 1993.
- [WC91] R. A. Walker and R. Camposano. A Survey of High-Level Synthesis Systems. Kluwer, Norwell, MA, 1991.
- [Wil94] T. E. Williams. Performance of iterative computation in self-timed rings. Journal of VLSI Signal Processing, 7(1-2):17-31, 1994.

[WTWL94] T. Wu, T. Tien, A. Wu, and Y. Lin. A synthesis method for mixed synchronous/asynchronous behavior. In Proc. EDAC, pages 277-281, 1994.