Hardlock: A Concurrent Real-time Multicore Locking Unit

Strøm, Tórur Biskopstø; Schoeberl, Martin

Published in:
Proceedings of 2018 IEEE 21st International Symposium on Real-Time Distributed Computing (ISORC)

Link to article, DOI:
10.1109/ISORC.2018.00010

Publication date:
2018

Document Version
Peer reviewed version

Link back to DTU Orbit

Citation (APA):

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Abstract—To use multicore processors, an application needs to split computation into several threads that execute on different processing cores. As those threads work together towards a common goal, they need to exchange data in a controlled way. A common communication paradigm between cooperating threads is using shared data structures protected by locks.

Implementing a lock on top of shared memory can easily result in a bottleneck on a multicore processor due to the congestion on the shared memory. However, the number of locks in use is usually low and using the large external memory to support locks is over-provisioning a resource. This paper presents an efficient implementation of locking by providing dedicated hardware support for locking on-chip. This locking unit supports a restricted number of locks without the need to get off-chip. The unit can process lock acquisitions in 2 clock cycles and releases in 1 clock cycle.

I. INTRODUCTION

On any system where multiple tasks interleave, and/or execute concurrently, the need for synchronization arises to prevent data model corruption. An example would be 2 tasks, $\tau_1$, that updates a record structure, and $\tau_2$, which reads the record. If $\tau_2$ reads the record before $\tau_1$ is finished updating the entire record, $\tau_2$ may read incorrect data.

One solution to this issue is locking. A lock is a synchronization mechanism that protects some critical region and employs the notion of ownership, e.g., if $\tau_1$ has acquired the lock that protects the record, $\tau_2$ will not read the record until $\tau_1$ has released the lock and $\tau_2$ has acquired it.

Locks, as well as other synchronization mechanisms (mutexes, semaphores, etc.), consist of several operations that must execute in a synchronous manner, i.e., be atomic. For locks, this means that a task should not be able to partially own a lock. It either owns it or not. Acquiring or releasing a lock should mimic a single operation. Therefore, locks and other synchronization mechanisms commonly employ hardware supported atomic operations to ensure that the higher-level operations become atomic.

Compare-And-Swap (CAS) is an example of a common hardware supported atomic operation. It checks whether a memory location has the expected value and, if so, swaps it with a newly supplied value, all in a single operation. Locks built upon CAS work generally well but suffer from two potential issues. Firstly, when two or more tasks contend for the same memory location, i.e., try to update the value of the same memory location, there is a small possibility that one or more of the tasks will suffer from starvation. This happens if a task tries to swap the value with its own, but with every attempt another task has already swapped it, resulting in the initial task never being able to swap the value and therefore, never acquiring the lock. Secondly, the locking operation requires multiple memory operations in addition to CAS. When multiple cores connect to the same memory, an arbiter handles access to the memory. There are different arbitration policies, but often, in the worst case, the time that an arbiter prevents a core from accessing the memory scales linearly with the number of cores, because another core is using it. This potentially results in the locking time scaling linearly as well.

Whilst these issues are undesirable on any system, they are egregious on real-time systems where tasks’ worst-case execution time (WCET) affects, or is the basis for, their scheduling. If the potential for starvation is present, hard guarantees on the WCET cannot be made, and the schedules are therefore potentially broken.

In this paper we present and evaluate our design and implementation of Hardlock, a concurrent real-time hardware locking unit. We designed the unit with the goal of minimizing the WCET and being starvation free. It is based on previous work from our patent application [1], although this is the first time it is implemented in Chisel, as well as integrated with a multicore processor and evaluated.

This paper is organized in 5 sections: The next section, Section II, presents related work in the field and Section III provides background on the T-CREST multicores platform, which is used for the implementation and evaluation of the locking unit. Section IV describes our design and our implementation. Section V evaluates our implementation and Section VI concludes the paper.

II. RELATED WORK

Carter et al. [2] compare 6 lock implementations, 2 of which are software implementations with CAS or similar atomic operation support, and 4 hardware implementations. Their findings show that hardware implementations can reduce the lock acquisition and release time by 25-94% compared to well-tuned software locks. Using their own benchmark with heavy contention, the hardware locks outperform the software
locks by up to 75%, whilst on a SPLASH-2 benchmark suite, the hardware locks perform 3-6% better.

Patel et al. [3] describe a hardware implementation of a multi-word CAS (MCAS) for multi-core systems. They find that on average their implementation is up to 13.8 times faster than locks. They also claim that starvation does not occur, however, it is unclear whether the claim applies to MCAS requests themselves or whether it also applies to multiple cores continuously making MCAS requests. From the implementation description it seems to be the former, which means that the potential starvation issue of CAS also applies to MCAS.

Afshar et al. [4] propose a synchronization unit connected to all cores, like the Hardlock. It also contains a field for each core to register synchronization participation. However, they designed the unit for low-power systems in a producer/consumer relationship, thus only the power consumption, and not the performance, is tested. Additionally, the unit has a shared counter field, meaning that some arbitration, which is not described, must be done to update the counter.

Milik and Hryniewicz [5] present a complete distributed control system, that also includes hardware memory semaphores. The semaphore allows consumers to be notified as soon as data is ready, or the producer to be notified when it can update data. The semaphores are not centralized. Instead, each consumer has its own semaphore that notifies, or is notified, when data is consumed, or available, respectively. There can only be one producer per semaphore. It is not clear if multiple producers for the same semaphore are allowed, and if so, how the semaphore handles arbitration.

Braojos et al. [6] investigate pre-emptive global resource sharing protocols. They also present their own protocol that features an increased schedulability ratio of task sets and strong task progress guarantees. The Hardlock operates at the core level and therefore does not make any guarantees about the behaviour of threads on the same core pre-empting each other. This is not an issue in this paper, as the T-CREST platform that we integrate the Hardlock with is not configured for more than one thread per core. However, the platform can easily support pre-emptive global resource sharing protocols by utilizing the Hardlock as a global lock and then managing queues and priorities in software.

US Patent 5,276,886 [7] provides atomic access to single-bit locking flags, but does not provide any support for more sophisticated locking protocols.

Altera provides a “mutex core” [8], which implements atomic test-and-set functionality on a register with fields for an owner and a value. However, that unit does not provide support for enqueuing tasks. Therefore, guaranteeing an ordering of tasks entering the critical section (according to priorities of a FIFO policy) must be done in software.

US Patent 8,321,872 [9] describes a hardware unit that provides multiple mutex registers with additional “waiters” flags. The hardware unit can trigger interrupts when locking or unlocking, such that an operating system can adapt appropriate scheduling. The actual handling of the wait queue is carried out by the operating system.

The hardware unit described in US Patent 7,062,583 [10] uses semaphores instead of mutexes, i.e., more than one task can gain access to a shared resource. The hardware unit supports both spin-locking and suspension; in the latter case, the hardware unit triggers an interrupt when the semaphore becomes available. Again, queue handling must be done in software. US Patent Application 11/116,972 [11] builds on that patent, but notably extends it with the possibility to allocate semaphores dynamically.

US Patent Application 10/764,967 [12] proposes hardware queues for resource management. These queues are, however, not used for ordering accesses to a shared resource; instead a queue implements a pool of resources, from which processors can acquire a resource when needed.

Besides using a CAS operation on a shared, external main memory, an on-chip shared scratchpad memory can support synchronization [13]. The shared scratchpad memory is arbitrated in a time-division multiplexing manner for normal read or write operations providing a time-predictable memory for shared data structures. The arbitration scheme is extended, allowing larger access slots where two memory operations, a read and a write, can be performed by a single core. With those two operations executed atomically, locks can be implemented. However, this extension of time slots also leads to higher worst-case access time for normal read and write operations. A variation, with one dedicated synchronization slot, leads to a smaller increase of the worst-case memory access time at the cost of longer lock acquisition times. In contrast, our approach avoids mixing normal access to shared memory and a locking protocol by providing a dedicated locking unit. We envision also using on-chip shared memory for shared data structures protected by a lock from our locking unit.

In our previous work [14] we implement hardware locks for a Java processor that support queues of waiting tasks. The locks support 3 types of atomic operations: requesting a lock, checking ownership, and releasing a lock. Using varying number of processors, we compare the hardware implementation to a software implementation. In all cases the hardware routines are significantly faster than the software routines. This also applies for the benchmarks with a high lock use. The difference between the Java locking unit and the proposed locking unit in this paper, is that the Java locking unit tracks locked memory locations using a content-addressable-memory and has a FIFO queue for each lock. This requires arbitration of requests. The unit in this paper does not rely on FIFO queues and can therefore be without the request arbitration, i.e., cores can concurrently issue requests that the unit processes concurrently, although in the case of contention only one core receives the lock.

III. THE T-CREST PLATFORM

This section provides an overview of the T-CREST multicore platform that we use to implement and evaluate our
The T-CREST multicore processor is designed to support WCET analysis and a local scratchpad memory. Patmos, an open-source WCET analysis tool, is available for this processor. Furthermore, an implementation of WCET analysis tool from AbsInt is supported by a compiler that optimizes for WCET. The platform consists of several processing cores connected to two NoCs: one for core-to-core message passing and one for access to the shared, external memory and connected to the locking unit.

Fig. 1: The T-CREST multicore architecture with several processor cores connected to two NoCs: one for core-to-core message passing and one for access to the shared, external memory and connected to the locking unit.

The T-CREST multicore is a processor optimized for low WCET and to simplify static WCET analysis. All components have been designed to be time-predictable and to enable WCET analysis. The platform consists of several processing cores connected to two networks-on-chip (NoCs): (1) a NoC for message passing between cores, called Argo and (2) a memory NoC for access to shared, external memory.

The processing cores are RISC style processors, called Patmos. Patmos is written in Chisel, a new hardware construction language. Therefore, to easily integrate with the cores, our locking unit implementation is also written in Chisel. The implementation is part of the Patmos source repository and therefore does not require separate retrieval.

Patmos contains special cache memories optimized for WCET analysis and a local scratchpad memory. Patmos is supported by a compiler that optimizes for WCET, based on the LLVM framework. The compiler provides flow facts for the aiT WCET analysis tool from AbsInt and also incorporates WCET path information for aiT. Furthermore, an open-source WCET analysis tool, platin, is available for Patmos.

The Argo NoC provides message passing between processing cores via virtual point-to-point channels. Data is pushed from one core’s local scratchpad memory to a destination scratchpad memory. To be time predictable, the Argo NoC uses static time-division multiplexing for access control to router and link resources. The network interface between the processor and the Argo NoC is synchronized with the time-division multiplexing schedule. This results in a NoC structure without flow control and no additional buffering.

The Argo NoC is also useful in supporting synchronization primitives. Furthermore, the original T-CREST platform supports locks in shared memory with a software-based implementation. It works as follows: to provide a coherent view of the main memory the write-through data cache is invalidated and then Lamport’s bakery algorithm is used to implement locks. This implementation of locks is inefficient, so the proposed locking unit replaces it.

### IV. Design of the Locking Unit

We designed our locking unit with two main goals in mind:

- Minimizing WCET
- Being starvation free

To this end, it is important that cores interacting with unrelated locks do not experience interference from each other’s interactions, i.e., cores can access the unit concurrently and not through some arbiter.

Figure 2 depicts the general architecture of the Hardlock. The Hardlock will stall the Patmos processor cores connected to shared memory, a message passing network-on-chip, and the locking unit.

The Argo NoC is also useful in supporting synchronization primitives. Furthermore, the original T-CREST platform supports locks in shared memory with a software-based implementation. It works as follows: to provide a coherent view of the main memory the write-through data cache is invalidated and then Lamport’s bakery algorithm is used to implement locks. This implementation of locks is inefficient, so the proposed locking unit replaces it.

### Table I: io bundle composition

<table>
<thead>
<tr>
<th>Name</th>
<th>Size</th>
<th>Direction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>en</td>
<td>1</td>
<td>Input</td>
<td>Request activation</td>
</tr>
<tr>
<td>op</td>
<td>1</td>
<td>Input</td>
<td>Request type, i.e., acquisition or release</td>
</tr>
<tr>
<td>sel</td>
<td>(\log_2(m))</td>
<td>Input</td>
<td>Lock ID</td>
</tr>
<tr>
<td>block</td>
<td>1</td>
<td>Output</td>
<td>Core blocked status</td>
</tr>
</tbody>
</table>

The Argo NoC provides message passing between processing cores via virtual point-to-point channels. Data is pushed from one core’s local scratchpad memory to a destination scratchpad memory. To be time predictable, the Argo NoC uses static time-division multiplexing for access control to router and link resources. The network interface between the processor and the Argo NoC is synchronized with the time-division multiplexing schedule. This results in a NoC structure without flow control and no additional buffering.
Fig. 2: The overall architecture of the Hardlock connecting $n$ cores to $m$ locks

Fig. 3: Composition of a lock from Figure 2 showing core signals connected to respective queue register bits and the current register
Acquisition 2
Release 1

An upper bound for this is the sum of acquisition time \( a \), release time \( r \) and critical section \( c \), of all cores potentially involved in the lock for 1 round of round-robin, i.e., 
\[
T_{\text{WCET}} = \sum_{i=0}^{n} a + r + c_i = n \times (2 + 1) + \sum_{i=0}^{n} c_i,
\]
where \( i \) is the id of the core, and \( n \) is the number of potentially involved cores.

Figure 3 also shows how the current register value together with a core’s queue register generate the core’s \( \text{blk} \) signal for the lock. Shown in Figure 4 the unit merges each core’s \( \text{blk} \) signal from each lock, i.e., the unit merges \( \text{blk}_0 \) from each lock to create a single \( \text{blk} \) which it routes to \( \text{io}_0 \).

V. Evaluation

Our evaluation consists of simulation, synthesis, and measurements.

Simulation

We simulate the Hardlock using Chisel’s built in testing framework. From the simulation we can verify that the unit behaves correctly and derive WCET performance. Table II shows the WCET of a lock request and release for the Hardlock. The circular priority encoders in the Hardlock find the next awaiting cores within a single cycle. We add 1 cycle for making the request and updating the queue register.

The Hardlock does not dictate how long cores hold the locks, as this is application dependent. Also, the unit does not prevent deadlocks. The software must handle these issues, e.g., always acquire locks in the same order, no infinite loops while holding locks, etc. Another apparent issue with the unit is the limited number of locks. Whilst this is configurable during hardware generation, it does not have the benefit of CAS where every memory address is a potential lock, i.e., practically infinite locks. However, given an application requiring a limited number of locks, the unit provides nearly optimal performance with regards to cycle count.

Synthesis

For synthesis we target the Cyclone IV FPGA used on the Altera DE2-115 development board. We use the Quartus II Web Edition version 15.0 [29]. Table III contains both the total hardware resources of the unit, as well as the resources used by the OCP connection wrapper, when synthesized with different number of locks and connected to different number of Patmos cores, in logic cells and registers (a logic cell contains a 4-bit lookup table in the Cyclone IV). To put the numbers in relation to the resource usage of a processor, a simple RISC style processor requires about 3000 logic cells and Patmos consumes about 9000 logic cells. Therefore, the hardware resource consumption of the Hardlock, which serves several processors, is negligible.

The number of registers used correspond to the number of queue and current registers, i.e., \( m \times (n + \log_2(n)) + n \).
## TABLE III: Hardlock hardware resource usage

<table>
<thead>
<tr>
<th>Cores</th>
<th>Locks</th>
<th>Total</th>
<th>Logic Cells</th>
<th>Registers</th>
<th>Logic Cells</th>
<th>Registers</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>2</td>
<td>15</td>
<td>8</td>
<td>2</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>2</td>
<td>4</td>
<td>32</td>
<td>14</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>2</td>
<td>16</td>
<td>105</td>
<td>50</td>
<td>5</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>2</td>
<td>43</td>
<td>16</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>96</td>
<td>28</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>4</td>
<td>16</td>
<td>331</td>
<td>100</td>
<td>14</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>8</td>
<td>2</td>
<td>110</td>
<td>30</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>8</td>
<td>4</td>
<td>229</td>
<td>52</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>8</td>
<td>16</td>
<td>807</td>
<td>184</td>
<td>17</td>
<td>8</td>
<td>8</td>
</tr>
</tbody>
</table>

The cell count corresponds to the number of locks and connected cores. When the number of cores increases, the queue and current registers grow, which results in the priority encoder growing. When the lock count increases, the unit grows almost proportionally with the count. Therefore, we find that the unit is useful with a high number of cores as long as we keep the number of locks low, and vice versa.

Another potential issue with connecting a high number of cores is the limitation on the clock frequency. The circular priority encoder must check all queue registers, which can lead to a very long critical path. However, with the number of processing cores fitting into the FPGA, we did not run into that limitation.

### Measurement

We have created a small test program to measure the performance of uncontended locks when running on the FPGA and using the Hardlock. The program contains 2 loops that execute on all cores and each run 1024 iterations. The first loop measures the time before and after acquiring a lock, as well as the time before and after releasing a lock:

```c
stop1 = TIMER_CLK_LOW;
lock(id);
stop2 = TIMER_CLK_LOW;
unlock(id);
stop3 = TIMER_CLK_LOW;
```

The second loop measures the time before and after acquiring and immediately releasing a lock:

```c
stop1 = TIMER_CLK_LOW;
lock(id);
unlock(id);
stop2 = TIMER_CLK_LOW;
```

For both loops, each core has its own lock to ensure that no core experiences contention.

Table IV contains the results of running the program on 8 cores using 8 locks. Within an operation and core, the average, minimum, and maximum cycle count is identical. In addition, the measured time is equal to our WCET from Table II. Only core 7 has a different acquisition time. This is a result of the circular priority encoder setting the current register to the highest value when the queue is empty. When core 7 then requests the lock, it is already the owner, and therefore allowed to continue a cycle sooner. This is therefore the best-case scenario.

Note, that we also tested the program with different numbers of cores and locks. If each core has its own lock, the results are the same, with the last core always spending 1 cycle less during acquisition.

We have also created 2 test programs to measure and compare the performance of the Hardlock with the SSPM lock [13] for contended locks. The programs are functionally identical, but one uses our locking unit and the other uses the SSPM lock. We therefore refer to them simply as a single program.

The program runs on 4 Patmos cores that all share a single integer array. The size of the array corresponds to the number of locks used, as each lock protects its respective element in the array. All cores also share a counter (cntl) that starts at 10000. Before a core tries to acquire an element specific lock, it first locks the counter, reads its value, decrements it, and unlocks it, as shown in the following excerpt:

```c
lock(0);
if(cntl == 0) {
    unlock(0);
    while(cnt2 > 0) {asm(""n);}
    break;
}
int _cnt = cnt1--;
unlock(0);
```

The core then calculates element and lock to use, after which it acquires the lock, increments the array element, and then executes a busy-wait to simulate a longer critical section, before releasing the lock:

```c
int fldid = _cnt%ki;
int lckid = fldid%LCK_CNT;
lock(lckid);
data[fldid]++;
for(int j = 0; j < WAIT; j++)
    asm(""n);
unlock(lckid);
```

When the counter reaches 0 the cores have incremented the array elements a total of 10000 times. The program then verifies correct execution and lock behaviour by summing all entries and comparing the sum to 10000.

The counter creates a possible point of contention across all cores. The individual element locks allow contention reduction by increasing the number of elements/locks, i.e., using 1 element lock means all cores try to update the same element, whereas with 8 locks, cores update different elements. The number of busy-wait iterations controls the length of each element lock’s critical section.

Figure 5 and Figure 6 contain the results of executing the programs. The y-axis represents the total time in cycles it takes all cores to decrement the counter to 0. The x-axis represents the number of element locks used, i.e., for each busy-wait variant we vary the number of elements used from 1 to 8. The legend contains the lock types (SSPM and Hardlock) and the number of iterations in the busy-loop (10-10000).
TABLE IV: Measured uncontended lock performance in cycles

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>5</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>6</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>7</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
</tbody>
</table>

Fig. 5: Contended lock performance with small critical sections, showing the Hardlock performing faster

The results show that the smaller the critical section is, the more significant the improvement of the Hardlock over the SSPM is, and vice versa, e.g., Figure 6 shows that if the critical section busy-waits for a thousand iterations, the benefits of the Hardlock are negligible, although the Hardlock always performs at least as well as the SSPM. For both locking mechanisms it is evident that having a single lock, and thereby the biggest contention, causes significant overhead. Increasing the lock count beyond the core count has no benefit. We attribute the slight variation seen for Hardlock 10 to locking the counter, as the lock used for the counter is the same as for the first array element.

Source Access

The T-CREST project is an open-source project and therefore we also provide the Hardlock unit and the evaluation benchmarks in open source. Our work is available at [30]. A README explaining how to run the tests and reproduce the results is available at [31] and we provide an Ubuntu virtual machine [32] containing all the build tools.

VI. CONCLUSION

Performance improvements are currently achieved by building multicore processors. Tasks split into several threads need to communicate and coordinate their work, which is commonly done with shared data structures protected by locks. Therefore, the performance of locking is an important aspect of the overall performance.

In this paper we presented a dedicated locking unit in hardware that bounds the acquisition of locks to 2 cycles, if the lock is free, and releases in 1 cycle. With an implementation of the locking unit in an FPGA we showed that the hardware cost is relatively low, using 807 logic cells when supporting 8 cores and 16 locks.

We also ran test programs that verified the acquisition and release performance. Additionally, we compared the unit to another locking implementation and showed that, for short critical sections, programs can perform twice as fast.

Additionally, the process for selecting the next lock owner prevents starvation. Our unit is therefore a good solution for time-predictable computer architectures.
Acknowledgment

This work was partially funded by the Faroese Research Council under project no. 0607.

REFERENCES