OC April2002 Alvaro Slides

AACC: :
PRESTO
PRESTO==Precursory
PrecursoryResearch
Researchfor
forEmbryonic
EmbryonicScience
Scienceand
andTechnology
Technology
JST=
and
JST=Japan
JapanScience
Science
andTechnology
Technology
Quad-tree
image

compression using

reconfigurable free-space optical interconnections
and pipelined parallel processors



LCD/SLM

LCD/SLM

LCD/SLM

LCD/SLM

Alvaro Cassinelli*, Makoto Naruse*,** and Masatoshi Ishikawa*
Ishikawa-Hashimoto lab. University of Tokyo*, PRESTO JST**

Plan of the presentation
I. OCULAR architectures for computing
- Reconfigurable Single Stage (OCULAR-I)
- Reconfigurable Multi-stage (OCULAR-II)


II. OCULAR-II demonstration: Quad-tree compression.
- Quad-tree compression algorithm
- Set-up and Demonstration
- Discussion

III. Conclusion and further work

O ptoelectronic
C omputer

I. OCULAR architectures for computing

U sing
I.1 Reconfigurable Single Stage (OCULAR-I)

L aser
A rrays with

I.2 Reconfigurable Multi-stage (OCULAR-II)


R econfiguration

Processing
Element Array
Photo
Detector
Array

VCSEL array
Optical
Interconnections

Optical Interconnections

2D array
of data
2D array
of data


Optical feed-back


VCSEL
Processing Element Array
Photo Detector

Output

I.1 Single-stage paradigm for parallel computing
Optical technology offers enhanced parallel communication primitives
…of great benefit for network-based parallel computers

= distributed memory
 shared memory

Pn

P1


Reconfigurable
interconnection

…switches inside
processors (local control)





Z



Pn



Fixed
interconnection

(X, Y, and Z)

Y



Mem



ULA

P2



X




control

X



P1

Y

(X, Y or Z).



P2

Dynamic
controller

mux


Z

Static

…switches outside processors
(local or global/external control possible)

I.1 Dynamic architecture vs. static [slide not shown in main presentation]
In an n-degree static topology,
each processor has n distinct
optoelectronic I/O ports…

switches



P1

interconnections




processors











Pn



P2


Technologically challenging
Non reusable architecture
Bad scalability

…anyway, static
networks can be
redesigned as
single-stage
dynamic
networks…

P1

Pn



P2


Feed-back loop

…processors, switches and
interconnections located in
distinct modules

Optimal use of electronic, optoelectronic and optics
Scalability, hardware reusability in other topologies
possible introduction of multiple stages…

I.1 OCULAR-I system architecture
dynamic single stage…

…optical architecture
VCSEL array





Y

Photo-detector
array



Optical
interconnecti
on module






Pn



P2

X



P1



Elementary
Processor Array

Z



Optical feed-back

[ Modular architecture ]
2D optoelectronic
processing layer
(PD-PE-VCSEL)

+

Switches and
interconnections :
reconfigurable
diffractive optics
module

Processing Module
[VCSEL array ]

[ Photo-detector array ]

Si photo-detectors with

850 nm VCSELs

Integrated amplifier / threshold

Modulation > 1 GHz
(possible 10-50 GHz)

[ SIMD Processor array ]
registers

A

8x8 PEs
(on FPGA)

B
local memory
(24 bits)

ALU

mapped I/O

4-neighbors

VCSEL PD

PE

Electronic mesh for rapid short range
communication between PEs.

Each array attached to a PCB

10 MHz operation demonstrated

alvaro:
alvaro:
Reconfigurable
interconnection module

InInthese
theseoptical
opticalinterconnection
interconnectionmodule,
module,we
werequire
requireadjustable
adjustablecomponents
componentstotoadopt
adoptthe
thediffraction
diffractionposition
position
on
LD
and
PD.
on LD and PD.

Folded 4-f system

We
The module generates the
14
x 25 zooming
xzooming
6.2 cmFourier
Wehave
havedesigned
designed
Fourier transform
transformlens
lensasasthe
theadjustable
adjustablecomponent.
component.
The
asasillustrated
figure.
interconnection
pattern…
Thefocal
focallength
lengthisisadjustable
adjustablefrom
from360mm
360mmtoto440mm
440mmby
bymoving
movingone
oneofoflenses
lenses
illustratedininthe
the
figure.
This
Thisfunction
functionisisimportant
importantfor
formatching
matchinginterconnection
interconnectionparameters
parameterssuch
suchasasthe
thepixel
pixelpitches
pitchesofofthe
theVCSELVCSELarray,
the
PD-array,
the
CGH,
and
for
compensating
for
wavelength
variation
of
the
VCSEL
array.
array, the PD-array, the CGH, and for compensating for wavelength variation of the VCSEL array.

Laser diode

FT
lens

CGH is generated by an
optically addressable SLM,
using a laser diode and a
liquid crystal display coupled
trough a fiber optical plate.

=

X
Y
Z

…it is therefore responsible for
interconnection and switching

Space-invariant interconnections – good/bad?
Free-space – alignment issues?
Multi-level CGH – good diffraction efficiency
Reconfiguration (“switch”) freq. – 100 Hz…

I.2 Multi-stage paradigm for parallel computing
architecture can be “spanned” into

Mesh

P2

Pn

Cube Cycle
Tree

[computing]
Pyramid

P1



Hypercube




Pn

P2

De Bruijn

Stage m
P1



Pn

Delta
Omega

S&I-m

P2

P1

S&I-1

P1

Stage 2

P2



Switch &
Interconnection

Stage 1

Multi-Stages

S&I-2

Single-Stage

Pn

Benes

Clos

[computing & networking]
Shuffle/exchange

Banyan

Simplicity & Speed – S & I does not need to be complex
(shuffle-exchange networks).
The cost of
multiplying the
processors is paid
back as…

Scalability / Reconfigurability – for different topologies.
Pipelining – possible.
Theoretical background – Multi-stage architectures have
been studied for decades in networking applications…

I.2 OCULAR-II system architecture
Optoelectronic
processing module

Elementary Processor Array
Photo-detector array

Optical
interconne
ction
module

VCSEL array

Optical
interconne
ction
module



Two
layer
module

Optical
interconne
ction
module

II. Quad-tree compression on OCULAR-II
II.1 Quad-tree compression algorithm
II.2 Set-up and Demonstration
II.3 Discussion
Sender array
Electrical feed-back
trough host
computer
PE array
VCSELs

Receiver
array

Interconnection module
(SLM)
Photo Detectors
PE array

II.1 Principle of the quad-tree compression algorithm

Image…
A

This group of
pixels is a level 2
leaf of address B

…corresponding tree
level 3

B
A

D

C

B

level 2

B
D

…this pixel
is NOT a leaf

D
level 1 leaf of
address DB

DB

C
…this pixel is a
level 0 leaf of
address CDA

B

level 1

A

CDA

Leaf = ( level , address )

Image as a tree = ( 2 , B ) + ( 1 , DB ) + ( 0 , CDA )

level 0

AACC:compression
:
II.1 Quad-tree
on OCULAR-II architecture

Rem
Rem: :data
datafrom
fromthe
thereceiver
receiverside
sidetoto
Load 2Nfeedx2N image. ON pixels are
the
sender
side
isiselectronically
the
sender
side
electronically
feed- level leafs on local
• initialization set as lowest
back
backtrough
troughthe
thehost
hostcomputer…
computer…
PE memories.



a
arr

+1
n
ay
r
r
a

yn

from stage to stage

2
1



detect upper leaves

4

3

- sequentially broadcast leaf’s values to
corresponding upper PE.
- compare on receiver side
- update leaf levels of upper-level PE, if corners
resulted to be lower “false” leafs.



cutting branches
- parallel broadcast signal for resetting false
low-level leaves.



detect upper leaves

a
arr

+1
n
y

+2
n
ay
r
r
a

End on last stage:
- Download data from last array.
- Save data (level, address) from PEs which are still
leaves.

cutting branches

Example : interconnection for processing of level 1
1) Detecting leaves

CCD image of PD plane
A

…Is A a level
one leaf?

B
(first order)

C

(zero order)

D
D

= broadcasting PE on array n
= computing PE on array n+1

2) Conditional broadcast
…If so, A must
update its leaf
level and cut
lower branches.

A

B

C

D

[slide not shown in main presentation]

A

II.2 OCULAR-II demonstrator setup
• demonstration is carried out on a two layer OCULAR II prototype
PE array 1

VCSEL array

PD array PE array 2

Optical
interconnection
module

Multiple layer processing
is simulated thanks to
electronic feed-back
between first and second
processor arrays.

• Interconnection for each level are time multiplexed on the SLM module.
Level 0

cgh

Level 1

diffraction pattern

• Two level CGHs are used (enough diffraction efficiency)

Level 2

…quad-tree algorithm and hypercube network
Image 2n/2 x 2n/2 pixel large

2n elementary processors arranged in
a n-dimensional hypercube topology

Y

X
Z
W

Quad-tree on OCULAR-II: pairs of (6-dimensional) hypercube links are generated
and multiplexed in time thanks to the SLM-based interconnection module…

…on level 1: X, Z

…on level 2: Y, W


II.2 Quad-Tree Compression Demonstration Setup
CGH
monitor

“receiver”
array
(SIMD + PD)
Monitor
CCD

Interconnection
module

“sender” array
(SIMD + VCELS)

Control and results on
host computer …

Example : holograms required during level 1 processing.
1) Broadcast hologram (quadrant comparison)

(first order)
(zero order)

A

Potential
leaf on
level one

B
D

C

D

2) Re-Broadcast hologram (cutting branches)
= broadcasting PE
= computing
PE

A
A

B

C

D

[slide not shown in main presentation]

Level 0. Detecting upper leaves.

Level 0
quadrants

level 0
leaves

AB
DC

D

C

A

B

e
tru
se
fal

…symbolic representation of the initial tree, containing 28
level 0 (most of them false) leaves

Detail of level 0 broadcasting

[slide not shown in main presentation]

= “D” corners with leaf bit ON

sender array

= “D” corners with leaf bit OFF.

photo-detector chip surface as seen
through the alignment CCD camera

receiver array

In this demonstration we used
two-level phase CGHs
computed by SA.
Only the 1st order of diffraction is
used as the interconnection
pattern.

Level 0. Cutting branches.

newly
created leaf
on level 1

D

C

A

B

Level 1. Detecting upper leaves.

Level 1
quadrants

A B
D C

D

C

A

B

Level 1. Cutting branches.

newly
created leaf
on level 2

D

C

A

B

Level 2. Detecting leaves and cutting branches.
D

Level 2
quadrants

A B

A

C

B

D C

…symbolic representation of the encoded image as a
minimal tree with seven leaves.

II.3 Discussion
Compression of a 2Nx2N pixel large image takes O(5.N) clock cycles...
SIMD array, VCSEL and photo-detectors can run at more than 100MHz…



two million 1024x1024 images compressed per second!

8x8
image
(N=3)

28 pixels ON = 28 initial leaves.

15 iterations…

…only seven final leaves

However, SLM reconfiguration limits operation at maximum hundred hertz....
Also, one have to remember than our chips are only 8x8 pixel large.

III. Conclusion and further work
II.1 Summary
II.2 Research underway and further work

I.1 Summary

We have successfully tested OCULAR-II multistage architecture with
reconfigurable optical interconnections by implementing quad-tree
compression on binary images (=example of embedded hypercube)

However…
Optically addressed SLM-based interconnection module accounts for the strongest
bandwidth limitation (hundred hertz)
Electronic feed-back trough host computer generates parasitic signals,
and synchronization problems!
Alignment is not difficult, but may become a critical issue in “true” multistage
architectures...

III.2 Further work: OCULAR-III
[ Research underway ]
Alignment issues (between 2D arrays)
- dynamic alignment using actuators and control theory.
- pre-aligned connectors using fiber-bundles.

Fiber
bundle

Concurrent multistage paradigm using fixed interconnections
- design of fixed, guide-wave-based pre-aligned
interconnection modules (the processor array is in
charge of the switching function) => OCULAR-III
Design of an integrated (VLSI) optoelectronic layer (with switching…)

IBnC

[ Future research directions ]
 network

interconnection
modules

- Test of these “modular” architectures for building
computing and networking MINs.
- Design of all-optical networks using the above paradigm.

Processor
arrays

http://www.k2.t.u-tokyo.ac.jp/index-e.html