Applied Reconfigurable Computing. Architectures, Tools, and Applications 2018

Nikolaos Voros · Michael Huebner

Georgios Keramidas · Diana Goehringer

(Eds.) Christos Antonopoulos · Pedro C. Diniz Applied Reconfigurable Computing LNCS 10824

Architectures, Tools, and Applications

14th International Symposium, ARC 2018 Santorini, Greece, May 2–4, 2018 Proceedings

Lecture Notes in Computer Science 10824

Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David Hutchison Lancaster University, Lancaster, UK

Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA

Josef Kittler University of Surrey, Guildford, UK

Jon M. Kleinberg Cornell University, Ithaca, NY, USA

Friedemann Mattern ETH Zurich, Zurich, Switzerland

John C. Mitchell Stanford University, Stanford, CA, USA

Moni Naor Weizmann Institute of Science, Rehovot, Israel

C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India

Bernhard Steffen TU Dortmund University, Dortmund, Germany

Demetri Terzopoulos University of California, Los Angeles, CA, USA

Doug Tygar University of California, Berkeley, CA, USA

Gerhard Weikum Max Planck Institute for Informatics, Saarbr ücken, Germany More information about this series at http://www.springer.com/series/7407

^• Nikolaos Voros Michael Huebner ^• Georgios Keramidas Diana Goehringer ^• Christos Antonopoulos Pedro C. Diniz (Eds.)

Applied Reconﬁgurable Computing

Architectures, Tools, and Applications 14th International Symposium, ARC 2018 Santorini, Greece, May 2–4, 2018 Proceedings Editors Nikolaos Voros Diana Goehringer Technological Educational Institute Technische Universität Dresden of Western Greece Dresden Antirrio Germany Greece

Christos Antonopoulos Michael Huebner Technological Educational Institute Ruhr-Universität Bochum of Western Greece Bochum Antirrio Germany Greece Georgios Keramidas Pedro C. Diniz Technological Educational Institute

INESC-ID of Western Greece Lisbon Antirrio Portugal Greece

ISSN 0302-9743

ISSN 1611-3349 (electronic) Lecture Notes in Computer Science

ISBN 978-3-319-78889-0

ISBN 978-3-319-78890-6 (eBook) https://doi.org/10.1007/978-3-319-78890-6 Library of Congress Control Number: 2018937393 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer International Publishing AG, part of Springer Nature 2018

This work is subject to copyright. All rights are reserved by the Publisherwhether the whole or part of the

material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now

known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are

believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors

give a warranty, express or implied, with respect to the material contained herein or for any errors or

omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in

published maps and institutional afﬁliations. Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature

Preface

Reconfigurable computing platforms offer increased performance gains and energy efficiency through coarse-grained and fine-grained parallelism coupled with their ability to implement custom functional, storage, and interconnect structures. As such, they have been gaining wide acceptance in recent years, spanning the spectrum from highly specialized custom controllers to general-purpose high-end programmable computing systems. The flexibility and configurability of these platforms, coupled with increasing technology integration, have enabled sophisticated platforms that facilitate both static and dynamic reconfiguration, rapid system prototyping, and early design verification. Configurability is emerging as a key technology for substantial product life-cycle savings in the presence of evolving product requirements, standards, and interface specifications.

The growth of the capacity of reconfigurable devices, such as FPGAs, has created a wealth of new research opportunities and intricate engineering challenges. Within the past decade, reconfigurable architectures have evolved from a uniform sea of programmable logic elements to fully reconfigurable systems-on-chip (SoCs) with inte- grate multipliers, memory elements, processors, and standard I/O interfaces. One of the foremost challenges facing reconfigurable application developers today is how to best exploit these novel and innovative resources to achieve the highest possible performance and energy efficiency; additional challenges include the design and imple- mentation of next-generation architectures, along with languages, compilers, synthesis technologies, and physical design tools to enable highly productive design methodologies.

The International Applied Reconﬁgurable Computing (ARC) symposium series provides a forum for dissemination and discussion of ongoing research efforts in this transformative research area. The series of editions started in 2005 in Algarve, Portugal. The second edition of the symposium (ARC 2006) took place in Delft, The Netherlands, and was the ﬁrst edition of the symposium to have selected papers published as a Springer LNCS (Lecture Notes in Computer Science) volume. Subse- quent editions of the symposium have been held in Rio de Janeiro, Brazil (ARC 2007), London, UK (ARC 2008), Karlsruhe, Germany (ARC 2009), Bangkok, Thailand (ARC 2010), Belfast, UK (ARC 2011), Hong Kong, SAR China (ARC 2012), California, USA (ARC 2013), Algarve, Portugal (ARC 2014), Bochum, Germany (ARC 2015), Rio de Janeiro, Brazil (ARC 2016), and Delft, The Netherlands (ARC 2017).

This LNCS volume includes the papers selected for the 14th edition of the symposium (ARC 2018), held in Santorini, Greece, during May 2–4, 2018. The symposium attracted a large number of very good papers, describing interesting work on recon- ﬁgurable computing-related subjects. A total of 78 papers were submitted to the symposium from 28 countries. In particular, the authors of the submitted papers are from the following countries: Australia (3), Belgium (5), Bosnia and Herzegovina (4), Brazil (24), China (22), Colombia (1), France (3), Germany (40), Greece (44), VI Preface

India (10), Iran (4), Ireland (4), Italy (5), Japan (22), Malaysia (2), The Netherlands (5), New Zealand (1), Norway (2), Poland (3), Portugal (3), Russia (8), Singapore (7), South Korea (2), Spain (4), Sweden (3), Switzerland (1), UK (18), and USA (11).

Submitted papers were evaluated by at least three members of the Program Committee. The average number of reviews per submission was 3.7. After careful selection, 29 papers were accepted as full papers (acceptance rate of 37.2%) and 22 as short papers. These accepted papers led to a very interesting symposium program, which we consider to constitute a representative overview of ongoing research efforts in reconﬁgurable computing, a rapidly evolving and maturing ﬁeld. In addition, the symposium included a special session dedicated to funded research projects. The purpose of this session was to present the recent accomplishments, preliminary ideas, or work-in-progress scenarios of on-going research projects. Nine EU- and national-funded projects were selected for presentation in this session.

Several people contributed to the success of the 2018 edition of the symposium. We would like to acknowledge the support of all the members of this year’s symposium Steering and Program Committees in reviewing papers, in helping the paper selection, and in giving valuable suggestions. Special thanks also to the additional researchers who contributed to the reviewing process, to all the authors who submitted papers to the symposium, and to all the symposium attendees. In addition, special thanks to Dr. Christos Antonopoulos from the Technological Educational Institute of Western Greece for organizing the research project special session. Last but not least, we are especially indebted to Anna Kramer from Springer for her support and work in publishing this book and to Pedro C. Diniz from INESC-ID, Lisbon, Portugal, for his strong support regarding the publication of the proceedings as part of the LNCS series.

February 2018 Nikolaos Voros

Michael Huebner Georgios Keramidas

Diana Goehringer

Organization

The 2018 Applied Reconﬁgurable Computing Symposium (ARC2018) was organized by the Technological Educational Institute of Western Greece, by the Ruhr-Universität, Germany, and by the Technische Universität Dresden, Germany. The symposium took place at Bellonio Conference Center in Fira, the capital of Santorini in Greece.

General Chairs

Nikolaos Voros Technological Educational Institute of Western Greece Michael Huebner Ruhr-Universität, Bochum, Germany

Program Chairs

Georgios Keramidas Technological Educational Institute of Western Greece Diana Goehringer TU Dresden, Germany

Publicity Chairs

Luigi Carro UFRGS, Brazil Chao Wang USTC, China Dimitrios Soudris NTUA, Greece Stephan Wong TU Delft, The Netherlands

EU Projects Track Chair

Christos Antonopoulos Technological Educational Institute of Western Greece

Proceedings Chair

Pedro C. Diniz

INESC-ID, Lisbon, Portugal

Web Chair

Christos Antonopoulos Technological Educational Institute of Western Greece

Steering Committee

Hideharu Amano Keio University, Japan Jürgen Becker Universität Karlsruhe (TH), Germany Mladen Berekovic Braunschweig University of Technology, Germany Koen Bertels Delft University of Technology, The Netherlands Katherine (Compton) Morrow

University of Wisconsin-Madison, USA George Constantinides Imperial College of Science, UK Pedro C. Diniz

INESC-ID, Portugal Philip H. W. Leong University of Sydney, Australia Walid Najjar University of California Riverside, USA Roger Woods

The Queen’s University of Belfast, UK

Program Committee

Hideharu Amano Keio University, Japan Zachary Baker Los Alamos National Laboratory, USA Jürgen Becker Karlsruhe Institute of Technology, Germany Mladen Berekovic C3E, TU Braunschweig, Germany Nikolaos Bellas University of Thessaly, Greece Neil Bergmann University of Queensland, Australia Alessandro Biondi

Scuola Superiore Sant’Anna, Italy João Bispo FEUP/Universidade do Porto, Portugal Michaela Blott Xilinx, Ireland Vanderlei Bonato University of São Paulo, Brazil Christos Bouganis Imperial College, UK João Cardoso FEUP/Universidade do Porto, Portugal Luigi Carro Instituto de Informática/UFRGS, Brazil Ray Cheung City University of Hong Kong, SAR China Daniel Chillet AIRN - IRISA/ENSSAT, France Steven Derrien Université de Rennes 1, France Giorgos Dimitrakopoulos Democritus University of Thrace, Greece Pedro C. Diniz

INESC-ID, Portugal António Ferrari Universidade de Aveiro, Portugal João Canas Ferreira

INESC TEC/University of Porto, Portugal Ricardo Ferreira Universidade Federal de Viçosa, Brazil Apostolos Fournaris Technological Educational Institute of Western Greece,

Greece Carlo Galuzzi TU Delft, The Netherlands Roberto Giorgi University of Siena, Italy Marek Gorgon AGH University of Science and Technology, Poland Frank Hannig Friedrich-Alexander University Erlangen-Nürnberg,

Germany Jim Harkin University of Ulster, UK Christian Hochberger TU Darmstadt, Germany Christoforos Kachris

ICCS, Greece Kimon Karras Think Silicon S.A., Greece Fernanda Kastensmidt Universidade Federal do Rio Grande do Sul - UFRGS,

Brazil Chrysovalantis Kavousianos University of Ioannina, Greece

VIII Organization

Krzysztof Kepa GE Global Research, USA Andreas Koch TU Darmstadt, Germany Stavros Koubias University of Patras, Greece Dimitrios Kritharidis Intracom Telecom, Greece Vianney Lapotre Universit de Bretagne-Sud - Lab-STICC, France Eduardo Marques University of São Paulo, Brazil Konstantinos Masselos University of Peloponnese, Greece Cathal Mccabe Xilinx, Ireland Antonio Miele Politecnico di Milano, Italy Takefumi Miyoshi e-trees.Japan, Inc., Japan Walid Najjar University of California Riverside, USA Horácio Neto

INESC-ID/IST/U Lisboa, Portugal Dimitris Nikolos University of Patras, Greece Roman Obermeisser University of Siegen, Germany Kyprianos Papadimitriou Technical University of Crete, Greece Monica Pereira Universidade Federal do Rio Grande do Norte, Brazil Thilo Pionteck Otto-von-Guericke Universität Magdeburg, Germany Marco Platzner University of Paderborn, Germany Mihalis Psarakis University of Piraeus, Greece Kyle Rupnow Advanced Digital Sciences Center, USA Marco Domenico

Santambrogio Politecnico di Milano, Italy

Kentaro Sano Tohoku University, Japan Yukinori Sato Tokyo Institute of Technology, Japan António Beck Filho Universidade Federal do Rio Grande do Sul, Brazil Yuichiro Shibata Nagasaki University, Japan Cristina Silvano Politecnico di Milano, Italy Dimitrios Soudris NTUA, Greece Theocharis Theocharides University of Cyprus, Cyprus George Theodoridis University of Patras, Greece David Thomas Imperial College, UK Chao Wang USTC, China Markus Weinhardt Osnabrück University of Applied Sciences, Germany Theerayod Wiangtong KMITL, Thailand Roger Woods Queens University Belfast, UK Yoshiki Yamaguchi University of Tsukuba, Japan

Additional Reviewers

Dimitris Bakalis University of Patras, Greece Guilherme Bileki University of São Paulo, Brazil Ahmet Erdem Politecnico di Milano, Italy Panagiotis Georgiou University of Ioannina, Greece Adele Maleki University of Siegen, Germany Farnam Khalili Maybodi University of Siena, Italy

Organization

IX X Organization

Marco Procaccini University of Siena, Italy Jose Rodriguez University of California Riverside, USA Bashar Romanous University of California Riverside, USA Leandro Rosa University of São Paulo, Brazil Skyler Windh University of California Riverside, USA Vasileios Zois University of California Riverside, USA

1 Introduction

Recurrent Neural Networks (RNNs) is a machine learning model which offers the capability of recognising long-range dependencies in sequential and temporal data. RNN models, with the prevalence of Long Short-Term Memory (LSTMs) networks, have demonstrated state-of-the-art performance in various AI applications including scene labelling

. Moreover, LSTMs

have been successfully employed for AI tasks in complex environments including human trajectory prediction

on mobile robots,

with more recent systems combining language and image processing in tasks

4 M. Rizakis et al.

Despite the high predictive power of LSTMs, their computational and memory demands pose a challenge with respect to deployment in latency-sensitive and power-constrained environments. Modern intelligent systems such as mobile robots and drones that employ LSTMs to perceive their surroundings often oper- ate under time-constrained, latency-critical settings. In such scenarios, retrieving the best possible output from an LSTM given a constraint in computation time may be necessary to ensure the timely operation of the system. Moreover, the requirements of such applications for low absolute power consumption, which would enable a longer battery life, prohibit the deployment of high-performance, but power-hungry platforms, such as multi-core CPUs and GPUs. In this context, FPGAs constitute a promising target device that can combine customisation and reconfigurability to achieve high performance at a low power envelope.

In this work, an approximate computing scheme along with a novel hardware architecture for LSTMs are proposed as an end-to-end framework to address the problem of high-performance LSTM deployment in time-constrained settings. Our approach comprises an iterative approximation method that applies simul- taneously low-rank compression and pruning of the LSTM model with a tunable number of refinement iterations. This iterative process enables our framework to (i) exploit the resilience of the target application to approximations, (ii) explore the trade-off between computational and memory load and application-level accuracy and (iii) execute the LSTM under a time constraint with increasing accuracy as a function of computation time budget. At the hardware level, our system consists of a novel FPGA-based architecture which exploits the inherent parallelism of the LSTM, parametrised with respect to the level of compression and pruning. By optimising the parameters of the approximation method, the proposed framework generates a system tailored to the target application, the available FPGA resources and the computation time constraints. To the best of our knowledge, this is the first work in the literature to address the deployment of LSTMs under computation time constraints.

2 Background

2.1 LSTM Networks A vanilla RNN typically processes an input and generates an output at each time step. Internally, the network has recurrent connections from the output at one time step to the hidden units at the next time step which enables it to capture sequential patterns. The LSTM model differs from vanilla RNNs in that it comprises control units named gates, instead of layers. A typical LSTM has four gates. The input gate (Eq.

) are responsible

for determining how much of the current input will propagate to the output. The forget gate (Eq.

) is responsible for determining whether the previous state of

the LSTM will be forgotten or not, while the output gate (Eq.

) determines

how much of the current state will be allowed to propagate to the final output of the LSTM at the current time step. Computationally, the gates are matrix-vector

Approximate FPGA-Based LSTMs Under Computation Time Constraints

multiplication blocks, followed by a nonlinear elementwise activation function. The equations for the LSTM model are shown below:

(t) (t) (t−1)

i = σ(W x + W h ) (1)

ix ih (t) (t) (t−1)

f = σ(W x + W h ) (2)

f x f h (t) (t) (t−1)

o x h = σ(W + W ) (3)

ox oh

(t) (t) (t−1) (t) (t) (t−1)

c = f ⊙ c + i ⊙ tanh(W x + W h ) (4)

cx ch (t) (t) (t)

h = c ⊙ o (5)

(t) (t) (t) (t)

i , f and o are the input, forget and output gates respectively, c

(t−1)

(t)

is the current state of the LSTM, h is the previous output, x is the current input at time t and σ(·) represents the sigmoid function. Equation

is (t) (t) (t)

frequently found in the literature as h = c ⊙ tanh(o ) with tanh(·) applied to the output gate. In this work, we follow the image captioning LSTM proposed in

] which removes the tanh(·) from the output gate and therefore we end up

with Eq.

. Finally, all the W matrices denote the weight matrices that contain the trainable parameters of the model, which are assumed to be provided.

3 Related Work

The effectiveness of RNNs has attracted the attention of the architecture and reconfigurable computing communities. Li et al.

] proposed an FPGA-based

accelerator for the training of an RNN language model. In

], the authors focus

on the optimised deployment of the Gated Recurrent Unit (GRU) model

data centres with server-grade FPGAs, ASICs, GPUs and CPUs and propose an algorithmic memoisation-based method to reduce the computational load at the expense of increased memory footprint. The authors of

] present an empir-

ical study of the effect of different architectural designs on the computational resources, on-chip memory capacity and off-chip memory bandwidth requirements of an LSTM model. Finally, Guan et al.

proposed an FPGA-based

LSTM accelerator optimised for speech recognition on a Xilinx VC707 FPGA platform.

From an algorithmic perspective, recent works have followed a model- hardware co-design approach. Han et al.

proposed an FPGA-based speech

recognition engine that employs a load-balance-aware compression scheme in order to compress the LSTM model size. Wang et al.

presented a method

that addresses compression at several levels including the use of circulant matrices for three of the LSTM gates and the quantisation of the trained parameters, together with the corresponding ASIC-based hardware architecture. Zhang et al.

] presented an FPGA-based accelerator for a Long-Term Recurrent Convo-

lutional Network (LRCN) for video footage description that consists of a CNN followed by an LSTM. Their design focuses on balancing the resource allocation between the layers of the LRCN and pruning the fully-connected and LSTM

deviate from the faith-

6 M. Rizakis et al.

to compensate for the introduced error of each proposed method. Finally, He and Sun

focused on CNNs and investigated algorithmic strategies for model selection under computation time constraints for both training and testing.

Our work differs from the majority of existing efforts by proposing a hardware architecture together with an approximate computing method for LSTMs that is application-aware and tunable with respect to the required computation time

]

by proposing an approximation to the model, but in contrast to these methods does not require a retraining phase and assumes no access to the full training set. Instead, with a limited subset of labelled data, our scheme compensates for the induced error by means of iterative refinement, making it suitable for applications where the dataset is privacy-critical and the quality of the approximation improves as the time availability increases.

4 Methodology

In this section, the main components of the proposed framework are presented (Fig.

. Given an LSTM model with its set of weight matrices and a small appli-

cation evaluation set, the proposed system searches for an appropriate approximation scheme that meets the application’s needs, by applying low-rank compression and pruning on the model. The design space is traversed by means of a roofline model to determine the highest performing configuration of the proposed architecture on the target FPGA. In this manner, the trade-off between computation time and application-level error is explored for different approximation schemes. The design point to be implemented on the device is selected based on user-specified requirements for the maximum computation time or application- level error tolerance.

Fig. 1. Design flow of the proposed framework Approximate FPGA-Based LSTMs Under Computation Time Constraints

4.1 Approximations for LSTMs At the core of an LSTM’s computational workload lie the matrix-vector multipli- cations in each of the four gates. Neural networks have been extensively studied to have redundancy in terms of their trained parameters

. To reduce the

computational demands of the LSTM, we propose an approximate computing scheme that enables the tuning between computational cost and application- level accuracy. The proposed approach exploits the statistical redundancy of the LSTM by acting at two levels: (i) approximating the weight matrices with a low-rank, SVD-based decomposition and (ii) pruning the network by sparsifying the weight matrices based on an importance criterion of their elements. Low-rank approximation.

Based on the set of LSTM Eqs.

, each gate

consists of two weight matrices corresponding to the current input and previous output vectors respectively. In our scheme, we construct an augmented matrix by concatenating the input and output weight matrices as shown in Eq. . Similarly, we concatenate the input and previous output vectors (Eq.

) and

thus the overall gate computation is given by Eq. .

(t) (t)T (t−1)T

˜ x = x h (6) W W

i = [W ix ih ] , ∀i ∈ [1, 4] (7) (t)

y x = nonlin(W ˜ ), ∀i ∈ [1, 4] (8)

where nonlin(·) is either the sigmoid function σ(·) or tanh(·). In this way, a

R×C

single weight matrix is formed for each gate, denoted by W for the

i ∈ R

i gate. We perform a full SVD decomposition on the four augmented matrices

th T R×R R×C

independently as W = U Σ V , ∀i ∈ [1, 4], where U , Σ

i i i i ∈ R i ∈ R i C×C i i iT

W u v and V i and employ a rank-1 approximation to obtain i = σ ∈ R

1 by keeping the singular vectors that correspond to the largest singular value.

Pruning by means of network sparsification. The second level of approximation on the LSTM comprises the structured pruning of the connectivity between neurons. With each neural connection being captured as an element of the weight matrices, we express network pruning as sparsification applied on the augmented weight matrices (Eq.

). To represent a sparse LSTM, we R×C

introduce four binary mask matrices F ∈ {0, 1} , ∀i ∈ [1, 4], with each

entry representing whether a connection is pruned or not. Overall, we employ the following notation for a (weight, mask) matrix pair {W , F | i ∈ [1, 4]}.

i i

In the proposed scheme, we explore sparsity with respect to the connections per output neuron and constrain each output to have the same number of inputs. We cast LSTM pruning as an optimisation problem of the following form.

min ||W − F ⊙ W || , s.t.||f || = NZ, ∀i ∈ [1, 4], ∀j ∈ [1, R] (9)

i i i _F _i 2 j i

where f is the j row of F and NZ is the number of non-zero elements on

th i j

8 M. Rizakis et al.

entries in a vector. The solution to the optimisation problem in Eq.

is given

by keeping the NZ elements on each row of W with the highest absolute value

i and setting their indices to 1 in F . i

In contrast to the existing approaches, the proposed pruning method does not employ retraining and hence removes the computationally expensive step of retraining and the requirement for the training set, which is important for privacy-critical applications. Even though our sparsification method does not explicitly capture the impact of pruning on the application-level accuracy, our design space exploration, detailed in Sect.

searches over different levels of sparsity and as a result it explores the effect of pruning on the application.

Hybrid compression and pruning.

By applying both low-rank approximation and pruning, we end up with the following weight matrix approximation:

i i iT

W u v = F ⊙ (σ ) (10)

i i

1 In this setting, for the i gate the ranking of the absolute values in each row th i i iT i

of the rank-1 approximation σ u v depends only on v , with each element of

1 i i

σ u operating as a shared scaling factor for all elements of a row. Therefore,

for the i gate all the rows of F become identical and hence can be represented

th i i C

by a single mask vector f ∈ {0, 1} . This leads to a weight matrix with zeros along (C−NZ) of its columns, which is described by the following expression:

i i i i T

W = σ u (f ⊙ v ) (11)

i _steps

1 N i(n) i(n) i(n) i(n) T (t)

y u x ˜ = σ (f ⊙ v ) ˜ (12)

1 n=1

In order to obtain a refinement mechanism, we propose an iterative algorithm, presented in Algorithm

that employs both the low-rank approximation and

pruning methods to progressively update the weight matrix. On lines 4–6 the first approximation of the weight matrix is constructed by obtaining the rank-1 approximation of the original matrix and applying pruning in order to have NZ non-zero elements on each row, as in Eq.

. Next, the weight matrix is refined

for N steps iterations, by computing the error matrix E (line 10) and employing its pruned rank-1 approximation as an update (line 15).

Different combinations of levels of sparsity and refinement iterations correspond to different design points in the computation-accuracy space. In this respect, the number of non-zero elements in each binary mask vector and the number of iterations are exposed to the design space exploration as tunable parameters (NZ, N ) to explore the LSTM computation-accuracy trade-off.

steps

4.2 Architecture The proposed FPGA architecture for LSTMs is illustrated in Fig.

The main

strategy of the architecture includes the exploitation of the coarse-grained par-

Approximate FPGA-Based LSTMs Under Computation Time Constraints

Algorithm 1.

Iterative LSTM Model Approximation

Inputs: _i _R×C 1: Weight matrices W , ∀i ∈ [1, 4] ∈ R 2: Number of non-zero elements, NZ _steps 3: Number of refinement iterations, N

Steps: 1: - - For all gates - - 2: for i = 1 to 4 do 3: - - Initialise weight matrix approximation - - _{i i i} _{(0) (0) (0)} 4: u , σ , v = SVD(W i ) _{i (0)} ₍₀₎ ₁ ₁ ₁ ₁ _i 5: f ← solution to Eq. (9) for vector v ₍₀₎ _{i i i} _{(0) (0) i (0)} ₍₀₎ _T ₁ W u f 6: i = σ ⊙ v ₁ ₁ ₁ 7: - - Apply refinements - - for n steps do 8: = 1 to N

9: - - Compute error matrix - - _(n−1) 10: E = W i − W i 11: - - Compute refinement - - _{i i i} _{(n) (n) (n)} 12: u , σ , v = SVD(E) _i _(n) ₁ ₁ ₁ ₁ _i _(n) 13: f ← solution to optimisation problem (9) for vector v ₁ 14: - - Update weight matrix approximation - - _{(n) (n−1)} _{i (n) i (n) i (n)}

_{i (n)}

_T W W u f

15: i = i + σ ⊙ v ₁ ₁ ₁ end for 16: 17: end for Notes: SVD(X )

^{1 returns the rank-1 SVD-based approximation of X .}

fine-grained parallelism in the dot-product and elementwise operations of the LSTM, allowing for a compile-time tunable performance-resource trade-off. SVD and Binary Masks Precomputation. In Algorithm

the number of

refinement iterations (N ), the level of sparsity (NZ) and the trained weight

steps

matrices are data-independent and known at compile time. As such, the required SVD decompositions along with the corresponding binary masks are precom- puted for all N iterations at compile time. As a result, the singular values

steps i(n) i(n) i(n) i(n)

σ , the vectors u and only the non-zero elements of the sparse f ⊙ v

1 1 are stored in the off-chip memory, so that they can be looked-up at run time.

Inter-gate and Intra-gate Parallelism.

In the proposed architecture, each gate is allocated a dedicated hardware gate unit with all gates operating in parallel. At each LSTM time-step t, a hardware gate unit computes its output by performing N refinement iterations as in Eq.

. At the beginning of steps (t)

the time-step, the current vector ˜ x is stored on-chip as it will be reused in each

i(n) i(n)

iteration by all four gates. The vectors u and v for each gate, along with

1 i(n)

their singular values σ , are streamed in the architecture from the off-chip

10 M. Rizakis et al.

Fig. 2.

Diagram of proposed hardware architecture i(n) i(n)

memory in a tiled manner. u and v are tiled with tile sizes of T r and T c

1 R C respectively, leading to and tiles sequentially streamed in the architecture.

T r T c

At each gate, a dot-product unit is responsible for computing the dot product

i(n) (t)

x of the current tile of v with the corresponding elements of the input ˜ . The

1 i(n)

dot-product unit is unrolled by a factor of T in order to process one tile of v

1 C

per cycle. After accumulating the partial results of all the tiles, the result is _c

T i(n)

produced and multiplied with the scalar σ . The multiplication result is passed

1 i(n)

as a constant operand to a multiplier array, with u as the other operand. The

1 i(n)

multiplier array has a size of T in order to match the tiling of u . As a final

stage, an array of T accumulators performs the summation across the N

r steps

iterations as expressed in Eq. , to produce the final gate output.

The outputs from the input, forget and output gates are passed through a sigmoid unit while the output of the cell gate is passed through a tanh unit. After the nonlinearities stage, the produced outputs are multiplied element-by-

(t) element as dictated by the LSTM equations to produce the cell state c (Eq. (t)

). The three multiplier arrays and

the one adder array all have a size of T r to match the tile size of the incoming vectors and exploit the available parallelism.

5 Design Space Exploration

Having parametrised the proposed approximation method over NZ and N steps and its underlying architecture over NZ and tile sizes (T r , T c ), corresponding metrics need to be employed for exploring the effects of each parameter on performance and accuracy. The approximation method parameters are studied based on an application-level evaluation metric (discussed in Sect.

, that measures

the impact of each applied approximation on the accuracy of the target application. In terms of the hardware architecture, roofline performance modelling is employed for exhaustively exploring the design space formed by all possible tile size combinations, to obtain the highest performing design point (discussed in Sect.

. Based on those two metrics, the computation time-accuracy trade-off

Approximate FPGA-Based LSTMs Under Computation Time Constraints

5.1 Roofline Model The design space of architectural configurations for all tile size combinations of T and T is explored exhaustively by performance modelling. The roofline

r c

model

] is used to develop a performance model for the proposed architecture

Applied Reconfigurable Computing. Architectures, Tools, and Applications 2018

ISSN 0302-9743

ISBN 978-3-319-78889-0

1 Introduction

3 Related Work

8 M. Rizakis et al.

10 M. Rizakis et al.

1 R C respectively, leading to and tiles sequentially streamed in the architecture.

5 Design Space Exploration

Dokumen yang terkait

The Interpretation and Applied Strategies of Logo and Tagline of Circles Shoes

Complex Variables and Applications (8th edition) J. W. Brown and R. V. Churchill

2013 International Conference on Applied Computing, Computer Science, and Computer Engineering

A Novel Configuration Circuit Architecture to Speedup Reconfiguration and Relocation for Partially Reconfigurable Devices

Reporting and OLAP Applications for Libraries and Archives in East Java Province

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Culture of Animal Cells: A Manual of Basic Technique and Specialized Applications 7

Digital Systems Principles and Applications 10th Ed – Ronald Tocci

Valence Bond Methods. Theory and Applications G A Gallup

Mobile and Ubiquitous Systems Computing. Networking, and Services

Dukungan

Links

Applied Reconfigurable Computing. Architectures, Tools, and Applications 2018

ISSN 0302-9743

ISBN 978-3-319-78889-0

1 Introduction

3 Related Work

8 M. Rizakis et al.

10 M. Rizakis et al.

1 R C respectively, leading to and tiles sequentially streamed in the architecture.

5 Design Space Exploration

Dokumen yang terkait

The Interpretation and Applied Strategies of Logo and Tagline of Circles Shoes

Complex Variables and Applications (8th edition) J. W. Brown and R. V. Churchill

2013 International Conference on Applied Computing, Computer Science, and Computer Engineering

A Novel Configuration Circuit Architecture to Speedup Reconfiguration and Relocation for Partially Reconfigurable Devices

Reporting and OLAP Applications for Libraries and Archives in East Java Province

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Culture of Animal Cells: A Manual of Basic Technique and Specialized Applications 7

Digital Systems Principles and Applications 10th Ed – Ronald Tocci

Valence Bond Methods. Theory and Applications G A Gallup

Mobile and Ubiquitous Systems Computing. Networking, and Services

Dokumen yang Anda mencari sudah siap untuk unduhkan