Applied Reconfigurable Computing. Architectures, Tools, and Applications 2018

  Nikolaos Voros · Michael Huebner

Georgios Keramidas · Diana Goehringer

  (Eds.) Christos Antonopoulos · Pedro C. Diniz Applied Reconfigurable Computing LNCS 10824

  

Architectures, Tools, and Applications

14th International Symposium, ARC 2018 Santorini, Greece, May 2–4, 2018 Proceedings

  

Lecture Notes in Computer Science 10824

  Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

  Editorial Board

  David Hutchison Lancaster University, Lancaster, UK

  Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA

  Josef Kittler University of Surrey, Guildford, UK

  Jon M. Kleinberg Cornell University, Ithaca, NY, USA

  Friedemann Mattern ETH Zurich, Zurich, Switzerland

  John C. Mitchell Stanford University, Stanford, CA, USA

  Moni Naor Weizmann Institute of Science, Rehovot, Israel

  C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India

  Bernhard Steffen TU Dortmund University, Dortmund, Germany

  Demetri Terzopoulos University of California, Los Angeles, CA, USA

  Doug Tygar University of California, Berkeley, CA, USA

  Gerhard Weikum Max Planck Institute for Informatics, Saarbr ücken, Germany More information about this series at http://www.springer.com/series/7407

  • Nikolaos Voros Michael Huebner Georgios Keramidas Diana Goehringer Christos Antonopoulos Pedro C. Diniz (Eds.)

  Applied Reconfigurable Computing

Architectures, Tools, and Applications 14th International Symposium, ARC 2018 Santorini, Greece, May 2–4, 2018 Proceedings Editors Nikolaos Voros Diana Goehringer Technological Educational Institute Technische Universität Dresden of Western Greece Dresden Antirrio Germany Greece

  Christos Antonopoulos Michael Huebner Technological Educational Institute Ruhr-Universität Bochum of Western Greece Bochum Antirrio Germany Greece Georgios Keramidas Pedro C. Diniz Technological Educational Institute

  INESC-ID of Western Greece Lisbon Antirrio Portugal Greece

ISSN 0302-9743

  ISSN 1611-3349 (electronic) Lecture Notes in Computer Science

ISBN 978-3-319-78889-0

  ISBN 978-3-319-78890-6 (eBook) https://doi.org/10.1007/978-3-319-78890-6 Library of Congress Control Number: 2018937393 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer International Publishing AG, part of Springer Nature 2018

This work is subject to copyright. All rights are reserved by the Publisherwhether the whole or part of the

material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now

known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a specific statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are

believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors

give a warranty, express or implied, with respect to the material contained herein or for any errors or

omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in

published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature

  

Preface

  Reconfigurable computing platforms offer increased performance gains and energy efficiency through coarse-grained and fine-grained parallelism coupled with their ability to implement custom functional, storage, and interconnect structures. As such, they have been gaining wide acceptance in recent years, spanning the spectrum from highly specialized custom controllers to general-purpose high-end programmable computing systems. The flexibility and configurability of these platforms, coupled with increasing technology integration, have enabled sophisticated platforms that facilitate both static and dynamic reconfiguration, rapid system prototyping, and early design verification. Configurability is emerging as a key technology for substantial product life-cycle savings in the presence of evolving product requirements, standards, and interface specifications.

  The growth of the capacity of reconfigurable devices, such as FPGAs, has created a wealth of new research opportunities and intricate engineering challenges. Within the past decade, reconfigurable architectures have evolved from a uniform sea of pro- grammable logic elements to fully reconfigurable systems-on-chip (SoCs) with inte- grate multipliers, memory elements, processors, and standard I/O interfaces. One of the foremost challenges facing reconfigurable application developers today is how to best exploit these novel and innovative resources to achieve the highest possible perfor- mance and energy efficiency; additional challenges include the design and imple- mentation of next-generation architectures, along with languages, compilers, synthesis technologies, and physical design tools to enable highly productive design methodologies.

  The International Applied Reconfigurable Computing (ARC) symposium series provides a forum for dissemination and discussion of ongoing research efforts in this transformative research area. The series of editions started in 2005 in Algarve, Portugal. The second edition of the symposium (ARC 2006) took place in Delft, The Netherlands, and was the first edition of the symposium to have selected papers published as a Springer LNCS (Lecture Notes in Computer Science) volume. Subse- quent editions of the symposium have been held in Rio de Janeiro, Brazil (ARC 2007), London, UK (ARC 2008), Karlsruhe, Germany (ARC 2009), Bangkok, Thailand (ARC 2010), Belfast, UK (ARC 2011), Hong Kong, SAR China (ARC 2012), California, USA (ARC 2013), Algarve, Portugal (ARC 2014), Bochum, Germany (ARC 2015), Rio de Janeiro, Brazil (ARC 2016), and Delft, The Netherlands (ARC 2017).

  This LNCS volume includes the papers selected for the 14th edition of the sym- posium (ARC 2018), held in Santorini, Greece, during May 2–4, 2018. The symposium attracted a large number of very good papers, describing interesting work on recon- figurable computing-related subjects. A total of 78 papers were submitted to the symposium from 28 countries. In particular, the authors of the submitted papers are from the following countries: Australia (3), Belgium (5), Bosnia and Herzegovina (4), Brazil (24), China (22), Colombia (1), France (3), Germany (40), Greece (44), VI Preface

  India (10), Iran (4), Ireland (4), Italy (5), Japan (22), Malaysia (2), The Netherlands (5), New Zealand (1), Norway (2), Poland (3), Portugal (3), Russia (8), Singapore (7), South Korea (2), Spain (4), Sweden (3), Switzerland (1), UK (18), and USA (11).

  Submitted papers were evaluated by at least three members of the Program Committee. The average number of reviews per submission was 3.7. After careful selection, 29 papers were accepted as full papers (acceptance rate of 37.2%) and 22 as short papers. These accepted papers led to a very interesting symposium program, which we consider to constitute a representative overview of ongoing research efforts in reconfigurable computing, a rapidly evolving and maturing field. In addition, the symposium included a special session dedicated to funded research projects. The purpose of this session was to present the recent accomplishments, preliminary ideas, or work-in-progress scenarios of on-going research projects. Nine EU- and national-funded projects were selected for presentation in this session.

  Several people contributed to the success of the 2018 edition of the symposium. We would like to acknowledge the support of all the members of this year’s symposium Steering and Program Committees in reviewing papers, in helping the paper selection, and in giving valuable suggestions. Special thanks also to the additional researchers who contributed to the reviewing process, to all the authors who submitted papers to the symposium, and to all the symposium attendees. In addition, special thanks to Dr. Christos Antonopoulos from the Technological Educational Institute of Western Greece for organizing the research project special session. Last but not least, we are especially indebted to Anna Kramer from Springer for her support and work in pub- lishing this book and to Pedro C. Diniz from INESC-ID, Lisbon, Portugal, for his strong support regarding the publication of the proceedings as part of the LNCS series.

  February 2018 Nikolaos Voros

  Michael Huebner Georgios Keramidas

  Diana Goehringer

  

Organization

  The 2018 Applied Reconfigurable Computing Symposium (ARC2018) was organized by the Technological Educational Institute of Western Greece, by the Ruhr-Universität, Germany, and by the Technische Universität Dresden, Germany. The symposium took place at Bellonio Conference Center in Fira, the capital of Santorini in Greece.

  General Chairs

  Nikolaos Voros Technological Educational Institute of Western Greece Michael Huebner Ruhr-Universität, Bochum, Germany

  Program Chairs

  Georgios Keramidas Technological Educational Institute of Western Greece Diana Goehringer TU Dresden, Germany

  Publicity Chairs

  Luigi Carro UFRGS, Brazil Chao Wang USTC, China Dimitrios Soudris NTUA, Greece Stephan Wong TU Delft, The Netherlands

  EU Projects Track Chair

  Christos Antonopoulos Technological Educational Institute of Western Greece

  Proceedings Chair

  Pedro C. Diniz

  INESC-ID, Lisbon, Portugal

  Web Chair

  Christos Antonopoulos Technological Educational Institute of Western Greece

  Steering Committee

  Hideharu Amano Keio University, Japan Jürgen Becker Universität Karlsruhe (TH), Germany Mladen Berekovic Braunschweig University of Technology, Germany Koen Bertels Delft University of Technology, The Netherlands Katherine (Compton) Morrow

  University of Wisconsin-Madison, USA George Constantinides Imperial College of Science, UK Pedro C. Diniz

  INESC-ID, Portugal Philip H. W. Leong University of Sydney, Australia Walid Najjar University of California Riverside, USA Roger Woods

  The Queen’s University of Belfast, UK

  Program Committee

  Hideharu Amano Keio University, Japan Zachary Baker Los Alamos National Laboratory, USA Jürgen Becker Karlsruhe Institute of Technology, Germany Mladen Berekovic C3E, TU Braunschweig, Germany Nikolaos Bellas University of Thessaly, Greece Neil Bergmann University of Queensland, Australia Alessandro Biondi

  Scuola Superiore Sant’Anna, Italy João Bispo FEUP/Universidade do Porto, Portugal Michaela Blott Xilinx, Ireland Vanderlei Bonato University of São Paulo, Brazil Christos Bouganis Imperial College, UK João Cardoso FEUP/Universidade do Porto, Portugal Luigi Carro Instituto de Informática/UFRGS, Brazil Ray Cheung City University of Hong Kong, SAR China Daniel Chillet AIRN - IRISA/ENSSAT, France Steven Derrien Université de Rennes 1, France Giorgos Dimitrakopoulos Democritus University of Thrace, Greece Pedro C. Diniz

  INESC-ID, Portugal António Ferrari Universidade de Aveiro, Portugal João Canas Ferreira

  INESC TEC/University of Porto, Portugal Ricardo Ferreira Universidade Federal de Viçosa, Brazil Apostolos Fournaris Technological Educational Institute of Western Greece,

  Greece Carlo Galuzzi TU Delft, The Netherlands Roberto Giorgi University of Siena, Italy Marek Gorgon AGH University of Science and Technology, Poland Frank Hannig Friedrich-Alexander University Erlangen-Nürnberg,

  Germany Jim Harkin University of Ulster, UK Christian Hochberger TU Darmstadt, Germany Christoforos Kachris

  ICCS, Greece Kimon Karras Think Silicon S.A., Greece Fernanda Kastensmidt Universidade Federal do Rio Grande do Sul - UFRGS,

  Brazil Chrysovalantis Kavousianos University of Ioannina, Greece

  VIII Organization

  Krzysztof Kepa GE Global Research, USA Andreas Koch TU Darmstadt, Germany Stavros Koubias University of Patras, Greece Dimitrios Kritharidis Intracom Telecom, Greece Vianney Lapotre Universit de Bretagne-Sud - Lab-STICC, France Eduardo Marques University of São Paulo, Brazil Konstantinos Masselos University of Peloponnese, Greece Cathal Mccabe Xilinx, Ireland Antonio Miele Politecnico di Milano, Italy Takefumi Miyoshi e-trees.Japan, Inc., Japan Walid Najjar University of California Riverside, USA Horácio Neto

  INESC-ID/IST/U Lisboa, Portugal Dimitris Nikolos University of Patras, Greece Roman Obermeisser University of Siegen, Germany Kyprianos Papadimitriou Technical University of Crete, Greece Monica Pereira Universidade Federal do Rio Grande do Norte, Brazil Thilo Pionteck Otto-von-Guericke Universität Magdeburg, Germany Marco Platzner University of Paderborn, Germany Mihalis Psarakis University of Piraeus, Greece Kyle Rupnow Advanced Digital Sciences Center, USA Marco Domenico

  Santambrogio Politecnico di Milano, Italy

  Kentaro Sano Tohoku University, Japan Yukinori Sato Tokyo Institute of Technology, Japan António Beck Filho Universidade Federal do Rio Grande do Sul, Brazil Yuichiro Shibata Nagasaki University, Japan Cristina Silvano Politecnico di Milano, Italy Dimitrios Soudris NTUA, Greece Theocharis Theocharides University of Cyprus, Cyprus George Theodoridis University of Patras, Greece David Thomas Imperial College, UK Chao Wang USTC, China Markus Weinhardt Osnabrück University of Applied Sciences, Germany Theerayod Wiangtong KMITL, Thailand Roger Woods Queens University Belfast, UK Yoshiki Yamaguchi University of Tsukuba, Japan

  Additional Reviewers

  Dimitris Bakalis University of Patras, Greece Guilherme Bileki University of São Paulo, Brazil Ahmet Erdem Politecnico di Milano, Italy Panagiotis Georgiou University of Ioannina, Greece Adele Maleki University of Siegen, Germany Farnam Khalili Maybodi University of Siena, Italy

  Organization

  IX X Organization

  Marco Procaccini University of Siena, Italy Jose Rodriguez University of California Riverside, USA Bashar Romanous University of California Riverside, USA Leandro Rosa University of São Paulo, Brazil Skyler Windh University of California Riverside, USA Vasileios Zois University of California Riverside, USA

  Sponsors

  The 2018 Applied Reconfigurable Computing Symposium (ARC2018) is sponsored by:

  

Contents

  . . .

   Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis . . .

   Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

  . . .

   Jiang Su, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Gianluca Durelli, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

  . . .

   Kazusa Musha, Tomohiro Kudoh, and Hideharu Amano . . .

   Panagiotis G. Mousouliotis and Loukas P. Petrou . . .

   Konstantinos Katsantonis, Christoforos Kachris, and Dimitrios Soudris . . .

  . . .

   Lukas Johannes Jung and Christian Hochberger

  Kalindu Herath, Alok Prakash, and Thambipillai Srikanthan XII Contents

  

  Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S. K. Nandy, and Ranjani Narayan

  

  Junsik Kim and Jaehyun Park

  Kazuei Hironaka, Ng. Anh Vu Doan, and Hideharu Amano

  Tim Hansmeier, Marco Platzner, and David Andrews

  Almabrok Abdoalnasir, Mihalis Psarakis, and Anastasios Dounis

  Santhi Natarajan, N. KrishnaKumar, Debnath Pal, and S. K. Nandy

  Masahiro Fukuda and Yasushi Inoguchi

  Deepayan Bhowmik and Kofi Appiah

  Emmanuel Ofori-Attah, Xiaohang Wang, and Michael Opoku Agyeman

  

  Augusto G. Erichsen, Anderson L. Sartor, Jeckson D. Souza, Monica M. Pereira, Stephan Wong, and Antonio C. S. Beck

   Fabio Benevenuti and Fernanda Lima Kastensmidt

   Milind Parelkar and Darshan Jetly

  Contents

  XIII

  

  Christos P. Antonopoulos, Konstantinos Antonopoulos, Christos Panagiotou, and Nikolaos S. Voros

   Bruno da Silva, Laurent Segers, An Braeken, Kris Steenhaut, and Abdellah Touhafi

  

  Lampros Pyrgas and Paris Kitsos

  Raheel Afsharmazayejani, Fahimeh Yazdanpanah, Amin Rezaei, Mohammad Alaei, and Masoud Daneshtalab

  

  Luca Sterpone and Ludovica Bozzoli

  Benedikt Jan ßen, Florian Kästner, Tim Wingender, and Michael Huebner

  Osvaldo Navarro and Michael Huebner

  Rafael F ão de Moura, Michael Guilherme Jordan, Antonio Carlos Schneider Beck, and Mateus Beck Rutzig

   Jeckson Dellagostin Souza, Anderson L. Sartor, Luigi Carro, Mateus Beck Rutzig, Stephan Wong, and Antonio C. S. Beck

  

  Kamil Piszczek, Piotr Janus, and Tomasz Kryjak

   Sikandar Khan, Kyprianos Papadimitriou, Giorgio Buttazzo, and Kostas Kalaitzakis

  

  Jens Rettkowski and Diana Goehringer

  Bj örn Liebig, Julian Oppermann, Oliver Sinnen, and Andreas Koch

  Habib ul Hasan Khan, Ahmed Kamal, and Diana Goehringer

  Juli án Caba, João M. P. Cardoso, Fernando Rincón, Julio Dondo, and Juan Carlos L ópez

  Konstantinos Georgopoulos, Pavlos Malakonakis, Nikolaos Tampouratzis, Antonis Nikitakis, Grigorios Chrysos, Apostolos Dollas, Dionysios Pnevmatikatos, and Ioannis Papaefstathiou

  

  Kris Heid, Jakob Wenzel, and Christian Hochberger

  Augusto W. Hoppe, Fernanda Lima Kastensmidt, and J ürgen Becker

  Pedro H. Exenberger Becker, Anderson L. Sartor, Marcelo Brandalero, Tiago Trevisan Jost, Stephan Wong, Luigi Carro, and Antonio C. Beck

  

  M ário Lopes Ferreira, João Canas Ferreira, and Michael Huebner

  XIV Contents

   Paulo Garcia, Deepayan Bhowmik, Andrew Wallace, Robert Stewart, and Greg Michaelson

  

  Ayan Palchaudhuri and Anindya Sundar Dhar

  Umar Ibrahim Minhas, Roger Woods, and George Karakonstantis

  Santhi Natarajan, N. KrishnaKumar, H. V. Anuchan, Debnath Pal, and S. K. Nandy

  Zhenhua Guo, Baoyu Fan, Yaqian Zhao, Xuelei Li, Shixin Wei, and Long Li

  Hoang-Gia Vu, Takashi Nakada, and Yasuhiko Nakashima

  Uzaif Sharif and Shahnam Mirzaei

  Johannes Pfau, Shalina Percy Delicia Figuli, Steffen B ähr, and J ürgen Becker

  Peter Littlewood, Shahnam Mirzaei, and Krishna Murthy Kattiyan Ramamoorthy

  Nikolaos Tzanis, Grigorios Proiskos, Michael Birbas, and Alexios Birbas

  Gennaro S. Rodrigues, Ádria Barros de Oliveira, Fernanda Lima Kastensmidt, and Alberto Bosio

  Contents

  XV

  

  Florian Fricke, Andr é Werner, Keyvan Shahin, and Michael Huebner

  Christoforos Kachris, Ioannis Stamelos, Elias Koromilas, and Dimitrios Soudris

  J ürgen Becker and Falco K. Bapp

  

  Panayiotis Alefragis, George Theodoridis, Merkourios Katsimpris, Christos Valouxis, Christos Gogos, George Goulas, Nikolaos Voros, Simon Reder, Koray Kasnakli, Marcus Bednara, David M üller, Umut Durak, and Juergen Becker

  

  Christos Antonopoulos, Georgios Keramidas, Nikolaos S. Voros, Michael Huebner, Fynn Schwiegelshohn, Diana Goehringer, Maria Dagioglou, Georgios Stavrinos, Stasinos Konstantopoulos, and Vangelis Karkaletsis

  

  Pavlos Malakonakis, Konstantinos Georgopoulos, Aggelos Ioannou, Luciano Lavagno, Ioannis Papaefstathiou, and Iakovos Mavroidis

  

  Ahmad Sadek, Ananya Muddukrishna, Lester Kalms, Asbj ørn Djupdal, Ariel Podlubne, Antonio Paolillo, Diana Goehringer, and Magnus Jahre

  

  XVI Contents

  Machine Learning and Neural Networks

  

Approximate FPGA-Based LSTMs Under

Computation Time Constraints

( )

  B

  Michalis Rizakis , Stylianos I. Venieri

  

Department of Electrical and Electronic Engineering,

Imperial College London, London, UK

{michail.rizakis14,stylianos.venieris10,a.kouris16,

christos-savvas.bouganis }@imperial.ac.uk

  Abstract.

  Recurrent Neural Networks, with the prominence of Long

Short-Term Memory (LSTM) networks, have demonstrated state-of-the-

art accuracy in several emerging Artificial Intelligence tasks. Neverthe-

less, the highest performing LSTM models are becoming increasingly

demanding in terms of computational and memory load. At the same

time, emerging latency-sensitive applications including mobile robots and

autonomous vehicles often operate under stringent computation time

constraints. In this paper, we address the challenge of deploying com-

putationally demanding LSTMs at a constrained time budget by intro-

ducing an approximate computing scheme that combines iterative low-

rank compression and pruning, along with a novel FPGA-based LSTM

architecture. Combined in an end-to-end framework, the approximation

method parameters are optimised and the architecture is configured

to address the problem of high-performance LSTM execution in time-

constrained applications. Quantitative evaluation on a real-life image

captioning application indicates that the proposed system required up to

6.5× less time to achieve the same application-level accuracy compared

to a baseline method, while achieving an average of 25× higher accuracy

under the same computation time constraints.

  Keywords: LSTM Low-rank approximation Pruning FPGAs · · ·

1 Introduction

  Recurrent Neural Networks (RNNs) is a machine learning model which offers the capability of recognising long-range dependencies in sequential and temporal data. RNN models, with the prevalence of Long Short-Term Memory (LSTMs) networks, have demonstrated state-of-the-art performance in various AI appli- cations including scene labelling

  . Moreover, LSTMs

  have been successfully employed for AI tasks in complex environments including human trajectory prediction

  on mobile robots,

  with more recent systems combining language and image processing in tasks

  4 M. Rizakis et al.

  Despite the high predictive power of LSTMs, their computational and mem- ory demands pose a challenge with respect to deployment in latency-sensitive and power-constrained environments. Modern intelligent systems such as mobile robots and drones that employ LSTMs to perceive their surroundings often oper- ate under time-constrained, latency-critical settings. In such scenarios, retrieving the best possible output from an LSTM given a constraint in computation time may be necessary to ensure the timely operation of the system. Moreover, the requirements of such applications for low absolute power consumption, which would enable a longer battery life, prohibit the deployment of high-performance, but power-hungry platforms, such as multi-core CPUs and GPUs. In this context, FPGAs constitute a promising target device that can combine customisation and reconfigurability to achieve high performance at a low power envelope.

  In this work, an approximate computing scheme along with a novel hardware architecture for LSTMs are proposed as an end-to-end framework to address the problem of high-performance LSTM deployment in time-constrained settings. Our approach comprises an iterative approximation method that applies simul- taneously low-rank compression and pruning of the LSTM model with a tunable number of refinement iterations. This iterative process enables our framework to (i) exploit the resilience of the target application to approximations, (ii) explore the trade-off between computational and memory load and application-level accuracy and (iii) execute the LSTM under a time constraint with increasing accuracy as a function of computation time budget. At the hardware level, our system consists of a novel FPGA-based architecture which exploits the inherent parallelism of the LSTM, parametrised with respect to the level of compression and pruning. By optimising the parameters of the approximation method, the proposed framework generates a system tailored to the target application, the available FPGA resources and the computation time constraints. To the best of our knowledge, this is the first work in the literature to address the deployment of LSTMs under computation time constraints.

  2 Background

  2.1 LSTM Networks A vanilla RNN typically processes an input and generates an output at each time step. Internally, the network has recurrent connections from the output at one time step to the hidden units at the next time step which enables it to cap- ture sequential patterns. The LSTM model differs from vanilla RNNs in that it comprises control units named gates, instead of layers. A typical LSTM has four gates. The input gate (Eq.

  ) are responsible

  for determining how much of the current input will propagate to the output. The forget gate (Eq.

  ) is responsible for determining whether the previous state of

  the LSTM will be forgotten or not, while the output gate (Eq.

  ) determines

  how much of the current state will be allowed to propagate to the final output of the LSTM at the current time step. Computationally, the gates are matrix-vector

  Approximate FPGA-Based LSTMs Under Computation Time Constraints

  5

  multiplication blocks, followed by a nonlinear elementwise activation function. The equations for the LSTM model are shown below:

  (t) (t) (t−1)

  i = σ(W x + W h ) (1)

  ix ih (t) (t) (t−1)

  f = σ(W x + W h ) (2)

  f x f h (t) (t) (t−1)

  o x h = σ(W + W ) (3)

  ox oh

(t) (t) (t−1) (t) (t) (t−1)

  c = f ⊙ c + i ⊙ tanh(W x + W h ) (4)

  cx ch (t) (t) (t)

  h = c ⊙ o (5)

  (t) (t) (t) (t)

  i , f and o are the input, forget and output gates respectively, c

  

(t−1)

(t)

  is the current state of the LSTM, h is the previous output, x is the current input at time t and σ(·) represents the sigmoid function. Equation

  is (t) (t) (t)

  frequently found in the literature as h = c ⊙ tanh(o ) with tanh(·) applied to the output gate. In this work, we follow the image captioning LSTM proposed in

   ] which removes the tanh(·) from the output gate and therefore we end up

  with Eq.

  . Finally, all the W matrices denote the weight matrices that contain the trainable parameters of the model, which are assumed to be provided.

3 Related Work

  The effectiveness of RNNs has attracted the attention of the architecture and reconfigurable computing communities. Li et al.

   ] proposed an FPGA-based

  accelerator for the training of an RNN language model. In

   ], the authors focus

  on the optimised deployment of the Gated Recurrent Unit (GRU) model

  in

  data centres with server-grade FPGAs, ASICs, GPUs and CPUs and propose an algorithmic memoisation-based method to reduce the computational load at the expense of increased memory footprint. The authors of

   ] present an empir-

  ical study of the effect of different architectural designs on the computational resources, on-chip memory capacity and off-chip memory bandwidth require- ments of an LSTM model. Finally, Guan et al.

  proposed an FPGA-based

  LSTM accelerator optimised for speech recognition on a Xilinx VC707 FPGA platform.

  From an algorithmic perspective, recent works have followed a model- hardware co-design approach. Han et al.

  proposed an FPGA-based speech

  recognition engine that employs a load-balance-aware compression scheme in order to compress the LSTM model size. Wang et al.

  presented a method

  that addresses compression at several levels including the use of circulant matri- ces for three of the LSTM gates and the quantisation of the trained parameters, together with the corresponding ASIC-based hardware architecture. Zhang et al.

  

] presented an FPGA-based accelerator for a Long-Term Recurrent Convo-

  lutional Network (LRCN) for video footage description that consists of a CNN followed by an LSTM. Their design focuses on balancing the resource allocation between the layers of the LRCN and pruning the fully-connected and LSTM

  

  deviate from the faith-

  6 M. Rizakis et al.

  to compensate for the introduced error of each proposed method. Finally, He and Sun

  focused on CNNs and investigated algorithmic strategies for model selection under computation time constraints for both training and testing.

  Our work differs from the majority of existing efforts by proposing a hardware architecture together with an approximate computing method for LSTMs that is application-aware and tunable with respect to the required computation time

  

   ]

  by proposing an approximation to the model, but in contrast to these methods does not require a retraining phase and assumes no access to the full training set. Instead, with a limited subset of labelled data, our scheme compensates for the induced error by means of iterative refinement, making it suitable for applica- tions where the dataset is privacy-critical and the quality of the approximation improves as the time availability increases.

  4 Methodology

  In this section, the main components of the proposed framework are presented (Fig.

  . Given an LSTM model with its set of weight matrices and a small appli-

  cation evaluation set, the proposed system searches for an appropriate approx- imation scheme that meets the application’s needs, by applying low-rank com- pression and pruning on the model. The design space is traversed by means of a roofline model to determine the highest performing configuration of the proposed architecture on the target FPGA. In this manner, the trade-off between com- putation time and application-level error is explored for different approximation schemes. The design point to be implemented on the device is selected based on user-specified requirements for the maximum computation time or application- level error tolerance.

  

Fig. 1. Design flow of the proposed framework Approximate FPGA-Based LSTMs Under Computation Time Constraints

  7

  4.1 Approximations for LSTMs At the core of an LSTM’s computational workload lie the matrix-vector multipli- cations in each of the four gates. Neural networks have been extensively studied to have redundancy in terms of their trained parameters

  . To reduce the

  computational demands of the LSTM, we propose an approximate computing scheme that enables the tuning between computational cost and application- level accuracy. The proposed approach exploits the statistical redundancy of the LSTM by acting at two levels: (i) approximating the weight matrices with a low-rank, SVD-based decomposition and (ii) pruning the network by sparsifying the weight matrices based on an importance criterion of their elements. Low-rank approximation.

  Based on the set of LSTM Eqs.

  , each gate

  consists of two weight matrices corresponding to the current input and previous output vectors respectively. In our scheme, we construct an augmented matrix by concatenating the input and output weight matrices as shown in Eq. . Similarly, we concatenate the input and previous output vectors (Eq.

  ) and

  thus the overall gate computation is given by Eq. .

  T

(t) (t)T (t−1)T

  ˜ x = x h (6) W W

  i = [W ix ih ] , ∀i ∈ [1, 4] (7) (t)

  y x = nonlin(W ˜ ), ∀i ∈ [1, 4] (8)

  

i

i

  where nonlin(·) is either the sigmoid function σ(·) or tanh(·). In this way, a

  R×C

  single weight matrix is formed for each gate, denoted by W for the

  i ∈ R

  i gate. We perform a full SVD decomposition on the four augmented matrices

  th T R×R R×C

  independently as W = U Σ V , ∀i ∈ [1, 4], where U , Σ

  i i i i ∈ R i ∈ R i C×C i i iT

  W u v and V i and employ a rank-1 approximation to obtain i = σ ∈ R

  1

  1

  1 by keeping the singular vectors that correspond to the largest singular value.

  Pruning by means of network sparsification. The second level of approx- imation on the LSTM comprises the structured pruning of the connectivity between neurons. With each neural connection being captured as an element of the weight matrices, we express network pruning as sparsification applied on the augmented weight matrices (Eq.

  ). To represent a sparse LSTM, we R×C

  introduce four binary mask matrices F ∈ {0, 1} , ∀i ∈ [1, 4], with each

  i

  entry representing whether a connection is pruned or not. Overall, we employ the following notation for a (weight, mask) matrix pair {W , F | i ∈ [1, 4]}.

  i i

  In the proposed scheme, we explore sparsity with respect to the connections per output neuron and constrain each output to have the same number of inputs. We cast LSTM pruning as an optimisation problem of the following form.

  

i

  2

  min ||W − F ⊙ W || , s.t.||f || = NZ, ∀i ∈ [1, 4], ∀j ∈ [1, R] (9)

  i i i F i 2 j i

  where f is the j row of F and NZ is the number of non-zero elements on

  th i j

8 M. Rizakis et al.

  entries in a vector. The solution to the optimisation problem in Eq.

  is given

  by keeping the NZ elements on each row of W with the highest absolute value

  i and setting their indices to 1 in F . i

  In contrast to the existing approaches, the proposed pruning method does not employ retraining and hence removes the computationally expensive step of retraining and the requirement for the training set, which is important for privacy-critical applications. Even though our sparsification method does not explicitly capture the impact of pruning on the application-level accuracy, our design space exploration, detailed in Sect.

  searches over different levels of spar- sity and as a result it explores the effect of pruning on the application.

  Hybrid compression and pruning.

  By applying both low-rank approxima- tion and pruning, we end up with the following weight matrix approximation:

  i i iT

  W u v = F ⊙ (σ ) (10)

  i i

  1

  1

  1 In this setting, for the i gate the ranking of the absolute values in each row th i i iT i

  of the rank-1 approximation σ u v depends only on v , with each element of

  1

  1

  1

  1 i i

  σ u operating as a shared scaling factor for all elements of a row. Therefore,

  1

  1

  for the i gate all the rows of F become identical and hence can be represented

  th i i C

  by a single mask vector f ∈ {0, 1} . This leads to a weight matrix with zeros along (C−NZ) of its columns, which is described by the following expression:

  i i i i T

  W = σ u (f ⊙ v ) (11)

  i steps

  1

  1

  1 N i(n) i(n) i(n) i(n) T (t)

  y u x ˜ = σ (f ⊙ v ) ˜ (12)

  i

  1

  1

  1 n=1

  In order to obtain a refinement mechanism, we propose an iterative algorithm, presented in Algorithm

  that employs both the low-rank approximation and

  pruning methods to progressively update the weight matrix. On lines 4–6 the first approximation of the weight matrix is constructed by obtaining the rank-1 approximation of the original matrix and applying pruning in order to have NZ non-zero elements on each row, as in Eq.

  . Next, the weight matrix is refined

  for N steps iterations, by computing the error matrix E (line 10) and employing its pruned rank-1 approximation as an update (line 15).

  Different combinations of levels of sparsity and refinement iterations cor- respond to different design points in the computation-accuracy space. In this respect, the number of non-zero elements in each binary mask vector and the number of iterations are exposed to the design space exploration as tunable parameters (NZ, N ) to explore the LSTM computation-accuracy trade-off.

  steps

  4.2 Architecture The proposed FPGA architecture for LSTMs is illustrated in Fig.

  The main

  strategy of the architecture includes the exploitation of the coarse-grained par-

  Approximate FPGA-Based LSTMs Under Computation Time Constraints

  9

  Algorithm 1.

  Iterative LSTM Model Approximation

  Inputs: i R×C 1: Weight matrices W , ∀i ∈ [1, 4] ∈ R 2: Number of non-zero elements, NZ steps 3: Number of refinement iterations, N

  Steps: 1: - - For all gates - - 2: for i = 1 to 4 do 3: - - Initialise weight matrix approximation - - i i i (0) (0) (0) 4: u , σ , v = SVD(W i ) i (0) (0) 1 1 1 1 i 5: f ← solution to Eq. (9) for vector v (0) i i i (0) (0) i (0) (0) T 1 W u f 6: i = σ ⊙ v 1 1 1 7: - - Apply refinements - - for n steps do 8: = 1 to N

  9: - - Compute error matrix - - (n−1) 10: E = W i − W i 11: - - Compute refinement - - i i i (n) (n) (n) 12: u , σ , v = SVD(E) i (n) 1 1 1 1 i (n) 13: f ← solution to optimisation problem (9) for vector v 1 14: - - Update weight matrix approximation - - (n) (n−1) i (n) i (n) i (n)

i (n)

T W W u f

  15: i = i + σ ⊙ v 1 1 1 end for 16: 17: end for Notes: SVD(X )

1 returns the rank-1 SVD-based approximation of X .

  fine-grained parallelism in the dot-product and elementwise operations of the LSTM, allowing for a compile-time tunable performance-resource trade-off. SVD and Binary Masks Precomputation. In Algorithm

  the number of

  refinement iterations (N ), the level of sparsity (NZ) and the trained weight

  steps

  matrices are data-independent and known at compile time. As such, the required SVD decompositions along with the corresponding binary masks are precom- puted for all N iterations at compile time. As a result, the singular values

  steps i(n) i(n) i(n) i(n)

  σ , the vectors u and only the non-zero elements of the sparse f ⊙ v

  1

  1 1 are stored in the off-chip memory, so that they can be looked-up at run time.

  Inter-gate and Intra-gate Parallelism.

  In the proposed architecture, each gate is allocated a dedicated hardware gate unit with all gates operating in parallel. At each LSTM time-step t, a hardware gate unit computes its output by performing N refinement iterations as in Eq.

  . At the beginning of steps (t)

  the time-step, the current vector ˜ x is stored on-chip as it will be reused in each

  i(n) i(n)

  iteration by all four gates. The vectors u and v for each gate, along with

  1

  1 i(n)

  their singular values σ , are streamed in the architecture from the off-chip

  1

10 M. Rizakis et al.

  Fig. 2.

  Diagram of proposed hardware architecture i(n) i(n)

  memory in a tiled manner. u and v are tiled with tile sizes of T r and T c

  1

1 R C respectively, leading to and tiles sequentially streamed in the architecture.

  T r T c

  At each gate, a dot-product unit is responsible for computing the dot product

  i(n) (t)

  x of the current tile of v with the corresponding elements of the input ˜ . The

  1 i(n)

  dot-product unit is unrolled by a factor of T in order to process one tile of v

  c

  1 C

  per cycle. After accumulating the partial results of all the tiles, the result is c

  T i(n)

  produced and multiplied with the scalar σ . The multiplication result is passed

  1 i(n)

  as a constant operand to a multiplier array, with u as the other operand. The

  1 i(n)

  multiplier array has a size of T in order to match the tiling of u . As a final

  r

  1

  stage, an array of T accumulators performs the summation across the N

  r steps

  iterations as expressed in Eq. , to produce the final gate output.

  The outputs from the input, forget and output gates are passed through a sigmoid unit while the output of the cell gate is passed through a tanh unit. After the nonlinearities stage, the produced outputs are multiplied element-by-

  (t) element as dictated by the LSTM equations to produce the cell state c (Eq. (t)

  

  ). The three multiplier arrays and

  the one adder array all have a size of T r to match the tile size of the incoming vectors and exploit the available parallelism.

5 Design Space Exploration

  Having parametrised the proposed approximation method over NZ and N steps and its underlying architecture over NZ and tile sizes (T r , T c ), corresponding metrics need to be employed for exploring the effects of each parameter on perfor- mance and accuracy. The approximation method parameters are studied based on an application-level evaluation metric (discussed in Sect.

  , that measures

  the impact of each applied approximation on the accuracy of the target appli- cation. In terms of the hardware architecture, roofline performance modelling is employed for exhaustively exploring the design space formed by all possible tile size combinations, to obtain the highest performing design point (discussed in Sect.

  . Based on those two metrics, the computation time-accuracy trade-off

  Approximate FPGA-Based LSTMs Under Computation Time Constraints

  11

  5.1 Roofline Model The design space of architectural configurations for all tile size combinations of T and T is explored exhaustively by performance modelling. The roofline

  r c

  model

   ] is used to develop a performance model for the proposed architecture