Web Big Data International Proceedings 12

  Lei Chen · Christian S. Jensen Cyrus Shahabi · Xiaochun Yang (Eds.) Xiang Lian

  LNCS 10366 Web and Big Data First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I

  

Lecture Notes in Computer Science 10366

  Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

  Editorial Board

  David Hutchison Lancaster University, Lancaster, UK

  Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA

  Josef Kittler University of Surrey, Guildford, UK

  Jon M. Kleinberg Cornell University, Ithaca, NY, USA

  Friedemann Mattern ETH Zurich, Zurich, Switzerland

  John C. Mitchell Stanford University, Stanford, CA, USA

  Moni Naor Weizmann Institute of Science, Rehovot, Israel

  C. Pandu Rangan Indian Institute of Technology, Madras, India

  Bernhard Steffen TU Dortmund University, Dortmund, Germany

  Demetri Terzopoulos University of California, Los Angeles, CA, USA

  Doug Tygar University of California, Berkeley, CA, USA

  Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at

  • Lei Chen Christian S. Jensen Cyrus Shahabi Xiaochun Yang Xiang Lian (Eds.)

  Web and Big Data

First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I Editors Lei Chen Xiaochun Yang Computer Science and Engineering Northeastern University Hong Kong University of Science and Shenyang

  Technology China Hong Kong Xiang Lian China Kent State University Christian S. Jensen Kent, OH

  Computer Science USA Aarhus University Aarhus N Denmark Cyrus Shahabi Computer Science University of Southern California Los Angeles, CA USA

ISSN 0302-9743

  ISSN 1611-3349 (electronic) Lecture Notes in Computer Science

ISBN 978-3-319-63578-1

  ISBN 978-3-319-63579-8 (eBook) DOI 10.1007/978-3-319-63579-8 Library of Congress Control Number: 2017947034 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer International Publishing AG 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the

material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now

known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a specific statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are

believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors

give a warranty, express or implied, with respect to the material contained herein or for any errors or

omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in

published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG

  

Preface

  This volume (LNCS 10366) and its companion volume (LNCS 10367) contain the proceedings of the first Asia-Pacific Web (APWeb) and Web-Age Information Man- agement (WAIM) Joint Conference on Web and Big Data, called APWeb-WAIM. This new joint conference aims to attract participants from different scientific communities as well as from industry, and not merely from the Asia Pacific region, but also from other continents. The objective is to enable the sharing and exchange of ideas, experiences, and results in the areas of World Wide Web and big data, thus covering Web tech- nologies, database systems, information management, software engineering, and big data. The first APWeb-WAIM conference was held in Beijing during July 7–9, 2017.

  As a new Asia-Pacific flagship conference focusing on research, development, and applications in relation to Web information management, APWeb-WAIM builds on the successes of APWeb and WAIM: APWeb was previously held in Beijing (1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou (2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou (2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha (2014), Guangzhou (2015), and Suzhou (2016); and WAIM was held in Shanghai (2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou (2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009), Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014), Qingdao (2015), and Nanchang (2016). With the fast development of Web-related technologies, we expect that APWeb-WAIM will become an increasingly popular forum that brings together outstanding researchers and developers in the field of Web and big data from around the world.

  The high-quality program documented in these proceedings would not have been possible without the authors who chose APWeb-WAIM for disseminating their find- ings. Out of 240 submissions to the research track and 19 to the demonstration track, the conference accepted 44 regular (18%), 32 short research papers, and ten demon- strations. The contributed papers address a wide range of topics, such as spatial data processing and data quality, graph data processing, data mining, privacy and semantic analysis, text and log data management, social networks, data streams, query pro- cessing and optimization, topic modeling, machine learning, recommender systems, and distributed data processing.

  The technical program also included keynotes by Profs. Sihem Amer-Yahia (National Center for Scientific Research, CNRS, France), Masaru Kitsuregawa (National Institute of Informatics, NII, Japan), and Mohamed Mokbel (University of Minnesota, Twin Cities, USA) as well as tutorials by Prof. Reynold Cheng (The University of Hong Kong, SAR China), Prof. Guoliang Li (Tsinghua University, China), Prof. Arijit Khan (Nanyang Technological University, Singapore), and VI Preface

  Prof. Yu Zheng (Microsoft Research Asia, China). We are grateful to these distin- guished scientists for their invaluable contributions to the conference program.

  As a new joint conference, teamwork is particularly important for the success of APWeb-WAIM. We are deeply thankful to the Program Committee members and the external reviewers for lending their time and expertise to the conference. Special thanks go to the local Organizing Committee led by Jun He, Yongxin Tong, and Shimin Chen. Thanks also go to the workshop co-chairs (Matthias Renz, Shaoxu Song, and Yang-Sae Moon), demo co-chairs (Sebastian Link, Shuo Shang, and Yoshiharu Ishikawa), industry co-chairs (Chen Wang and Weining Qian), tutorial co-chairs (Andreas Züfle and Muhammad Aamir Cheema), sponsorship chair (Junjie Yao), proceedings co-chairs (Xiang Lian and Xiaochun Yang), and publicity co-chairs (Hongzhi Yin, Lei Zou, and Ce Zhang). Their efforts were essential to the success of the conference. Last but not least, we wish to express our gratitude to the Webmaster (Zhao Cao) for all the hard work and to our sponsors who generously supported the smooth running of the conference.

  We hope you enjoy the exciting program of APWeb-WAIM 2017 as documented in these proceedings. June 2017

  Xiaoyong Du Beng Chin Ooi

  M. Tamer Özsu Bin Cui

  Lei Chen Christian S. Jensen

  Cyrus Shahabi

  

Organization

Organizing Committee

  General Co-chairs Xiaoyong Du Renmin University of China, China BengChin Ooi National University of Singapore, Singapore M. Tamer Özsu University of Waterloo, Canada Program Co-chairs Lei Chen Hong Kong University of Science and Technology, China Christian S. Jensen Aalborg University, Denmark Cyrus Shahabi The University of Southern California, USA Workshop Co-chairs Matthias Renz George Mason University, USA Shaoxu Song Tsinghua University, China Yang-Sae Moon Kangwon National University, South Korea Demo Co-chairs Sebastian Link The University of Auckland, New Zealand Shuo Shang King Abdullah University of Science and Technology,

  Saudi Arabia Yoshiharu Ishikawa Nagoya University, Japan Industrial Co-chairs Chen Wang Innovation Center for Beijing Industrial Big Data, China Weining Qian East China Normal University, China Proceedings Co-chairs Xiang Lian Kent State University, USA Xiaochun Yang Northeast University, China Tutorial Co-chairs

  George Mason University, USA Andreas Züfle Muhammad Aamir Monash University, Australia

  Cheema ACM SIGMOD China Lectures Co-chairs Guoliang Li Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Publicity Co-chairs Hongzhi Yin The University of Queensland, Australia Lei Zou Peking University, China Ce Zhang Eidgenössische Technische Hochschule ETH, Switzerland Local Organization Co-chairs Jun He Renmin University of China, China Yongxin Tong Beihang University, China Shimin Chen Chinese Academy of Sciences, China Sponsorship Chair Junjie Yao East China Normal University, China Web Chair Zhao Cao Beijing Institute of Technology, China Steering Committee Liaison Yanchun Zhang Victoria University, Australia Senior Program Committee Dieter Pfoser George Mason University, USA Ilaria Bartolini University of Bologna, Italy Jianliang Xu Hong Kong Baptist University, SAR China Mario Nascimento University of Alberta, Canada Matthias Renz George Mason University, USA Mohamed Mokbel University of Minnesota, USA Ralf Hartmut Güting Fernuniversität in Hagen, Germany Seungwon Hwang Yongsei University, South Korea Sourav S. Bhowmick Nanyang Technological University, Singapore Tingjian Ge University of Massachusetts Lowell, USA Vincent Oria New Jersey Institute of Technology, USA Walid Aref Purdue University, USA Wook-Shin Han Pohang University of Science and Technology, Korea Yoshiharu Ishikawa Nagoya University, Japan Program Committee Alex Delis University of Athens, Greece Alex Thomo University of Victoria, Canada

  VIII Organization

  Aviv Segev Korea Advanced Institute of Science and Technology, South Korea

  Baoning Niu Taiyuan University of Technology, China Bin Cui Peking University, China Bin Yang Aalborg University, Denmark Carson Leung University of Manitoba, Canada Chih-Hua Tai National Taipei University, China Cuiping Li Renmin University of China, China Daniele Riboni University of Cagliari, Italy Defu Lian University of Electronic Science and Technology of China,

  China Dejing Dou University of Oregon, USA Demetris Zeinalipour Max Planck Institute for Informatics, Germany and

  University of Cyprus, Cyprus Dhaval Patel Indian Institute of Technology Roorkee, India Dimitris Sacharidis Technische Universität Wien, Vienna, Austria Fei Chiang McMaster University, Canada Ganzhao Yuan South China University of Technology, China Giovanna Guerrini Universita di Genova, Italy Guoliang Li Tsinghua University, China Guoqiong Liao Jiangxi University of Finance and Economics, China Hailong Sun Beihang University, China Han Su University of Southern California, USA Hiroaki Ohshima Kyoto University, Japan Hong Chen Renmin University of China, China Hongyan Liu Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Hongzhi Yin The University of Queensland, Australia Hua Li Aalborg University, Denmark Hua Lu Aalborg University, Denmark Hua Wang Victoria University, Melbourne, Australia Hua Yuan University of Electronic Science and Technology of China,

  China Iulian Sandu Popa Inria and PRiSM Lab, University of Versailles

  Saint-Quentin, France James Cheng Chinese University of Hong Kong, SAR China Jeffrey Xu Yu Chinese University of Hong Kong, SAR China Jiaheng Lu University of Helsinki, Finland Jiajun Liu Renmin University of China, China Jialong Han Nanyang Technological University, Singapore Jian Yin Zhongshan University, China Jianliang Xu Hong Kong Baptist University, SAR China Jianmin Wang Tsinghua University, China Jiannan Wang Simon Fraser University, Canada Jianting Zhang City College of New York, USA

  Organization

  IX Jinchuan Chen Renmin University of China, China Ju Fan National University of Singapore, Singapore Jun Gao Peking University, China Junfeng Zhou Yanshan University, China Junhu Wang

  Griffith University, Australia Kai Zeng University of California, Berkeley, USA Karine Zeitouni PRISM University of Versailles St-Quentin, Paris, France Kyuseok Shim Seoul National University, Korea Lei Zou Peking University, China Lei Chen Hong Kong University of Science and Technology,

  SAR China Leong Hou U. University of Macau, SAR China Liang Hong Wuhan University, China Lianghuai Yang Zhejiang University of Technology, China Long Guo Peking University, China Man Lung Yiu Hong Kong Polytechnical University, SAR China Markus Endres University of Augsburg, Germany Maria Damiani University of Milano, Italy Meihui Zhang Singapore University of Technology and Design,

  Singapore Mihai Lupu Vienna University of Technology, Austria Mirco Nanni

  ISTI-CNR Pisa, Italy Mizuho Iwaihara Waseda University, Japan Mohammed Eunus Ali Bangladesh University of Engineering and Technology,

  Bangladesh Peer Kroger Ludwig-Maximilians-University of Munich, Germany Peiquan Jin Univerisity of Science and Technology of China Peng Wang Fudan University, China Yaokai Feng Kyushu University, Japan Wookey Lee Inha University, Korea Raymond Chi-Wing

  Wong Hong Kong University of Science and Technology,

  SAR China Richong Zhang Beihang University, China Sanghyun Park Yonsei University, Korea Sangkeun Lee Oak Ridge National Laboratory, USA Sanjay Madria Missouri University of Science and Technology, USA Shengli Wu Jiangsu University, China Shi Gao University of California, Los Angeles, USA Shimin Chen Chinese Academy of Sciences, China Shuai Ma Beihang University, China Shuo Shang King Abdullah University of Science and Technology,

  Saudi Arabia Sourav S Bhowmick Nanyang Technological University, Singapore Stavros Papadopoulos Intel Labs and MIT, USA Takahiro Hara Osaka University, Japan

  X Organization

  Tieyun Qian Wuhan University, China Ting Deng Beihang University, China Tru Cao Ho Chi Minh City University of Technology, Vietnam Vicent Zheng Advanced Digital Sciences Center, Singapore Vinay Setty Aalborg University, Denmark Wee Ng Institute for Infocomm Research, Singapore Wei Wang University of New South Wales, Australia Weining Qian East China Normal University, China Weiwei Sun Fudan University, China Wei-Shinn Ku Auburn University, USA Wenjia Li New York Institute of Technology, USA Wen Zhang Wuhan University, China Wolf-Tilo Balke Braunschweig University of Technology, Germany Xiang Lian Kent State University, USA Xiang Zhao National University of Defence Technology, China Xiangliang Zhang King Abdullah University of Science and Technology,

  Saudi Arabia Xiangmin Zhou RMIT University, Australia Xiaochun Yang Northeast University, China Xiaofeng He East China Normal University, China Xiaoyong Du Renmin University of China, China Xike Xie University of Science and Technology of China, China Xingquan Zhu Florida Atlantic University, USA Xuan Zhou Renmin University of China, China Yanghua Xiao Fudan University, China Yang-Sae Moon Kangwon National University, South Korea Yasuhiko Morimoto Hiroshima University, Japan Yijie Wang National University of Defense Technology, China Yingxia Shao Peking University, China Yong Zhang Tsinghua University, China Yongxin Tong Beihang University, China Yoshiharu Ishikawa Nagoya University, Japan Yu Gu Northeast University, China Yuan Fang Institute for Infocomm Research, Singapore Yueguo Chen Renmin University of China, China Yunjun Gao Zhejiang University, China Zakaria Maamar Zayed University, United Arab Emirates Zhaonian Zou Harbin Institute of Technology, China Zhengjia Fu Advanced Digital Sciences Center, Singapore Zhiguo Gong University of Macau, SAR China Zouhaier Brahmia University of Sfax, Tunisia

  Organization

  XI Keynotes

  

A Holistic View of Human Factors

in Crowdsourcing

  Sihem Amer-Yahia

  

CNRS, University of Grenoble Alpes, Grenoble, France

sihem.amer-yahia@cnrs.fr

Abstract. For over 40 years, organization studies have examined human factors

in physical workplaces and their influence on the ability of an individual to

perform a task, or a set of tasks, alone or in collaboration with others. In a virtual

marketplace, the crowd is typically volatile, its arrival and departure asyn-

chronous, and its levels of attention and accuracy diverse. This has generated a

wealth of new research ranging from studying workers’ fatigue in task com-

pletion to examining the role of motivation in task assignment. I will review

such work and argue that we need a holistic view to take full advantage of

human factors such as skills, expected wage and motivation, in improving the

performance of a crowdsourcing platform.

  

Experience on XXX Health such as Earth

Health and Human Health Though Big Data

1,2 1 Masaru Kitsuregawa 2 The University of Tokyo, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan

kitsure@tkl.iis.u-tokyo.ac.jp

Abstract. We have been working in the area so called ‘Health’. In this talk, our

experiences on the problem solving for earth environmental health and human

health by big data system technologies are presented. We are wondering what

type of platform be suitable for societal health as a whole.

  

Thinking Spatial

  Mohamed Mokbel

  

Department of Computer Science and Engineering, University of Minnesota

mokbel@umn.edu

Abstract. The need to manage and analyze spatial data is hampered by the lack

of specialized systems to support such data. System builders mostly build

general-purpose systems that are generic enough to handle any kind of attri-

butes. Whenever there is a pressing need for spatial data support, it is considered

as an afterthought problem that can be addressed by adding new data types,

extensions, or spatial cartridges to existing systems. This talk advocates for

dealing with spatial data as first class citizens, and for always thinking spatially

whenever it comes to system design. This is well justified by the proliferation of

location-based applications that are mainly relying on spatial data. The talk will

go through various system designs and show how they would be different if we

have designed them while thinking spatially. Examples of these systems include

data base systems, big data systems, recommender systems, social networks, and

crowd sourcing.

  

Contents – Part I

  Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Reynold Cheng, Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng Spatial Data Processing and Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, and Aoying Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  . . . . . . . . . . . . . . .

   Dawei Gao, Yongxin Tong, Yudian Ji, and Ke Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Jianguo Wu, Jianwen Xiang, Dongdong Zhao, Huanhuan Li, Qing Xie, and Xiaoyi Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  . . . . . . . . . . . . . . .

   Zizhe Xie, Qizhi Liu, and Zhifeng Bao Graph Data Processing . . . .

   Chuxu Zhang, Lu Yu, Chuang Liu, Zi-Ke Zhang, and Tao Zhou

  Yanxia Xu, Jinjing Huang, An Liu, Zhixu Li, Hongzhi Yin, and Lei Zhao

  Guohua Li, Weixiong Rao, and Zhongxiao Jin XX Contents – Part I

   Wei Shi, Weiguo Zheng, Jeffrey Xu Yu, Hong Cheng, and Lei Zou

   Qiang Xu, Xin Wang, Junhu Wang, Yajun Yang, and Zhiyong Feng

  

  Qun Liao, Lei Sun, He Du, and Yulu Yang Data Mining, Privacy and Semantic Analysis

  Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing

  Ningning Ma, Hai-Tao Zheng, and Xi Xiao

  Shushu Liu, An Liu, Zhixu Li, Guanfeng Liu, Jiajie Xu, Lei Zhao, and Kai Zheng

  Jerry Chun-Wei Lin, Jiexiong Zhang, and Philippe Fournier-Viger

  Ting Huang, Ruizhang Huang, Bowei Liu, and Yingying Yan

  Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, and Han-Chieh Chao Text and Log Data Management

  Ming Chen, Lin Li, and Qing Xie

  Shengluan Hou, Yu Huang, Chaoqun Fei, Shuhan Zhang, and Ruqian Lu

  XXI Contents – Part I

   Chunxia Zhang, Sen Wang, Jiayu Wu, and Zhendong Niu

  

  Zhenyu Zhao, Guozheng Rao, and Zhiyong Feng

  Jinwei Guo, Jiahao Wang, Peng Cai, Weining Qian, Aoying Zhou, and Xiaohang Zhu

  Huan Zhou, Huiqi Hu, Tao Zhu, Weining Qian, Aoying Zhou, and Yukun He Social Networks

  Xia Lv, Peiquan Jin, Lin Mu, Shouhong Wan, and Lihua Yue

  Yang Wu, Cheng Long, Ada Wai-Chee Fu, and Zitong Chen

  Kaixia Li, Zhao Cao, and Dacheng Qu

  Yu Qiao, Jun Wu, Lei Zhang, and Chongjun Wang

  Naoki Kito, Xiangmin Zhou, Dong Qin, Yongli Ren, Xiuzhen Zhang, and James Thom

  Tianchen Zhu, Zhaohui Peng, Xinghua Wang, and Xiaoguang Hong Data Mining and Data Streams

  Song Wu, Xingjun Wang, Hai Jin, and Haibao Chen

  Qiong Li, Xiaowang Zhang, and Zhiyong Feng XXII Contents – Part I

   Xiutao Shi, Liqiang Wang, Shijun Liu, Yafang Wang, Li Pan, and Lei Wu

   Hui Wen, Minglan Li, and Zhili Ye

  

  Chuxu Zhang, Chuang Liu, Lu Yu, Zi-Ke Zhang, and Tao Zhou

  Yangming Liu, Suyun Zhao, Hong Chen, Cuiping Li, and Yanmin Lu Query Processing

  Xiaoying Zhang, Hui Peng, Lei Dong, Hong Chen, and Hui Sun

  Kento Sugiura and Yoshiharu Ishikawa

  Zhijin Lv, Ben Chen, and Xiaohui Yu

  Ying Wang, Ming Zhong, Yuanyuan Zhu, Xuhui Li, and Tieyun Qian

  Yuan Tian, Peiquan Jin, Shouhong Wan, and Lihua Yue

  Zepeng Fang, Chen Lin, and Yun Liang Topic Modeling

  Yunfeng Chen, Lei Zhang, Xin Li, Yu Zong, Guiquan Liu, and Enhong Chen

  Lin Xiao, Zhang Min, and Zhang Yongfeng

  XXIII Contents – Part I

   Linjing Wei, Heyan Huang, Yang Gao, Xiaochi Wei, and Chong Feng

   Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, and Bowei Liu

   Ming Xu, Yang Cai, Hesheng Wu, Chongjun Wang, and Ning Li

   Jinjing Zhang, Jing Wang, and Li Li

  Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  

Contents – Part II

  Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Qi Ye, Changlei Zhu, Gang Li, and Feng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  . . . .

  . . . . . . . . . . . . . . . . . .

   Xiang Li, Rui Yan, and Ming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  . . . . . . . . . . . .

   Hai-Tao Zheng, Xin Yao, Yong Jiang, Shu-Tao Xia, and Xi Xiao Recommendation Systems . . . . . . . .

   Yan Zhao, Jia Zhu, Mengdi Jia, Wenyan Yang, and Kai Zheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Qinyong Wang, Hongzhi Yin, and Hao Wang

  Fei Yu, Zhijun Li, Shouxu Jiang, and Xiaofei Yang

  Wenli Yu, Li Li, Jingyuan Wang, Dengbao Wang, Yong Wang, Zhanbo Yang, and Min Huang

   Shuhei Kishida, Seiji Ueda, Atsushi Keyaki, and Jun Miyazaki XXVI Contents – Part II

   Xiaolu Xing, Chaofeng Sha, and Junyu Niu

  Distributed Data Processing and Applications

  Chenhao Yang, Ben He, and Jungang Xu

  Ningnan Zhou, Xuan Zhou, Xiao Zhang, Xiaoyong Du, and Shan Wang

  Tinghai Pang, Lei Duan, Jyrki Nummenmaa, Jie Zuo, and Peng Zhang

  Shiliang Fan, Yubin Yang, Wenyang Lu, and Ping Song

  Tao Zhu, Huiqi Hu, Weining Qian, Aoying Zhou, Mengzhan Liu, and Qiong Zhao

  Hao Liu, Jiang Xiao, Xianjun Guo, Haoyu Tan, Qiong Luo, and Lionel M. Ni Machine Learning and Optimization

  Guangxuan Song, Wenwen Qu, Yilin Wang, and Xiaoling Wang

  Donghui Wang, Peng Cai, Weining Qian, Aoying Zhou, Tianze Pang, and Jing Jiang

  Tao Xie, Bin Wu, and Bai Wang

  Yixuan Liu, Zihao Gao, and Mizuho Iwaihara

  XXVII Contents – Part II

  

  Jun Yin and Xiaoming Li

  Yongheng Wang, Guidan Chen, and Zengwang Wang

  Hai-Tao Zheng, Zhuren Wang, and Xi Xiao Demo Papers

  Jiawei Jiang, Ming Huang, Jie Jiang, and Bin Cui

  Dezhi Zhang, Peiquan Jin, Xiaoliang Wang, Chengcheng Yang, and Lihua Yue

  Yihai Xi, Ning Wang, Xiaoyu Wu, Yuqing Bao, and Wutong Zhou

  Xuguang Bao, Lizhen Wang, and Qing Xiao

  Yuan Liu, Xin Wang, and Qiang Xu

  Longlong Xu, Wutao Lin, Xiaorong Wang, Zhenhui Xu, Wei Chen, and Tengjiao Wang

  Leonard K.M. Poon, Chun Fai Leung, Peixian Chen, and Nevin L. Zhang

  Mingyan Teng, Qiao Sun, Buqiao Deng, Lei Sun, and Xiongpai Qin

  Qiao Sun, Xiongpai Qin, Buqiao Deng, and Wei Cui XXVIII Contents – Part II

   Weiwei Wang and Jianqiu Xu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   Tutorials

  

Meta Paths and Meta Structures: Analysing

Large Heterogeneous Information Networks

( )

  B

  Reynold Cheng , Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng

  

University of Hong Kong, Pokfulam Road, Pokfulam, Hong Kong

{ckcheng,zphuang,ydzheng2,jyan}@cs.hku.hk, kywong2@connect.hku.hk,

ngheii@gmail.com

http://www.cs.hku.hk/~ckcheng/

  

Abstract. A heterogeneous information network (HIN) is a graph

model in which objects and edges are annotated with types. Large and

complex databases, such as YAGO and DBLP, can be modeled as HINs.

A fundamental problem in HINs is the computation of closeness, or rel-

evance, between two HIN objects. Relevance measures, such as PCRW,

PathSim, and HeteSim, can be used in various applications, including

information retrieval, entity resolution, and product recommendation.

These metrics are based on the use of meta-paths, essentially a sequence

of node classes and edge types between two nodes in a HIN. In this tuto-

rial, we will give a detailed review of meta-paths, as well as how they are

used to define relevance. In a large and complex HIN, retrieving meta

paths manually can be complex, expensive, and error-prone. Hence, we

will explore systematic methods for finding meta paths. In particular, we

will study a solution based on the Query-by-Example (QBE) paradigm,

which allows us to discover meta-paths in an effective and efficient man-

ner.

  

We further generalise the notion of meta path to “meta structure”,

which is a directed acyclic graph of object types with edge types con-

necting them. Meta structure, which is more expressive than the meta

path, can describe complex relationship between two HIN objects (e.g.,

two papers in DBLP share the same authors and topics). We will discuss

three relevance measures based on meta structure. Due to the compu-

tational complexity of these measures, we also study an algorithm with

data structures proposed to support their evaluation. Finally, we will

examine solutions for performing query recommendation based on meta-

paths. We will also discuss future research directions.

1 Background

  Heterogeneous information networks (HINs), such as DBLP

  , and

  DBpedia

  , have recently received a lot of attention. These data sources, con-

  taining a vast number of inter-related facts, facilitate the discovery of interesting knowledge

  a) illustrates an HIN, which describes the relation-

4 R. Cheng et al.

  a a a 1 2 3 w rite w rite publish 1 publish

  • -1

  P A P A 1 1 P 2 1 V 2

p p p p p p -1

1,1 1,2 2,1 2,2 3,1 3,2 mention P A P P 2 : 1 T w rite mention 1 1 2 w rite A 2 ICDM KDD VLDB AAAI : h lis 1 pub

v t t v t v t blis -1

1 1 v 2 2 3 3 4 w rite 4 A A pu w rite V h object types: author paper venue topic : edge types: write mention publish S P P 1 1 me nt ion ion 1 2 2 T nt me

  

(a) (b)

Fig. 1.

  HIN, meta paths, and meta structures.

  example, Jiawei Han (a

  

2 ) has written a VLDB paper (p

2 , 2 ), which mentions the topic “efficient” (t ).

3 Given two HIN objects a and b, the evaluation of their relevance is of fun- damental importance. This quantifies the degree of closeness between a and b.

  In Fig.

  a), Jian Pei (a ) and Jiawei Han (a ) have a high relevance score,

  1

  2

  since they have both published papers with keyword “mining” in the same venue (KDD). Relevance finds its applications in information retrieval, recommendation, and clustering

   ]: a researcher can retrieve papers that have high relevance in

  terms of topics and venues in DBLP; in YAGO, relevance facilitates the extrac- tion of actors who are close to a given director. As another example, in entity resolution applications, duplicated HIN object pairs having high relevance scores (e.g., two different objects in an HIN referring to the same real-world person) can be identified and removed from the HIN.

  Relevance Computation. In this tutorial, we will explore different ways of computing the relevance between two graph objects, for instance, neighborhood- based measures, such as common neighbors and Jaccard’s coefficient; graph- theoretic measures based on random walks, such as Personalized PageRank and SimRank. These measures do not consider object and edge type information in an HIN. We will discuss the concept of meta paths

  . A meta path is a

  sequence of object types with edge types between them. Figure

  b) illustrates

  a meta path P

  1 , which states that two authors (A 1 and A 2 ) are related by

  their publications in the same venue (V ). Another meta path P

  2 says that two

  authors have written papers containing the same topic (T ). We will discuss several meta-path-based relevance measures, including PathCount, PathSim, and Path Constrained Random Walk (PCRW)

  . These measures have been shown to be better than those that do not consider object and edge type information.

  We will further discuss meta structures, recently proposed in

  , to depict the

  relationship of two graph objects. This is essentially a directed acyclic graph of object and edge types. Figure

  b) illustrates a meta structure S, which depicts

  that two authors are relevant if they have published papers in the same venue, and have also mentioned the same topic. A meta path (e.g., P or P ) is a special

  1

  2 Meta Paths and Meta Structures

  5

  case of a meta structure. However, a meta path fails to capture such complex relationship that can be conveniently expressed by a meta structure (e.g., S). We will discuss how meta structures can be used to formulate three relevance definitions, as well as their efficient calculation.

  Meta Path Discovery. There are often a huge number of meta paths between a pair of HIN objects. This can be very difficult, even for a domain expert, to identify the right meta paths. We will discuss a meta path discovery algo- rithm, recently proposed by

  , where users provide example instances of source

  and target objects through a Query-by-Example paradigm, to derive meta paths automatically. We will demonstrate a HIN search engine prototype based on this algorithm. Query Recommendation. We will study the use of meta paths in query rec- ommendation, where queries are suggested to web search users based on their previous query histories. As studied in

  , it is possible to use a knowledge

  base (a HIN) and its related meta-paths to perform effective query recommen- dation. The approach is especially useful to long-tail queries that rarely appear in query logs.

  2 Proposed Schedule The following is our proposed schedule of the 90-min tutorial.

  • – Introduction (15 min). We will discuss the basic model of HIN, and dis- cuss applications based on it, such as search, relevance computation, query recommendation, and data integration (10 min). We will also introduce meta- paths, a fundamental HIN analysis tool, and give an overview of the tutorial (5 min).
  • – Main contents (60 min). Next, we will introduce meta path, and how it facilitates the computation of various relevance measures (10 min). We then explain the process of discovering meta paths (15 min). We discuss a novel query recommendation framework based on meta paths (15 min). We will also present the meta structures, which is the latest development of meta paths (15 min). We will demonstrate a HIN search engine prototype based on meta paths (5 min).
  • – Conclusions (15 min). We will conclude the tutorial and discuss future directions (5 min). The rest of the time will be dedicated to Q&A (10 min).

  3 Intended Audience

  The tutorial is designed for researchers interested in latest development in the field of HINs, especially regarding meta-paths for novel applications. The HIN search demonstration will be give insight to software practitioners for developing recommendation facilities for HINs.

  6 R. Cheng et al.

  4 Biography of Presenters

  Reynold Cheng is an Associate Professor of the Department of Computer Sci- ence in the University of Hong Kong. He obtained his PhD from Department of Computer Science of Purdue University in 2005. He was granted an Outstand- ing Young Researcher Award 2011–12 by HKU. He was the recipient of the 2010 Research Output Prize in the Department of Computer Science of HKU. He also received the U21 Fellowship in 2011. He received the Performance Reward in years 2006 and 2007 awarded by the Hong Kong Polytechnic University. He is a member of the IEEE, the ACM, and ACM SIGMOD. He is an editorial board member of TKDE, DAPD and IS, and was a guest editor for TKDE, DAPD, and Geoinformatica. He is an area chair of ICDE 2017, senior PC member of BigData 2017 and DASFAA 2015, PC co-chair of APWeb 2015, area chair for CIKM 2014, and workshop co-chair of ICDE 2014. He received an Outstanding Service Award in CIKM 2009. He has served as PC members and reviewer for top conferences and journals.

  Zhipeng Huang is a 2nd year Ph.D. in the CS department of HKU, super- vised by Prof. Nikos Mamoulis and Dr. Reynold Cheng. He received his bachelor degree from EECS department of PKU in 2015. His research interests cover data mining, data management and data cleaning.

  Yudian Zheng is a 4th year Ph.D. in the CS department of HKU, super- vised by Dr. Reynold Cheng. Yudian’s research interests cover crowdsourcing, data management and data cleaning. He has published full research papers in well-established database and data mining conferences/journals, including SIG- MOD, VLDB, KDD, WWW, ICDE, and TKDE. He has also taken internships in Microsoft Research and Google Research.

  Jing Yan is a 1st year MPhil student supervised by Dr. Reynold Cheng in the CS department of HKU. His research interests include data management and data mining, with emphasis on knowledge graphs and data cleaning.

  Ka Yu Wong is currently a MSc student of the CS department of HKU. Eddie Ng is currently a MSc student of the CS department of HKU.

  Acknowledgements.

  Reynold Cheng, Zhipeng Huang, Yudian Zheng, and Jing Yan

were supported by the Research Grants Council of Hong Kong (RGC Projects HKU

17229116 and 17205115) and the University of Hong Kong (Projects 102009508 and

104004129).

  References

  

1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:

a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC - 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi

  

2. Huang, Z., Cautis, B., Cheng, R., Zheng, Y.: KB-enabled query recommendation

for long-tail queries. In: CIKM, pp. 2107–2112 (2016)

3. Huang, Z., Zheng, Y., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure:

  Meta Paths and Meta Structures

  7

  

4. Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-

constrained random walks. Mach. Learn. 81(1), 53–67 (2010)

5. Ley, M.: DBLP Computer Science Bibliography (2005)

  

6. Meng, C., Cheng, R., Maniu, S., Senellart, P., Zhang, W.: Discovering meta-paths

in large heterogeneous information networks. In: WWW, pp. 754–764 (2015)

  

7. Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: Exemplar queries: give

me an example of what you need. PVLDB 7(5), 365–376 (2014)

  

8. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In:

WWW, pp. 697–706 (2007)

  

9. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k

similarity search in heterogeneous information networks. In: PVLDB, pp. 992–1003 (2011)

  

10. Yu, X., Ren, X., Sun, Y., Sturt, B., Khandelwal, U., Gu, Q., Norick, B., Han, J.:

Recommendation in heterogeneous information networks with implicit user feed- back. In: RecSys, pp. 347–350 (2013)

  Spatial Data Processing and Data Quality

  

TrajSpark: A Scalable and Efficient In-Memory

Management System for Big Trajectory Data

  Zhigang Zhang, Cheqing Jin

  ( B

  )

  , Jiali Mao, Xiaolin Yang, and Aoying Zhou

  

School of Data Science and Engineering,

East China Normal University, Shanghai, China

{zgzhang,jlmao1231,xlyang}@stu.ecnu.edu.cn,

{cqjin,ayzhou}@sei.ecnu.edu.cn

  Abstract.

  The widespread application of mobile positioning devices has

generated big trajectory data. Existing disk-based trajectory manage-

ment systems cannot provide scalable and low latency query services any

more. In view of that, we present TrajSpark, a distributed in-memory

system to consistently offer efficient management of trajectory data. Tra-

jSpark introduces a new abstraction called IndexTRDD to manage tra-

jectory segments, and exploits a global and local indexing mechanism to

accelerate trajectory queries. Furthermore, to alleviate the essential par-

titioning overhead, it adopts the time-decay model to monitor the change

of data distribution and updates the data-partition structure adaptively.

This model avoids repartitioning existing data when new batch of data

arrives. Extensive experiments of three types of trajectory queries on

both real and synthetic dataset demonstrate that the performance of

TrajSpark outperforms state-of-the-art systems.

  Keywords: Big trajectory data ·

  In-memory ·

  Low latency query

1 Introduction

  Recently, with the explosive development of positioning techniques and popular use of intelligent electronic devices, trajectory data of MOs (Moving Objects) has been accumulated rapidly in many applications, such as location-based ser- vices (LBS) and geographical information systems (GIS). For example, DiD

  

   the largest one-stop consumer transportation platform in China, now has 1.5 million registered active drivers, and provides services for more than 300 million passengers. The total length of all trajectories generated in this platform reaches around 13 billion kilometers in 2015. Moreover, the volume of trajectory data increases in a surging way. In March 2016, the number of trajectories generated in one day has already exceeded 10 million. It is challenging to provide real-time service over such data. However, as almost all of existing trajectory manage- ment systems are disk-oriented (e.g., TrajStore

  ), they cannot support low latency query services upon big trajectory data. 1

12 Z. Zhang et al.

  Recently, in-memory computing systems are a widely used to provide low

  

  latency query services. For instance, Spar , a distributed in-memory comput- ing system, has been widely used. Spark provides a data abstraction called RDDs (Resilient Distributed Datasets), to maintain a collection of objects that are par- titioned across a cluster of machines. Users can manipulate RDDs conveniently through a batch of predefined operations. However, Spark is lack of indexing mechanism upon RDDs and needs to scan the whole dataset for a given query. Recently, some Spark-based system prototypes have been proposed to process big spatial data, including SpatialSpark

  and

  Simba

  . Amongst them, SpatialSpark implements the spatial join query on