Web Big Data International Proceedings 12
Lei Chen · Christian S. Jensen Cyrus Shahabi · Xiaochun Yang (Eds.) Xiang Lian
LNCS 10366 Web and Big Data First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I
Lecture Notes in Computer Science 10366
Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison Lancaster University, Lancaster, UK
Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler University of Surrey, Guildford, UK
Jon M. Kleinberg Cornell University, Ithaca, NY, USA
Friedemann Mattern ETH Zurich, Zurich, Switzerland
John C. Mitchell Stanford University, Stanford, CA, USA
Moni Naor Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan Indian Institute of Technology, Madras, India
Bernhard Steffen TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos University of California, Los Angeles, CA, USA
Doug Tygar University of California, Berkeley, CA, USA
Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at
- • Lei Chen Christian S. Jensen • Cyrus Shahabi Xiaochun Yang Xiang Lian (Eds.)
Web and Big Data
First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I Editors Lei Chen Xiaochun Yang Computer Science and Engineering Northeastern University Hong Kong University of Science and Shenyang
Technology China Hong Kong Xiang Lian China Kent State University Christian S. Jensen Kent, OH
Computer Science USA Aarhus University Aarhus N Denmark Cyrus Shahabi Computer Science University of Southern California Los Angeles, CA USA
ISSN 0302-9743
ISSN 1611-3349 (electronic) Lecture Notes in Computer Science
ISBN 978-3-319-63578-1
ISBN 978-3-319-63579-8 (eBook) DOI 10.1007/978-3-319-63579-8 Library of Congress Control Number: 2017947034 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer International Publishing AG 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG
Preface
This volume (LNCS 10366) and its companion volume (LNCS 10367) contain the proceedings of the first Asia-Pacific Web (APWeb) and Web-Age Information Man- agement (WAIM) Joint Conference on Web and Big Data, called APWeb-WAIM. This new joint conference aims to attract participants from different scientific communities as well as from industry, and not merely from the Asia Pacific region, but also from other continents. The objective is to enable the sharing and exchange of ideas, experiences, and results in the areas of World Wide Web and big data, thus covering Web tech- nologies, database systems, information management, software engineering, and big data. The first APWeb-WAIM conference was held in Beijing during July 7–9, 2017.
As a new Asia-Pacific flagship conference focusing on research, development, and applications in relation to Web information management, APWeb-WAIM builds on the successes of APWeb and WAIM: APWeb was previously held in Beijing (1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou (2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou (2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha (2014), Guangzhou (2015), and Suzhou (2016); and WAIM was held in Shanghai (2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou (2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009), Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014), Qingdao (2015), and Nanchang (2016). With the fast development of Web-related technologies, we expect that APWeb-WAIM will become an increasingly popular forum that brings together outstanding researchers and developers in the field of Web and big data from around the world.
The high-quality program documented in these proceedings would not have been possible without the authors who chose APWeb-WAIM for disseminating their find- ings. Out of 240 submissions to the research track and 19 to the demonstration track, the conference accepted 44 regular (18%), 32 short research papers, and ten demon- strations. The contributed papers address a wide range of topics, such as spatial data processing and data quality, graph data processing, data mining, privacy and semantic analysis, text and log data management, social networks, data streams, query pro- cessing and optimization, topic modeling, machine learning, recommender systems, and distributed data processing.
The technical program also included keynotes by Profs. Sihem Amer-Yahia (National Center for Scientific Research, CNRS, France), Masaru Kitsuregawa (National Institute of Informatics, NII, Japan), and Mohamed Mokbel (University of Minnesota, Twin Cities, USA) as well as tutorials by Prof. Reynold Cheng (The University of Hong Kong, SAR China), Prof. Guoliang Li (Tsinghua University, China), Prof. Arijit Khan (Nanyang Technological University, Singapore), and VI Preface
Prof. Yu Zheng (Microsoft Research Asia, China). We are grateful to these distin- guished scientists for their invaluable contributions to the conference program.
As a new joint conference, teamwork is particularly important for the success of APWeb-WAIM. We are deeply thankful to the Program Committee members and the external reviewers for lending their time and expertise to the conference. Special thanks go to the local Organizing Committee led by Jun He, Yongxin Tong, and Shimin Chen. Thanks also go to the workshop co-chairs (Matthias Renz, Shaoxu Song, and Yang-Sae Moon), demo co-chairs (Sebastian Link, Shuo Shang, and Yoshiharu Ishikawa), industry co-chairs (Chen Wang and Weining Qian), tutorial co-chairs (Andreas Züfle and Muhammad Aamir Cheema), sponsorship chair (Junjie Yao), proceedings co-chairs (Xiang Lian and Xiaochun Yang), and publicity co-chairs (Hongzhi Yin, Lei Zou, and Ce Zhang). Their efforts were essential to the success of the conference. Last but not least, we wish to express our gratitude to the Webmaster (Zhao Cao) for all the hard work and to our sponsors who generously supported the smooth running of the conference.
We hope you enjoy the exciting program of APWeb-WAIM 2017 as documented in these proceedings. June 2017
Xiaoyong Du Beng Chin Ooi
M. Tamer Özsu Bin Cui
Lei Chen Christian S. Jensen
Cyrus Shahabi
Organization
Organizing CommitteeGeneral Co-chairs Xiaoyong Du Renmin University of China, China BengChin Ooi National University of Singapore, Singapore M. Tamer Özsu University of Waterloo, Canada Program Co-chairs Lei Chen Hong Kong University of Science and Technology, China Christian S. Jensen Aalborg University, Denmark Cyrus Shahabi The University of Southern California, USA Workshop Co-chairs Matthias Renz George Mason University, USA Shaoxu Song Tsinghua University, China Yang-Sae Moon Kangwon National University, South Korea Demo Co-chairs Sebastian Link The University of Auckland, New Zealand Shuo Shang King Abdullah University of Science and Technology,
Saudi Arabia Yoshiharu Ishikawa Nagoya University, Japan Industrial Co-chairs Chen Wang Innovation Center for Beijing Industrial Big Data, China Weining Qian East China Normal University, China Proceedings Co-chairs Xiang Lian Kent State University, USA Xiaochun Yang Northeast University, China Tutorial Co-chairs
George Mason University, USA Andreas Züfle Muhammad Aamir Monash University, Australia
Cheema ACM SIGMOD China Lectures Co-chairs Guoliang Li Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Publicity Co-chairs Hongzhi Yin The University of Queensland, Australia Lei Zou Peking University, China Ce Zhang Eidgenössische Technische Hochschule ETH, Switzerland Local Organization Co-chairs Jun He Renmin University of China, China Yongxin Tong Beihang University, China Shimin Chen Chinese Academy of Sciences, China Sponsorship Chair Junjie Yao East China Normal University, China Web Chair Zhao Cao Beijing Institute of Technology, China Steering Committee Liaison Yanchun Zhang Victoria University, Australia Senior Program Committee Dieter Pfoser George Mason University, USA Ilaria Bartolini University of Bologna, Italy Jianliang Xu Hong Kong Baptist University, SAR China Mario Nascimento University of Alberta, Canada Matthias Renz George Mason University, USA Mohamed Mokbel University of Minnesota, USA Ralf Hartmut Güting Fernuniversität in Hagen, Germany Seungwon Hwang Yongsei University, South Korea Sourav S. Bhowmick Nanyang Technological University, Singapore Tingjian Ge University of Massachusetts Lowell, USA Vincent Oria New Jersey Institute of Technology, USA Walid Aref Purdue University, USA Wook-Shin Han Pohang University of Science and Technology, Korea Yoshiharu Ishikawa Nagoya University, Japan Program Committee Alex Delis University of Athens, Greece Alex Thomo University of Victoria, Canada
VIII Organization
Aviv Segev Korea Advanced Institute of Science and Technology, South Korea
Baoning Niu Taiyuan University of Technology, China Bin Cui Peking University, China Bin Yang Aalborg University, Denmark Carson Leung University of Manitoba, Canada Chih-Hua Tai National Taipei University, China Cuiping Li Renmin University of China, China Daniele Riboni University of Cagliari, Italy Defu Lian University of Electronic Science and Technology of China,
China Dejing Dou University of Oregon, USA Demetris Zeinalipour Max Planck Institute for Informatics, Germany and
University of Cyprus, Cyprus Dhaval Patel Indian Institute of Technology Roorkee, India Dimitris Sacharidis Technische Universität Wien, Vienna, Austria Fei Chiang McMaster University, Canada Ganzhao Yuan South China University of Technology, China Giovanna Guerrini Universita di Genova, Italy Guoliang Li Tsinghua University, China Guoqiong Liao Jiangxi University of Finance and Economics, China Hailong Sun Beihang University, China Han Su University of Southern California, USA Hiroaki Ohshima Kyoto University, Japan Hong Chen Renmin University of China, China Hongyan Liu Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Hongzhi Yin The University of Queensland, Australia Hua Li Aalborg University, Denmark Hua Lu Aalborg University, Denmark Hua Wang Victoria University, Melbourne, Australia Hua Yuan University of Electronic Science and Technology of China,
China Iulian Sandu Popa Inria and PRiSM Lab, University of Versailles
Saint-Quentin, France James Cheng Chinese University of Hong Kong, SAR China Jeffrey Xu Yu Chinese University of Hong Kong, SAR China Jiaheng Lu University of Helsinki, Finland Jiajun Liu Renmin University of China, China Jialong Han Nanyang Technological University, Singapore Jian Yin Zhongshan University, China Jianliang Xu Hong Kong Baptist University, SAR China Jianmin Wang Tsinghua University, China Jiannan Wang Simon Fraser University, Canada Jianting Zhang City College of New York, USA
Organization
IX Jinchuan Chen Renmin University of China, China Ju Fan National University of Singapore, Singapore Jun Gao Peking University, China Junfeng Zhou Yanshan University, China Junhu Wang
Griffith University, Australia Kai Zeng University of California, Berkeley, USA Karine Zeitouni PRISM University of Versailles St-Quentin, Paris, France Kyuseok Shim Seoul National University, Korea Lei Zou Peking University, China Lei Chen Hong Kong University of Science and Technology,
SAR China Leong Hou U. University of Macau, SAR China Liang Hong Wuhan University, China Lianghuai Yang Zhejiang University of Technology, China Long Guo Peking University, China Man Lung Yiu Hong Kong Polytechnical University, SAR China Markus Endres University of Augsburg, Germany Maria Damiani University of Milano, Italy Meihui Zhang Singapore University of Technology and Design,
Singapore Mihai Lupu Vienna University of Technology, Austria Mirco Nanni
ISTI-CNR Pisa, Italy Mizuho Iwaihara Waseda University, Japan Mohammed Eunus Ali Bangladesh University of Engineering and Technology,
Bangladesh Peer Kroger Ludwig-Maximilians-University of Munich, Germany Peiquan Jin Univerisity of Science and Technology of China Peng Wang Fudan University, China Yaokai Feng Kyushu University, Japan Wookey Lee Inha University, Korea Raymond Chi-Wing
Wong Hong Kong University of Science and Technology,
SAR China Richong Zhang Beihang University, China Sanghyun Park Yonsei University, Korea Sangkeun Lee Oak Ridge National Laboratory, USA Sanjay Madria Missouri University of Science and Technology, USA Shengli Wu Jiangsu University, China Shi Gao University of California, Los Angeles, USA Shimin Chen Chinese Academy of Sciences, China Shuai Ma Beihang University, China Shuo Shang King Abdullah University of Science and Technology,
Saudi Arabia Sourav S Bhowmick Nanyang Technological University, Singapore Stavros Papadopoulos Intel Labs and MIT, USA Takahiro Hara Osaka University, Japan
X Organization
Tieyun Qian Wuhan University, China Ting Deng Beihang University, China Tru Cao Ho Chi Minh City University of Technology, Vietnam Vicent Zheng Advanced Digital Sciences Center, Singapore Vinay Setty Aalborg University, Denmark Wee Ng Institute for Infocomm Research, Singapore Wei Wang University of New South Wales, Australia Weining Qian East China Normal University, China Weiwei Sun Fudan University, China Wei-Shinn Ku Auburn University, USA Wenjia Li New York Institute of Technology, USA Wen Zhang Wuhan University, China Wolf-Tilo Balke Braunschweig University of Technology, Germany Xiang Lian Kent State University, USA Xiang Zhao National University of Defence Technology, China Xiangliang Zhang King Abdullah University of Science and Technology,
Saudi Arabia Xiangmin Zhou RMIT University, Australia Xiaochun Yang Northeast University, China Xiaofeng He East China Normal University, China Xiaoyong Du Renmin University of China, China Xike Xie University of Science and Technology of China, China Xingquan Zhu Florida Atlantic University, USA Xuan Zhou Renmin University of China, China Yanghua Xiao Fudan University, China Yang-Sae Moon Kangwon National University, South Korea Yasuhiko Morimoto Hiroshima University, Japan Yijie Wang National University of Defense Technology, China Yingxia Shao Peking University, China Yong Zhang Tsinghua University, China Yongxin Tong Beihang University, China Yoshiharu Ishikawa Nagoya University, Japan Yu Gu Northeast University, China Yuan Fang Institute for Infocomm Research, Singapore Yueguo Chen Renmin University of China, China Yunjun Gao Zhejiang University, China Zakaria Maamar Zayed University, United Arab Emirates Zhaonian Zou Harbin Institute of Technology, China Zhengjia Fu Advanced Digital Sciences Center, Singapore Zhiguo Gong University of Macau, SAR China Zouhaier Brahmia University of Sfax, Tunisia
Organization
XI Keynotes
A Holistic View of Human Factors
in Crowdsourcing
Sihem Amer-Yahia
CNRS, University of Grenoble Alpes, Grenoble, France
sihem.amer-yahia@cnrs.fr
Abstract. For over 40 years, organization studies have examined human factors
in physical workplaces and their influence on the ability of an individual to
perform a task, or a set of tasks, alone or in collaboration with others. In a virtual
marketplace, the crowd is typically volatile, its arrival and departure asyn-
chronous, and its levels of attention and accuracy diverse. This has generated a
wealth of new research ranging from studying workers’ fatigue in task com-
pletion to examining the role of motivation in task assignment. I will review
such work and argue that we need a holistic view to take full advantage of
human factors such as skills, expected wage and motivation, in improving the
performance of a crowdsourcing platform.
Experience on XXX Health such as Earth
Health and Human Health Though Big Data
1,2 1 Masaru Kitsuregawa 2 The University of Tokyo, Tokyo, JapanNational Institute of Informatics, Tokyo, Japan
kitsure@tkl.iis.u-tokyo.ac.jp
Abstract. We have been working in the area so called ‘Health’. In this talk, our
experiences on the problem solving for earth environmental health and human
health by big data system technologies are presented. We are wondering what
type of platform be suitable for societal health as a whole.
Thinking Spatial
Mohamed Mokbel
Department of Computer Science and Engineering, University of Minnesota
mokbel@umn.edu
Abstract. The need to manage and analyze spatial data is hampered by the lack
of specialized systems to support such data. System builders mostly build
general-purpose systems that are generic enough to handle any kind of attri-
butes. Whenever there is a pressing need for spatial data support, it is considered
as an afterthought problem that can be addressed by adding new data types,
extensions, or spatial cartridges to existing systems. This talk advocates for
dealing with spatial data as first class citizens, and for always thinking spatially
whenever it comes to system design. This is well justified by the proliferation of
location-based applications that are mainly relying on spatial data. The talk will
go through various system designs and show how they would be different if we
have designed them while thinking spatially. Examples of these systems include
data base systems, big data systems, recommender systems, social networks, and
crowd sourcing.
Contents – Part I
Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reynold Cheng, Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng Spatial Data Processing and Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, and Aoying Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Dawei Gao, Yongxin Tong, Yudian Ji, and Ke Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jianguo Wu, Jianwen Xiang, Dongdong Zhao, Huanhuan Li, Qing Xie, and Xiaoyi Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Zizhe Xie, Qizhi Liu, and Zhifeng Bao Graph Data Processing . . . .
Chuxu Zhang, Lu Yu, Chuang Liu, Zi-Ke Zhang, and Tao Zhou
Yanxia Xu, Jinjing Huang, An Liu, Zhixu Li, Hongzhi Yin, and Lei Zhao
Guohua Li, Weixiong Rao, and Zhongxiao Jin XX Contents – Part I
Wei Shi, Weiguo Zheng, Jeffrey Xu Yu, Hong Cheng, and Lei Zou
Qiang Xu, Xin Wang, Junhu Wang, Yajun Yang, and Zhiyong Feng
Qun Liao, Lei Sun, He Du, and Yulu Yang Data Mining, Privacy and Semantic Analysis
Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing
Ningning Ma, Hai-Tao Zheng, and Xi Xiao
Shushu Liu, An Liu, Zhixu Li, Guanfeng Liu, Jiajie Xu, Lei Zhao, and Kai Zheng
Jerry Chun-Wei Lin, Jiexiong Zhang, and Philippe Fournier-Viger
Ting Huang, Ruizhang Huang, Bowei Liu, and Yingying Yan
Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, and Han-Chieh Chao Text and Log Data Management
Ming Chen, Lin Li, and Qing Xie
Shengluan Hou, Yu Huang, Chaoqun Fei, Shuhan Zhang, and Ruqian Lu
XXI Contents – Part I
Chunxia Zhang, Sen Wang, Jiayu Wu, and Zhendong Niu
Zhenyu Zhao, Guozheng Rao, and Zhiyong Feng
Jinwei Guo, Jiahao Wang, Peng Cai, Weining Qian, Aoying Zhou, and Xiaohang Zhu
Huan Zhou, Huiqi Hu, Tao Zhu, Weining Qian, Aoying Zhou, and Yukun He Social Networks
Xia Lv, Peiquan Jin, Lin Mu, Shouhong Wan, and Lihua Yue
Yang Wu, Cheng Long, Ada Wai-Chee Fu, and Zitong Chen
Kaixia Li, Zhao Cao, and Dacheng Qu
Yu Qiao, Jun Wu, Lei Zhang, and Chongjun Wang
Naoki Kito, Xiangmin Zhou, Dong Qin, Yongli Ren, Xiuzhen Zhang, and James Thom
Tianchen Zhu, Zhaohui Peng, Xinghua Wang, and Xiaoguang Hong Data Mining and Data Streams
Song Wu, Xingjun Wang, Hai Jin, and Haibao Chen
Qiong Li, Xiaowang Zhang, and Zhiyong Feng XXII Contents – Part I
Xiutao Shi, Liqiang Wang, Shijun Liu, Yafang Wang, Li Pan, and Lei Wu
Hui Wen, Minglan Li, and Zhili Ye
Chuxu Zhang, Chuang Liu, Lu Yu, Zi-Ke Zhang, and Tao Zhou
Yangming Liu, Suyun Zhao, Hong Chen, Cuiping Li, and Yanmin Lu Query Processing
Xiaoying Zhang, Hui Peng, Lei Dong, Hong Chen, and Hui Sun
Kento Sugiura and Yoshiharu Ishikawa
Zhijin Lv, Ben Chen, and Xiaohui Yu
Ying Wang, Ming Zhong, Yuanyuan Zhu, Xuhui Li, and Tieyun Qian
Yuan Tian, Peiquan Jin, Shouhong Wan, and Lihua Yue
Zepeng Fang, Chen Lin, and Yun Liang Topic Modeling
Yunfeng Chen, Lei Zhang, Xin Li, Yu Zong, Guiquan Liu, and Enhong Chen
Lin Xiao, Zhang Min, and Zhang Yongfeng
XXIII Contents – Part I
Linjing Wei, Heyan Huang, Yang Gao, Xiaochi Wei, and Chong Feng
Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, and Bowei Liu
Ming Xu, Yang Cai, Hesheng Wu, Chongjun Wang, and Ning Li
Jinjing Zhang, Jing Wang, and Li Li
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents – Part II
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Qi Ye, Changlei Zhu, Gang Li, and Feng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . .
Xiang Li, Rui Yan, and Ming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
Hai-Tao Zheng, Xin Yao, Yong Jiang, Shu-Tao Xia, and Xi Xiao Recommendation Systems . . . . . . . .
Yan Zhao, Jia Zhu, Mengdi Jia, Wenyan Yang, and Kai Zheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Qinyong Wang, Hongzhi Yin, and Hao Wang
Fei Yu, Zhijun Li, Shouxu Jiang, and Xiaofei Yang
Wenli Yu, Li Li, Jingyuan Wang, Dengbao Wang, Yong Wang, Zhanbo Yang, and Min Huang
Shuhei Kishida, Seiji Ueda, Atsushi Keyaki, and Jun Miyazaki XXVI Contents – Part II
Xiaolu Xing, Chaofeng Sha, and Junyu Niu
Distributed Data Processing and Applications
Chenhao Yang, Ben He, and Jungang Xu
Ningnan Zhou, Xuan Zhou, Xiao Zhang, Xiaoyong Du, and Shan Wang
Tinghai Pang, Lei Duan, Jyrki Nummenmaa, Jie Zuo, and Peng Zhang
Shiliang Fan, Yubin Yang, Wenyang Lu, and Ping Song
Tao Zhu, Huiqi Hu, Weining Qian, Aoying Zhou, Mengzhan Liu, and Qiong Zhao
Hao Liu, Jiang Xiao, Xianjun Guo, Haoyu Tan, Qiong Luo, and Lionel M. Ni Machine Learning and Optimization
Guangxuan Song, Wenwen Qu, Yilin Wang, and Xiaoling Wang
Donghui Wang, Peng Cai, Weining Qian, Aoying Zhou, Tianze Pang, and Jing Jiang
Tao Xie, Bin Wu, and Bai Wang
Yixuan Liu, Zihao Gao, and Mizuho Iwaihara
XXVII Contents – Part II
Jun Yin and Xiaoming Li
Yongheng Wang, Guidan Chen, and Zengwang Wang
Hai-Tao Zheng, Zhuren Wang, and Xi Xiao Demo Papers
Jiawei Jiang, Ming Huang, Jie Jiang, and Bin Cui
Dezhi Zhang, Peiquan Jin, Xiaoliang Wang, Chengcheng Yang, and Lihua Yue
Yihai Xi, Ning Wang, Xiaoyu Wu, Yuqing Bao, and Wutong Zhou
Xuguang Bao, Lizhen Wang, and Qing Xiao
Yuan Liu, Xin Wang, and Qiang Xu
Longlong Xu, Wutao Lin, Xiaorong Wang, Zhenhui Xu, Wei Chen, and Tengjiao Wang
Leonard K.M. Poon, Chun Fai Leung, Peixian Chen, and Nevin L. Zhang
Mingyan Teng, Qiao Sun, Buqiao Deng, Lei Sun, and Xiongpai Qin
Qiao Sun, Xiongpai Qin, Buqiao Deng, and Wei Cui XXVIII Contents – Part II
Weiwei Wang and Jianqiu Xu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tutorials
Meta Paths and Meta Structures: Analysing
Large Heterogeneous Information Networks
( )B
Reynold Cheng , Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng
University of Hong Kong, Pokfulam Road, Pokfulam, Hong Kong
{ckcheng,zphuang,ydzheng2,jyan}@cs.hku.hk, kywong2@connect.hku.hk,
ngheii@gmail.com
http://www.cs.hku.hk/~ckcheng/
Abstract. A heterogeneous information network (HIN) is a graph
model in which objects and edges are annotated with types. Large and
complex databases, such as YAGO and DBLP, can be modeled as HINs.
A fundamental problem in HINs is the computation of closeness, or rel-
evance, between two HIN objects. Relevance measures, such as PCRW,
PathSim, and HeteSim, can be used in various applications, including
information retrieval, entity resolution, and product recommendation.
These metrics are based on the use of meta-paths, essentially a sequence
of node classes and edge types between two nodes in a HIN. In this tuto-
rial, we will give a detailed review of meta-paths, as well as how they are
used to define relevance. In a large and complex HIN, retrieving meta
paths manually can be complex, expensive, and error-prone. Hence, we
will explore systematic methods for finding meta paths. In particular, we
will study a solution based on the Query-by-Example (QBE) paradigm,
which allows us to discover meta-paths in an effective and efficient man-
ner.
We further generalise the notion of meta path to “meta structure”,
which is a directed acyclic graph of object types with edge types con-
necting them. Meta structure, which is more expressive than the meta
path, can describe complex relationship between two HIN objects (e.g.,
two papers in DBLP share the same authors and topics). We will discuss
three relevance measures based on meta structure. Due to the compu-
tational complexity of these measures, we also study an algorithm with
data structures proposed to support their evaluation. Finally, we will
examine solutions for performing query recommendation based on meta-
paths. We will also discuss future research directions.1 Background
Heterogeneous information networks (HINs), such as DBLP
, and
DBpedia
, have recently received a lot of attention. These data sources, con-
taining a vast number of inter-related facts, facilitate the discovery of interesting knowledge
a) illustrates an HIN, which describes the relation-
4 R. Cheng et al.
a a a 1 2 3 w rite w rite publish 1 publish
- -1
P A P A 1 1 P 2 1 V 2
p p p p p p -1
1,1 1,2 2,1 2,2 3,1 3,2 mention P A P P 2 : 1 T w rite mention 1 1 2 w rite A 2 ICDM KDD VLDB AAAI : h lis 1 pubv t t v t v t blis -1
1 1 v 2 2 3 3 4 w rite 4 A A pu w rite V h object types: author paper venue topic : edge types: write mention publish S P P 1 1 me nt ion ion 1 2 2 T nt me
(a) (b)
Fig. 1.HIN, meta paths, and meta structures.
example, Jiawei Han (a
2 ) has written a VLDB paper (p
2 , 2 ), which mentions the topic “efficient” (t ).3 Given two HIN objects a and b, the evaluation of their relevance is of fun- damental importance. This quantifies the degree of closeness between a and b.
In Fig.
a), Jian Pei (a ) and Jiawei Han (a ) have a high relevance score,
1
2
since they have both published papers with keyword “mining” in the same venue (KDD). Relevance finds its applications in information retrieval, recommendation, and clustering
]: a researcher can retrieve papers that have high relevance in
terms of topics and venues in DBLP; in YAGO, relevance facilitates the extrac- tion of actors who are close to a given director. As another example, in entity resolution applications, duplicated HIN object pairs having high relevance scores (e.g., two different objects in an HIN referring to the same real-world person) can be identified and removed from the HIN.
Relevance Computation. In this tutorial, we will explore different ways of computing the relevance between two graph objects, for instance, neighborhood- based measures, such as common neighbors and Jaccard’s coefficient; graph- theoretic measures based on random walks, such as Personalized PageRank and SimRank. These measures do not consider object and edge type information in an HIN. We will discuss the concept of meta paths
. A meta path is a
sequence of object types with edge types between them. Figure
b) illustrates
a meta path P
1 , which states that two authors (A 1 and A 2 ) are related by
their publications in the same venue (V ). Another meta path P
2 says that two
authors have written papers containing the same topic (T ). We will discuss several meta-path-based relevance measures, including PathCount, PathSim, and Path Constrained Random Walk (PCRW)
. These measures have been shown to be better than those that do not consider object and edge type information.
We will further discuss meta structures, recently proposed in
, to depict the
relationship of two graph objects. This is essentially a directed acyclic graph of object and edge types. Figure
b) illustrates a meta structure S, which depicts
that two authors are relevant if they have published papers in the same venue, and have also mentioned the same topic. A meta path (e.g., P or P ) is a special
1
2 Meta Paths and Meta Structures
5
case of a meta structure. However, a meta path fails to capture such complex relationship that can be conveniently expressed by a meta structure (e.g., S). We will discuss how meta structures can be used to formulate three relevance definitions, as well as their efficient calculation.
Meta Path Discovery. There are often a huge number of meta paths between a pair of HIN objects. This can be very difficult, even for a domain expert, to identify the right meta paths. We will discuss a meta path discovery algo- rithm, recently proposed by
, where users provide example instances of source
and target objects through a Query-by-Example paradigm, to derive meta paths automatically. We will demonstrate a HIN search engine prototype based on this algorithm. Query Recommendation. We will study the use of meta paths in query rec- ommendation, where queries are suggested to web search users based on their previous query histories. As studied in
, it is possible to use a knowledge
base (a HIN) and its related meta-paths to perform effective query recommen- dation. The approach is especially useful to long-tail queries that rarely appear in query logs.
2 Proposed Schedule The following is our proposed schedule of the 90-min tutorial.
- – Introduction (15 min). We will discuss the basic model of HIN, and dis- cuss applications based on it, such as search, relevance computation, query recommendation, and data integration (10 min). We will also introduce meta- paths, a fundamental HIN analysis tool, and give an overview of the tutorial (5 min).
- – Main contents (60 min). Next, we will introduce meta path, and how it facilitates the computation of various relevance measures (10 min). We then explain the process of discovering meta paths (15 min). We discuss a novel query recommendation framework based on meta paths (15 min). We will also present the meta structures, which is the latest development of meta paths (15 min). We will demonstrate a HIN search engine prototype based on meta paths (5 min).
- – Conclusions (15 min). We will conclude the tutorial and discuss future directions (5 min). The rest of the time will be dedicated to Q&A (10 min).
3 Intended Audience
The tutorial is designed for researchers interested in latest development in the field of HINs, especially regarding meta-paths for novel applications. The HIN search demonstration will be give insight to software practitioners for developing recommendation facilities for HINs.
6 R. Cheng et al.
4 Biography of Presenters
Reynold Cheng is an Associate Professor of the Department of Computer Sci- ence in the University of Hong Kong. He obtained his PhD from Department of Computer Science of Purdue University in 2005. He was granted an Outstand- ing Young Researcher Award 2011–12 by HKU. He was the recipient of the 2010 Research Output Prize in the Department of Computer Science of HKU. He also received the U21 Fellowship in 2011. He received the Performance Reward in years 2006 and 2007 awarded by the Hong Kong Polytechnic University. He is a member of the IEEE, the ACM, and ACM SIGMOD. He is an editorial board member of TKDE, DAPD and IS, and was a guest editor for TKDE, DAPD, and Geoinformatica. He is an area chair of ICDE 2017, senior PC member of BigData 2017 and DASFAA 2015, PC co-chair of APWeb 2015, area chair for CIKM 2014, and workshop co-chair of ICDE 2014. He received an Outstanding Service Award in CIKM 2009. He has served as PC members and reviewer for top conferences and journals.
Zhipeng Huang is a 2nd year Ph.D. in the CS department of HKU, super- vised by Prof. Nikos Mamoulis and Dr. Reynold Cheng. He received his bachelor degree from EECS department of PKU in 2015. His research interests cover data mining, data management and data cleaning.
Yudian Zheng is a 4th year Ph.D. in the CS department of HKU, super- vised by Dr. Reynold Cheng. Yudian’s research interests cover crowdsourcing, data management and data cleaning. He has published full research papers in well-established database and data mining conferences/journals, including SIG- MOD, VLDB, KDD, WWW, ICDE, and TKDE. He has also taken internships in Microsoft Research and Google Research.
Jing Yan is a 1st year MPhil student supervised by Dr. Reynold Cheng in the CS department of HKU. His research interests include data management and data mining, with emphasis on knowledge graphs and data cleaning.
Ka Yu Wong is currently a MSc student of the CS department of HKU. Eddie Ng is currently a MSc student of the CS department of HKU.
Acknowledgements.
Reynold Cheng, Zhipeng Huang, Yudian Zheng, and Jing Yan
were supported by the Research Grants Council of Hong Kong (RGC Projects HKU
17229116 and 17205115) and the University of Hong Kong (Projects 102009508 and
104004129).References
1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC - 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi
2. Huang, Z., Cautis, B., Cheng, R., Zheng, Y.: KB-enabled query recommendation
for long-tail queries. In: CIKM, pp. 2107–2112 (2016)3. Huang, Z., Zheng, Y., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure:
Meta Paths and Meta Structures
7
4. Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-
constrained random walks. Mach. Learn. 81(1), 53–67 (2010)5. Ley, M.: DBLP Computer Science Bibliography (2005)
6. Meng, C., Cheng, R., Maniu, S., Senellart, P., Zhang, W.: Discovering meta-paths
in large heterogeneous information networks. In: WWW, pp. 754–764 (2015)
7. Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: Exemplar queries: give
me an example of what you need. PVLDB 7(5), 365–376 (2014)
8. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In:
WWW, pp. 697–706 (2007)
9. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k
similarity search in heterogeneous information networks. In: PVLDB, pp. 992–1003 (2011)
10. Yu, X., Ren, X., Sun, Y., Sturt, B., Khandelwal, U., Gu, Q., Norick, B., Han, J.:
Recommendation in heterogeneous information networks with implicit user feed- back. In: RecSys, pp. 347–350 (2013)Spatial Data Processing and Data Quality
TrajSpark: A Scalable and Efficient In-Memory
Management System for Big Trajectory Data
Zhigang Zhang, Cheqing Jin
( B
)
, Jiali Mao, Xiaolin Yang, and Aoying Zhou
School of Data Science and Engineering,
East China Normal University, Shanghai, China
{zgzhang,jlmao1231,xlyang}@stu.ecnu.edu.cn,
{cqjin,ayzhou}@sei.ecnu.edu.cn
Abstract.
The widespread application of mobile positioning devices has
generated big trajectory data. Existing disk-based trajectory manage-
ment systems cannot provide scalable and low latency query services any
more. In view of that, we present TrajSpark, a distributed in-memory
system to consistently offer efficient management of trajectory data. Tra-
jSpark introduces a new abstraction called IndexTRDD to manage tra-
jectory segments, and exploits a global and local indexing mechanism to
accelerate trajectory queries. Furthermore, to alleviate the essential par-
titioning overhead, it adopts the time-decay model to monitor the change
of data distribution and updates the data-partition structure adaptively.
This model avoids repartitioning existing data when new batch of data
arrives. Extensive experiments of three types of trajectory queries on
both real and synthetic dataset demonstrate that the performance of
TrajSpark outperforms state-of-the-art systems.Keywords: Big trajectory data ·
In-memory ·
Low latency query
1 Introduction
Recently, with the explosive development of positioning techniques and popular use of intelligent electronic devices, trajectory data of MOs (Moving Objects) has been accumulated rapidly in many applications, such as location-based ser- vices (LBS) and geographical information systems (GIS). For example, DiD
the largest one-stop consumer transportation platform in China, now has 1.5 million registered active drivers, and provides services for more than 300 million passengers. The total length of all trajectories generated in this platform reaches around 13 billion kilometers in 2015. Moreover, the volume of trajectory data increases in a surging way. In March 2016, the number of trajectories generated in one day has already exceeded 10 million. It is challenging to provide real-time service over such data. However, as almost all of existing trajectory manage- ment systems are disk-oriented (e.g., TrajStore
), they cannot support low latency query services upon big trajectory data. 1
12 Z. Zhang et al.
Recently, in-memory computing systems are a widely used to provide low
latency query services. For instance, Spar , a distributed in-memory comput- ing system, has been widely used. Spark provides a data abstraction called RDDs (Resilient Distributed Datasets), to maintain a collection of objects that are par- titioned across a cluster of machines. Users can manipulate RDDs conveniently through a batch of predefined operations. However, Spark is lack of indexing mechanism upon RDDs and needs to scan the whole dataset for a given query. Recently, some Spark-based system prototypes have been proposed to process big spatial data, including SpatialSpark
and
Simba
. Amongst them, SpatialSpark implements the spatial join query on