Web Big Data International Proceedings 12

Lei Chen · Christian S. Jensen Cyrus Shahabi · Xiaochun Yang (Eds.) Xiang Lian

LNCS 10366 Web and Big Data First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I

Lecture Notes in Computer Science 10366

Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David Hutchison Lancaster University, Lancaster, UK

Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA

Josef Kittler University of Surrey, Guildford, UK

Jon M. Kleinberg Cornell University, Ithaca, NY, USA

Friedemann Mattern ETH Zurich, Zurich, Switzerland

John C. Mitchell Stanford University, Stanford, CA, USA

Moni Naor Weizmann Institute of Science, Rehovot, Israel

C. Pandu Rangan Indian Institute of Technology, Madras, India

Bernhard Steffen TU Dortmund University, Dortmund, Germany

Demetri Terzopoulos University of California, Los Angeles, CA, USA

Doug Tygar University of California, Berkeley, CA, USA

Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at

^• Lei Chen Christian S. Jensen ^• Cyrus Shahabi Xiaochun Yang Xiang Lian (Eds.)

Web and Big Data

First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I Editors Lei Chen Xiaochun Yang Computer Science and Engineering Northeastern University Hong Kong University of Science and Shenyang

Technology China Hong Kong Xiang Lian China Kent State University Christian S. Jensen Kent, OH

Computer Science USA Aarhus University Aarhus N Denmark Cyrus Shahabi Computer Science University of Southern California Los Angeles, CA USA

ISSN 0302-9743

ISSN 1611-3349 (electronic) Lecture Notes in Computer Science

ISBN 978-3-319-63578-1

ISBN 978-3-319-63579-8 (eBook) DOI 10.1007/978-3-319-63579-8 Library of Congress Control Number: 2017947034 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer International Publishing AG 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the

material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now

known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are

believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors

give a warranty, express or implied, with respect to the material contained herein or for any errors or

omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in

published maps and institutional afﬁliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG

Preface

This volume (LNCS 10366) and its companion volume (LNCS 10367) contain the proceedings of the first Asia-Pacific Web (APWeb) and Web-Age Information Man- agement (WAIM) Joint Conference on Web and Big Data, called APWeb-WAIM. This new joint conference aims to attract participants from different scientific communities as well as from industry, and not merely from the Asia Pacific region, but also from other continents. The objective is to enable the sharing and exchange of ideas, experiences, and results in the areas of World Wide Web and big data, thus covering Web technologies, database systems, information management, software engineering, and big data. The first APWeb-WAIM conference was held in Beijing during July 7–9, 2017.

As a new Asia-Paciﬁc flagship conference focusing on research, development, and applications in relation to Web information management, APWeb-WAIM builds on the successes of APWeb and WAIM: APWeb was previously held in Beijing (1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou (2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou (2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha (2014), Guangzhou (2015), and Suzhou (2016); and WAIM was held in Shanghai (2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou (2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009), Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014), Qingdao (2015), and Nanchang (2016). With the fast development of Web-related technologies, we expect that APWeb-WAIM will become an increasingly popular forum that brings together outstanding researchers and developers in the ﬁeld of Web and big data from around the world.

The high-quality program documented in these proceedings would not have been possible without the authors who chose APWeb-WAIM for disseminating their ﬁnd- ings. Out of 240 submissions to the research track and 19 to the demonstration track, the conference accepted 44 regular (18%), 32 short research papers, and ten demon- strations. The contributed papers address a wide range of topics, such as spatial data processing and data quality, graph data processing, data mining, privacy and semantic analysis, text and log data management, social networks, data streams, query processing and optimization, topic modeling, machine learning, recommender systems, and distributed data processing.

The technical program also included keynotes by Profs. Sihem Amer-Yahia (National Center for Scientiﬁc Research, CNRS, France), Masaru Kitsuregawa (National Institute of Informatics, NII, Japan), and Mohamed Mokbel (University of Minnesota, Twin Cities, USA) as well as tutorials by Prof. Reynold Cheng (The University of Hong Kong, SAR China), Prof. Guoliang Li (Tsinghua University, China), Prof. Arijit Khan (Nanyang Technological University, Singapore), and VI Preface

Prof. Yu Zheng (Microsoft Research Asia, China). We are grateful to these distin- guished scientists for their invaluable contributions to the conference program.

As a new joint conference, teamwork is particularly important for the success of APWeb-WAIM. We are deeply thankful to the Program Committee members and the external reviewers for lending their time and expertise to the conference. Special thanks go to the local Organizing Committee led by Jun He, Yongxin Tong, and Shimin Chen. Thanks also go to the workshop co-chairs (Matthias Renz, Shaoxu Song, and Yang-Sae Moon), demo co-chairs (Sebastian Link, Shuo Shang, and Yoshiharu Ishikawa), industry co-chairs (Chen Wang and Weining Qian), tutorial co-chairs (Andreas Züfle and Muhammad Aamir Cheema), sponsorship chair (Junjie Yao), proceedings co-chairs (Xiang Lian and Xiaochun Yang), and publicity co-chairs (Hongzhi Yin, Lei Zou, and Ce Zhang). Their efforts were essential to the success of the conference. Last but not least, we wish to express our gratitude to the Webmaster (Zhao Cao) for all the hard work and to our sponsors who generously supported the smooth running of the conference.

We hope you enjoy the exciting program of APWeb-WAIM 2017 as documented in these proceedings. June 2017

Xiaoyong Du Beng Chin Ooi

M. Tamer Özsu Bin Cui

Lei Chen Christian S. Jensen

Cyrus Shahabi

Organization

Organizing Committee

General Co-chairs Xiaoyong Du Renmin University of China, China BengChin Ooi National University of Singapore, Singapore M. Tamer Özsu University of Waterloo, Canada Program Co-chairs Lei Chen Hong Kong University of Science and Technology, China Christian S. Jensen Aalborg University, Denmark Cyrus Shahabi The University of Southern California, USA Workshop Co-chairs Matthias Renz George Mason University, USA Shaoxu Song Tsinghua University, China Yang-Sae Moon Kangwon National University, South Korea Demo Co-chairs Sebastian Link The University of Auckland, New Zealand Shuo Shang King Abdullah University of Science and Technology,

Saudi Arabia Yoshiharu Ishikawa Nagoya University, Japan Industrial Co-chairs Chen Wang Innovation Center for Beijing Industrial Big Data, China Weining Qian East China Normal University, China Proceedings Co-chairs Xiang Lian Kent State University, USA Xiaochun Yang Northeast University, China Tutorial Co-chairs

George Mason University, USA Andreas Züfle Muhammad Aamir Monash University, Australia

Cheema ACM SIGMOD China Lectures Co-chairs Guoliang Li Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Publicity Co-chairs Hongzhi Yin The University of Queensland, Australia Lei Zou Peking University, China Ce Zhang Eidgenössische Technische Hochschule ETH, Switzerland Local Organization Co-chairs Jun He Renmin University of China, China Yongxin Tong Beihang University, China Shimin Chen Chinese Academy of Sciences, China Sponsorship Chair Junjie Yao East China Normal University, China Web Chair Zhao Cao Beijing Institute of Technology, China Steering Committee Liaison Yanchun Zhang Victoria University, Australia Senior Program Committee Dieter Pfoser George Mason University, USA Ilaria Bartolini University of Bologna, Italy Jianliang Xu Hong Kong Baptist University, SAR China Mario Nascimento University of Alberta, Canada Matthias Renz George Mason University, USA Mohamed Mokbel University of Minnesota, USA Ralf Hartmut Güting Fernuniversität in Hagen, Germany Seungwon Hwang Yongsei University, South Korea Sourav S. Bhowmick Nanyang Technological University, Singapore Tingjian Ge University of Massachusetts Lowell, USA Vincent Oria New Jersey Institute of Technology, USA Walid Aref Purdue University, USA Wook-Shin Han Pohang University of Science and Technology, Korea Yoshiharu Ishikawa Nagoya University, Japan Program Committee Alex Delis University of Athens, Greece Alex Thomo University of Victoria, Canada

VIII Organization

Aviv Segev Korea Advanced Institute of Science and Technology, South Korea

Baoning Niu Taiyuan University of Technology, China Bin Cui Peking University, China Bin Yang Aalborg University, Denmark Carson Leung University of Manitoba, Canada Chih-Hua Tai National Taipei University, China Cuiping Li Renmin University of China, China Daniele Riboni University of Cagliari, Italy Defu Lian University of Electronic Science and Technology of China,

China Dejing Dou University of Oregon, USA Demetris Zeinalipour Max Planck Institute for Informatics, Germany and

University of Cyprus, Cyprus Dhaval Patel Indian Institute of Technology Roorkee, India Dimitris Sacharidis Technische Universität Wien, Vienna, Austria Fei Chiang McMaster University, Canada Ganzhao Yuan South China University of Technology, China Giovanna Guerrini Universita di Genova, Italy Guoliang Li Tsinghua University, China Guoqiong Liao Jiangxi University of Finance and Economics, China Hailong Sun Beihang University, China Han Su University of Southern California, USA Hiroaki Ohshima Kyoto University, Japan Hong Chen Renmin University of China, China Hongyan Liu Tsinghua University, China Hongzhi Wang Harbin Institute of Technology, China Hongzhi Yin The University of Queensland, Australia Hua Li Aalborg University, Denmark Hua Lu Aalborg University, Denmark Hua Wang Victoria University, Melbourne, Australia Hua Yuan University of Electronic Science and Technology of China,

China Iulian Sandu Popa Inria and PRiSM Lab, University of Versailles

Saint-Quentin, France James Cheng Chinese University of Hong Kong, SAR China Jeffrey Xu Yu Chinese University of Hong Kong, SAR China Jiaheng Lu University of Helsinki, Finland Jiajun Liu Renmin University of China, China Jialong Han Nanyang Technological University, Singapore Jian Yin Zhongshan University, China Jianliang Xu Hong Kong Baptist University, SAR China Jianmin Wang Tsinghua University, China Jiannan Wang Simon Fraser University, Canada Jianting Zhang City College of New York, USA

Organization

IX Jinchuan Chen Renmin University of China, China Ju Fan National University of Singapore, Singapore Jun Gao Peking University, China Junfeng Zhou Yanshan University, China Junhu Wang

Grifﬁth University, Australia Kai Zeng University of California, Berkeley, USA Karine Zeitouni PRISM University of Versailles St-Quentin, Paris, France Kyuseok Shim Seoul National University, Korea Lei Zou Peking University, China Lei Chen Hong Kong University of Science and Technology,

SAR China Leong Hou U. University of Macau, SAR China Liang Hong Wuhan University, China Lianghuai Yang Zhejiang University of Technology, China Long Guo Peking University, China Man Lung Yiu Hong Kong Polytechnical University, SAR China Markus Endres University of Augsburg, Germany Maria Damiani University of Milano, Italy Meihui Zhang Singapore University of Technology and Design,

Singapore Mihai Lupu Vienna University of Technology, Austria Mirco Nanni

ISTI-CNR Pisa, Italy Mizuho Iwaihara Waseda University, Japan Mohammed Eunus Ali Bangladesh University of Engineering and Technology,

Bangladesh Peer Kroger Ludwig-Maximilians-University of Munich, Germany Peiquan Jin Univerisity of Science and Technology of China Peng Wang Fudan University, China Yaokai Feng Kyushu University, Japan Wookey Lee Inha University, Korea Raymond Chi-Wing

Wong Hong Kong University of Science and Technology,

SAR China Richong Zhang Beihang University, China Sanghyun Park Yonsei University, Korea Sangkeun Lee Oak Ridge National Laboratory, USA Sanjay Madria Missouri University of Science and Technology, USA Shengli Wu Jiangsu University, China Shi Gao University of California, Los Angeles, USA Shimin Chen Chinese Academy of Sciences, China Shuai Ma Beihang University, China Shuo Shang King Abdullah University of Science and Technology,

Saudi Arabia Sourav S Bhowmick Nanyang Technological University, Singapore Stavros Papadopoulos Intel Labs and MIT, USA Takahiro Hara Osaka University, Japan

X Organization

Tieyun Qian Wuhan University, China Ting Deng Beihang University, China Tru Cao Ho Chi Minh City University of Technology, Vietnam Vicent Zheng Advanced Digital Sciences Center, Singapore Vinay Setty Aalborg University, Denmark Wee Ng Institute for Infocomm Research, Singapore Wei Wang University of New South Wales, Australia Weining Qian East China Normal University, China Weiwei Sun Fudan University, China Wei-Shinn Ku Auburn University, USA Wenjia Li New York Institute of Technology, USA Wen Zhang Wuhan University, China Wolf-Tilo Balke Braunschweig University of Technology, Germany Xiang Lian Kent State University, USA Xiang Zhao National University of Defence Technology, China Xiangliang Zhang King Abdullah University of Science and Technology,

Saudi Arabia Xiangmin Zhou RMIT University, Australia Xiaochun Yang Northeast University, China Xiaofeng He East China Normal University, China Xiaoyong Du Renmin University of China, China Xike Xie University of Science and Technology of China, China Xingquan Zhu Florida Atlantic University, USA Xuan Zhou Renmin University of China, China Yanghua Xiao Fudan University, China Yang-Sae Moon Kangwon National University, South Korea Yasuhiko Morimoto Hiroshima University, Japan Yijie Wang National University of Defense Technology, China Yingxia Shao Peking University, China Yong Zhang Tsinghua University, China Yongxin Tong Beihang University, China Yoshiharu Ishikawa Nagoya University, Japan Yu Gu Northeast University, China Yuan Fang Institute for Infocomm Research, Singapore Yueguo Chen Renmin University of China, China Yunjun Gao Zhejiang University, China Zakaria Maamar Zayed University, United Arab Emirates Zhaonian Zou Harbin Institute of Technology, China Zhengjia Fu Advanced Digital Sciences Center, Singapore Zhiguo Gong University of Macau, SAR China Zouhaier Brahmia University of Sfax, Tunisia

Organization

XI Keynotes

A Holistic View of Human Factors

in Crowdsourcing

Sihem Amer-Yahia

CNRS, University of Grenoble Alpes, Grenoble, France

sihem.amer-yahia@cnrs.fr

Abstract. For over 40 years, organization studies have examined human factors

in physical workplaces and their influence on the ability of an individual to

perform a task, or a set of tasks, alone or in collaboration with others. In a virtual

marketplace, the crowd is typically volatile, its arrival and departure asyn-

chronous, and its levels of attention and accuracy diverse. This has generated a

wealth of new research ranging from studying workers’ fatigue in task com-

pletion to examining the role of motivation in task assignment. I will review

such work and argue that we need a holistic view to take full advantage of

human factors such as skills, expected wage and motivation, in improving the

performance of a crowdsourcing platform.

Experience on XXX Health such as Earth

Health and Human Health Though Big Data

_1,2 ₁ Masaru Kitsuregawa ₂ The University of Tokyo, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan

kitsure@tkl.iis.u-tokyo.ac.jp

Abstract. We have been working in the area so called ‘Health’. In this talk, our

experiences on the problem solving for earth environmental health and human

health by big data system technologies are presented. We are wondering what

type of platform be suitable for societal health as a whole.

Thinking Spatial

Mohamed Mokbel

Department of Computer Science and Engineering, University of Minnesota

mokbel@umn.edu

Abstract. The need to manage and analyze spatial data is hampered by the lack

of specialized systems to support such data. System builders mostly build

general-purpose systems that are generic enough to handle any kind of attri-

butes. Whenever there is a pressing need for spatial data support, it is considered

as an afterthought problem that can be addressed by adding new data types,

extensions, or spatial cartridges to existing systems. This talk advocates for

dealing with spatial data as ﬁrst class citizens, and for always thinking spatially

whenever it comes to system design. This is well justiﬁed by the proliferation of

location-based applications that are mainly relying on spatial data. The talk will

go through various system designs and show how they would be different if we

have designed them while thinking spatially. Examples of these systems include

data base systems, big data systems, recommender systems, social networks, and

crowd sourcing.

Contents – Part I

Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reynold Cheng, Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng Spatial Data Processing and Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, and Aoying Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

Dawei Gao, Yongxin Tong, Yudian Ji, and Ke Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Jianguo Wu, Jianwen Xiang, Dongdong Zhao, Huanhuan Li, Qing Xie, and Xiaoyi Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

Zizhe Xie, Qizhi Liu, and Zhifeng Bao Graph Data Processing . . . .

Chuxu Zhang, Lu Yu, Chuang Liu, Zi-Ke Zhang, and Tao Zhou

Yanxia Xu, Jinjing Huang, An Liu, Zhixu Li, Hongzhi Yin, and Lei Zhao

Guohua Li, Weixiong Rao, and Zhongxiao Jin XX Contents – Part I

Wei Shi, Weiguo Zheng, Jeffrey Xu Yu, Hong Cheng, and Lei Zou

Qiang Xu, Xin Wang, Junhu Wang, Yajun Yang, and Zhiyong Feng

Qun Liao, Lei Sun, He Du, and Yulu Yang Data Mining, Privacy and Semantic Analysis

Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing

Ningning Ma, Hai-Tao Zheng, and Xi Xiao

Shushu Liu, An Liu, Zhixu Li, Guanfeng Liu, Jiajie Xu, Lei Zhao, and Kai Zheng

Jerry Chun-Wei Lin, Jiexiong Zhang, and Philippe Fournier-Viger

Ting Huang, Ruizhang Huang, Bowei Liu, and Yingying Yan

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, and Han-Chieh Chao Text and Log Data Management

Ming Chen, Lin Li, and Qing Xie

Shengluan Hou, Yu Huang, Chaoqun Fei, Shuhan Zhang, and Ruqian Lu

XXI Contents – Part I

Chunxia Zhang, Sen Wang, Jiayu Wu, and Zhendong Niu

Zhenyu Zhao, Guozheng Rao, and Zhiyong Feng

Jinwei Guo, Jiahao Wang, Peng Cai, Weining Qian, Aoying Zhou, and Xiaohang Zhu

Huan Zhou, Huiqi Hu, Tao Zhu, Weining Qian, Aoying Zhou, and Yukun He Social Networks

Xia Lv, Peiquan Jin, Lin Mu, Shouhong Wan, and Lihua Yue

Yang Wu, Cheng Long, Ada Wai-Chee Fu, and Zitong Chen

Kaixia Li, Zhao Cao, and Dacheng Qu

Yu Qiao, Jun Wu, Lei Zhang, and Chongjun Wang

Naoki Kito, Xiangmin Zhou, Dong Qin, Yongli Ren, Xiuzhen Zhang, and James Thom

Tianchen Zhu, Zhaohui Peng, Xinghua Wang, and Xiaoguang Hong Data Mining and Data Streams

Song Wu, Xingjun Wang, Hai Jin, and Haibao Chen

Qiong Li, Xiaowang Zhang, and Zhiyong Feng XXII Contents – Part I

Xiutao Shi, Liqiang Wang, Shijun Liu, Yafang Wang, Li Pan, and Lei Wu

Hui Wen, Minglan Li, and Zhili Ye

Chuxu Zhang, Chuang Liu, Lu Yu, Zi-Ke Zhang, and Tao Zhou

Yangming Liu, Suyun Zhao, Hong Chen, Cuiping Li, and Yanmin Lu Query Processing

Xiaoying Zhang, Hui Peng, Lei Dong, Hong Chen, and Hui Sun

Kento Sugiura and Yoshiharu Ishikawa

Zhijin Lv, Ben Chen, and Xiaohui Yu

Ying Wang, Ming Zhong, Yuanyuan Zhu, Xuhui Li, and Tieyun Qian

Yuan Tian, Peiquan Jin, Shouhong Wan, and Lihua Yue

Zepeng Fang, Chen Lin, and Yun Liang Topic Modeling

Yunfeng Chen, Lei Zhang, Xin Li, Yu Zong, Guiquan Liu, and Enhong Chen

Lin Xiao, Zhang Min, and Zhang Yongfeng

XXIII Contents – Part I

Linjing Wei, Heyan Huang, Yang Gao, Xiaochi Wei, and Chong Feng

Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, and Bowei Liu

Ming Xu, Yang Cai, Hesheng Wu, Chongjun Wang, and Ning Li

Jinjing Zhang, Jing Wang, and Li Li

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents – Part II

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Qi Ye, Changlei Zhu, Gang Li, and Feng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . .

Xiang Li, Rui Yan, and Ming Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

Hai-Tao Zheng, Xin Yao, Yong Jiang, Shu-Tao Xia, and Xi Xiao Recommendation Systems . . . . . . . .

Yan Zhao, Jia Zhu, Mengdi Jia, Wenyan Yang, and Kai Zheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Qinyong Wang, Hongzhi Yin, and Hao Wang

Fei Yu, Zhijun Li, Shouxu Jiang, and Xiaofei Yang

Wenli Yu, Li Li, Jingyuan Wang, Dengbao Wang, Yong Wang, Zhanbo Yang, and Min Huang

Shuhei Kishida, Seiji Ueda, Atsushi Keyaki, and Jun Miyazaki XXVI Contents – Part II

Xiaolu Xing, Chaofeng Sha, and Junyu Niu

Distributed Data Processing and Applications

Chenhao Yang, Ben He, and Jungang Xu

Ningnan Zhou, Xuan Zhou, Xiao Zhang, Xiaoyong Du, and Shan Wang

Tinghai Pang, Lei Duan, Jyrki Nummenmaa, Jie Zuo, and Peng Zhang

Shiliang Fan, Yubin Yang, Wenyang Lu, and Ping Song

Tao Zhu, Huiqi Hu, Weining Qian, Aoying Zhou, Mengzhan Liu, and Qiong Zhao

Hao Liu, Jiang Xiao, Xianjun Guo, Haoyu Tan, Qiong Luo, and Lionel M. Ni Machine Learning and Optimization

Guangxuan Song, Wenwen Qu, Yilin Wang, and Xiaoling Wang

Donghui Wang, Peng Cai, Weining Qian, Aoying Zhou, Tianze Pang, and Jing Jiang

Tao Xie, Bin Wu, and Bai Wang

Yixuan Liu, Zihao Gao, and Mizuho Iwaihara

XXVII Contents – Part II

Jun Yin and Xiaoming Li

Yongheng Wang, Guidan Chen, and Zengwang Wang

Hai-Tao Zheng, Zhuren Wang, and Xi Xiao Demo Papers

Jiawei Jiang, Ming Huang, Jie Jiang, and Bin Cui

Dezhi Zhang, Peiquan Jin, Xiaoliang Wang, Chengcheng Yang, and Lihua Yue

Yihai Xi, Ning Wang, Xiaoyu Wu, Yuqing Bao, and Wutong Zhou

Xuguang Bao, Lizhen Wang, and Qing Xiao

Yuan Liu, Xin Wang, and Qiang Xu

Longlong Xu, Wutao Lin, Xiaorong Wang, Zhenhui Xu, Wei Chen, and Tengjiao Wang

Leonard K.M. Poon, Chun Fai Leung, Peixian Chen, and Nevin L. Zhang

Mingyan Teng, Qiao Sun, Buqiao Deng, Lei Sun, and Xiongpai Qin

Qiao Sun, Xiongpai Qin, Buqiao Deng, and Wei Cui XXVIII Contents – Part II

Weiwei Wang and Jianqiu Xu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Tutorials

Meta Paths and Meta Structures: Analysing

Large Heterogeneous Information Networks

( )

Reynold Cheng , Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, and Eddie Ng

University of Hong Kong, Pokfulam Road, Pokfulam, Hong Kong

{ckcheng,zphuang,ydzheng2,jyan}@cs.hku.hk, kywong2@connect.hku.hk,

ngheii@gmail.com

http://www.cs.hku.hk/~ckcheng/

Abstract. A heterogeneous information network (HIN) is a graph

model in which objects and edges are annotated with types. Large and

complex databases, such as YAGO and DBLP, can be modeled as HINs.

A fundamental problem in HINs is the computation of closeness, or rel-

evance, between two HIN objects. Relevance measures, such as PCRW,

PathSim, and HeteSim, can be used in various applications, including

information retrieval, entity resolution, and product recommendation.

These metrics are based on the use of meta-paths, essentially a sequence

of node classes and edge types between two nodes in a HIN. In this tuto-

rial, we will give a detailed review of meta-paths, as well as how they are

used to define relevance. In a large and complex HIN, retrieving meta

paths manually can be complex, expensive, and error-prone. Hence, we

will explore systematic methods for finding meta paths. In particular, we

will study a solution based on the Query-by-Example (QBE) paradigm,

which allows us to discover meta-paths in an effective and efficient man-

ner.

We further generalise the notion of meta path to “meta structure”,

which is a directed acyclic graph of object types with edge types con-

necting them. Meta structure, which is more expressive than the meta

path, can describe complex relationship between two HIN objects (e.g.,

two papers in DBLP share the same authors and topics). We will discuss

three relevance measures based on meta structure. Due to the compu-

tational complexity of these measures, we also study an algorithm with

data structures proposed to support their evaluation. Finally, we will

examine solutions for performing query recommendation based on meta-

paths. We will also discuss future research directions.

1 Background

Heterogeneous information networks (HINs), such as DBLP

, and

DBpedia

, have recently received a lot of attention. These data sources, con-

taining a vast number of inter-related facts, facilitate the discovery of interesting knowledge

a) illustrates an HIN, which describes the relation-

4 R. Cheng et al.

_{a a a} ₁ ₂ ₃ _{w rite w rite} _publish ₁ _publish

P A P A ₁ ₁ _P ₂ ₁ V ²

_{p p p p p p -1}

_{1,1 1,2 2,1 2,2 3,1 3,2 mention} _{P A P P} ₂ : _{1 T} _{w rite mention} ₁ ₁ ₂ _{w rite} _A ₂ _{ICDM KDD} _VLDB _AAAI : _{h lis} ₁ _pub

_{v t t v t v t blis -1}

₁ _{1 v} ₂ ₂ ₃ ₃ _{4 w rite} ₄ _{A A} _{pu w rite} V h _{object types: author paper venue topic :} _{edge types: write mention publish} S P P ₁ ₁ _me _nt _ion _ion ₁ ₂ ² T nt _me

(a) (b)

Fig. 1.

HIN, meta paths, and meta structures.

example, Jiawei Han (a

2 ) has written a VLDB paper (p

2 , 2 ), which mentions the topic “efficient” (t ).

3 Given two HIN objects a and b, the evaluation of their relevance is of fun- damental importance. This quantifies the degree of closeness between a and b.

In Fig.

a), Jian Pei (a ) and Jiawei Han (a ) have a high relevance score,

since they have both published papers with keyword “mining” in the same venue (KDD). Relevance finds its applications in information retrieval, recommendation, and clustering

]: a researcher can retrieve papers that have high relevance in

terms of topics and venues in DBLP; in YAGO, relevance facilitates the extrac- tion of actors who are close to a given director. As another example, in entity resolution applications, duplicated HIN object pairs having high relevance scores (e.g., two different objects in an HIN referring to the same real-world person) can be identified and removed from the HIN.

Relevance Computation. In this tutorial, we will explore different ways of computing the relevance between two graph objects, for instance, neighborhood- based measures, such as common neighbors and Jaccard’s coefficient; graph- theoretic measures based on random walks, such as Personalized PageRank and SimRank. These measures do not consider object and edge type information in an HIN. We will discuss the concept of meta paths

. A meta path is a

sequence of object types with edge types between them. Figure

b) illustrates

a meta path P

1 , which states that two authors (A 1 and A 2 ) are related by

their publications in the same venue (V ). Another meta path P

2 says that two

authors have written papers containing the same topic (T ). We will discuss several meta-path-based relevance measures, including PathCount, PathSim, and Path Constrained Random Walk (PCRW)

. These measures have been shown to be better than those that do not consider object and edge type information.

We will further discuss meta structures, recently proposed in

, to depict the

relationship of two graph objects. This is essentially a directed acyclic graph of object and edge types. Figure

b) illustrates a meta structure S, which depicts

that two authors are relevant if they have published papers in the same venue, and have also mentioned the same topic. A meta path (e.g., P or P ) is a special

2 Meta Paths and Meta Structures

case of a meta structure. However, a meta path fails to capture such complex relationship that can be conveniently expressed by a meta structure (e.g., S). We will discuss how meta structures can be used to formulate three relevance definitions, as well as their efficient calculation.

Meta Path Discovery. There are often a huge number of meta paths between a pair of HIN objects. This can be very difficult, even for a domain expert, to identify the right meta paths. We will discuss a meta path discovery algorithm, recently proposed by

, where users provide example instances of source

and target objects through a Query-by-Example paradigm, to derive meta paths automatically. We will demonstrate a HIN search engine prototype based on this algorithm. Query Recommendation. We will study the use of meta paths in query recommendation, where queries are suggested to web search users based on their previous query histories. As studied in

, it is possible to use a knowledge

base (a HIN) and its related meta-paths to perform effective query recommendation. The approach is especially useful to long-tail queries that rarely appear in query logs.

2 Proposed Schedule The following is our proposed schedule of the 90-min tutorial.

– Introduction (15 min). We will discuss the basic model of HIN, and dis- cuss applications based on it, such as search, relevance computation, query recommendation, and data integration (10 min). We will also introduce meta- paths, a fundamental HIN analysis tool, and give an overview of the tutorial (5 min).
– Main contents (60 min). Next, we will introduce meta path, and how it facilitates the computation of various relevance measures (10 min). We then explain the process of discovering meta paths (15 min). We discuss a novel query recommendation framework based on meta paths (15 min). We will also present the meta structures, which is the latest development of meta paths (15 min). We will demonstrate a HIN search engine prototype based on meta paths (5 min).
– Conclusions (15 min). We will conclude the tutorial and discuss future directions (5 min). The rest of the time will be dedicated to Q&A (10 min).

3 Intended Audience

The tutorial is designed for researchers interested in latest development in the field of HINs, especially regarding meta-paths for novel applications. The HIN search demonstration will be give insight to software practitioners for developing recommendation facilities for HINs.

6 R. Cheng et al.

4 Biography of Presenters

Reynold Cheng is an Associate Professor of the Department of Computer Sci- ence in the University of Hong Kong. He obtained his PhD from Department of Computer Science of Purdue University in 2005. He was granted an Outstand- ing Young Researcher Award 2011–12 by HKU. He was the recipient of the 2010 Research Output Prize in the Department of Computer Science of HKU. He also received the U21 Fellowship in 2011. He received the Performance Reward in years 2006 and 2007 awarded by the Hong Kong Polytechnic University. He is a member of the IEEE, the ACM, and ACM SIGMOD. He is an editorial board member of TKDE, DAPD and IS, and was a guest editor for TKDE, DAPD, and Geoinformatica. He is an area chair of ICDE 2017, senior PC member of BigData 2017 and DASFAA 2015, PC co-chair of APWeb 2015, area chair for CIKM 2014, and workshop co-chair of ICDE 2014. He received an Outstanding Service Award in CIKM 2009. He has served as PC members and reviewer for top conferences and journals.

Zhipeng Huang is a 2nd year Ph.D. in the CS department of HKU, supervised by Prof. Nikos Mamoulis and Dr. Reynold Cheng. He received his bachelor degree from EECS department of PKU in 2015. His research interests cover data mining, data management and data cleaning.

Yudian Zheng is a 4th year Ph.D. in the CS department of HKU, supervised by Dr. Reynold Cheng. Yudian’s research interests cover crowdsourcing, data management and data cleaning. He has published full research papers in well-established database and data mining conferences/journals, including SIG- MOD, VLDB, KDD, WWW, ICDE, and TKDE. He has also taken internships in Microsoft Research and Google Research.

Jing Yan is a 1st year MPhil student supervised by Dr. Reynold Cheng in the CS department of HKU. His research interests include data management and data mining, with emphasis on knowledge graphs and data cleaning.

Ka Yu Wong is currently a MSc student of the CS department of HKU. Eddie Ng is currently a MSc student of the CS department of HKU.

Acknowledgements.

Reynold Cheng, Zhipeng Huang, Yudian Zheng, and Jing Yan

were supported by the Research Grants Council of Hong Kong (RGC Projects HKU

17229116 and 17205115) and the University of Hong Kong (Projects 102009508 and

104004129).

References

1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:

a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC - 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi

2. Huang, Z., Cautis, B., Cheng, R., Zheng, Y.: KB-enabled query recommendation

for long-tail queries. In: CIKM, pp. 2107–2112 (2016)

3. Huang, Z., Zheng, Y., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure:

Meta Paths and Meta Structures

4. Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-

constrained random walks. Mach. Learn. 81(1), 53–67 (2010)

5. Ley, M.: DBLP Computer Science Bibliography (2005)

6. Meng, C., Cheng, R., Maniu, S., Senellart, P., Zhang, W.: Discovering meta-paths

in large heterogeneous information networks. In: WWW, pp. 754–764 (2015)

7. Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: Exemplar queries: give

me an example of what you need. PVLDB 7(5), 365–376 (2014)

8. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In:

WWW, pp. 697–706 (2007)

9. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k

similarity search in heterogeneous information networks. In: PVLDB, pp. 992–1003 (2011)

10. Yu, X., Ren, X., Sun, Y., Sturt, B., Khandelwal, U., Gu, Q., Norick, B., Han, J.:

Recommendation in heterogeneous information networks with implicit user feed- back. In: RecSys, pp. 347–350 (2013)

Spatial Data Processing and Data Quality

TrajSpark: A Scalable and Efficient In-Memory

Management System for Big Trajectory Data

Zhigang Zhang, Cheqing Jin

( B

)

, Jiali Mao, Xiaolin Yang, and Aoying Zhou

School of Data Science and Engineering,

East China Normal University, Shanghai, China

{zgzhang,jlmao1231,xlyang}@stu.ecnu.edu.cn,

{cqjin,ayzhou}@sei.ecnu.edu.cn

Abstract.

The widespread application of mobile positioning devices has

generated big trajectory data. Existing disk-based trajectory manage-

ment systems cannot provide scalable and low latency query services any

more. In view of that, we present TrajSpark, a distributed in-memory

system to consistently offer efficient management of trajectory data. Tra-

jSpark introduces a new abstraction called IndexTRDD to manage tra-

jectory segments, and exploits a global and local indexing mechanism to

accelerate trajectory queries. Furthermore, to alleviate the essential par-

titioning overhead, it adopts the time-decay model to monitor the change

of data distribution and updates the data-partition structure adaptively.

This model avoids repartitioning existing data when new batch of data

arrives. Extensive experiments of three types of trajectory queries on

both real and synthetic dataset demonstrate that the performance of

TrajSpark outperforms state-of-the-art systems.

Keywords: Big trajectory data ·

In-memory ·

Low latency query

1 Introduction

Recently, with the explosive development of positioning techniques and popular use of intelligent electronic devices, trajectory data of MOs (Moving Objects) has been accumulated rapidly in many applications, such as location-based services (LBS) and geographical information systems (GIS). For example, DiD

the largest one-stop consumer transportation platform in China, now has 1.5 million registered active drivers, and provides services for more than 300 million passengers. The total length of all trajectories generated in this platform reaches around 13 billion kilometers in 2015. Moreover, the volume of trajectory data increases in a surging way. In March 2016, the number of trajectories generated in one day has already exceeded 10 million. It is challenging to provide real-time service over such data. However, as almost all of existing trajectory management systems are disk-oriented (e.g., TrajStore

), they cannot support low latency query services upon big trajectory data. ₁

12 Z. Zhang et al.

Recently, in-memory computing systems are a widely used to provide low

latency query services. For instance, Spar , a distributed in-memory computing system, has been widely used. Spark provides a data abstraction called RDDs (Resilient Distributed Datasets), to maintain a collection of objects that are par- titioned across a cluster of machines. Users can manipulate RDDs conveniently through a batch of predefined operations. However, Spark is lack of indexing mechanism upon RDDs and needs to scan the whole dataset for a given query. Recently, some Spark-based system prototypes have been proposed to process big spatial data, including SpatialSpark

and

Simba

. Amongst them, SpatialSpark implements the spatial join query on

Web Big Data International Proceedings 12

ISSN 0302-9743

ISBN 978-3-319-63578-1

1 Background

4 R. Cheng et al.

3 Given two HIN objects a and b, the evaluation of their relevance is of fun- damental importance. This quantifies the degree of closeness between a and b.

3. Huang, Z., Zheng, Y., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure:

5. Ley, M.: DBLP Computer Science Bibliography (2005)

1 Introduction

12 Z. Zhang et al.

Dokumen yang terkait

Proceedings International Seminar The Knowledge City: Spirit, Character, and Manifestation

Pengembangan Aplikasi Antarmuka Layanan Big Data Analysis

Sentiment Analysis Berbasis Big Data Sentiment Analysis Based Big Data

Introducing Big Data Concepts in an Introductory Technology Course

Proceedings International Seminar The Knowledge City: Spirit, Character, and Manifestation

Medical Computer Vision Algorithms for Big Data

Machine Learning, Optimization, and Big Data 2017

Big Data Analytics with Hadoop 3 Sridhar Alla

Digital Communications for Big Data

Web Hosting for Dummies for Big data

Dukungan

Links

Web Big Data International Proceedings 12

ISSN 0302-9743

ISBN 978-3-319-63578-1

1 Background

4 R. Cheng et al.

3 Given two HIN objects a and b, the evaluation of their relevance is of fun- damental importance. This quantifies the degree of closeness between a and b.

3. Huang, Z., Zheng, Y., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure:

5. Ley, M.: DBLP Computer Science Bibliography (2005)

1 Introduction

12 Z. Zhang et al.

Dokumen yang terkait

Proceedings International Seminar The Knowledge City: Spirit, Character, and Manifestation

Pengembangan Aplikasi Antarmuka Layanan Big Data Analysis

Sentiment Analysis Berbasis Big Data Sentiment Analysis Based Big Data

Introducing Big Data Concepts in an Introductory Technology Course

Proceedings International Seminar The Knowledge City: Spirit, Character, and Manifestation

Medical Computer Vision Algorithms for Big Data

Machine Learning, Optimization, and Big Data 2017

Big Data Analytics with Hadoop 3 Sridhar Alla

Digital Communications for Big Data

Web Hosting for Dummies for Big data

Dokumen yang Anda mencari sudah siap untuk unduhkan