System Overview
9.2 System Overview
In this section, we describe the system architecture of CloudWeaver. We first briefly review Hadoop and its components. Hadoop is an integrated system for Map/Reduce jobs. It runs on a large cluster with HDFS (Hadoop Distributed File System). HDFS has a single Namenode which manages the file system namespace and regulates access to files by clients. Each machine has a Datanode which manages storage attached to the machine. Each data file in HDFS is stored as many small data blocks, typically of a fixed size. Each block has 2 or 3 replicas located in different Datanodes. Using multiple copies of small data blocks provides better availability and accessability. The Map/Reduce execution is built on top of HDFS. The user submits a Map/Reduce job configuration through job client. A master node will maintain a job tracker and fork many slave nodes to execute map/reduce tasks. Each slave node has a task tracker which manages map or reduce tasks instances on that node.
Compared to Hadoop, our new proposed system for generic clouds, called CloudWeaver, has the following extensions:
• Cloud Monitor is added to monitor the resource utilization of a processor and consumption status of processor output (i.e. results). It is also used to add new servers to the cloud or shut down some computing utilities.
• The Hadoop cloud is extended to be more generic for general purpose computing as a generic cloud. In order to enable a generic cloud, the jobs for CloudWeaver is also extended from MapReduce job description to DAG with general purpose operators.
• Workload Manager (WLM) is added to automate the assignment of processors to tasks and map jobs to processors. More detail will be given in Section 9.3 .
Figure 9.1 shows the system architecture of CloudWeaver, which consists of a job client, a central workload manager (WLM), servers (or called workers or slave nodes), a name node and a storage system (called data node).
The generic cloud is provided as hardware computing facility. The user may or may not know the detail of its configurations, and the configuration can be changed. The user will submit a query job to the generic cloud from a client computer. The query job can be considered as a marked operator tree, so that we know the work flow or the data flow. The name of input files of some of the nodes are also included. We assume that these input files reside in the storage system of the generic cloud. In other words, the cloud can read the files from their names.
We also relax the assumption of HDFS by supporting both shared file systems and non-HDFS shared-nothing file systems. In shared-file systems, each processor can access storage directly through some common interface. Other distributed file systems may have different storage policy. We assume that the name node and data node in CloudWeaver provide interface to access data as small blocks similar to HDFS. Our scheduling algorithm is designed to run many tasks on small data blocks to improve performance and to achieve load balance.
222 R. Li et al. Fig. 9.1 Architecture of
Job: DAG of general
cloudweaver
purpose operators
Master node Workload Manager
Job Client
Job Tracker Cloud Monitor
Slave node
Slave node
Task Tracker
Task Tracker ...
Task Instance
Task Instance
Name Node
Data Node
• General
File block Directory
Data Access
File System
9.2.1 Components
In this section, we briefly discuss the extension components of Hadoop in CloudWeaver, which are the workload manager, cloud monitor, and generic cloud.
9.2.1.1 Workload Manager
The workload manager will accept the query job and is responsible for processing the query. It knows the status of whole system: where’s the name node, where are the computing servers and where ’s the storage system. Any change to the cloud envi- ronment will be noticed by workload manager. WLM looks at the operator tree of a query job and process the job in a data driven model. That is, WLM will schedule small tasks to run in servers. Each task will take a small block of input and gener- ates some output files. WLM will schedule the intermediate result file to feed other operators until the output of the query job is generated. In this process, WLM takes care of the generic cloud change and the working progress, so it can dynamically utilize all available resources.
The name node maintain a directory of all the files. It could be considered as a file system interface. The source input files reside in the storage system. When the server process job, it will ask the name node about the accessing address of the files and then read them or write new files as result.
The storage system can be either shared storage or a share-nothing structure. We assume that there’s central node to maintain all the related files in the generic cloud.
9 Cloudweaver: Adaptive and Data-Driven Workload Manager 223
9.2.1.2 Cloud Monitor
Since the system is based on producer-consumer model, the output of lower tier tasks is used as input for the tasks of upper tiers. Each task stores its output (i.e. intermediate results) in the local disk by pre-defined block size. The intermediate result blocks are then read by upper tier tasks.
If the number of intermediate result blocks is increasing, Cloud Monitor can notice WLM to increase the number of upper tier tasks to consume increasing number of blocks.
9.2.1.3 Generic Cloud
A generic cloud has a cluster of servers with computing power. The large data set is either stored in the cloud or can be passed into the cloud to fulfill a data processing job. Each data processing job is called a job for short in the rest of this chapter. We mainly study the queries in this chapter. A task can be parallelized into small jobs. Jobs are executed on different servers. The performance is improved by using parallelism.
The servers can have different computing power and storage size. Scheduling jobs in generic cloud to achieve best response time is a hard optimization problem.
A predefined scheduling algorithm is hard to deal with the changing environment. In this chapter we solves the scheduling problem in run time. We check the data processing requirement and cloud status in run time, determine the number of jobs and the assignment of jobs to servers. Because each partition and scheduling step is based on available data need to be processed, we believe that this data driven method can best balance the workload of servers in the generic cloud and achieve best performance.
We consider SQL like query processing for large data set with MapReduce support.
Query Job
A user query job can be described by an operator graph. The operators include Extract, Join, Aggregate functions and Map Reduce. The map and reduce function is provided by user.
In the parallel environment, the behavior of an operator can be partitioned into several small jobs and run in parallel on different servers. Each job can be run by an executable file. The command takes some input files and generate output files. In this way, we can direct data files and executable file to different servers, the the executable file consume the input files and produce outputs.
Input We assume that the input data are very large. The large input files may be considered
as tables in an RDMBS. We can consider each file as a big table.
224 R. Li et al. In order for the WLM to make use of parallelism, WLM needs to know how each
operator can be parallelized. For example, a simple table extract can be arbitrarily partitioned into many small files. Each file can be processed by an extract task on any server. The sort operator can also be parallelized by many small sorting tasks with a merger task, but the output can not be fixed unless all the input data has been processed. This kind of operator is called a blocking operator.
The join operator is another complex example. We can partition the input and use many servers to process join. When WLM wants to schedule more servers to process join, the status of original join servers need to be migrated or changed.
The differences of our work compared with others is that we do not assume SMP or cluster with identical servers. Instead, we deal with machines with different power. We deal with dynamic data throughput nature. For example, when we have
a join, the processing speed is a constant, but the result output rate varies during the whole processing period. Our framework also has a big difference compared with Map/reduce or Hadoop. The Hadoop system aims to provide abstraction (virtualization) for the underlying hardware/file location/load balancing so that the application programs can focus on writing maps and reduce functions. In our work, CloudWeaver provides similar functionality but focuses on coordinating execution of complex jobs (virtualization of execution flow management and optimization from the programmers). The exe- cution of Map/Reduce or Hadoop is much simpler since the execution only have two phases and does not involve complex data flow. This is similar to SQL: users need to specify what they want using SQL and no need to specify how to execute the queries and how to optimize it. The optimization and execution is done auto- matically by the database system. Our system is a powerful implementation for data processing under cloud environment.
A whole data processing system can have three important components on top of each other. The first is user’s input to describe the job. The second is the paral- lelization and execution. The third, like the storage is provided by the infrastructure. In our system, the workload manager focuses on the second phase of execution by conducting the processor/task mapping. We assume that how to select a set of right servers is handled by the infrastructure. Hadoop and Map/Reduce provide all three phases. This makes it not generic. Our system can deal with a user’s input job and has good performance over arbitrary infrastructure.
Dryad is similar to our system in the sense that it parallelize a sequential data flow, but it only has local optimization for operators that runs slower while our algorithm schedule the whole job DAG in a data driven fashion, which is more flexible and extensible. Besides, Dryad has not been extended to schedule multiple jobs.