Scientific Compute Tasks

23.2 Scientific Compute Tasks

Dissecting characteristics of scientific computing is like mapping the genes. At the bigger scope -the double helix structure of DNAs in a chromosome- we notice big affinity of the exterior structure of genes. As we digest more scrupulously to the smaller details, however, it can be noticed that tasks pertaining to and information confined in each gene are different. Similarly, scientific computing tasks may seem indistinguishable for non-practitioners while researchers at the same time perceive the broadness of the tasks. Each field in sciences conducts various compute tasks

23 Feasibility Study and Experience 537 with different computing strategies. We are then interested more in how a certain

strategy is selected for accomplishing a scientific compute task. Scientific compute tasks are mainly related with two things, processing time and result. A compute task that can be completed in less time is favorable in gen- eral sense. With quicker task completion, other jobs in queue can be staged and sequenced to be processed in order to obtain the final result. Implications brought by the final result after series of analyses of the outputted data will determine the state of success of the corresponding compute task. The result, however, depends on the correctness of the underlying logic. Since the logic is formulated by the process designer or researcher, proper and efficient implementation of the logic will yield less processing time and expected result given that the logic is verifiably correct for the task.

Designing implementation of algorithms for a scientific compute task can be triv- ial. Still, in some cases complexities exist and transform the task to be non-trivial. The complexities are primarily caused by two factors: size of input and algorithms in the processing block. Input to a compute task can be huge hence the time needed to process it completely also becomes lengthier. At the other side, the processing block may also implement complex algorithms which in turn requires hefty amount of resources. This results in the necessity of superior computation capability. Huge inputs and complex algorithms infer intense computing. By basing on compute intensity as criteria, we can categorize scientific compute tasks into two groups, resource-intensive tasks and data-intensive tasks.

A resource-intensive task refers to a task that makes use of plenty of compute resources. Compute resources can be I/O, CPU cycles, memory, temporary stor- age, and also energy. Examples of resource-intensive task in scientific realm are system modeling, image rendering (especially in 3D), and forecasting. Such kinds of tasks involve chains of complex computations which demand big allocation of resources. Slightly different with the previous one, data-intensive task refers to a task that deals with processing of huge size of input data. An example of task that falls into this category is data mining for huge set of data (iris data in biomet- rics, protein data in biology, access log data in computer network, etc.). However, for resource-constrained processing node, a data-intensive task can also become

a resource-intensive task if proper strategy for processing such data can not be applied. Enabling parallelism has been a strategy used in the execution and accomplish- ment of scientific compute tasks especially the complex ones. With parallelism,

a complex compute task which requires enormous amount of resources, let say equivalent with the specifications of a minicomputer or mainframe, can be divided into smaller parts, each to be fitted and deployed into less powerful infrastruc- ture, for example PC-class compute nodes. Techniques and paradigm in achieving parallelism have also evolved, from clustering to grid computing and now cloud computing.

In classical clustering, a virtually more superior compute node which is named

a compute cluster is formed by several physical compute nodes which share the same software system. The nodes in a compute cluster are also placed in close

538 M.F. Simalango and S. Oh vicinity and in general case, the hardware specifications are also the same. Hence, a

compute cluster is basically built over homogeneous compute nodes. Homogeneity in elements of a compute cluster lessens communication overhead among nodes in cluster. At the same time, however, the homogeneity brings inflexibility in extending the compute capability. Since the requirements are tight, future node provisioning and configuration can be a problem for an institution with limited resources.

Grid computing enables resource pooling from distributed sets of compute nodes. The compute-node set can be a cluster of commodity computers and servers, or simply an individual node. Different with a traditional compute cluster which is built on top of homogeneity, a grid can be formed by diverse systems. A grid fed- erates resources comprising storage, network, and compute which are distributed geographically and generally heterogeneous and dynamic. A grid defines and pro- vides a set of standard protocols, middleware, toolkits, and service to discover and share the componential distributed resources. Upon the creation of a grid, a com- putation power equivalent to supercomputers and large dedicated clusters can be built and utilized at cheaper price compared to the purchase of a mainframe or supercomputers.

There are two types of grid based on its usage method namely institutional grid and community grid. In institutional grid, utilization of resources in the grid is pos- sible only by the institutions or individuals donating compute resources to the grid. In contrast, community grid model also offers compute resources to public users. Nevertheless, resource utilization is usually tied within a contract. A grid user is allocated certain amount of resources and extra resource allocations will be made possible after approval of proposal of additional resource request (Foster et al., 2008 ).

A common technique in accomplishing scientific compute task in the grid is through batch processing. This technique is mainly aimed at solving data-intensive tasks. Initially, data are segmented into several sequences and workflows pertinent to the segmentations (Simmhan, Barga, Lazowska, & Szalay, 2008 ) are then cre- ated by a scheduler. A batch process is then executed in order to process all the

sequences and output the result. The overall process is depicted in Fig. 23.1 . In the picture, we redraw and combine common processes in implementations mentioned in Matsunaga, Tsugawa, and Fortes ( 2008 ) and Liu and Orban ( 2008 ). In the pic- ture, the job scheduler manages the process of assigning workflows for processing in compute infrastructure, which is the grid. The outputs of the process are chunks of data which have to be merged later to yield the final result.

Based on the scheme in Fig. 23.1 , we can see the main idea is implementing parallelization through simultaneous processing of segmented data. Through the creation of multiple workflows and their delegation to worker/compute nodes in the grid, an initially time-consuming process over a large set of data (data-intensive task) can be reduced into sets of smaller processes running in parallel. Consequently, this approach shortens the time to yield the result compared to processing of such data in serial.

23 Feasibility Study and Experience 539

Fig. 23.1

A batch-process scheme for processing large scientific data