Setup and Experiment on Tiny Cloud Infrastructure
23.4.1 Setup and Experiment on Tiny Cloud Infrastructure
and Platform
Our research lab has diverse types of compute nodes ranging from mobile devices like PDAs, desktops, and, workstations. We run some server programs on the work- stations and let the desktop utilized for personal research purpose. We wanted to assess the difficulty level of creating a tiny cloud which comprises small number of compute nodes. Pointing out numbers, the workstations add up to 20 cores in total. We are familiar with the operating systems we use (some distributions of GNU/Linux and Windows) and have been self-managing and self-administering our infrastructure. In our view, if it doesn’t take too much time and effort to set up a tiny cloud, other research institutions not associated with computer science/computer engineering field can also transform their legacy system to the cloud by their own.
Since we are interested in MapReduce, we experimented on our workstations by installing and configuring Hadoop in the network and then ran some MapReduce- based applications. We wanted to assess the difficulties of self-installation of Hadoop in a cluster and its performance when running some MapReduce appli-
cations. The specifications of our infrastructure can be seen in Table 23.2 .
Table 23.2 Infrastructure for running a tiny compute cloud
Part Item
Description
Hardware Processors 20 cores total with speed ranging from 1.6 GHz to 3.0 GHz
RAMs At least 2 GB on a physical server Software
OSes RHEL 4 64-bit, Fedora Linux 12 64-bit (Linux kernel 2.6.31), Windows Server 2003 64-bit Other software
Hadoop 0.20, CloudBurst 1.0.1, CrossBow 0.1.3 Network
Ethernet
Gigabit Ethernet
IP Addresses
Public IPv4 addresses
Firewall Two-level firewall, no NAT
In our experience, installing and configuring Hadoop in a cluster may take some time but should not bear a lot of technical difficulties. There exists Hadoop distribu- tion from Cloudera Hadoop Distribution, ( http://www.cloudera.com/hadoop ) which streamlines the installation of packages and dependencies with the use of yum soft- ware program. The challenge can be in configuring the connection between master node and slave node. Since Hadoop requires SSH connection to a slave node with root-level access to start its daemons and other startup scripts, there can be diffi- culties in configuring the permission and scaling the infrastructure horizontally in a network shielded by NAT. Also, there is some security concern regarding the net- work configuration of the cluster. Enabling remote root login is potential to bring security breach if variety of software is installed in the machine. Certain software may have security holes or bugs hence once it is exploited, it can affect the whole system.
546 M.F. Simalango and S. Oh Virtualization can isolate the workload thus improving the security. It also
improves the utilization of idle resources since several VMs are invoked simul- taneously. However, our finding indicates that there are still tough difficulties in transforming existing infrastructure into totally virtualized one. We are interested in type 1 hypervisor like Xen Hypervisor. ( http://xen.org ) that runs directly on com- puter hardware and manages all its guest operating systems. There are two modes of an operating system running as a Xen guest, dom0 and domU modes. In dom0 mode, an operating system will be assigned as the main guest OS with more privi- leges. The dom0 guest OS also manages and controls communication between Xen and other guest OSes. In domU mode, the guest OS has less privilege and con- fined to direct network and I/O access respectively. The major problem was native OS support for running Xen as the virtualization software. We wanted to use the version of Fedora Linux as the dom0 guest but since Xen requires modification to Linux kernel it was not possible to run Xen without manually patching the kernel or downgrading to older version of the OS whose kernel still supports dom0 mode. We consider this as a serious hurdle especially if the kernel patching is to be conducted by somebody with minimum system administration experience. Also, the option to downgrade the OS brings incompatibility problem to other software and software build process.
Beyond the technical difficulties, we noted that Hadoop is potential to be used as
a platform for broader scientific applications. Besides the streaming feature, Hadoop also provides built-in web interface for analyzing logs of executed jobs 1 along with its configuration. This can ease the collection of log data for further analysis. For example, in one of our experiments, we ran PatternRecog job which counted the occurrence of certain pattern in several input files. Through the web interface, the total execution time can be seen along with more detailed info like the execution time and details of each phase like Map and Reduce phases. This helped us not only in analyzing and solving some data-intensive problems but also getting to know what happened in each compute node.