This article assumes a few things:
- You have the hardware to run multiple Virtual Machines – A hadoop cluster needs four machines, in this situation four virtual machines, requiring their own hard drive space, memory and processor(s). I recommend 10+ GB of hard drive space per VM, 2 GB of memory each if you don’t need a GUI, 4GB each if you do, and a multi-core processor. That’s 40GB of hard drive space and 8GB of memory (16GB with GUI). The machine I’m setting this up on is an i7, 20 GB RAM, 500GB HD. If you actually plan to use this, you’ll probably want way more than 10GB of hard drive, the OS alone takes up 5 GB. You’ll want to consider this when setting it up.
- You need to have visualization turned ON in your bios, if you can’t get a VM to run, this could be the problem.
- You can install Oracle VM VirtualBox, and turn up a Virtual Machine running your favorite Linux distribution.
- You have some familiarity with Linux.
- I’m using CentOS 6.4 for my Linux distribution. This article assumes you are too.
Step 1 – Get your first virtual machine up and running.
Install virtual box, turn up a virtual machine, and get your linux distribution (64-bit) installed. I used southpark for my domain, and cartman (cartman.southpark.com) as my host name on my first machine. You need to come up with something, even if its server1, server2 and so on. As you can tell, I went with a SouthPark naming scheme. The host names will be Cartman, Stan, Kyle, and Kenny, all on the southpark domain.
Step 2 – Configure network
I’ve set all the network connections to Bridged inside of VirtualBox, this will allow my home dhcp and gateway to control much of the networking. I am however going to manually set each of the virtual machines IP’s to static, its not good when they move around.
In a terminal window, type to following commands to discover the network settings, as the OS defaults to DHCP.
ifconfig will show you the IP and the netmask.
route -n will show you the gateway.
In the GUI, select System > Prefrences > Network Connections > eth0. Manually configure the network settings to match what is in your terminal window. For the DNS servers, I use googles 126.96.36.199 and 188.8.131.52 Set the search domain to your domain.
Save those settings, restart the service with
service network restart, and then check to verify you can still
Step 3 – Edit the hosts file
$ vi /etc/hosts
Step 4 – Final tweaks
yum update to get everything up to its newest packages, and then I’m disabling
iptables. You can disable them or deal with them, there are hundreds of articles on both, but I’m not going into them here. After all that, go ahead and shut down the virtual machine.
Step 5 – Clone one into four
Now I’m going to clone cartman into stan, kyle, and kenny. With your virtual machine off, click on it in VirtualBox and then right click it, selecting clone. Name the new clone and check the box to reinitialize the mac and network connections. Select full clone and wait for it to finish. Repeat this for Kyle and Kenny.
Step 6 – Tweak your clones
Ok, now we have to fix the network connection on the clones. VirtualBox has defined new MAC Addresses for each of the machines, but because I cloned them, its stored on each machine incorrectly. To fix this we are going to boot the virtual machines up one at a time, make the corrections and then shut it down, until their all fixed.
Repeat these steps for all of your clones:
- Boot clone
- Right click the clone you plan to fix, choose settings. Under Network, click on the arrow next to the Advanced link. Here you will find the MAC Address displayed. Write it down or something, your going to need it in a minute.
- Start the clone up.
- Fix hostname
$ vi /etc/sysconfig/network
Edit the line labeled
HOSTNAME=stan.southpark.com(use the correct hostname)
Save the change
- Fix MAC Address & Set network interface to static
- In the GUI, select System > Prefrences > Network Connections > eth0
- In here you can change the MAC Address to the one you pulled from VirtualBox
- You can also setup the network settings like you did for the first virtual machine.
- Save the changes
- Restart service
ping google.com& verify IP is correct
- Shut down machine
- Repeat for next clone
Step 8 – Install Hadoop
Thanks to Matthew for discovering this extremely simple way of getting this going, this is very simple.
On cartman, download to your root the cloudera-manager-installer.bin, you can get it by clicking on the download link at the bottom of the Download Cloudera Standard page.
Make sure your also in root, and then run these commands:
$ chmod u+x cloudera-manager-installer.bin
$ sudo ./cloudera-manager-installer.bin
If your in the OS GUI the cloudera installer will pop up a little box informing you of its progress through this part of the process. At the end, it will pop up another box telling you to open up an IP:port in a browser. If your doing this all from the command line, you can jump over to Matt’s post on installing a cluster on dropplet servers, he has a good walk through over there.
*Note: It took nearly 10 minutes for the port to respond in a browser after the pop up telling me to go there.
Sit back and let cloudera do the work for you.
What your looking for here is the cloudera manager login page. The default login is
admin. On the next screen you need to define the hosts, just put their IP addresses in there one by one separated by comma’s. There’s a few options that will be specified by the type of cluster your looking for, but I would suggest the embedded databases. From here on out its pretty self explanatory. If you have issues, I’d suggest posting to their forums.
I only gave my VM’s 10GB each of storage, as this was mostly a test run. This proves to be an insufficient amount of storage. I couldn’t find a minimum amount anywhere, but I did find this pretty decent article about planning for your storage needs. If your actually planning on using your cluster, your probably going to want to understand your storage needs.
I’ve added a few additional screenshots I took during the installation, they should at least give you an idea of what decisions you’ll have to make during the installation.
You can see from that last screenshot that the installation was far from perfect. All of the machines are throwing several errors, mostly due to the lack of storage space. Oh, and as you can see, having all these virtual machine’s actually doing things. Ate all my memory up.
I’ll attempt to extend the disks and run some metrics against this VM cluster. We’re currently gathering data from several hadoop clusters to do an article on performance comparison. I may update this post at that time.