This article assumes a few things:

  • You have the hardware to run multiple Virtual Machines – A hadoop cluster needs four machines, in this situation four virtual machines, requiring their own hard drive space, memory and processor(s).  I recommend 10+ GB of hard drive space per VM, 2 GB of memory each if you don’t need a GUI, 4GB each if you do, and a multi-core processor.  That’s 40GB of hard drive space and 8GB of memory (16GB with GUI).  The machine I’m setting this up on is an i7, 20 GB RAM, 500GB HD.  If you actually plan to use this, you’ll probably want way more than 10GB of hard drive, the OS alone takes up 5 GB.  You’ll want to consider this when setting it up.
  • You need to have visualization turned ON in your bios, if you can’t get a VM to run, this could be the problem.
  • You can install Oracle VM VirtualBox, and turn up a Virtual Machine running your favorite Linux distribution.
  • You have some familiarity with Linux.
  • I’m using CentOS 6.4 for my Linux distribution.  This article assumes you are too.

Resources

Oracle VM VirtualBox
CentOS Mirror list
Download Cloudera Standard

Step 1 – Get your first virtual machine up and running.

Install virtual box, turn up a virtual machine, and get your linux distribution (64-bit)  installed.  I used southpark for my domain, and cartman (cartman.southpark.com) as my host name on my first machine.  You need to come up with something, even if its server1, server2 and so on.  As you can tell, I went with a SouthPark naming scheme.  The host names will be Cartman, Stan, Kyle, and Kenny, all on the southpark domain.

Step 2 – Configure network

vm-config-networkI’ve set all the network connections to Bridged inside of VirtualBox, this will allow my home dhcp and gateway to control much of the networking.  I am however going to manually set each of the virtual machines IP’s to static, its not good when they move around.

In a terminal window, type to following commands to discover the network settings, as the OS defaults to DHCP.

ifconfig will show you the IP and the netmask.
route -n will show you the gateway.

In the GUI, select System > Prefrences > Network Connections > eth0.  Manually configure the network settings to match what is in your terminal window.  For the DNS servers, I use googles 8.8.8.8 and 8.8.4.4  Set the search domain to your domain.

Save those settings, restart the service with service network restart, and then check to verify you can still ping google.com.

Step 3 – Edit the hosts file

vm-config-hostsNow I’m going to edit the hosts file, just to be sure the machines always know where the other machines are.  Edit your host file like below, except with your configuration in mind.

$ vi /etc/hosts

192.168.6.220 cartman.southpark.com
192.168.6.221 stan.southpark.com
192.168.6.222 kyle.southpark.com
192.168.6.223 kenny.southpark.com

Step 4 – Final tweaks

Run a yum update to get everything up to its newest packages, and then I’m disabling selinux and iptables.  You can disable them or deal with them, there are hundreds of articles on both, but I’m not going into them here.  After all that, go ahead and shut down the virtual machine.

Step 5 – Clone one into four

clone-name&macNow I’m going to clone cartman into stan, kyle, and kenny.  With your virtual machine off, click on it in VirtualBox and then right click it, selecting clone.  Name the new clone and check the box to reinitialize the mac and network connections.  Select full clone and wait for it to finish.  Repeat this for Kyle and Kenny.

Step 6 – Tweak your clones

Ok, now we have to fix the network connection on the clones.  VirtualBox has defined new MAC Addresses for each of the machines, but because I cloned them, its stored on each machine incorrectly.  To fix this we are going to boot the virtual machines up one at a time, make the corrections and then shut it down, until their all fixed.

Repeat these steps for all of your clones:

  1. clone-macBoot clone
    • Right click the clone you plan to fix, choose settings.  Under Network, click on the arrow next to the Advanced link. Here you will find the MAC Address displayed.  Write it down or something, your going to need it in a minute.
    • Start the clone up.
  2. Fix hostname
    • $ vi /etc/sysconfig/network
      Edit the line labeled HOSTNAME=stan.southpark.com (use the correct hostname)
      Save the change
  3. Fix MAC Address & Set network interface to static
    • In the GUI, select System > Prefrences > Network Connections > eth0
    • In here you can change the MAC Address to the one you pulled from VirtualBox
    • You can also setup the network settings like you did for the first virtual machine.
    • Save the changes
  4. Restart service
    • Test:  ping google.com & verify IP is correct ifconfig
  5. Shut down machine
  6. Repeat for next clone

Step 8 – Install Hadoop

Thanks to Matthew for discovering this extremely simple way of getting this going, this is very simple.

On cartman, download to your root the cloudera-manager-installer.bin, you can get it by clicking on the download link at the bottom of the Download Cloudera Standard page.

Make sure your also in root, and then run these commands:
$ chmod u+x cloudera-manager-installer.bin
$ sudo ./cloudera-manager-installer.bin

cloudera-command-startcloudera-command-start1If your in the OS GUI the cloudera installer will pop up a little box informing you of its progress through this part of the process.  At the end, it will pop up another box telling you to open up an IP:port in a browser.  If your doing this all from the command line, you can jump over to Matt’s post on installing a cluster on dropplet servers, he has a good walk through over there.

*Note: It took nearly 10 minutes for the port to respond in a browser after the pop up telling me to go there.

 Sit back and let cloudera do the work for you.

cloudera-command-start3What your looking for here is the cloudera manager login page.  The default login is admin & admin.  On the next screen you need to define the hosts, just put their IP addresses in there one by one separated by comma’s. There’s a few options that will be specified by the type of cluster your looking for, but I would suggest the embedded databases.  From here on out its pretty self explanatory.  If you have issues, I’d suggest posting to their forums.

I only gave my VM’s 10GB each of storage, as this was mostly a test run.  This proves to be an insufficient amount of storage.  I couldn’t find a minimum amount anywhere, but I did find this pretty decent article about planning for your storage needs.  If your actually planning on using your cluster, your probably going to want to understand your storage needs.

 I’ve added a few additional screenshots I took during the installation, they should at least give you an idea of what decisions you’ll have to make during the installation.

Good luck!

cloudera-command-start4 cloudera-command-start5 cloudera-command-start6
cloudera-command-start7 cloudera-command-start8 cloudera-command-start9

cloudera-command-start10You can see from that last screenshot that the installation was far from perfect.  All of the machines are throwing several errors, mostly due to the lack of storage space.  Oh, and as you can see, having all these virtual machine’s actually doing things.  Ate all my memory up.

 

I’ll attempt to extend the disks and run some metrics against this VM cluster.  We’re currently gathering data from several hadoop clusters to do an article on performance comparison.  I may update this post at that time.

Leave a reply

Your email address will not be published. Required fields are marked *