Virtual Hadoop / Hbase Clusters on Mac OS X with Parallels and Fusion

Rather than pay the power bill on the 10 node Hadoop / Hbase cluster I was previously using for testing, I thought it might be better to recreate the cluster using virtual machines instead. I use a 64 bit Mac Pro desktop running Mac OS X 10.5.6, which can support 32 Gigs of RAM, so it seemed to make sense to give it a try with that.

In the world of Mac virtulization, there are two players battling it out for domination: the upstart Parallels Desktop and the incumbent VMWare Fusion. While Parallels was the first mover, which garnered it considerable users, Fusion entered the market with better features and more stability. Parallels came back with a better product for their next release, and the two have been fighting it out for dominance ever since with no clear winner.

I started with Parallels because it was the only option early on, and I had stayed with it continually upgrading through version 3.5. I hadn't upgraded to Parallels 4.0, the current version as of this writing, because most of the features were focused on Windows desktop users, and I was usually using Parallels for virtualizing Linux servers. So I started by creating a 4 node Hadoop / Hbase cluster in Parallels.

Rather than suffer through 4 Linux installs, I just made a master Linux instance and copied it out to create the 4 machines. Parallels assigns random MAC addresses to each instance so they are uniquely addressable over the network. I got a Hadoop cluster up and running, launched Hbase and started importing test data. Each node had one CPU, 2 Gigs of RAM (the maximum for Parallels 3.5) and 32 Gigs of storage. The import speed was fairly good, and if it were not for Activity Monitor, I probably wouldn't have noticed the 20% CPU usage when the cluster was idle. Everything seemed acceptable until I decided to arrange the Linux instances on another monitor. As soon as I did this, I started to get draw errors all over the place and, eventually, even after I had quit Parallels, I was forced to reboot my Mac. It was an avoidable situation but not an ideal one, so I took a look at VMWare Fusion.



(larger version)

VMWare Fusion had the benefit of the latest version and was also easy to set up. Like the last setup, I created a master Linux instance and copied it out, but, this time, I had 2 CPUs per instance. I didn't think this would add much, if anything, to the performance, but it more closely simulated the physical setup so I went with it. I ran the same versions of Hadoop and Hbase (0.19.1) as I had on the Parallels setup but got slightly better insert performance. The important exceptions were that Fusion consumed no noticeable CPU when the cluster was at idle, and the whole thing didn't crash my system when moving windows around. Parallels may have remedied these problems in their 4.0 release, but I haven't had the time to test that out.

Considering the 25 Amps it takes to run the physical 10 node cluster, simulating the cluster on one machine is a far better use of power if you can live without the raw speed. I can't tell you how many circuits I blew trying to balance the cluster with the other equipment in the server room! Of course, if you need the speed, you have to deal with the power penalty, but the setup within virtual machines allows me to develop code and simulate machine and rack failures without paying an outlandish power bill. I'm also able to test other distributed filesystems like Cloudstore and MapReduce frameworks such as Skynet. I've also tested up to 16 virtual nodes at the same time tough 10 is usually sufficient for what I'm doing.

If you have experience virtualizing Linux clusters, please leave a comment with your experiences.

Tags

Trackbacks

To send a trackback, use the URL of this story appending ?page=tb at the end.

Comments (4)

Nathan Charles from

With Fusion something I find really handy is the ability for headless since I generally ssh into vm's, the following command adds a Headless option to the view menu.

defaults write com.vmware.fusion fluxCapacitor -bool YES

Even André Fiskvik from Norway

I tried setting up the same here as well, but found that setting up a VMWare ESXi solution on a unix server was better (too bad they don't provide ESXi on OS X yet).

milkfilk from dc

Nice post, I like the pic for inspiration. Is that gentoo or a ripped down ubuntu?

Anders from Boston, MA

I used Gentoo. Good eye! I needed a small footprint and tend to compile everything myself so I went Gentoo for this. I'm sure Ubuntu would work fine as well.

Leave a Comment

Name:
Location: (city / state / country)
Email: (not published / no spam)

No HTML is allowed. Cookies must be enabled to post. Your comment will appear on this page after a moderator OKs it. Offensive content will not be published.

Click the firetruck to submit your comment.

To create links in comments:

[link:http://www.anders.com/] becomes http://www.anders.com/

[link:http://www.anders.com/|Anders.com] becomes Anders.com

Notice there is no rel="nofollow" in these hrefs. Links in comments will carry page rank from this site so only link to things worthy of people's attention.

About Me:


Name: Anders Brownworth
Location: Boston, USA
Work: Writing iPhone and Android applications.
Play: Technology, World Traveler and Licensed Helicopter Pilot
Follow:
more...

Books:

Lars Brownworth's book on Byzantine History spawned from our 12 Byzantine Rulers podcast:



or get the Audiobook in iTunes

Contact Me:

Name:
Email:

Click the firetruck to submit. (Why?)

Want to stop form spam on your website? Try JustHumans.com.
user:
pass: