Gluster Performance Testing & Tuning Part 1

After several weeks of experimentation and testing, I’ve been setting up glusterfs between some of my cluster nodes.

The current system consists of 16 nodes connected by 1Gib Ethernet. The machines are HPG6 servers, each server has 72GB’s of RAM, and 2 6 core processors. I configured 2 Two terabyte drives (7200 RPM) as /dev/sdb and /dev/sdc and formatted the machines as XFS. I am using Ubuntu 14.04 LTS with the latest kernel version.

I have had numerous trials and tribulations getting gluster working. The performance has been much slower than I expected, and hopefully as this blog developers, I will be posting what I have modified/changed.

During the debug process, I ran into multiple confusing issues related to the documentation, as well as my own inpatience at getting this to work! One of the most frustrating/unintuitive issues is related to the “peer probe” process.. i.e. how a machine communicates with its neighbors. I wanted to use a separate network for gluster communcation, since all of my machines had more than one ethernet port. So in this case, I had eth0 for primary network traffic, and eth1 in theory should be dedicated to gluster traffic.

To make this work, I needed to edit my /etc/hosts file, so that there was a line for each of the hosts pointing to their second network address.

In my case, I added lines like
192.168.50.51 gshpg6-01
192.168.50.52 gshpg6-02
192.168.50.53 gshpg6-03

Each machine had a default hostname like hpg6-01, hpg6-02, hpg6-03. What is confusing is that, at least in the current version, it’s not abundantly clear what network gluster will communicate over. So in order to get this to work, I had to make sure I “probed” each machine using the hostname I set up for the second network.

Another thing I had to reconcile is what is the theoretical maximum transfer rate. Since I am going over a 1GB connection, best case scenario I could hope for would be ~ 100MB/second of transfer. At first I was only seeing between ~5/10 megs a second, and sometimes depending on the benchmark between 35-50.

So today, the first thing I did was check my network hardware. As it turns out, likely due to a loose cable, one of my machines was only negotiating a connection at 100MB (not 1GB). I noticed this when I was peeking at the switch and noticed an “orange” light.. versus all green. I was able to swap out the cable/reseat it and get all the nodes at least talking to each other at 1GB. Since gluster is a distributed system, this could create an obvious bottleneck as girder round robins connections.

So in my next post, I am going to start working on some better performance testing..