Storage Adventures: March 2010

Monday, March 15, 2010

Cluster Interconnect Network Benchmarking

Netperf is a great tool for benchmarking and stress testing your cluster interconnect. Oracle RAC demands a high bandwidth, low latency interconnect in order to effectively utilize Cache Fusion. Cache Fusion, in a nutshell, means allowing each server in the cluster to access blocks of data in each other’s memory, rather than going to disk. In this particular cluster, with 128GB of memory in each server, and 4 servers, we have an effective 512GB cache. Therefore, the cluster interconnect becomes critical because it is essentially, another (slightly slower) memory bus. Oracle uses UDP to communicate between servers in the cluster, not only for heartbeat, but also for these fast server to server memory copy operations.

Here’s how to benchmark your cluster interconnect network:

Download the source tarball from ftp://ftp.netperf.org/netperf/netperf-2.4.5.tar.gz
Extract to a directory, go to that directory, “./configure --enable-demo=yes;make;make install”
Add the following line to your /etc/services file: “netperf 12865/tcp”
Add /etc/xinetd.d/netperf with the following contents:

# default: on
# description: Netperf service
service netperf
{
               disable = no
               socket_type = stream
               protocol = tcp
               wait = no
               user = root
               server = /usr/local/bin/netserver
}

Open port 12865 TCP and UDP on your iptables configuration.
Test your installation by typing “netperf”, which will initiate a test to the loopback adapter for 10 seconds.
Repeat the above steps for all servers in the cluster.

Now, to test your cluster interconnect, here are some tests you can run:

TCP Stream Test (10 seconds): “root@server1# netperf –H server2 –t TCP_STREAM”
UDP Stream Test (10 seconds): “root@server1# netperf –H server2 –t UDP_STREAM”
UDP Request/Response (10 seconds): “root@server1# netperf –H server2 –t UDP_RR”

And, the test that I believe best represents continuous Oracle Interconnect (Cache Fusion) traffic:

1 hour long UDP stream test:

[root@server1 ~]# netperf -t UDP_STREAM -D 10 -l 3600 -H server2
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to server2 (x.x.x.x) port 0 AF_INET : demo
Interim result: 9920.38 10^6bits/s over 10.01 seconds
Interim result: 9925.23 10^6bits/s over 10.00 seconds
Interim result: 9922.43 10^6bits/s over 10.00 seconds

After an hour you should get a result like this:

Socket  Message  Elapsed      Messages               
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

1048576   65507   300.05    5682971      0    9925.78
1048576           300.05    3647379           6370.45

It looks like we are getting 9925 megabits from our 10 gigabit ethernet connections. We are using jumbo frames. My experience is that you can’t get more than 5-6mb without jumbo frames (9000 MTU). You also want to verify the errors should be 0, otherwise you might have some packet loss or physical network issues on the cluster interconnect.

I recommend running these tests and benchmarks on both the public and private network connections, between every server.

Happy benchmarking!

Thursday, March 11, 2010

Benchmarking multi-server Data Warehouse performance

I figured out a way to run a benchmark across multiple servers in the cluster simultaneously. Basically, I have all 4 servers synchronized to an NTP time source, and I scheduled an at job to run the same benchmark at the same time on all 4 servers. This way, the data points align in time. Each data point consists of exactly 60 seconds of I/O traffic. Watching all 4 nodes I tail –f the .csv file so I can watch each data point be reported in real time after it is captured, and each node is in sync with each other down to the split second.

Here are the results of the 2 node benchmark I ran, pushing just over 4GB/sec:

And the 4 node benchmark, getting close to 5GB/sec. Almost 5 gigabytes a second is pretty impressive for a storage system that cost less than $200K… I know I could build a server with multiple attached MSA70 shelves that would have faster performance, but this is a shared everything fully scale out architecture, not a single node “all eggs in one basket” scenario:

Tuesday, March 9, 2010

Benchmarking the HP MSA 2000 Storage

I spent a lot of time last week benchmarking the MSA 2000 storage arrays with a DSS/sequential workload. I’m going to run you through my process so you can see where I made breakthroughs in performance.

I initially decided that since we have 4 storage arrays, the best way to obtain maximum performance was to find the configuration that would give us the fastest single array performance. With 4 arrays this allowed me to carve each one a different way, then run my identical benchmarks, and get 4 result sets in a shorter amount of time than it takes to carve all 4 identically, and then recarve all 4.

To start out, I chose 4 possible configurations that I thought had potential for good performance (we have 48 disks + 1 global spare per array):

Configuration #1: 4x RAID 5 devices of 12 disks each, striped into a RAID 0+5 configuration.
Configuration #2: 4x RAID 5 devices of 12 disks each, split across both shelves (6 from each shelf), striped into a RAID 0+5 configuration.
Configuration #3: 6x RAID 5 devices of 8 disks each, striped into a RAID 0+5 configuration.
Configuration #4: 4x RAID 6 devices of 12 disks each – RAID 0+6 was not supported by this array, so I did the striping in Oracle.

After creating each of the underlying virtual disks, I then placed a total of 16 volumes of approximately 400GB each on top (the actual volume size varies depending on how the storage is carved, but anywhere from 360GB to 403GB each was obtained), and told the Oracle ORION benchmarking tool to set these up in a RAID 0 striped configuration across all 16, which Oracle likes to do to achieve greater performance.

Here are the results of these initial tests:

The first thing that jumped right out at me is how much faster the RAID 6 configuration was than the other 3. I wanted to figure out why this was the case, when my past experience has led me to believe that RAID 6 is usually slightly slower than RAID 5. The only thing that was different was that the array only supports plain RAID 6, not RAID 6+0 or 0+6. For all of the RAID 5 tests, I was using the array’s capability of RAID 0+5 (they call it RAID 50) to stripe across multiple RAID 5 virtual disks. My theory was that for some reason, utilizing the array’s RAID 50 technology was harming performance.

To test this theory, I ran a new test, this time without using RAID 0+5 on the array, simply carving 4 virtual disks of 12 disks each, in a RAID 5 configuration. Of course Oracle stripes across all 16 of the volumes I created in these 4 virtual disks, so the effect is the same, the only difference is that it’s being done in software now rather than hardware. Here are the results of those tests:

Now we’re getting somewhere! This was my first breakthrough: Determining that utilizing RAID 50 or RAID 0+5 capability at the hardware level was harming performance meant we could achieve greater performance by striping in software, and leaving that feature disabled on the storage array. As you can see, the difference between RAID 5 and RAID 6 is negligible in performance, so we could now make our decision based more on redundancy or availability as opposed to performance.

We were now topping out at around 1000 MB/sec, or close to 1GB a second of data transfer rate. This is not bad for a single array, but I knew the MSA 2000 could perform better. Now, to determine how to get that last bit of performance out of the storage.

I had one more idea how we could improve performance. Up until now, I had been creating 16 volumes, 4 on each virtual disk, and using Oracle to stripe across them. Each virtual disk was either a RAID 5 or RAID 6 containing 12 physical disks. My theory now is that by creating 4 volumes in the physical disk, and causing Oracle to stripe across all 4, meant the hard disks were doing unnecessary seeking as each read or write had to access all 4 volumes simultaneously. I decided, rather than carving 4 volumes per virtual disk, to just carve one large volume per virtual disk, 4 total, and test like that:

My theory was correct. By creating only 4 volumes instead of 16, each volume containing the entire space of a RAID 5 or 6 virtual disk with 12 physical drives in it, we were able to achieve a maximum performance rate of about 1150MB/sec. This was a 15% improvement over our previous test results and beginning to reach the theoretical maximum of a fully loaded array of 1.24GB/sec.

The last step was to take our two best performing single array configurations and deploy them across all 4 units, and test aggregate performance. Here are the results of that test:

This is looking really good. One thing I should mention is that with a single server, I can never fully saturate all 4 storage arrays. We are now reaching peak performance of around 2700 MB/sec, which is getting close to the limits of our host bus adapters. It will actually take at least 2 of these servers doing maximum sequential reads to fully saturate the capability of this storage.

Next steps:

We’ve determined which RAID 5 or 6 configuration will give us the best performance. We need to decide if we want to use RAID 5 or 6. If we use RAID 5, we can withstand a single disk failure per 12 disks, and the global spare will be brought online immediately. If we use RAID 6 we can withstand 2 disk failures per 12 disks. Some other things to think about: with RAID 5, we end up with 25.6 TB usable space, RAID 6, we end up with 23.4TB usable space. If we decide to mirror across storage arrays for further redundancy, these figures are halved to 12.8TB RAID 5 and 11.7TB RAID 6.
Optimal sizing for Redo logs and Archive logs: What I would prefer to do is present 16 total disks, 4 from each array, to achieve the best performance. If we go with RAID 5, each one of these 16 disks will be roughly 1.6TB in size. If we use RAID 6 each one will be roughly 1.45TB in size. I would recommend we partition them at the Linux level to take a small section from the beginning of each disk to use for Redo or Archive logs. For example, if we took 50GB from each disk, we would have 800GB total for Redo logs. Likewise for archive logs, taking 100GB from each disk would give us 1.6TB total for archive logs. I’ve scheduled a meeting with the DBA team to make decisions here.

All in all, this is looking like a very exciting and highly performing data warehouse solution. Now I want to figure out a way to benchmark more than one node simultaneously and obtain meaningful results. I suppose I could start 2 nodes running ORION simultaneously, and aggregate their results, however, it might not be ideal.

Any blog readers out there care to offer suggestions for benchmarking multi-node performance? SNMP data gathering at the SAN fabric level? Software benchmark tools?

Installing the Oracle 11g Data Warehouse

2 weeks ago I travelled to Sacramento, CA, to physically install the servers and storage for this new Oracle 11g data warehouse. The servers are 4 of the HP Proliant DL-585G6. These servers are pretty incredible, using the new AMD Opteron 6-core CPUs, with a total of 24 cores per server. In addition, each core has 3MB of cache. Each server came bundled with 128GB of RAM, which, when pooled together will give us a massive half terabyte of cache for Oracle 11g RAC to use!

The network consists of a pair of Cisco Catalyst 4900M switches. They have 24 ports each of the copper CX-4 10 gigabit Ethernet adapters, and each server has a pair of dual port Intel 10GbE adapters, so we can have bonded connections to both the public IP network and the private Oracle RAC interconnect. With half a terabyte of total cache and a 10 gigabit interconnect, cache memory hits through Oracle Cache Fusion should be extremely fast, almost as fast as local memory.

On the SAN network we have a pair of HP/Brocade 8/24 8 gigabit switches, with 4 of the 8gb Qlogic adapters per server, for a total SAN bandwidth of 32gb per server. Each storage array has 4x 4gb connections to the SAN, for a total of 16gb per array.

The storage consists of 4 of the HP MSA 2000fc G2 storage arrays, each with an expansion HP MSA70 disk shelf, for a total of 49x 146GB 10k RPM disks (24 in the MSA 2000 itself, and 25 in the expansion shelf) per array, 196 disks total. This extra 25th disk might not leave you with an even number of disks, but it works great as a global hot spare in the event of a disk failure.

I spent the week installing and configuring the storage and servers, and getting everything on the network. Thanks to my network engineer, James, for helping on the Cisco side. At this point all of my HP/Brocade switches have been zoned, my servers have all had a base Red Hat Linux 5 installation completed, and I’m ready to start benchmarking storage. Now that everything is network accessible and physically cabled, I can head back home and do the rest remote.

Here’s what it looks like – Storage on the top, servers on the bottom, SAN switches at the very top, and you can’t see them, but network switches in the middle facing the back:

Oracle 11g Data Warehouse Storage

I’m in the middle of deploying an Oracle Data Warehouse solution and we decided to go a new direction with the storage. Normally, for a large Oracle cluster like this, we’d purchase a large HP or NetApp storage array and put a couple hundred disks in it. As you can imagine, this costs a lot of money, usually several hundred thousand USD. There are some limitations to this approach for data warehousing. Let me explain:

Sequential read performance is our most important criteria for the data warehouse storage. Even write performance is not very critical, because data will be loaded infrequently through an ETL process. There will only be a few users at a time on the system, but those users want their queries to run very fast. These are complex queries with JOINs that perform many full table scans. Random I/O performance is not that important, since this is not your typical transactional database or OLTP application.

From past benchmarking of a fully loaded HP EVA 8100, I’ve determined that it can do about 75K random I/Os per second, however, it can also do about 1.6GB/sec. sequential reads. The random performance is pretty good, but knowing we have a hard upper limit of 1.6GB/sec. is very limiting. If we need more performance from the data warehouse, we have to purchase a second storage array and stripe across both, costing even more money. We needed a more cost effective solution.

Enter the HP MSA 2000. These are “budget” storage arrays that are not nearly as powerful as an EVA, don’t hold nearly as many disks, and don’t have nearly as much random I/O performance. However, they do have pretty good sequential performance. Here are their performance characteristics:

MSA2 G2 Performance	MSA2300fc G2	MSA2000sa G2	MSA2000i G2
Protocol (host connect)	4 Gb Fibre Channel	3 Gb SAS	1GbE Ethernet
MSA2000 RAID 10 Performance Results
Random Reads IOPs	22,874	21,861	13,658
Random Writes IOPs	15,008	14,491	14,399
Random Mix IOPs 60/40 read/write	17,878	17,700	13,493
Sequential Reads MB	1,238	1,050	274
Sequential Writes MBs	532	532	266
MSA2000 RAID 5 Performance Results
Random Reads IOPs	22,044	21,000	12,335
Random Writes IOPs	2,714	2,714	2,714
Random Mix IOPs 60/40 read/write	5,325	4,926	5,325
Sequential Reads MBs	1,238	1,050	274
Sequential Writes MBs	725	617	266
MSA2000 RAID 6 Performance Results
Random Reads IOPs	21,975	21,000	12,292
Random Writes IOPs	1,876	1,876	1,876
Random Mix IOPs 60/40 read/write	4,293	3,865	3,851
Sequential Reads MBs	1,238	1,050	274
Sequential Writes MBs	772	729	266

Looking at the table above, the MSA 2000 we’re interested in is the MSA 2300fc G2, which can do 1238 MB/sec. sequential reads. At 1.24GB/second, this is getting very close to the EVA 8100’s 1.6GB/second. Now, consider that this array with a couple shelves of disks costs about 20% of the price of an EVA 8100.

For our data warehouse solution, we decided to purchase 4 of these smaller MSA 2300fc G2 arrays, and use Oracle ASM to stripe across them, hopefully achieving aggregate sequential throughput of approximately ~5GB/second (4x 1.24GB/second).

This solution seems to fit a data warehouse workload better than a single large storage array, because as performance and storage capacity needs grow, more MSA 2000s can simply be added, increasing performance and storage capacity as the application performance requirements grow.

Saturday, March 6, 2010

Welcome to Storage Adventures

I wanted to start blogging about some of the storage work I've been doing lately. I'm currently building out a mid-sized data warehouse and I've been benchmarking some storage. Hopefully others will find this stuff interesting and I can save some other storage engineers a lot of time. Look for more posts soon.

Storage Adventures