Hardware Requirements for HBase: HBase is a powerful and flexible technology, but accompanying this flexibility is a requirement for proper configuration and tuning. Time for some general guidelines for configuring HBase clusters. Your “mileage” may vary, depending on the specific computing requirements of RegionServers’ servers (dedicated shared processors, for example) and other applications you may choose to locate in your cluster.
Hardware Requirements for HBase
Region Server
The first temptation to resist when configuring your RegionServers is to sink a lot of money for some quality enterprise systems. Do not do it! HBase is usually deployed on x86 servers of the vanilla commodity.
Now, do not take this statement as a license to publish the cheapest low-quality servers. Yes, HBase is designed to recover from node failures but your availability suffers during recovery periods, so hardware quality and redundancy matter.
Redundant power supplies as well as redundant network interface cards are a good idea for production deployments. Typically, organizations choose two sockets with four to six cores each.
The second temptation to resist is to configure your server with maximum storage and memory capacity. A common configuration may include 6 to 12 terabytes (TB) of disk space and 48 to 96 gigabytes (GB) of RAM. Disk RAID controllers are unnecessary because HDFS provides data protection when disks fail.
HBase requires a dedicated read and write cache from the Java heap. Keep this statement in mind when you read about HBase configuration variables because you will see that there is a direct relationship between RegionServer disk capacity and RegionServer’s Java heap. Check out the excellent discussion on HBase RegionServer memory scaling.
The article states that you can estimate the ratio of raw disk space to Java heap by following this formula:
Region Size divided by Memstoresize multiplied by HDFS replication factor multiplied by HeapFractionForMemstores
Using default HBase configuration variables provides this ratio:
10 GB / 128 MB * 3 * 0.4 = 96 MB Disk Space Ratio: 1 MB Java heap space.
The previous line equates to 3 TB of initial disk capacity per RegionServer with 32 GB of RAM allocated to the Java heap.
What you end up with, then, is 1 TB of usable space per RegionServer since the default HDFS replication factor is 3. That number is still impressive in terms of database storage per node but it’s not that impressive given that commodity servers can It typically accommodates eight or more drives with a capacity of 2 to 4 TB per piece.
The overarching problem as of this writing is the fact that current Java Virtual Machines (JVMs) struggle to provide efficient memory management (garbage collection, to be exact) with large heap spaces (spaces greater than 32GB, for example).
Yes, there are tuning parameters for garbage collection you can use, and you should check with your JVM vendor to make sure you have the latest options, but you won’t be able to use them away at this time.
The memory management issue will eventually be resolved but for now you should be aware that you may run into a problem if your HBase storage requirements are in the range of hundreds of terabytes to more than petabytes. You can easily increase its size up to 20 GB to reach 6 TB raw and 2 TB usable.
You can make other tweaks (reduce the size of the MemStore to read heavy workloads, for example) but you won’t make huge leaps in usable space until we have a JVM that efficiently handles garbage collection with huge heaps.
You can find ways around the JVM garbage collection issue for RegionServers servers but the solutions are new and no longer part of the main HBase distribution as of this writing.
Master Servers
MasterServer does not consume system resources like RegionServers. However, you must provide hardware redundancy, including RAID, to prevent system failures. For a good measure, also configure a backup MasterServer in the cluster. A common configuration is 4 CPU cores, between 8GB and 16GB RAM and 1GbE is a common configuration. If you opt in to locate the MasterServers and Zookeeper nodes, 16GB of RAM is recommended.
Zoo Keeper: Hardware Requirements for HBase
Like MasterServer, Zookeeper does not require extensive hardware configuration, but Zookeeper should not block (or be required to compete for) system resources. Zookeeper, the coordination service for the HBase group, is located in the data path for clients. If Zookeeper can’t do its job, timeouts will happen – and the results can be disastrous.
The hardware requirements for Zookeeper are the same as for the MasterServer except that a dedicated disk must be provided for the process. For small blocks, you can locate Zookeeper with the master server but remember that Zookeeper needs enough system resources to run when it’s ready.