Solving big data challenges requires managing large amounts of highly distributed data stores along with the use of computational and data-intensive applications. Virtualization provides an extra level of efficiency to make big data platforms a reality. Although virtualization is not technically a requirement for big data analysis, software frameworks are more efficient in a virtual environment.
Virtualization has three characteristics that support the scalability and operational efficiency required for big data environments:
Segmentation: In virtualization, multiple applications and operating systems are supported in a single physical system by partitioning the available resources.
Isolation: Each virtual machine is isolated from its host physical system and other virtual machines. Because of this isolation, if one virtual instance goes down, the other virtual machines and the host system will not be affected. In addition, data is not shared between one virtual instance and another.
Encapsulation: A virtual machine can be represented as a single file, so you can easily identify it based on the services you provide.
Big data server virtualization
In server virtualization, a single physical server is divided into several virtual servers. Device hardware and resources—including RAM, CPU, hard disk, and network controller—can be turned into a series of virtual machines that each manage their own applications and operating system.
A virtual machine (VM) is a software representation of a physical device that can perform or perform the same functions as a physical machine. A thin layer of software is actually inserted into devices that have a virtual machine monitor or hypervisor.
Server virtualization uses a hypervisor to provide efficiency in the use of physical resources. Of course, installation, configuration and administrative tasks are related to setting up these virtual machines.
Server virtualization helps ensure that your platform can scale as needed to handle the large volumes and diverse types of data involved in big data analysis. You may not know how much volume is needed before you start analyzing. This uncertainty increases the need for server virtualization, providing your environment with the ability to meet the unexpected demand for processing very large data sets.
Additionally, server virtualization provides the foundation that enables many cloud services used as data sources in big data analysis. Virtualization increases the efficiency of the cloud, which facilitates the optimization of many complex systems.
Big Data Application Virtualization
Application infrastructure virtualization provides an efficient way to manage applications in the context of customer demand. The application is packaged in such a way that it removes its dependencies from the underlying physical computer system. This helps improve application management and overall portability.
In addition, application infrastructure virtualization software typically allows for commercial and technical usage policies to be written to ensure that each of your applications leverages virtual and physical resources in a predictable manner. Efficiencies are gained because you can easily allocate IT resources according to the relative business value of your applications.
Virtualization of the application infrastructure used with server virtualization can help ensure that business service level agreements are met. Server virtualization monitors CPU and memory usage, but does not take into account differences in work priority when allocating resources.
Big data network virtualization
Network virtualization provides an efficient way to use networking as a pool of connection resources. Instead of relying on the physical network for managing traffic, you can create multiple virtual networks all utilizing the same physical implementation. This can be useful if you need to define a network for data gathering with a certain set of performance characteristics and capacity and another network for applications with different performance and capacity.
Virtualizing the network helps reduce these bottlenecks and improve the capability to manage the large distributed data required for big data analysis.
Big data processor and memory virtualization
Processor virtualization helps to optimize the processor and maximize performance. Memory virtualization decouples memory from the servers.
In big data analysis, you may have repeated queries of large data sets and the creation of advanced analytic algorithms, all designed to look for patterns and trends that are not yet understood. These advanced analytics can require lots of processing power (CPU) and memory (RAM). For some of these computations, it can take a long time without sufficient CPU and memory resources.
Big data and storage virtualization
Data virtualization can be used to create a platform for dynamic linked data services. This allows data to be easily searched and linked through a unified reference source. As a result, data virtualization provides an abstract service that delivers data in a consistent form regardless of the underlying physical database. In addition, data virtualization exposes cached data to all applications to improve performance.
Storage virtualization combines physical storage resources so that they are more effectively shared. This reduces the cost of storage and makes it easier to manage data stores required for big data analysis.