Block size and its impact on storage performance
Explore structured versus unstructured data, how storage segments respond to block size changes, and the differences between I/O-driven and throughput-driven workloads.
This article analyzes the correlation between block size and its impact on storage performance. This article discusses the definition and understanding of structured data versus unstructured data, how different storage segments react to changes in block size, and the differences between I/O-driven and throughput-driven workloads. It also emphasizes the calculation of throughput and the selection of storage products based on the workload type.
Block size and why it matters
In computing, a physical record or block of data storage is a sequence of bits/bytes called a block. The amount of data processed or transferred in a single block in a system or storage device is called the block size. It is one of the factors that determine storage performance. Block size is a key factor in performance benchmarking of storage products and in categorizing products into block, file, and object segments.
Structured Data vs. Unstructured Data
Structured data is organized in a standardized format, usually in a table with rows and columns, making it easily accessible to people and software. It is usually quantitative data, meaning it can be counted or measured, and can include data types such as numbers, short text, and dates. Structured data is well suited for analysis and can be combined with other data sets and stored in a relational database.
Unstructured simply refers to data sets that are not stored in a structured database format (typically large collections of files). Unstructured data has an internal structure, but it is not predefined through a data model. It can be human-generated or machine-generated, in text or non-text format.
Typically, the block size for structured data is between 4KB to 128KB, and in some cases, it can go up to 512KB as well. On the contrary, the block size range for unstructured data is much larger and can easily reach the MB range.
Block sizes for structured and unstructured data
OLTP, or online transaction processing, is a type of data processing that involves performing multiple transactions that occur concurrently—online banking, shopping, order entry, or sending a text message—while OLAP is an online analytical processing software technology that you can use to analyze business data from different perspectives. Organizations collect and store data from multiple data sources, such as websites, applications, smart meters, and internal systems.
Most OLTP workloads follow structured data, whereas most OLAP workloads follow unstructured data patterns, and the main difference between them is the block size.
Throughput/IOPS formula based on block size
Storage throughput (also known as data transfer rate) measures the amount of data transferred to and from a storage device per second. Typically, throughput is measured in MB/s. Throughput is closely related to IOPS and block size.
IOPS (Input/Output Operations Per Second) is a standard unit of measurement for the maximum number of reads/writes to non-contiguous storage locations.
In above formula, KB/IO is the block size. So, depending on the block size, each workload is io driven or throughput driven. If IOPS is higher for any workload, it means the block size is smaller, if throughput is higher for any workload, the block size is larger.
Storage performance based on block size
Storage technologies respond based on block size, so there are different storage recommendations based on block size and response time. Block storage is better suited for applications with smaller block sizes, while file-level and object storage are better suited for applications with larger block sizes.
Storage technologies and their range are based on block size
Block storage has always been the choice for production workloads with smaller block sizes, and these applications have higher IOPS limits. Performance numbers for the IOPS that each storage box is capable of are included in each block storage release note. Meanwhile, file-level storage or any NFS storage is better suited for block sizes greater than 1MB.
Object storage is a relatively new product in the market, which was introduced for storing files and folders across multiple sites with a performance range similar to NFS.
Object storage will require a load balancer to distribute blocks across storage systems, which also helps improve performance. Both NFS and object storage have higher response times compared to block storage because I/O must go over the network to the disk and back to complete the I/O cycle. The average response time for NFS and object storage is above 10 milliseconds.
File system storage can accommodate a wider range of block sizes. The architecture of file system storage can be adjusted to handle striping for most block sizes and improve overall performance. Typically, file system storage is used to implement data lakes, analytical workloads, and high-performance computing. Most file system storage also uses agent-level software installed on the server to better distribute data over the network and improve performance.
InfiniBand setups are the file system storage of choice for large-scale deployments of data lakes or HPC systems where workloads are throughput driven and large amounts of data are expected to be ingested in a short period of time.
VSAN was introduced as a block storage product for VMware workloads and has been very successful on OLTP workloads. More recently, VSAN has also been used for workloads with larger block sizes, especially backup workloads where response time requirements may not be critical. Working in VSAN's favor is the new improved architecture and cluster size, which helps improve overall performance.
Workloads, block sizes, and appropriate storage
Since storage products have different performance levels for different block sizes, how do you choose storage based on the block size of your workload? Here are some such examples:
Take workloads and their respective block sizes as an example. This diagram helps in selecting the right storage product based on workload block size and overall performance requirements.
For block sizes less than 256KB, most block storage can perform well, regardless of the vendor company, because block storage architecture is best suited for small block size workloads. Similarly, larger block size workloads such as RMAN or Veeam backup software are better suited for NFS or object storage because these are throughput-driven workloads. There are other design parameters such as throughput requirements, total capacity, and read/write percentages that can help determine the size of the solution.
Final Thoughts
It is hoped that this research will help IT engineers and architects design their setups according to the nature of application workloads and block sizes.