Dim3 grid calculation

8/11/2023

Well, to answer that question, we need to first look at the relationship between global memory and our program. Why we need a shared memory version? We already seen hundreds of times speed improvement. Matrix Multiplication with Shared Memory source file: There is also local memory located in each thread. Constant memory, also located in the grid, has very small storage capacity (64KB per GPU) but very fast accessing speed, and can read (can’t write) from any threads. Shared memory, located in each block, has small storage capacity (16KB per block) but fast accessing speed, can be read and write by all the threads within the located block. Global memory, located in the gird, has large storage capacity but limited speed, and can be read and write from all the blocks within CUDA system. There are in total 4 types of memory designed for GPU cards with CUDA architecture. This figure is from the website, originally found in the book Programming Massively Parallel Processors: A Hands-on Approach. For now, we first need to know specifications of different memory types. We will see how use them in later examples. To address this problem, CUDA architecture designed several types of memory that could potentially speed up the data loading process. In other words, the highest achievable floating-point calculation throughout is limited by the rate at which the input data can be transfered from global memory to computational kernel. The actual kernel computational capability of our tesla card is 1 teraflops (1000 gigaflops) per second, but due to limited memory accessing speed, we can only achieve less than 4 percent of the actual speed. Since the computational kernel cannot compute more floating-point than the amount global memory has loaded, it will execute no more than 36 gigaflops per second. With 4 bytes in each single precision floating-point datum, we can load no more than 36 (144/4) giga single precision data per second. The nVidia Tesla C2075 companion processor supports 144 gigabytes per second (GB/s) of global memory access bandwidth. With this in mind, we can proceed to our example. We assume that in order to perform one floating point operation, the runtime need to transfer one single-precision floating-point from global memory datum to the computational kernel. One of the most important standards of a processor’s computation ability is its flops computation. The following example will show you why matching these two speeds is so important to GPU computation. The reason CUDA architecture has many memory types is to increase the memory accessing speed so that data transfer speed can match data processing speed. Matrix Multiplication with Global Memory source file: It will later on write back the result to the out put matrix. This thread will fetch the data and do all the calculations. We do this by assigning each entry in output matrix a thread of its own. As you can see, this kind of operation is highly paralleled, make it perfect for us to use CUDA. The result will be the value at entry (A,B) in the output matrix. We do this for all the elements in row A and column B, and then we get the sum of products. Later, we take the second left element in row A and multiply it by second top element in column B. We first take the left most element in row A and multiply it by top element in column B. For example, to calculate entry (A,B) in the output matrix, we need to use row A in one input matrix and column B in another input matrix. If you have learned linear algebra before, you will know that the output of two square matrices multiplied together is a square matrix of the same size. The output matrix is P with the same size. Two input matrices of size Width x Width are M and N.

In this example, we will do the Square Matrix Multiplication. This feature in CUDA architecture enable us to create two-dimensional or even three-dimensional thread hierarchy so that solving two or three-dimensional problems becomes easier and more efficient. As we know, threads can be organized into multi-dimensional block and blocks can also be organized into multi-dimensional grid. Starting from this example, we will look at the how to solve problem in two-dimensional domain using two-dimensional grid and block.

0 Comments

Dim3 grid calculation

Leave a Reply.

Author

Archives

Categories