A Review on GPU Programming Strategies and Recent Trends in GPU Computing

  • Nisha Chandran S. Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
  • Durgaprasad Gangodkar Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
  • Ankush Mittal Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
Keywords: Parallel Programming, GPUs, CUDA, Fermi, Debugging, Programming Strategies, Optimization


The advancements in the field of internet and cloud computing has resulted in a huge amount of multimedia data and processing of this data have become more complex and computationally intensive. With the advent of the scalable, inexpensive Graphics Processing Units (GPUs) with very high computation power, the processing of such big data has become less expensive and efficient. Also fast developments happening in the field of programming languages and different programming and debugging tools adds to the ease of GPU programming. However, utilizing the resources of the GPU effectively and fully is still a challenge. The goal of this paper is to present a brief review of NVIDIA’s state of the art Fermi architecture and to survey different programming and optimization strategies adopted by researchers’ to accelerate the GPU computation. This survey aims to provide researchers with knowledge about the different programming and optimization techniques in GPU programming and to motivate them to architect highly efficient parallel algorithms by extracting maximum available capability of the GPUs. The paper also explores some recent trends in the field of GPU programming.


Download data is not yet available.


Bhangale, U., Durbha, S. S., King, R. L., Younan, N. H., & Vatsavai, R. (2017). High performance GPU computing based approaches for oil spill detection from multi-temporal remote sensing data. Remote Sensing of Environment, 202, 28-44.

Brodtkorb, A. R., Hagen, T. R., & Sætra, M. L. (2013). Graphics processing unit (GPU) programming strategies and trends in GPU computing. Journal of Parallel and Distributed Computing, 73(1), 4-13.

Chakroun, I., Mezmaz, M., Melab, N., & Bendjoudi, A. (2013). Reducing thread divergence in a GPU‐accelerated branch‐and‐bound algorithm. Concurrency and Computation: Practice and Experience, 25(8), 1121-1136.

Chen, X., Wang, C., Tang, S., Yu, C., & Zou, Q.(2017). CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics, 18(1), 315.

Domínguez, J. M., Barreiro, A.J.C. Crespo, O. García-Feal, & Gómez-Gesteira, M. (2016). Parallel CPU/GPU Computing for smoothed particle hydrodynamics models. In Recent Advances in Fluid Dynamics with Environmental Applications(pp. 477-491). Springer International Publishing.

Doulgerakis, M., Eggebrecht, A., Wojtkiewicz, S., Culver, J., & Dehghani, H. (2017). Toward real-time diffuse optical tomography: accelerating light propagation modeling employing parallel computing on GPU and CPU. Journal of Biomedical Optics, 22(12), 125001.

Dubey, S. P., Kini, N. G., Kumar, M. S., & Balaji, S. (2016). Ab initio protein structure prediction using GPU computing. Perspectives in Science, 8, 645-647.

Fauzia, N., Pouchet, L. N., & Sadayappan, P. (2015, February). Characterizing and enhancing global memory data coalescing on GPUs. In Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on(pp. 12-22). IEEE.

Jung, J., & Bae, D. (2018). Accelerating implicit integration in multi-body dynamics using GPU computing. Multibody System Dynamics, 42(2), 169-195.

Ke, J., Sowmya, A., Guo, Y., Bednarz, T., & Buckley, M. (2016, November). Efficient GPU computing framework of cloud filtering in remotely sensed image processing. In Digital Image Computing: Techniques and Applications (DICTA), 2016 International Conference on(pp. 1-8). IEEE.

Kim, K., Lee, S., Yoon, M. K., Koo, G., Ro, W. W., & Annavaram, M. (2016, March). Warped-pre execution: A GPU pre-execution approach for improving latency hiding. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on(pp. 163-175). IEEE.

Kimanius, D., Forsberg, B. O., Scheres, S. H., & Lindahl, E. (2016). Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2. Elife, 5, e18722.

Kirk, D. B., & Wen-Mei, W. H. (2016). Programming massively parallel processors: a hands-on approach. Morgan Kaufmann. ISBN:9780123914187, Elsevier.

Lee, D., Dinov, I., Dong, B., Gutman, B., Yanovsky, I., & Toga, A. W. (2012). CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms. Computer Methods and Programs in Biomedicine, 106(3), 175-187.

Lee, S. Y., & Wu, C. J. (2014, March). Characterizing the latency hiding ability of gpus. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)(pp. 145-146). IEEE.

Li, C., Yang, Y., Feng, M., Chakradhar, S., & Zhou, H. (2016, November). Optimizing memory efficiency for deep convolutional neural networks on GPUs. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for(pp. 633-644). IEEE.

Ma, Y., Chen, L., Liu, P., & Lu, K. (2016). Parallel programming templates for remote sensing image processing on GPU architectures: design and implementation. Computing, 98(1-2), 7-33.

Mantas, J. M., De la Asunción, M., & Castro, M. J. (2016). An introduction to GPU computing for numerical simulation. In Numerical Simulation in Physics and Engineering(pp. 219-251). Springer International Publishing.

Patterson, D. (2009). The top 10 innovations in the new NVIDIA Fermi architecture, and the top 3 next challenges. Nvidia Whitepaper, 47.

Siegel, J., Ributzka, J., & Li, X. (2011). CUDA memory optimizations for large data-structures in the gravit simulator. Journal of Algorithms & Computational Technology, 5(2), 341-362.

Smirnov, A. V. (2016). FIESTA4: Optimized Feynman integral calculations with GPU support. Computer Physics Communications, 204, 189-199.

Sundfeld, D., Havgaard, J. H., Gorodkin, J., & De Melo, A. C. (2017, March). CUDA-Sankoff: using GPU to accelerate the pairwise structural RNA alignment. In 2017 25th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)(pp. 295-302). IEEE.

Tan, G., Li, L., Triechle, S., Phillips, E., Bao, Y., & Sun, N. (2011, November). Fast implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis(p. 35). ACM.

Wen-Mei, W. H. (2011). GPU computing gems emerald edition. Elsevier.

Wittenbrink, C. M., Kilgariff, E., & Prabhu, A. (2011). Fermi GF100 GPU architecture. IEEE Micro, 31(2), 50-59.

Wu, B., Zhao, Z., Zhang, E. Z., Jiang, Y., & Shen, X. (2013, February). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In ACM SIGPLAN Notices(Vol. 48, No. 8, pp. 57-68). ACM.

Wu, Y., Song, J., Ren, K., & Li, X. (2017). Research on LogGP based parallel computing model for CPU/GPU cluster. In Information Technology and Intelligent Transportation Systems(pp. 409-420). Springer International Publishing.

Zhang, E. Z., Jiang, Y., Guo, Z., & Shen, X. (2010, June). Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing(pp. 115-126). ACM.

Zhou, Q., Chen, W., Song, S., Gardner, J. R., Weinberger, K. Q., & Chen, Y. (2015, January). A reduction of the elastic net to support vector machines with an application to GPU computing. In AAAI(pp. 3210-3216).