멀티코어컴퓨팅 - 11. CUDA programming

728x90

GPU 위에서 돌아가는 병렬 처리 프로그램을 만들어보자.

CUDA는 오직 NVDIA GPU에서만 돌아간다.

*CUDA code를 google colab에서 실행하는 방법 : https://www.geeksforgeeks.org/how-to-run-cuda-c-c-on-jupyter-notebook-in-google-colaboratory/

How To Run CUDA C/C++ on Jupyter notebook in Google Colaboratory - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

www.geeksforgeeks.org

motivation

high clock speed에 따라 발열, 소비전력이 커지면서 멀티 코어 활용의 필요성이 대두되었다.

many-core gpu는 특히 실시간, 고해상도 3D graphics가 필요한 곳에 많이 사용된다. (병렬화된 작업에 특화)

CPU는 2~16개의 코어로 구성되지만 GPU는 100~10000개의 코어로 구성된다.

cpu는 성능향상을 위해 다양한, 복잡한 기능을 제공한다. (out-of-order, branch prediction, pipelining, ...)

gpu 코어는 cpu와 달리 simple한 기능만을 제공한다. (in order, single instruction)

하나의 instruction이 여러 코어에서 동일하게 돌아가는 것을 SIMD processor라고 한다.

SM 하나가 SIMD프로세서 하나 라고 보면된다.

gpu의 연산능력을 cpu처럼 일반적인 목적으로도 사용해보자는 시도가 GPGPU이다. (ex, 행렬계산)

GPU 프로그래밍을 위해서 (NVIDA위에서) CUDA를 사용한다.

CUDA

C, C++, Python 사용 가능하다.

CPU를 host, GPU를 device라고 한다.

cpu에서 실행되는 코드를 host 코드 라고 하고 gpu에서 실행되는 코드를 device 코드라고 한다.

병렬처리를 수행할 함수를 cuda color code라고 한다. cuda thread가 이를 수행한다.

한 kernel function이 한 그리드를 갖고 한 그리드가 block을 갖는다. 각 블럭이 여러개의 thread를 갖는다.

cpu에서 연산을 수행하도록 하는 것이 목적이다.

1. 가장 먼저 필요한 data를 gpu 메모리로 옮겨준다.

2. cpu가 gpu에게 연산을 수행하라고 말한다

3. gpu가 연산을 수행하고 연산 결과를 gpu 메모리에 저장한다.

4. gpu 메모리에 있는 연산 결과를 메인 메모리로 가져온다.

hello-world를 cuda kernel function으로 만들었다.

1개의 block과 그 block은 5개의 쓰레드를 생성하라는 뜻이다. 이 부분이 execution configuration이다.

__global__을 붙이면 host(cpu)만 호출할 수 있고 실행은 gpu에서 되는 함수라는 의미가 부여된다.

이 문장이 실행되면 hello_world c코드가 gpu로 넘어갔다 실행 종료 후 main으로 다시 돌아온다.

gridDim, blockDim (1,1,1), (2,1,1) / (3,1,1), (2,1,1), (1,1,1)

blockIdx = (blockIdx.x, blockIdx.y, blockIdx.z)

threadIdx = (threadIdx.x, threadIdx.y, threadIdx.z)

는 기본으로 생성되는 built-in variable이다. dim3타입으로 선언되어 있다.

execution configuration을 보면 2개의 block을 생성하고 각 block은 5개의 thread를 갖는다는 뜻이다.

총 4개의 block이 2차원 공간에 있고 각 block에는 3차원으로 8개의 thread가 생성되어 실행된다.

처음 for문은 host memory에 있는 데이터를 device memory에 옮긴다.

그리고 연산 결과 c를 다시 device memory에서 cpu memory로 복사해온다.

Thrust

cuda를 베이스로하는 C++ template library이다 (cuda 기반 STL)

gpu를 사용하면서도 high level 인터페이스를 제공한다. 이미 cuda일부로 내장되어있다.

https://docs.nvidia.com/cuda/thrust/index.html

Thrust :: CUDA Toolkit Documentation

In this section we used expressions like H.begin() and H.end() or offsets like D.begin() + 7. The result of begin() and end() is called an iterator in C++. In the case of vector containers, which are really just arrays, iterators can be thought of as point

docs.nvidia.com

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <iostream>

int main(void)
{
    // H has storage for 4 integers
    thrust::host_vector<int> H(4);

    // initialize individual elements
    H[0] = 14;
    H[1] = 20;
    H[2] = 38;
    H[3] = 46;
    
    // H.size() returns the size of vector H
    std::cout << "H has size " << H.size() << std::endl;

    // print contents of H
    for(int i = 0; i < H.size(); i++)
        std::cout << "H[" << i << "] = " << H[i] << std::endl;

    // resize H
    H.resize(2);
    
    std::cout << "H now has size " << H.size() << std::endl;

    // Copy host_vector H to device_vector D
    thrust::device_vector<int> D = H;
    
    // elements of D can be modified
    D[0] = 99;
    D[1] = 88;
    
    // print contents of D
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    // H and D are automatically deleted when the function returns
    return 0;
}

host_vector는 host memeory에 저장되고 device_vector는 gpu에 저장된다.

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>

int main(void)
{
    // allocate three device_vectors with 10 elements
    thrust::device_vector<int> X(10);
    thrust::device_vector<int> Y(10);
    thrust::device_vector<int> Z(10);

    // initialize X to 0,1,2,3, ....
    thrust::sequence(X.begin(), X.end());

    // compute Y = -X
    thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

    // fill Z with twos
    thrust::fill(Z.begin(), Z.end(), 2);

    // compute Y = X mod 2
    thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

    // replace all the ones in Y with tens
    thrust::replace(Y.begin(), Y.end(), 1, 10);

    // print Y
    thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));
   
    return 0;    
}

728x90

저작자표시 (새창열림)

'ComputerScience > Multi-core Computing' 카테고리의 다른 글

멀티코어컴퓨팅 - 10. OpenMP (0)	2022.05.10
멀티코어컴퓨팅 - 9. C++ Threads (0)	2022.05.10
멀티코어컴퓨팅 - 8. Pthread Programming (0)	2022.04.28
멀티코어컴퓨팅 - 7. Divide-and-Conquer for Parallelization (0)	2022.04.19
멀티코어컴퓨팅 - 6. Concurrent Programming (0)	2022.04.16

jsdysw

멀티코어컴퓨팅 - 11. CUDA programming

'ComputerScience > Multi-core Computing' 카테고리의 다른 글

티스토리툴바

멀티코어컴퓨팅 - 11. CUDA programming

'ComputerScience > Multi-core Computing' 카테고리의 다른 글

'ComputerScience/Multi-core Computing' Related Articles

티스토리툴바