Visual Studio 2010을활용한 - download.microsoft.comdownload.microsoft.com/.../pdf/1-2_VS2010_Cpp/1-2_VS2010Cpp.pdf · Visual Studio 2010을활용한 C++ 병렬프로그래밍

마이에트 엔터테인먼트Server Programmer

최흥배

Microsoft Visual C++ MVPTwitter : @jacking75

Visual Studio 2010을 활용한C++ 병렬 프로그래밍

1. Multi-Core 시대2. 어려운 병렬 프로그래밍3. 진화4. Concurrency Runtime

5. 병렬 패턴 라이브러리 ( PPL )

목차

Multi-Core 시대

Multi-Core 컴퓨터는 이미 일반화

throughput 컴퓨팅 시대

• 싱글 CPU의 스피드 향상의 한계에 도달.

• 멀티 코어 CPU로 방향을 바꿈.

• throughput가 최대 중요 요소가 됨.

• Intel, AMD의 CPU 아키텍처는 Multi-Core를 넘어서heterogeneous(헤테로지니아스 )로 가고 있음.

• 2 코어를 융합한 클러스터.

• 「Bulldozer Module」이라고 부른다.

• 2개의 스레드를 병렬로 실행할 수 있는 모듈이 Bulldozer

의 기본 단위. 기본은 하나가 아니고 두 개다 !!!

• 4 코어의 Bulldozer CPU라면 두 개의 Bulldozer Module을탑재.

• Hyper-Threading이 아니다 !!!

AMD - Bulldozer 아키텍처

• Intel의 Hyper-Threading에서는 CPU 전체의 자원을명령 단위로 2개의 스레드에서 공유한다.

• Bulldozer에서는 CPU의 자원 중 정수 연산 파이프는2개의 스레드가 각각 전용 파이프를 가진다.

• 그러나 명령 디코더나 부동 소수점 연산 유닛 등은 2

개의 스레드에서 공유한다.

• 정수 연산에서는 스레드간의 경합이 없기 때문에throughput가 높다,

Hyper-Threading과의 차이

Intel - Larrabee 아키텍처

• 현재 Intel은 「Larrabee(라라비)」를 그래픽스 제품으로서 투입하는 것을 단념.

• 그러나 Intel이 CPU에 통합하기 위한 데이터 병렬 중시형 프로세서 코어의 아키텍처를 긴급하게 필요로 하고 있는 점은 변화지 않음.

• Larrabee의 목적은 어떻게 하면 유연하고 고효율이면서 프로그램 하기 쉬운 아키텍처로 할 수 있을지를 추구하는 것.

• Intel의 간부들은 Larrabee와 같은 범용 데이터 병렬코어를 CPU로 통합하는 것을 전망.

• 효율성으로 말하면 대형 슈퍼 스칼라 코어와 소형의데이터 병렬 특화형 코어의 편성의 헤테로지니아스(이종 혼합) 구성이 바람직하다.

• 왜냐하면 지금부터 퍼포먼스를 늘리고 싶은 것은 데이터 병렬로 부동 소수점 연산 중심의 작업 부담량이기 때문.

• Amdahl의 법칙은 여전히 살아 있기 때문에 Intel은 대형 슈퍼 스칼라 코어를 버리고 가는 것도 할 수 없다. 필연적으로 헤테로지니아스가 된다.

어려운 병렬 프로그래밍

병렬 프로그래밍? Multi-Core?

그거 먹는 건가요? 우걱우걱

race condition, dead lock

void SetReUseSocket()

{

………

if( flase == m_bUsed ) {

LOG(“SetReUseSocket() | Failed”);

return;

}

LOG(“SetReUseSocket()”);

m_bUsed = true;

………

}

진화

2002년 2010년

OS – 2001년과 2009년

Windows XP Windows 7

Visual Studio – 2002년과 2010년

Visual Studio.NET( 2002)

Visual Studio 2010

2008년 10월 Microsoft의 최고 연구 전략 책임자를 맡은 Craig Mundie씨

- Win32는 비동기 병렬 컴퓨팅에는 적합하지 않는 것을 인정.

- Windows 7 및 Windows Server 2008 R2에서 문제 해결을 위해 첫발을 내딪음.

- Windows는 지금이라도 2,3의 코어 머신을 처리할 수 있지만 8, 16 또는 32 이상의 코어 머신을 사용하도록 설계되어 있지 않다.

- Windows를 보다 뛰어난 병렬/비동기 프로세싱 플랫폼화 하기 위한 최초의 씨앗은 2009년부터 뿌려지기 시작.

Core 2

Thread

3

Non-running threads

Core 1

Thread

4

Thread

5Thread

1

Thread

2

Thread

6

Core 2Core 1

User

Thread

2

Kernel

Thread

2

User

Thread

1

Kernel

Thread

1

User

Thread

3

Kernel

Thread

3

User

Thread

4

Kernel

Thread

4

User

Thread

5

Kernel

Thread

5

User

Thread

6

Kernel

Thread

6

UMS - Cooperative Scheduling

그림 출처 : PDC 09

• 대기 중인 스레드의 커널 모드에서의 Block이 풀리면

• 대응하는 유저 모드 스레드는 Completion List에 등록되어

• 코어에서 실행 중인 스레드가 종료하는 것을 기다린 후 다시 실행한다.

UMS - Completion List

Group

NUMA Node

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

NUMA Node

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Group

NUMA Node

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

NUMA Node

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Socket

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

Core

LP LP

LP LP

NUMA


Parallel Pattern Lib

rary

Resource Manager

Task Scheduler

ThreadsOperating

System

Native Concurrency Runtime

Data

Str

uctu

res

Asynchronous

Agents

Library

UMS Threads

Native LibrariesTools

Visual Studio

2010

Parallel

Debugger

Profiler Concur

rency

Analysis

Intel Parallel

Studio

Concurrency Runtime


Concurrency Runtime

(ConcRT)

Resource Manager

Task Scheduler

OS

Parallel Patterns Library

Asynchronus Agents Library

Synchronization Data Structures

• 작업을 작고 세밀하게 처리할 수 있도록 범용적인 컨테이너와 알고리즘 제공

• Imperative parallelism – parallel_for, parallel_for_each 등

• Task parallelisn – task_group, structured_task_group

Parallel Patterns Library(PPL)

• Actor 베이스 모델 및 메시지 전달을 통해서 작고 세밀한data flow와 task pipeniling을 제공

• AAL은 다른 컴포넌트의 데이터를 기다리면서 작업을 처리 할 수 있다.

• AAL은 복수의 엔티티가 서로간에 비동기로 통신을 할 때사용한다.

Asynchronous Agents Library(AAL)

class GameAI : public agent

{

.....

void run()

{

// Send the request.

......

send(_target,

wstring(L"request"));

// Read the response.

int response = receive(_source);

}

private:

ISource<int>& _source;

ITarget<wstring>& _target;

};

class GameLogic : public agent

{

.....

void run()

{

// Send the request.

......

send(_target,

wstring(L"request"));

// Read the response.

int response = receive(_source);

}

private:

ISource<int>& _source;

ITarget<wstring>& _target;

};

• 여러 스레드로부터 공유 데이터 접근을 동기화 할 수 있는 몇 개의 데이터 구조를 제공.

• 동기 오브젝트는 크리티컬 섹션과 같이 다른 스레드로부터 공유 데이터를 사용할 수 있을 때까지 기다린다.

• critical_section, reader_writer_lock, event

Synchronization Data Structures

• 실행 시에 task의 스케쥴링 및 조정을 한다.

• cooperative 스케쥴링과 work-stealing 알고리즘을 사용하여 최대한 효율이 좋게 리소스를 처리하도록 한다.

• Concurrency Runtime은 기본적인 스케쥴러를 제공하므로 직접 관리할 필요는 없다.

• 다만 우리가 만든 애플리케이션에 최적화 시켜 더 높은성능을 얻고 싶을 때는 스케줄러의 정책을 변경하거나 특별한 task, 특별한 스케줄러와 제휴할 수 있다.

Task Scheduler

• Processors나 메모리 등의 컴퓨팅 리소스를 관리하는 것이목적.

• 실행 시에 작업 부하에 변경이 일어나면 가장 효율이 좋게처리할 수 있도록 리소스를 할당.

• 컴퓨팅 리소스를 추상화하여 Task Scheduler와 주로 대화.

• 더 높은 성능을 얻기 위해서 Resource Manager를 세밀하게 조정할 수 있다.

• 다른 병렬 라이브러리의 병행 런타임과 컴퓨팅 리소스 관리를 통합 할 수 있다.

Resource Manager

CPU0 CPU1 … CPUN

BigBig

Big

Big

Small

Small

Small Small

CPU0 CPU1 … CPUN

Big

Big

Big

Small

Small Small

Small

Big

ConcRT의 Cooperative

Demo

코어 증가와 Resource Management

• 4개의 Core를 가진 ConcRT를 사용한 프로세스가 두개 실행 중이라면, 하나의 ConcRT는 Core 0, Core 1에서, 두 번째 ConcRT는 Core 2, Core 3에서 실행

http://blogs.msdn.com/blogfiles/nativeconcurrency/WindowsLiveWriter/CrossProcessResourceManagementdoweneedit_C8F7/CPRMBlogChart_2.jpg

#include "stdafx.h"

#include <ppl.h>

using namespace Concurrency;

int main()

{

_CrtSetDbgFlag( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF );

parallel_invoke( [] { }, [] { } );

return 0;

}

ConRT Memory leak?

• 위 코드는 디버그 모드에서 메모리 릭을 경고.

• 이유는 Task Scheduler와 Resource Manage가 파괴되기전에 프로그램이 종료 되기 때문.

int main()

{

HANDLE hEvent = CreateEvent( NULL, TRUE, FALSE, NULL );

CurrentScheduler::Create( SchedulerPolicy() );

CurrentScheduler::RegisterShutdownEvent( hEvent );

_CrtSetDbgFlag( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF );

parallel_invoke( [] {}, [] {} );

CurrentScheduler::Detach();

WaitForSingleObject( hEvent, INFINITE );

CloseHandle( hEvent );

Sleep(500);

return 0;

}

( PPL )

PPL의 세 가지 features

• Task Parallelism

• Parallel algorithms

• Parallel containers and objects

3D 게임을 실행하면…

Font 리소스 로딩Texture 리소스 로딩3D 모델링 리소스 로딩 등등…

http://review.nate.com/view/8441754/review/2/PRO

Task Parallelism

Thread A

task_group1.run( task2)

Thread B



Thread A

structured_task_group1.run( task2)

structured_task_group1.run( task1)

Main Thread

Main Thread

ConcRT의 Task

Demo

for( i = 0; i < 1000000; ++i )

{

…………

…………

}

http://review.nate.com/view/8441754/review/2/PRO

Parallel Algorithms

parallel_for

!!! parallel_for

parallel_for_each

!!! parallel_for_each

http://cfile4.uf.tistory.com/original/113607184AA3C1F57F8CD8

http://cfile24.uf.tistory.com/original/13371F184AA3C21766B527

parallel_invoke

!!! parallel_invoke

http://cfile4.uf.tistory.com/original/113607184AA3C1F57F8CD8

http://cfile24.uf.tistory.com/original/13371F184AA3C21766B527

parallel objects - combinable

concurrent containers

concurrent_vector• 병렬 프로그래밍에 접합한 STL vector 타입의 컨테이너.

• 전체적인 인터페이스는 vector와 비슷하나 제한이 있슴.

• 기존 요소의 값을 변경할 때는 스레드 세이프하지 않음.

기존 요소의 값을 변경할 때는 동기화 객체를 사용하여 lock을 걸어야 합니다.

• concurrent_vector 사용 방법- concurrent_vector를 사용하기 위해서 먼저 헤더 파일을 포함해야 합니다.

- concurrent_vector의 헤더 파일은 “concurrent_vector.h” 입니다.

- STL의 vector 사용 방법과 거의 같음.

기능 vctor Concurrent_vector

추가 스레드에 안전하지 않음 스레드에 안전

요소에 접근 스레드에 안전하지 않음 스레드에 안전

반복자 접근 및 순회 스레드에 안전하지 않음 스레드에 안전

push_back 가능 가능

insert 가능 불가능

clear 모두 삭제 모두 삭제

erase 가능 불가능

pop_back 가능 불가능

배열식 접근 예. &v[0]+2 가능 불가능

grow_by, grow_to_at_least (vector의 resiz

e와 비슷)는 스레드에 안전하지 않음

추가 또는 resize 때 기존 인덱스나 반복자의 위치가 바뀌지 않음

bool 형은 정의 되지 않았음

concurrent_deque• 병렬 프로그래밍에 접합한 STL deque 타입의 컨테이너.

• enqueue와 dequeue 조작이 스레드 세이프 하다.

• 반복자를 지원하지만 스레드 세이프 하지 않다.

• front와 pop 함수를 지원하지 않음. 대신에 try_pop 함수를 대신해서 사용.

• back 함수를 지원하지 않는다. 그러므로 마지막 요소를 참조하는 것은 불가능하다.

• size 메소드 대신 unsafe_size 함수를 지원한다. unsafe_size는 이름 그대로 스레드 세이프 하지 않다.

• 사용 방법- “concurrent_queue.h” 파일을 include 한다.

- 사용 방법은 STL의 deque와 비슷.

스레드 세이프한 concurrent_queue의 함수

- concurrent_queue에 enqueue 또는 dequeue 하는 모든 조작에 대해서는스레드 세이프합니다.

- empty

- push

- get_allocator

- try_pop

- empty는 스레드 세이프하지만 empty 호출 후 반환되기 전에 다른 스레드에 의해서 queue가 작아지던가 커지는 경우 이 동작들이 끝난 후에empty의 결과가 반환됩니다.

스레드 세이프 하지 않은 concurrent_queue의 함수

- clear

- unsafe_end

- unsafe_begin

- unsafe_size

Intel의 TBB를 배운 후 ConcRt를 보면……

ConcRt를 배운 후 TBB를 보면……

참고

AMD가 차기 아키텍쳐 「Bulldozer」와「Bobcat」의 개요를 공표원문 : http://pc.watch.impress.co.jp/docs/column/kaigai/20091112_328392.html

번역 :

http://jacking.tistory.com/487



2011년에 등장하는 AMD의 8코어 데스크탑 CPU 「Zambezi」원문http://pc.watch.impress.co.jp/docs/column/kaigai/20091126_331235.html

번역http://jacking.tistory.com/514



Core i5/i7

원문http://www.atmarkit.co.jp/fwin2k/words/011corei5/corei5.html

번역http://jacking.tistory.com/510


http://pc.watch.impress.co.jp/docs/column/kaigai/20091112_328392.html




http://pc.watch.impress.co.jp/docs/column/kaigai/20091126_331235.html




http://www.atmarkit.co.jp/fwin2k/words/011corei5/corei5.html



계획이 바뀐 Larrabee 무엇이 문제였는가?



마이크로소프트 Windows 7에서도 병렬처리 향상을 목표로http://jacking.tistory.com/355

[PDC09] Developing Applications for Scale-Up Servers Running Windows

Server 2008 R2

http://microsoftpdc.com/Sessions/SVR18

[PDC09] Lighting up Windows Server 2008 R2 Using the ConcRT on UMS


양보할 줄 아는 Concurrency Runtime의 event

http://vsts2010.tistory.com/109

Cross Process Resource Management - do we need it now?

http://blogs.msdn.com/nativeconcurrency/archive/2010/04/07/cross-process-

resource-management-do-we-need-it-now.aspx






http://vsts2010.tistory.com/109

http://blogs.msdn.com/nativeconcurrency/archive/2010/04/07/cross-process-resource-management-do-we-need-it-now.aspx

















Concurrency::task_group leaks memory

http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thread/15799a79-

cca0-4c51-85e3-64ea1e26981d

MSDN – Concurrency Runtime

http://msdn.microsoft.com/en-us/library/dd504870(VS.100).aspx

VSTS 2010 스터디 블로그http://vsts2010.net

Parallel Programming in Native Code

http://blogs.msdn.com/nativeconcurrency/default.aspx

본인 블로그http://jacking.tistory.com/

http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thread/15799a79-cca0-4c51-85e3-64ea1e26981d












http://vsts2010.net/

http://blogs.msdn.com/nativeconcurrency/default.aspx

http://jacking.tistory.com/

Documents

Visual Studio 2010을활용한 - download.microsoft.comdownload.microsoft.com/.../pdf/1-2_VS2010_Cpp/1-2_VS2010Cpp.pdf · Visual Studio 2010을활용한 C++ 병렬프로그래밍