More Related Content Similar to [0312 조진현] good bye dx9 Similar to [0312 조진현] good bye dx9 (20) [0312 조진현] good bye dx929. 더 이상 게임 목적이 아니다!!! 굴러온 돌 Win32 Application Win32 Application Win32 Application Future Graphics Components Direct3D API GDI Direct3D API User-Mode Driver HAL Device DXGI Device Driver Interface ( DDI ) Kernel-mode Driver. Hardware Graphics Hardware 38. CPU 처리 최적화! CPU 성능 개선 이슈 Debug Layer Draw Call Management Constant Updates Resource Updates State Management Shader Linkage Asynchronous Queries … 39. 멀티 스레딩 지원 강화 Direct3D API 측면에서의 변화 ID3D11Device ID3D10Device ID3D11DeviceContext 66. XNA Math 를 써야하는 이유 더 많은 inline 더 좋은 SIMD최적화 D3DXMath는 끝! 82. 근본적으로 본 개수가 많음!!! 스키닝 연산의 최적화 각종 수치 데이터를 최대한 활용. 데이터 양을 줄이는 방법. DX10의 경우에는 StreamOut기능 활용. 83. H/W 테셀레이션 님이 등장! 문제들… 그래픽 퀄리티의 향상 텍스쳐링의 한계 멀티패스 렌더링의 보편화 스키닝 부담 84. DX11 에서 채택! ATI의 노력 RADEON 8500 시리즈에서 시작. X-Box 360 에서 도입. ATI 제품군에 전면 도입. 94. DX9 vs DX11 Tessellation Pipeline Input Assembler Vertex Shader Input Assembler HullShader Tessellator Tessellator Vertex Shader Domain Shader Memory / Resources Geometry Shader Stream Output Rasterizer Rasterizer Pixel Shader Pixel Shader Output Merger Output Merger 102. LOD 판정.HullShader Tessellator Domain Shader Geometry Shader Rasterizer Pixel Shader Output Merger 103. Parallel~~~ DX11 Adaptive Tessellation Input Assembler Vertex Shader HullShader Tessellator Control Points( ID 식별 ) Hull Shader Domain Shader Geometry Shader patch control points from the VS. Rasterizer Patch Constant Data ( Tessellation Factors 포함 ) Pixel Shader Output Merger 104. DX11 Adaptive Tessellation Input Assembler Vertex Shader HullShader BaryCentric Coordinate 생성 Tessellator Domain Shader Geometry Shader Stream Output Rasterizer Pixel Shader Output Merger 106. [domain("tri")] DS_OUTPUT DS( HS_CONSTANT_DATA_OUTPUT input, float3 BarycentricCoordinates : SV_DomainLocation, const OutputPatch<HS_CONTROL_POINT_OUTPUT, 3> TrianglePatch ) { DS_OUTPUT output = (DS_OUTPUT)0; // Interpolate world space position with barycentric coordinates float3 vWorldPos = BarycentricCoordinates.x * TrianglePatch[0].vWorldPos +BarycentricCoordinates.y * TrianglePatch[1].vWorldPos + BarycentricCoordinates.z * TrianglePatch[2].vWorldPos; 107. Surface evaluation Displacement mapping DX11 Adaptive Tessellation Input Assembler Vertex Shader HullShader Tessellator Hull Shader Output Control Points Domain Shader Domain Shader Vertex Position Geometry Shader Stream Output Rasterizer Tessellator Stage Output Texture Coordinates Pixel Shader Output Merger 108. H/W Tessellation 의 이점은 ? GPU 의 프로세싱 능력 활용! ( 1PASS ) 메모리 절약! 대역폭 감소! 렌더링퀄리티 향상 성능의 향상 117. 어렵다!!! DX9 DX11 DX9 DX10 으로 포팅하라! 모든 파이프라인 사용 방법 수정 스테이트 오브젝트 사용할 것. InputLayout으로 쉐이더 연결할 것. 상수 버퍼 사용할 것. etc.. 121. cbuffercbPerFrame : register( b0 ) { matrix g_mViewProjection; float g_fTessellationFactor; }; struct VS_CONTROL_POINT_INPUT { float3 vPosition : POSITION; }; struct VS_CONTROL_POINT_OUTPUT { float3 vPosition : POSITION; }; 126. struct DS_OUTPUT { float4 vPosition : SV_POSITION; }; [domain("tri")] DS_OUTPUT DS( HS_CONSTANT_DATA_OUTPUT input, float3 UVW : SV_DomainLocation, const OutputPatch<HS_OUTPUT, 3> patched ) { DS_OUTPUT Output; float3 finalPos = UVW.x * patched[0].vPosition + UVW.y * patched [1].vPosition + UVW.z * patched [2].vPosition; Output.vPosition = mul( float4(finalPos,1), g_mViewProjection ); return Output; } 137. Multi-Threaded Rendering ID3D11Device Immediate Context ID3D11DeviceContext Deferred Context free thread Rendering Command 153. ID3D11CommandList intm_iRenderThreadCount; HANDLE* m_pDefferredThreadHandleArray; HANDLE* m_pBeginDeferredEventHandleArray; HANDLE* m_pEndDeferredEventHandleArray; ID3D11DeviceContext** m_ippDeferredContextArray; ID3D11CommandList** m_ippCommandListArray 154. CreateDeferredContext ID3D11DeviceContext* ipResultPointer = 0x00; for( inti = 0; i < m_iRenderThreadCount; ++i ) { ipResultPointer = JinRenderUtil::CreateDeferredContext( this->m_ipGPU ); m_ippDeferredContextPointerArray[ i ] = ipResultPointer; m_pDeferredThreadHandleArray[ i ] = (HANDLE)_beginthreadex ( NULL,0, DeferredProcForRenderPerScene, &i, CREATE_SUSPENDED, NULL ); ResumeThread( this->m_pDeferredThreadHandleArray[ i ] ); } 155. FinishCommandList for(;;) { ::WaitForSingleObject( GetBeginDeferredEventHandle( iInstance ), INFINITE ); RenderSomething(); hr = ipDeferredContextPtr->FinishCommandList( FALSE, &ipCommandListPtr ); assert( hr == S_OK ); ::SetEvent( Jin3D::GetInstance()->GetEndDeferredEventHandle( iInstance ) ); } 156. ExecuteCommandList for( inti = 0; I < m_iThreadCount; ++i; ) { m_ipImmediateContext->ExecuteCommandList( m_ippCommandListArray[ I ], TRUE ); } 157. MTR 이란… 스레드들을 생성 ( 코어 개수만큼 ) 커맨드 생성( 각 스레드별) 렌더링( 메인 스레드) 커맨드 전송( To GPU ) 163. Applications Media playback or processing, media UI, recognition, etc. Technical Domain Libraries Domain Languages Accelerator, Brook+, Rapidmind, Ct MKL, ACML, cuFFT, D3DX, etc. Compute Languages DirectCompute, CUDA, CAL, OpenCL, LRB Native, etc. Processors CPU, GPU, Larrabee nVidia, Intel, AMD, S3, etc. 166. CPU vs GPU SIMD SIMD SIMD SIMD SIMD CPU 0 CPU 1 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD CPU 2 CPU 3 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD L2 Cache SIMD SIMD SIMD SIMD SIMD L2 Cache 167. CPU 4 Cores 4 float wide SIMD 3GHz 48-96GFlops 2x HyperThreaded 64kB $L1/core 20GB/s to Memory $200 200W CPU 0 CPU 1 CPU 2 CPU 3 L2 Cache 168. GPU 32 Cores 32 Float wide 1GHz 1TeraFlop 32x “HyperThreaded” 64kB $L1/Core 150GB/s to Mem $200, 200W SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD L2 Cache 171. Example HLSLcode #define BLOCK_SIZE 256 StructuredBuffer gBuf1; StructuredBuffer gBuf2; RWStructuredBuffergBufOut; [numthreads(BLOCK_SIZE,1,1)] void VectorAdd( uint3 id: SV_DispatchThreadID ) { gBufOut[id] = gBuf1[id] + gBuf2[id]; } 172. Compile the HLSL code hr = D3DX11CompileFromFile( “myCode.hlsl”, // path to .hlsl file NULL, NULL, “VectorAdd”, // entry point pProfile, NULL, // Flags NULL, NULL, &pBlob, // compiled shader &pErrorBlob, // error log NULL ); 173. Initialize DirectCompute hr = D3D11CreateDevice ( NULL, // default gfx adapter D3D_DRIVER_TYPE_HARDWARE, // use hw NULL, // not swrasterizer uCreationFlags, // Debug, Threaded, etc. NULL, // feature levels 0, // size of above D3D11_SDK_VERSION, // SDK version ppDeviceOut, // D3D Device &FeatureLevelOut, // of actual device ppContextOut ); // subunit of device ); 174. CS 생성 및 설정 pD3D->CreateComputeShader( pBlob->GetBufferPointer(), pBlob->GetBufferSize(), NULL, &pMyShader ); // hw fmt pD3D->CSSetShader( pMyShader, NULL, 0 ); 176. 입력을 위한 버퍼 설정 D3D11_BUFFER_DESC descBuf; ZeroMemory( &descBuf, sizeof(descBuf) ); desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS; desc.StructureByteStride = uElementSize; desc.ByteWidth = uElementSize * uCount; desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; pD3D->CreateBuffer( &desc, pInput, ppBuffer ); 177. 뷰 설정 D3D11_UNORDERED_ACCESS_VIEW_DESC desc; ZeroMemory( &desc, sizeof(desc) ); desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; desc.Buffer.FirstElement = 0; desc.Format = DXGI_FORMAT_UNKNOWN; desc.Buffer.NumElements = uCount; pD3D->CreateUnorderedAccessView( pBuffer, // Buffer view is into &desc, // above data &pMyUAV ); // result 179. pDev11->Dispatch(3, 2, 1); [numthreads(4, 4, 1)] void MyCS(…) 10 00 00 01 02 03 00 01 02 03 20 00 01 02 03 10 11 12 13 10 11 12 13 10 11 12 13 20 21 22 23 20 21 22 23 20 21 22 23 30 31 32 33 30 31 32 33 30 31 32 33 01 11 21 00 01 02 03 00 01 02 03 00 01 02 03 10 11 12 13 10 11 12 13 10 11 12 13 20 21 22 23 20 21 22 23 20 21 22 23 30 31 32 33 30 31 32 33 30 31 32 33 180. 결과를 얻기 위한 버퍼 설정 D3D11_BUFFER_DESC desc; ZeroMemory( &desc, sizeof(desc) ); desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ; desc.Usage = D3D11_USAGE_STAGING; desc.BindFlags = 0; desc.MiscFlags = 0; pD3D->CreateBuffer( &desc, NULL, &StagingBuf ); 183. The Teraflop Today N-Body Demo App: AMD Phenom II X4 940 3GHz + Radeon HD 5850 CPU 13.7GFlops Multicore SSE, not cache-aware GPU 537GFlops DirectCompute Intel Xeon E5410 2.33GHz + Radeon HD 5870 CPU 25.5GFlops Multicore SSE, not cache-aware GPU 722GFlops DirectCompute