Professional Documents
Culture Documents
Product C1060 M1060 S1070-400 Inventory 200 units 500 units 50 units Leadtime for big orders 8 weeks 8 weeks 10 weeks Notes Build to order Build to order Build to order
S1070-500
M2050 S2050 C2050 M2070 C2070 M2070-Q
10 weeks
8 weeks 8 weeks
Build to order
Sold out through mid-July Sold out through mid-July
Processamento Paralelo em 2000 units 8 weeks Will maintain inventory GPUs na Arquitetura Fermi
Arnaldo Tavares
Get PO in now to get priority
1
Sept-Oct 2010 - for LatinGet PO in now to get priority Tesla Sales Manager America Oct 2010 -
Quadro or Tesla?
Numerical Analytics
e.g. MATLAB, Mathematica
3D Modeling / Animation
e.g. 3ds, Maya, Softimage
Computational Biology
e.g. AMBER, NAMD, VMD
Video Editing / FX
e.g. Adobe CS5, Avid
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU
48 GigaFlops (DP)
GPU
515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
3
146X
Medical Imaging U of Utah
36X
Molecular Dynamics U of Illinois, Urbana
18X
Video Transcoding Elemental Tech
50X
Matlab Computing AccelerEyes
100X
Astrophysics RIKEN
50x 150x
149X
Financial simulation Oxford
47X
Linear Algebra Universidad Jaime
20X
3D Ultrasound Techniscan
130X
Quantum Chemistry U of Illinois, Urbana
30X
Gene Sequencing U of Maryland
4
Future
Parallel Nsight Vis Studio IDE ParaTools VampirTrace MAGMA (LAPACK) StoneRidge RTM MATLAB PGI CUDA x86 TotalView Debugger
PGI Accelerators CAPS HMPP EMPhotonics CULAPACK OpenGeoSolut ions OpenSEIS VSG Open Inventor NAMD LAMMPS MUMmerGPU GPU-HMMR Autodesk Moldflow
Platform LSF Cluster Mgr Bright Cluster Manager Thrust C++ Template Lib GeoStar Seismic Suite Seismic City RTM HOOMD VMD CUDA-MEME CUDA-EC Prometch Particleworks
TauCUDA Perf Tools Allinea DDT Debugger NVIDIA NPP Perf Primitives Acceleware RTM Solver Tsunami RTM TeraChem GAMESS PIPER Docking
PGI CUDA Fortran CUDA FFT CUDA BLAS Headwave Suite ffA SVI Pro AMBER
AccelerEyes Wolfram Jacket MATLAB Mathematica NVIDIA RNG & SPARSE Video Libraries CUDA Libraries Paradigm RTM Paradigm SKUA Panorama Tech
Libraries
Acellera ACEMD
DL-POLY
Bio-Chemistry
OpenEye ROCS
BioInformatics
CAE
Available
Future
Video
Elemental Video
Refractive SW Octane
Rendering
Caustic Graphics
SciComp SciFinance Murex MACS Agilent ADS SPICE Gauda OPC
Lightworks Artisan
EDA
Siemens 4D Ultrasound
Other
Digisens Medical
Manifold GIS
Available
3 of Top5 Supercomputers
3000 8 7 2500 6
2000
5 Megawatts Gigaflops
1500
4
3
3 of Top5 Supercomputers
3000 8 7 2500 6
2000
5 Megawatts Gigaflops
1500
4
3
Linpack Teraflops
1000
800
600
400
200
This is a projection based on Moores law and does not represent a committed roadmap
10
Tesla Roadmap
11
250
1000
200
T20A
800
T20 T20A
150
8-core Sandy Bridge 3 GHz
600
T20
T10
100
8-core Sandy Bridge 3 GHz Nehalem 3 GHz Westmere 3 GHz
400 50
200
T10
Nehalem 3 GHz
Westmere 3 GHz
0 2007
2008
2009
2010
2011
2012
0 2007
2008
2009
2010
2012
12
Project Denver
13
14
Tesla C2050
Processors Number of Cores Caches Floating Point Peak Performance GPU Memory Memory Bandwith System I/O Power 238 W (max) 448
Tesla C2070
64 KB L1 cache + Shared Memory / 32 cores 768 KB L2 cache 1030 Gigaflops (single) 515 Gigaflops (double) 3 GB 2.625 GB with ECC on 6 GB 5.25 GB with ECC on
Available
Shipping Now
Shipping Now
16
17
18
SM Architecture
32 CUDA cores per SM (512 total)
16 x Load/Store Units = source and destin. address calculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.) 64 KB of RAM for shared memory and L1 cache (configurable) Dual Warp Scheduler
Interconnect Network
64K Configurable Cache/Shared Mem Uniform Cache
20
21
CUDA Core
Dispatch Port Operand Collector FP Unit INT Unit
Result Queue
Interconnect Network
64K Configurable Cache/Shared Mem Uniform Cache
22
23
10x faster application context switching (each program receives a time slice of processing resources)
HTS
24
nel
Time
Activity Snapshot:
Kernel 0
CPU SDT0 CPU GPU SDT0 CPU SDT1 GPU SDT0 CPU SDT1 GPU SDT0 SDT1 GPU SDT1
Kernel 1
Kernel 2
Kernel 3
26
27
HLSL
OpenCL C
C for CUDA
C for CUDA
DirectX 11 Compute
OpenCL Driver
CUDA Driver
PTX (ISA)
28
Thread Hierarchy
Kernels (simple C program) are executed by thread Threads are grouped into Blocks Threads in a Block can synchronize execution
29
30
CUDA
Thread Block
Hardware Level
CUDA Core SM
Memory Access
Registers Shared Memory
Grid
Device
GPU
Node
Global Memory
Host Memory
31
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)
32
Rest of C Application
NVCC (Open64)
Modify into Parallel CUDA code
CPU Compiler
Linker
33
Standard C Code
__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; Parallel if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
C Code
34
Software Programming
35
Software Programming
36
Software Programming
37
Software Programming
38
Software Programming
39
Software Programming
40
Software Programming
41
Software Programming
42
Double Precision Compiler Optimizations Vista 32/64 Mac OSX 3D Textures HW Interpolation
43
Benefits
Higher Performance for scientific CUDA applications
Features
Cluster management software tools available on Tesla only
TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla. Integrated OEM workstations and servers Professional ISVs will certify CUDA applications only on Tesla 2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability.
44