Product Inventory and Lead Time Update

Product Availability Update
Product C1060 M1060 S1070-400 Inventory 200 units 500 units 50 units Leadtime for big orders 8 weeks 8 weeks 10 weeks Notes Build to order Build to order Build to order
S1070-500
M2050 S2050 C2050 M2070 C2070 M2070-Q
25 units+ 75 being built

Shipping now Building 20K for Q2 Shipping now Building 200 for Q2 Sept 2010
10 weeks
8 weeks 8 weeks
Build to order
Sold out through mid-July Sold out through mid-July
Processamento Paralelo em 2000 units 8 weeks Will maintain inventory GPUs na Arquitetura Fermi
Arnaldo Tavares
Get PO in now to get priority
1
Sept-Oct 2010 - for LatinGet PO in now to get priority Tesla Sales Manager America Oct 2010 -
Quadro or Tesla?
Computer Aided Design

e.g. CATIA, SolidWorks, Siemens NX
Numerical Analytics
e.g. MATLAB, Mathematica
3D Modeling / Animation
e.g. 3ds, Maya, Softimage
Computational Biology
e.g. AMBER, NAMD, VMD
Video Editing / FX
e.g. Adobe CS5, Avid
Computer Aided Engineering

e.g. ANSYS, SIMULIA/ABAQUS
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU
48 GigaFlops (DP)
GPU
515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
3
146X
Medical Imaging U of Utah
36X
Molecular Dynamics U of Illinois, Urbana
18X
Video Transcoding Elemental Tech
50X
Matlab Computing AccelerEyes
100X
Astrophysics RIKEN
50x 150x
149X
Financial simulation Oxford
47X
Linear Algebra Universidad Jaime
20X
3D Ultrasound Techniscan
130X
Quantum Chemistry U of Illinois, Urbana
30X
Gene Sequencing U of Maryland
4
Increasing Number of Professional CUDA Apps

Available Now
CUDA C/C++
Tools
Future
Parallel Nsight Vis Studio IDE ParaTools VampirTrace MAGMA (LAPACK) StoneRidge RTM MATLAB PGI CUDA x86 TotalView Debugger
PGI Accelerators CAPS HMPP EMPhotonics CULAPACK OpenGeoSolut ions OpenSEIS VSG Open Inventor NAMD LAMMPS MUMmerGPU GPU-HMMR Autodesk Moldflow
Platform LSF Cluster Mgr Bright Cluster Manager Thrust C++ Template Lib GeoStar Seismic Suite Seismic City RTM HOOMD VMD CUDA-MEME CUDA-EC Prometch Particleworks
TauCUDA Perf Tools Allinea DDT Debugger NVIDIA NPP Perf Primitives Acceleware RTM Solver Tsunami RTM TeraChem GAMESS PIPER Docking
PGI CUDA Fortran CUDA FFT CUDA BLAS Headwave Suite ffA SVI Pro AMBER
AccelerEyes Wolfram Jacket MATLAB Mathematica NVIDIA RNG & SPARSE Video Libraries CUDA Libraries Paradigm RTM Paradigm SKUA Panorama Tech
Libraries
Oil & Gas
BigDFT ABINT CP2K
Acellera ACEMD
DL-POLY
Bio-Chemistry
GROMACS CUDA-BLASTP CUDA SW++ SmithWaterm ACUSIM AcuSolve 1.8

Announced
OpenEye ROCS
BioInformatics
HEX Protein Docking

Remcom XFdtd 7.0 ANSYS Mechanical LSTC LS-DYNA 971 Metacomp CFD++ FluiDyna OpenFOAM MSC.Software Marc 2010.2 5
CAE
Available
Increasing Number of Professional CUDA Apps

Available Now
Adobe Premier Pro CS5 MainConcept CUDA Encoder Bunkspeed Shot (iray) ARRI Various Apps GenArts Sapphire Fraunhofer JPEG2000 Random Control Arion TDVision TDVCodec Cinnafilm Pixel Strings ILM Plume Black Magic Da Vinci Assimilate SCRATCH Autodesk 3ds Max Cebas finalRender Chaos Group V-Ray GPU Works Zebra Zeany The Foundry Kronos
Future
Video
Elemental Video
Refractive SW Octane
Rendering
mental images iray (OEM)

NAG RNG
Finance
NVIDIA OptiX (SDK)

Numerix Risk Hanweck Options Analy CST Microwave SPEAG SEMCAD X
Caustic Graphics
SciComp SciFinance Murex MACS Agilent ADS SPICE Gauda OPC
Weta Digital PantaRay

RMS Risk Mgt Solutions
Lightworks Artisan
Aquimin AlphaVision Agilent EMPro 2010 Synopsys TCAD
EDA
Acceleware FDTD Solver Acceleware EM Solution
Rocketick Veritlog Sim
Siemens 4D Ultrasound
Other
Digisens Medical
Manifold GIS
Schrodinger Core Hopping
Useful Progress Med
MVTec Machine Vis
MotionDSP Ikena Video

Announced
Dalsa Machine Digital Vision Anarchy Photo 6
Available
3 of Top5 Supercomputers
3000 8 7 2500 6
2000
5 Megawatts Gigaflops
1500
4
3
1000 2 500 1 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 0
3 of Top5 Supercomputers
3000 8 7 2500 6
2000
5 Megawatts Gigaflops
1500
4
3
1000 2 500 1 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 0
Linpack Teraflops
What if Every Supercomputer Had Fermi?
1000
800
600
400
450 GPUs 110 TeraFlops $2.2 M Top 50
225 GPUs 55 TeraFlops $1.1 M Top 100
150 GPUs 37 TeraFlops $740K Top 150
200
Top 500 Supercomputers (Nov 2009)
Hybrid ExaScale Trajectory
2010 1.27 PFLOPS 2.55 MWatts 2008 1 TFLOP 7.5 KWatts
2017 * 2 EFLOPS 10 MWatts
This is a projection based on Moores law and does not represent a committed roadmap
10
Tesla Roadmap
11
The March of the GPUs

1200
Peak Double Precision FP GFlops/sec
250
Peak Memory Bandwidth GBytes/sec
1000
200
T20A
800
T20 T20A
150
8-core Sandy Bridge 3 GHz
600
T20
T10
100
8-core Sandy Bridge 3 GHz Nehalem 3 GHz Westmere 3 GHz
400 50
200
T10
Nehalem 3 GHz
Westmere 3 GHz
0 2007
2008
2009
2010
2011
2012
0 2007
2008
2009
2010
2011 x86 CPU
2012
Double Precision: NVIDIA GPU
Double Precision: x86 CPU
NVIDIA GPU (ECC off)
12
Project Denver
13
Expected Tesla Roadmap with Project Denver
14
Workstation / Data Center Solutions

2 Tesla M2050/70 GPUs
Workstations Up to 4x Tesla C2050/70 GPUs
OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U
Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U

15
Tesla C-Series Workstation GPUs
Tesla C2050
Processors Number of Cores Caches Floating Point Peak Performance GPU Memory Memory Bandwith System I/O Power 238 W (max) 448
Tesla C2070
Tesla 20-series GPU
64 KB L1 cache + Shared Memory / 32 cores 768 KB L2 cache 1030 Gigaflops (single) 515 Gigaflops (double) 3 GB 2.625 GB with ECC on 6 GB 5.25 GB with ECC on
144 GB/s (GDDR5) PCIe x16 Gen2 238 W (max)
Available
Shipping Now
Shipping Now
16
How is the GPU Used?
Basic Component: Stream Multiprocessor (SM)

SIMD: Single Instruction Multiple Data Same Instruction for all cores, but can operate over different data SIMD at SM, MIMD at GPU chip
Source: Presentation from Felipe A. Cruz, Nagasaki University
17
The Use of GPUs and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
18
The Fermi Architecture

3 billion transistors 16 x Streaming Multiprocessors (SMs) 6 x 64-bit Memory Partitions = 384-bit Memory Interface Host Interface: connects the GPU to the CPU via PCI-Express GigaThread global scheduler: distribute thread blocks to SM thread schedulers
19
SM Architecture
32 CUDA cores per SM (512 total)
Instruction Cache Scheduler Scheduler Dispatch Dispatch
Register File Core Core Core Core
16 x Load/Store Units = source and destin. address calculated for 16 threads per clock
4 x Special Function Units (sin, cosine, sq. root, etc.) 64 KB of RAM for shared memory and L1 cache (configurable) Dual Warp Scheduler
Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core

Core Core Core Core Load/Store Units x 16 Special Func Units x 4
Interconnect Network
64K Configurable Cache/Shared Mem Uniform Cache
20
Dual Warp Scheduler

1 Warp = 32 parallel threads
2 Warps issued and executed concurrently

Each Warp goes to 16 CUDA Cores Most instructions can be dual issued (exception: Double Precision instructions) Dual-Issue Model allows near peak hardware performance
21
CUDA Core Architecture

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Newly designed integer ALU optimized for 64-bit and extended precision operations Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision
Instruction Cache Scheduler Scheduler Dispatch Dispatch
Register File Core Core Core Core
Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
CUDA Core
Dispatch Port Operand Collector FP Unit INT Unit
Core Core Core Core

Core Core Core Core Load/Store Units x 16
Result Queue
Special Func Units x 4
Interconnect Network
64K Configurable Cache/Shared Mem Uniform Cache
22
Fused Multiply-Add Instruction (FMA)
23
GigaThreadTM Hardware Thread Scheduler (HTS)

Hierarchically manages thousands of simultaneously active threads
10x faster application context switching (each program receives a time slice of processing resources)
HTS
Concurrent kernel execution
24
GigaThread Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch
Kernel 1 Kernel 2 Kernel 2 Kernel 3 Kernel 4 Kernel 5 Kernel 1 Kernel 2 Kernel 2 Kernel 3 Kernel 5 Ker 4
nel
Time
Serial Kernel Execution
Parallel Kernel Execution

25
GigaThread Streaming Data Transfer Engine

Dual DMA engines
Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU and GPU processing time SDT
Activity Snapshot:
Kernel 0
CPU SDT0 CPU GPU SDT0 CPU SDT1 GPU SDT0 CPU SDT1 GPU SDT0 SDT1 GPU SDT1
Kernel 1
Kernel 2
Kernel 3
26
Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory Shared/L1 Cache per SM (64KB)
Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)

Fast, coherent data sharing across all cores in the GPU
Global Memory (up to 6GB)
27
CUDA: Compute Unified Device Architecture

NVIDIAs Parallel Computing Architecture Software Development Platform aimed to the GPU Architecture
Device-level APIs Language Integration
Applications Using DirectX
Applications Using OpenCL
Applications Using the CUDA Driver API
Applications Using C, C++, Fortran, Java, Python, ...
HLSL
OpenCL C
C for CUDA
C for CUDA
DirectX 11 Compute
OpenCL Driver
C Runtime for CUDA
CUDA Driver
PTX (ISA)
CUDA Support in Kernel Level Driver
CUDA Parallel Compute Engines inside GPU
28
Thread Hierarchy
Kernels (simple C program) are executed by thread Threads are grouped into Blocks Threads in a Block can synchronize execution
Blocks are grouped in a Grid

Blocks are independent (must be able to be executed at any order
29
Memory and Hardware Hierarchy

Threads access Registers CUDA Cores execute Threads Threads within a Block can share data/results via Shared Memory Streaming Multiprocessors (SMs) execute Blocks Grids use Global Memory for result sharing (after kernel-wide global synchronization) GPU executes Grids
30
Full View of the Hierarchy Model
CUDA
Thread Block
Hardware Level
CUDA Core SM
Memory Access
Registers Shared Memory
Grid
Device
GPU
Node
Global Memory
Host Memory
31
IDs and Dimensions

Threads 3D IDs, unique within a block
Blocks
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)
2D IDs, unique within a grid

Dimensions set at launch time Can be unique for each grid
Block (1, 1)
Built-in variables threadIdx, blockIdx blockDim, gridDim
Thread (0, 0) Thread (0, 1) Thread (0, 2)
32
Compiling C for CUDA Applications

void serial_function( ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... }
C CUDA Key Kernels
Rest of C Application
NVCC (Open64)
Modify into Parallel CUDA code
CPU Compiler
CUDA object files
Linker
CPU object files CPU-GPU Executable
33
C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y);
Standard C Code
__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; Parallel if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
C Code
34
Software Programming
Source: Presentation from Andreas Klckner, NYU
35
36
37
38
39
40
41
42
CUDA C/C++ Leadership

2007 2008 2009 2010
CUDA Toolkit 1.0 July 07

C Compiler C Extensions
CUDA Toolkit 1.1 Nov 07

Win XP 64
Atomics support Multi-GPU support
CUDA Visual Profiler April 08 2.2 cuda-gdb HW Debugger
CUDA Toolkit 2.0 Aug 08
CUDA Toolkit 2.3 July 09

DP FFT 16-32 Conversion intrinsics Performance enhancements
Parallel Nsight Beta Nov 09
CUDA Toolkit 3.0 Mar 10

C++ inheritance Fermi arch support Tools updates Driver / RT interop
Double Precision Compiler Optimizations Vista 32/64 Mac OSX 3D Textures HW Interpolation
Single Precision BLAS FFT SDK 40 examples
43
Why should I choose Tesla over consumer cards?

Feature
4x Higher double precision (on 20-series)
Benefits
Higher Performance for scientific CUDA applications
ECC only on Tesla & Quadro (on 20-series)

Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine) Larger memory for larger data sets 3GB and 6GB Products
Data reliability inside the GPU and on DRAM memories

Higher Performance for CUDA applications (by overlapping communication & computation) Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE) Needed for GPU monitoring and job scheduling in data center deployments Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services Trusted, reliable systems built for Tesla products. Bug reproduction, support, feature requests for Tesla only. Built for 24/7 computing in data center and workstation environments. No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance. Reliable, long life products Ability to influence CUDA and GPU roadmap. Get early access to features requests. Reliable product supply
Features
Cluster management software tools available on Tesla only
TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla. Integrated OEM workstations and servers Professional ISVs will certify CUDA applications only on Tesla 2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability.
Quality & Warranty
Manufactured & guaranteed by NVIDIA 3-year warranty from HP
Support & Lifecycle
Enterprise support, higher priority for CUDA bugs and requests

18-24 months availability + 6-month EOL notice
44

Product Inventory and Lead Time Update

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Product Inventory and Lead Time Update

Uploaded by

Copyright:

Available Formats

Product Availability Update

25 units+ 75 being built

Computer Aided Design

Computer Aided Engineering

Increasing Number of Professional CUDA Apps

Oil & Gas

BigDFT ABINT CP2K

GROMACS CUDA-BLASTP CUDA SW++ SmithWaterm ACUSIM AcuSolve 1.8

HEX Protein Docking

Increasing Number of Professional CUDA Apps

mental images iray (OEM)

NVIDIA OptiX (SDK)

Weta Digital PantaRay

Aquimin AlphaVision Agilent EMPro 2010 Synopsys TCAD

Acceleware FDTD Solver Acceleware EM Solution

Rocketick Veritlog Sim

Schrodinger Core Hopping

Useful Progress Med

MVTec Machine Vis

MotionDSP Ikena Video

Dalsa Machine Digital Vision Anarchy Photo 6

1000 2 500 1 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 0

1000 2 500 1 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 0

What if Every Supercomputer Had Fermi?

450 GPUs 110 TeraFlops $2.2 M Top 50

225 GPUs 55 TeraFlops $1.1 M Top 100

150 GPUs 37 TeraFlops $740K Top 150

Top 500 Supercomputers (Nov 2009)

Hybrid ExaScale Trajectory

2010 1.27 PFLOPS 2.55 MWatts 2008 1 TFLOP 7.5 KWatts

2017 * 2 EFLOPS 10 MWatts

The March of the GPUs

Peak Double Precision FP GFlops/sec

Peak Memory Bandwidth GBytes/sec

2011 x86 CPU

Double Precision: NVIDIA GPU

Double Precision: x86 CPU

NVIDIA GPU (ECC off)

Expected Tesla Roadmap with Project Denver

Workstation / Data Center Solutions

Workstations Up to 4x Tesla C2050/70 GPUs

OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U

Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U

Tesla C-Series Workstation GPUs

Tesla 20-series GPU

144 GB/s (GDDR5) PCIe x16 Gen2 238 W (max)

How is the GPU Used?

Basic Component: Stream Multiprocessor (SM)

Source: Presentation from Felipe A. Cruz, Nagasaki University

The Use of GPUs and Bottleneck Analysis

Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

The Fermi Architecture

Instruction Cache Scheduler Scheduler Dispatch Dispatch

Register File Core Core Core Core

Core Core Core Core

Core Core Core Core

Dual Warp Scheduler

2 Warps issued and executed concurrently

CUDA Core Architecture

Instruction Cache Scheduler Scheduler Dispatch Dispatch

Register File Core Core Core Core

Core Core Core Core

Core Core Core Core