You are on page 1of 17

ANSYS High Performance Computing (HPC)

2011 CAE Associates

Why HPC

As we have seen, calculation of crack extension often requires multiple solutions of the model. If the full crack path is known, then each of these solutions can be run on separate machines simultaneously. However, if the path of the crack is determined by the previous solution, then each run must be made sequentially.

Use of HPC can greatly reduce the overall time it takes for these calculations.
2

Why HPC

We have also seen that there is significant scatter and variation in the fatigue properties as well as variation in the geometry, material properties and loading. To account for these variations a statistical assessment is needed, which will require many simulations.

Use of HPC can greatly reduce the overall time it takes for these calculations.
3

ANSYS HPC

The default ANSYS licensing allows the use of 2 cores. Use of additional cores requires HPC licenses.

Memory Options

Shared Memory ANSYS (SMP)


Multiple processors on one machine, all accessing the same RAM Limited by memory bandwidth Most, but not all of the solution phase runs in parallel Tops out between 4-8 cores

Distributed ANSYS (MPP)

Can run over a cluster of machines OR use multiple processors on one machine. In the case of clusters, limited by interconnect speed Entire solution phase runs in parallel (including stiffness matrix generation, linear equation solving, and results calculation). For distributed on one machine, the non-solution phases will run in SMP mode Does not support all analysis types, elements, etc. Requires MPI software Extends performance to a larger number of cores

Solver Options

SMP Solvers

Sparse JCG ICCG PCG QMR AMG

MPP Solvers

Sparse JCG PCG

Scalability

Amdahls law scalability is limited by serial computations Scalability limiters:


Large contact pairs Constraint equations across domain partitions MPCs Hardware

For distributed ANSYS, hardware must be balanced


Fast processors require fast interconnects Cannot have lots of I/O to a single disk

FEA Benchmark Problem

Bolted Flange with O-Ring Nonlinear material properties (Hyperelastic O-Ring) Large Deformation Nonlinear Contact 1 Million Degrees of Freedom

DHCAD5650 High End Workstation


(2) Intel Hex Core 2.66GHz Processors (12 Cores total) 24 GB RAM (4) 300GB Toshiba SAS 15,000 RPM RAID 5 Configuration

FEA Benchmark Performance


Single Machine - v13
6

Solver Speed Up

SPARSE DSPARSE AMG

0 0 2 4 6 8 10 12 14

# Cores
10

ANSYS Inc. Benchmark

Distributed 4M DOF Sparse Solver - Linear Elastic, SOLID186s

These runs were done on a Intel cluster containing 1000+ nodes, where each node contained 2 6-core Westmere processors, 24 GB of RAM, fast I/O and Infiniband DDR2 (~2500 MB/s) interconnect.
25

20

Solver Speed Up

15

10

0 0 10 20 30 Number of Cores 40 50 60

11

Disk Drive Speed

The bolted flange analysis was run on the two different drives of our high end workstation to compare disk speed influence on solution time. The RAID array completed the solution almost twice as fast as the SATA drive:

Run #1: PCG Solver, 12 CPU, In-Core, RAID Array. Wall time = 8754 sec. Run #2: PCG Solver, 12 CPU, In-Core, SATA Drive. Wall time = 16822 sec.

12

Hyperthreading

Hyperthreading allows one physical processor to appear as two logical processors to the operating system. This allows the operating system to perform two different processes simultaneously. It does not, however, allow the processor to do two of the same type of operation simultaneously (i.e. floating point operations). This form of parallel processing is only effective when a system has many lightweight tasks.

13

Hyperthreading and ANSYS

The bolted flange analysis was run with Hyperthreading on and then again with it off to determine its influence.

Run #1: PCG Solver, 12 CPUs, Hyperthreading Off. Wall time = 8754 sec. Run #2: PCG Solver, 24 CPUs, Hyperthreading On. Wall time = 8766 sec.

An LS-Dyna analysis was also run in the same manner as above with the following results.

Run #1: 12 CPUs, Hyperthreading Off. Wall time = 19560 sec. Run #2: 24 CPUs, Hyperthreading On. Wall time = 32918 sec.

14

GPU Accelerator

Available at v13 for Mechanical only ANSYS job launches on a CPU, sends floating point operations to the GPU for heavy lifting, returns data to CPU to end the job. Works in SMP mode only (in v13) using the GPU cards memory

Solvers: Sparse, PCG, JCG

Model limits for sparse solver depend on largest front sizes:

~3M DOF for 3GB Tesla C2050 and ~6M DOF for 6GB Tesla C2070

Model limits for PCG/JCG solvers:

~1.5M DOF for 3GB Tesla C2050 and ~3M DOF for 6GB Tesla C2070

15

GPU Accelerator

Shows additional 1.65x speedup in static test case:


Run #1: Sparse Solver, 4 CPUs, GPU Off. Wall time = 240 sec. Run #2: Sparse Solver, 4 CPUs, GPU On. Wall time = 146 sec.

GPU test cases run on small models, dont show great improvement.

16

GPU Accelerator Performance

ANSYS Inc. Benchmark

17

You might also like