HPC Fracture Seminar

ANSYS High Performance Computing (HPC)
2011 CAE Associates
Why HPC

As we have seen, calculation of crack extension often requires multiple solutions of the model. If the full crack path is known, then each of these solutions can be run on separate machines simultaneously. However, if the path of the crack is determined by the previous solution, then each run must be made sequentially.
Use of HPC can greatly reduce the overall time it takes for these calculations.
2
Why HPC
We have also seen that there is significant scatter and variation in the fatigue properties as well as variation in the geometry, material properties and loading. To account for these variations a statistical assessment is needed, which will require many simulations.
Use of HPC can greatly reduce the overall time it takes for these calculations.
3
ANSYS HPC

The default ANSYS licensing allows the use of 2 cores. Use of additional cores requires HPC licenses.
Memory Options
Shared Memory ANSYS (SMP)

Multiple processors on one machine, all accessing the same RAM Limited by memory bandwidth Most, but not all of the solution phase runs in parallel Tops out between 4-8 cores
Distributed ANSYS (MPP)
Can run over a cluster of machines OR use multiple processors on one machine. In the case of clusters, limited by interconnect speed Entire solution phase runs in parallel (including stiffness matrix generation, linear equation solving, and results calculation). For distributed on one machine, the non-solution phases will run in SMP mode Does not support all analysis types, elements, etc. Requires MPI software Extends performance to a larger number of cores
Solver Options
SMP Solvers

Sparse JCG ICCG PCG QMR AMG
MPP Solvers

Sparse JCG PCG
Scalability
Amdahls law scalability is limited by serial computations Scalability limiters:

Large contact pairs Constraint equations across domain partitions MPCs Hardware
For distributed ANSYS, hardware must be balanced

Fast processors require fast interconnects Cannot have lots of I/O to a single disk
FEA Benchmark Problem
Bolted Flange with O-Ring Nonlinear material properties (Hyperelastic O-Ring) Large Deformation Nonlinear Contact 1 Million Degrees of Freedom
DHCAD5650 High End Workstation

(2) Intel Hex Core 2.66GHz Processors (12 Cores total) 24 GB RAM (4) 300GB Toshiba SAS 15,000 RPM RAID 5 Configuration
FEA Benchmark Performance

Single Machine - v13
6
Solver Speed Up
SPARSE DSPARSE AMG
0 0 2 4 6 8 10 12 14
# Cores
10
ANSYS Inc. Benchmark
Distributed 4M DOF Sparse Solver - Linear Elastic, SOLID186s
These runs were done on a Intel cluster containing 1000+ nodes, where each node contained 2 6-core Westmere processors, 24 GB of RAM, fast I/O and Infiniband DDR2 (~2500 MB/s) interconnect.
25
20
Solver Speed Up
15
10
0 0 10 20 30 Number of Cores 40 50 60
11
Disk Drive Speed
The bolted flange analysis was run on the two different drives of our high end workstation to compare disk speed influence on solution time. The RAID array completed the solution almost twice as fast as the SATA drive:

Run #1: PCG Solver, 12 CPU, In-Core, RAID Array. Wall time = 8754 sec. Run #2: PCG Solver, 12 CPU, In-Core, SATA Drive. Wall time = 16822 sec.
12
Hyperthreading
Hyperthreading allows one physical processor to appear as two logical processors to the operating system. This allows the operating system to perform two different processes simultaneously. It does not, however, allow the processor to do two of the same type of operation simultaneously (i.e. floating point operations). This form of parallel processing is only effective when a system has many lightweight tasks.
13
Hyperthreading and ANSYS
The bolted flange analysis was run with Hyperthreading on and then again with it off to determine its influence.

Run #1: PCG Solver, 12 CPUs, Hyperthreading Off. Wall time = 8754 sec. Run #2: PCG Solver, 24 CPUs, Hyperthreading On. Wall time = 8766 sec.
An LS-Dyna analysis was also run in the same manner as above with the following results.

Run #1: 12 CPUs, Hyperthreading Off. Wall time = 19560 sec. Run #2: 24 CPUs, Hyperthreading On. Wall time = 32918 sec.
14
GPU Accelerator
Available at v13 for Mechanical only ANSYS job launches on a CPU, sends floating point operations to the GPU for heavy lifting, returns data to CPU to end the job. Works in SMP mode only (in v13) using the GPU cards memory
Solvers: Sparse, PCG, JCG
Model limits for sparse solver depend on largest front sizes:
~3M DOF for 3GB Tesla C2050 and ~6M DOF for 6GB Tesla C2070
Model limits for PCG/JCG solvers:
~1.5M DOF for 3GB Tesla C2050 and ~3M DOF for 6GB Tesla C2070
15
GPU Accelerator
Shows additional 1.65x speedup in static test case:

Run #1: Sparse Solver, 4 CPUs, GPU Off. Wall time = 240 sec. Run #2: Sparse Solver, 4 CPUs, GPU On. Wall time = 146 sec.
GPU test cases run on small models, dont show great improvement.
16
GPU Accelerator Performance
ANSYS Inc. Benchmark
17

HPC Fracture Seminar

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC Fracture Seminar

Uploaded by

Copyright:

Available Formats

ANSYS High Performance Computing (HPC)

2011 CAE Associates

Shared Memory ANSYS (SMP)

Distributed ANSYS (MPP)

Sparse JCG ICCG PCG QMR AMG

Sparse JCG PCG

Amdahls law scalability is limited by serial computations Scalability limiters:

For distributed ANSYS, hardware must be balanced

FEA Benchmark Problem

DHCAD5650 High End Workstation

FEA Benchmark Performance

SPARSE DSPARSE AMG

ANSYS Inc. Benchmark

Distributed 4M DOF Sparse Solver - Linear Elastic, SOLID186s

Disk Drive Speed

Hyperthreading and ANSYS

Solvers: Sparse, PCG, JCG

Model limits for sparse solver depend on largest front sizes:

Model limits for PCG/JCG solvers:

Shows additional 1.65x speedup in static test case:

GPU Accelerator Performance

ANSYS Inc. Benchmark

You might also like