šŸ¦š Edition 12: Select the Right GPU Instance on AWS?

Select right GPU

Imagination is the exaggeration of the data you have in your brain. I want to train a diffusion model to compose a piece of music on a lovely evening in Amsterdam. However, I require an AWS GPU instance to achieve this. We, the machine learning engineers, are always baffled about the optimal GPU instance on GPU. I completed a short study on this, and the outcome is that:

If you are doing HPC (High Performance Job) like Drug Discovery or High Precision Job, then we suggest following the P (historically called Performance-Heavy) Instance Family. Else we recommend following the G (historically called Graphics-Heavy) Instance Family. I am providing the cost chart for the noted GPU instances.

šŸ„ P3 and P4 Instances Cost

šŸ° G4 and G5 Instances Cost

šŸ¹ Always Don't Select GPU on the Price Ground

Please don't select the GPU always as per the pricing basis. We executed a little experiment- We trained a ScibertTransformer model with 100K data points.

The result on a g4dn.2xlarge machine:

We executed on a g5.xlarge machine too. The result is on the same configuration :

So if we use a g5.xlarge machine then we can save 20% of our budget.

And other benefits of a g5 family over a g4 family are:

  • NVIDIA Ampere Architecture is modern architecture, it supports all the precision formats.

  • We should follow the Mixed Precision Training in Pytoch based project.There are different types of floating datatypes - FP32, FP16, TF32, BF16

We executed the apple-to-apple comparison with fp16 datatype because tf32 and bf16 need the Ampere Architecture which is available under g5 instances family.

šŸ¼ PS : What is TF32 and BF16?

BF16

If you have access to a Ampere or newer hardware you can use bf16 for your training and evaluation. While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were experiencing overflow issues while training the model, bf16 will prevent this from happening most of the time. Remember that in fp16 the biggest number you can have is `65535` and any number above that will overflow. A bf16 number can be as large as `3.39e+38` (!) which is about the same as fp32 - because both have 8-bits used for the numerical range.

TF32

The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total.

Itā€™s magical in the sense that you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput improvement.

When this is done CUDA will automatically switch to using tf32 instead of fp32 where itā€™s possible. This, of course, assumes that the used GPU is from the Ampere series.

Like all cases with reduced precision this may or may not be satisfactory for your needs, so you have to experiment and see. According to NVIDIA research the majority of machine learning training shouldnā€™t be impacted and showed the same perplexity and convergence as the fp32 training.

FP8:

Will create a blog on this shortly.

And I found the detailed AWS GPU Instance details for further read:

šŸÆ Reference

**

I will publish the next Edition on Sunday.

This is the 12th Edition, If you have any feedback please donā€™t hesitate to share it with me, And if you love my work, do share it with your colleagues.

It takes time to research and document it - Please be a paid subscriber and support my work.

Cheers!!

Raahul

**

Reply

or to participate.