- The Musings On AI
- Posts
- ⏳ The 5th Edition: Load Test Comparison of In-house Self-hosted LLM Models
⏳ The 5th Edition: Load Test Comparison of In-house Self-hosted LLM Models
Load Test Comparison of In-house Self-hosted LLM Models
💫 Cost of Summarization: GPT 4 - 8K Context Length
💰Pricing of the Model
Cost of Prompt: $30 Per million of tokens
Cost of Response: $60 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $30 = $180,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $60 = $180,000
💵 Total Cost
Input Cost + Output Cost = $180,000 + $180,000 = $360,000
💫 Cost of Summarization: GPT 4 - 32K Context Length
💰Pricing of the Model
Cost of Prompt: $60 Per million of tokens
Cost of Response: $120 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $60 = $360,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $120 = $360,000
💵 Total Cost
Input Cost + Output Cost = $360,000 + $360,000 = $720,000
💫 Cost of Summarization: Anthropic Claude V1
💰Pricing of the Model
Cost of Prompt: $11 Per million of tokens
Cost of Response: $32 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $11 = $66,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $32 = $96,000
💵 Total Cost
Input Cost + Output Cost = $66,000 + $96,000 = $162,000
💫 Cost of Summarization: InstructGPT DaVinci
💰Pricing of the Model
Cost of Prompt: $20 Per million of tokens
Cost of Response: $20 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $20 = $120,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $20 = $60,000
💵 Total Cost
Input Cost + Output Cost = $120,000 + $60,000 = $180,000
💫 Cost of Summarization: Curie
💰Pricing of the Model
Cost of Prompt: $2 Per million tokens
Cost of Response: $2 Per million tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $2 = $12000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $2 = $6,000
💵 Total Cost
Input Cost + Output Cost = $12,000 + $6,000 = $18,000
⚠️ Disclaimer
It's not a direct apple-to-apple comparison -
If you employ an API service like Azure/Openai then you don't need to assemble a layer of load balancer, autoscaling, and other parts. The API encapsulates everything. You need to bear the final price. On the contrary, if you operate the self-hosted API then you must take care of the batcher service and the cluster operation (Kubernetes).
The API services use other frameworks(RLHF or toxicity detection layer) including the LLM models to provide a better-moderated response.
💫 Cost of Summarization: 1.3 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (9135.57 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.038
Average Latency: ~50 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 9135.57 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (9135.57 * 8 * 3600)) * 10 = $342
💫 Cost of Summarization: 2.7 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (7538.69 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.046
Average Latency: ~63 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 7538.69 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (7538.69 * 8 * 3600)) * 10 = $414.5
Thank you for reading Musings on AI. This post is public so feel free to share it.
💫 Cost of Summarization: 6.7 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (4285.98 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.0810
Average Latency: ~70 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 4285.98 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (4285.98 * 8 * 3600)) * 10 = $729.12
💫 Cost of Summarization: 13 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge~ 10 Euro (With AWS Discount)
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (1564.10 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.22
Average Latency: ~190 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 1564.10 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (1564.10 * 8 * 3600)) * 10 = $1998
Thank you for reading Musings on AI. This post is public so feel free to share it.
💫 Cost of Summarization: 30 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 11 Euro (With AWS Discount)
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (370.2 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.937
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 370.2 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (370.2 * 8 * 3600)) * 10 = $8441
The load test was not completed properly due to a Cuda Memory error. In production, 8 shards of the node will be used. So won't face the problem. That's why I couldn't generate the latency figures.
💫 Cost of Summarization: 7 Billion LLAMA-2 Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro (With AWS Discount)
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million: (10 <node cost> / (2725.52 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.13
Average Latency: ~120 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 2725.52 <That I have achieved with a single node and single core > * 8 * 3600) * 10
9000000000 * (1 / (2725.52 * 8 * 3600)) * 10 = $1146
💻 Self-hosted Backend
The Inhouse Code Repo
Continuous Batching: A Distributed Serving System for Transformer-Based Generative Models
vllm - PagedAttention
🧐 Observation
The node(A100 - p4d.24xlarge) is really hard to get. Spent 20+ hours with AWS to reserve the capacity for the load test. I would recommend going with 3 years reservation because it is affordable. The cost has been calculated with the 3-year reserve price.
I have not tested the fine-tuned model in this experiment. I have found that a 7b model would cost around ~$350 for the same scenario. In production, we should go with the fine-tuned model.
I explicitly test the facebook/opt model family because there are different size variations.The falcon, llama, musicml-mpt models are built on modern optimized architecture. So the inference time and cost will be lower from the opt family.
So I have used 1000 tokens as Input and 500 tokens per output.
Followed the philosophy to implement the sample prompt - the prompt token length varies from 900 to 1100 and the response token length varies from 450 to 550 tokens.
Have used vllm and continuous batching in the backend.
📜 The One Liner
**
I will publish the next Edition on Thursday.
This is the 5th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
Cheers!!
Raahul
**
Reply