๐Ÿฆ™ The 4th Edition : Llama 2.0

What I have learned in last 46 hours.

Source: MidJourney

Yesterday I was writing a blog on Polars and suddenly my LinkedIn feed was flooded with the news of Llama. It is a great day for the Open Source EcoSystem. Congratulations to M&M (Microsoft and Meta).

โ„น๏ธ Details

๐Ÿ“ˆ Business

Satya Nadella is probably the best dealmaker of our generation. Little surprise; under Satya's leadership Microsoftโ€™share price has grown 940% in 9 years. Can you name any other current CEO who can to move a 2600 billion dollar behemoth at such a pace?

๐Ÿ”น Early Invest in OpenAI.

๐Ÿ”น Now Investment in the Open-Source.

๐Ÿ•ต๏ธโ€โ™€๏ธ Observations

๐Ÿ”น A threat to open-source LLM startups. Mosaic(sold), Red Pajama, etc. are in significant trouble. 

๐Ÿ”น This pumps Meta onto the AI scene. With this announcement, Zuck is signaling how strong Meta's AI position is. They will now own one of the most widely adopted LLMs + have one of the best training data sets in the world.

๐Ÿ”น This further strengthens Microsoft's dominant position in the AI space. With this partnership, they now have exclusive partnerships with the top LLMs (OpenAI, Meta), priority access to Nvidia GPUs, and strategic assets like GitHub and Azure. Certainly, It will push Azure.

๐Ÿ”น The collaboration between Microsoft and AMD will be stronger. Like tensortRT or NeuronSDK, AMD will publish its quantize-SDK for the llama model to be run on an AMD chip.

๐Ÿ”น Qualcomm Chips To Run Meta's AI on Mobile Devices By 2024. Data from Whatsapp and Thread (Insta) will be processed on the edge device. Threat to Privacy.

๐Ÿ‹๏ธ Training Cost

This is an important plot from the LLaMa 2 paper. It directly outlines the pretraining hours for the model! Costs below, assuming $1.50 / A100 from LambdaAPI.

๐Ÿ”น The 7B model cost $276,480

๐Ÿ”นThe 13B model cost $552,960

๐Ÿ”น The 34B model cost $1.03M!

๐Ÿ”น The 70B model cost $1.7M!

๐ŸŒป Model Download

๐Ÿ”น Please request here

๐Ÿ”น I received the email from Meta in an hour with the token.

๐Ÿ”น chmod 755 download.sh

๐Ÿ”น Execute the file with the token - Now, when passing the URL to the download script, make sure you're pasting an URL that begins with https://download.llamameta.net and not with https://l.facebook.com.If you copy the link address from the e-mail, you will get an https://l.facebook.com address, that's incorrect. Better copy-paste it as plain text through a text editor first, before passing it on to the download script.

๐Ÿง‘โ€๐Ÿซ Evaluation

๐Ÿ”น The paper evaluates their model in many ways.

Text within this block will maintain its original spacing when publishedWhile reviewing these results, it is important to note that human evaluations can be noisy due to limitations of the prompt set, subjectivity of the review guidelines, subjectivity of individual raters, and the inherent difficulty of comparing generations.

I evaluated LLaMA-2 Chat! It seems to be similar quality as the latest Vicuna's. Excited to see how much the community will be able to improve it using the LLaMA-2 base and their fine-tuning pipelines!

๐Ÿ”น Leaderboard

๐Ÿ”น There are some claims that it is better than chatgpt or star coder - but no solid proof.

Thank you for reading Musings on AI. This post is public so feel free to share it.

๐Ÿชช License

 Source 

๐Ÿ”น Watched The Episode of `Mark Zuckerberg: Future of AI at Meta, Facebook, Instagram, and WhatsApp | Lex Fridman Podcast #383`. Lex asked many times about the license issue of LLAMA-1 Models because the weights were leaked before release. Facebook LLAMA is being openly distributed via torrents - It was really hard to detect whether any organization is using the models in production or not.

๐Ÿ”น So now businesses can use the models but - LLaMA2 is available for commercial use if you don't have >700M MAU. Products that cross that threshold: -

  • YouTube (~2.5B)

  • WeChat (1.3B)

  • TikTok (1B)

  • LinkedIn (900M)

  • Snap (750M) - Snapchat: "We have 750M MAUs" - hmm thatโ€™s why 700M ๐Ÿค”

๐Ÿ’ป Model Execution on Mac

๐Ÿ”น LLama2 weights have already been quantized and available in cpp for local inference! Details

๐Ÿ”น Run Llama2 on your MacBook with GPU support!

๐Ÿ”น Both Llama 2 7B and 13B are now available on MLC LLM through CLI. 7B model generating ~46 tok/s on Apple M2 Max and ~156 tok/s on RTX 4090. Stay tuned for the web version, as well as more soon-landing 13B and 70B model optimizations! Details 

๐ŸŒธ Fine Tune Llama-2 with few lines of code!

๐Ÿ”น With Llama2 landing in the Hugging Face ecosystem, It is very easy to fine-tune this architecture using various tools from the HF ecosystem (TRL, PEFT, ..) - 4bit quantization and PEFT to fine-tune llama2-7b on a single Google Colab instance!

๐Ÿ”น Source๐Ÿ”น Fine-tuning script

๐Ÿ Deployment

๐Ÿ”น The intuition was that the 70B model needs 70B*4 (parameters are fp32 so 4 bytes)= 280Gb ~ 340 to 360 GB(Adam needs max 40% to run) GPU.

๐Ÿ”น I have found this script was tested on a ray cluster of 4 x g5.24xlarge (32xA10Gs) ~ 384Gb and works for all of the model sizes (7B, 13B, and 70B).

๐Ÿ”น The PR 

๐Ÿคฏ Load Test

๐Ÿ’ซ Cost of Summarization: 7 Billion Self-Hosted Model

๐Ÿ’ฐ Pricing of the Model

Instance Cost: p4d.24xlarge ~ 10 Euro (With AWS Discount)

๐Ÿค– Model Details:

๐ŸŽ๏ธ Cost of Formula

Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)

๐Ÿ’ฐ Derived Details of the Model

  • Cost of Toeken Per Million: (10 <node cost> / (2725.52 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000 = $0.13

  • Average Latency: ~120 Ms

๐Ÿ’ต Total Cost

  • Total Tokens (Input + Output): 9 Billion

    • Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens

    • Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)

  • 9 Billion * 1 / ( 2725.52 <That I have achieved with a single node and single core > * 8 * 3600) * 10

  • 9000000000 * (1 / (2725.52 * 8 * 3600)) * 10 = $1146

  • backend vLLM

  • dur_s 65.96

  • tokens_per_s 2725.52

  • qps 1.52

  • successful_responses 100

  • prompt_token_count 74999

  • response_token_count 104779,

  • median_token_latency=0.1208686280123731,

  • median_e2e_latency=45.40771961212158

โš™๏ธ Configuration

Let's talk about a scenario - We need to summarize 6 Million Wikipedia Articles.

  • So I have used 1000 tokens as Input and 500 tokens per output.

  • Followed the philosophy to implement the sample prompt - the prompt token length varies from 900 to 1100 and the response token length varies from 450 to 550 tokens.

  • Have used vllm and continuous batching in the backend.

  • I have not tested the fine-tuned model in this experiment. If you fine-tune with different datatype the cost will be lower.

  • For comparison the cost will be higher if you use the openai or another service - will publish the detailed load test analysis in the next edition.

**

I will publish the next Edition on Sunday.

This is the 4th Edition, If you have any feedback please donโ€™t hesitate to share it with me, And if you love my work, do share it with your colleagues.

Cheers!!

Raahul

**

Reply

or to participate.