Are you planning to buy GPU to run LLMs at home? Cost/Benefit analysis in 2024.
Summary
- Cost Efficiency: APIs are generally more cost-effective for heavy users due to payment based on actual usage, unlike fixed monthly subscriptions which may include usage limits.
- Local Hardware Expenses: High-end GPUs like the Nvidia RTX 3090 or RTX 4090 incur significant upfront and ongoing electricity costs, making them less economical for continuous use.
- Token Costs: Annual electricity costs for a GPU could fund hundreds of millions more tokens via cloud APIs than the tokens generated locally on the same GPU.
- Practicality: Cloud APIs provide greater scalability and flexibility without the need for local hardware maintenance. For most users, they offer a more practical and economical solution for running large language models.
- Conclusion: Cloud APIs are recommended over local GPUs for most users, providing a more cost-effective, scalable, and maintenance-free option for accessing large language models.
Intro
As an IT professional who leverages cloud and AI technologies, including large language models (LLMs), in both professional and personal capacities, I've often found myself wondering if it is more economical or efficient to run LLMs locally on my own hardware or if I should rely on cloud services and APIs? This article delves into that very query, exploring the pros and cons of running LLMs on-premises—at home versus utilizing the vast resources of cloud-based APIs. We'll analyze the costs, performance, and practicality of each option to provide a comprehensive overview for anyone grappling with this decision. Let's first take a look at the data that I've put together.
Data
Vendor | Model | Context Window | Model IQ | Price/1M Tokens | Speed Tokens/s | Latency |
---|---|---|---|---|---|---|
OpenAI | GPT-4o | 128k | 100 | $7.50 | 80.6 | 0.52 |
OpenAI | GPT-4 Turbo | 128k | 94 | $15.00 | 27.8 | 0.69 |
Microsoft Azure | GPT-4 Turbo | 128k | 94 | $15.00 | 28.0 | 0.55 |
OpenAI | GPT-4 | 8k | 93 | $37.50 | 22.6 | 0.86 |
Microsoft Azure | GPT-4 | 8k | 93 | $37.50 | 20.2 | 0.55 |
OpenAI | GPT-3.5 Turbo | 16k | 65 | $0.75 | 65.9 | 0.45 |
Microsoft Azure | GPT-3.5 Turbo | 16k | 65 | $0.75 | 57.8 | 0.32 |
OpenAI | GPT-3.5 Turbo Instruct | 4k | 60 | $1.63 | 73.1 | 0.36 |
Microsoft Azure | GPT-3.5 Turbo Instruct | 4k | 60 | $1.63 | 137.4 | 0.60 |
Gemini 1.5 Pro | 1m | 93 | $5.25 | 63.8 | 1.33 | |
Gemini 1.5 Flash | 1m | 83 | $0.53 | 142.7 | 1.32 | |
Gemini 1.0 Pro | 33k | 62 | $0.75 | 86.6 | 2.30 | |
Fireworks | Gemma 7B | 8k | 57 | $0.20 | 234.8 | 0.26 |
Deepinfra | Gemma 7B | 8k | 57 | $0.07 | 64.9 | 0.29 |
Groq | Gemma 7B | 8k | 57 | $0.07 | 1029 | 0.88 |
Together.ai | Gemma 7B | 8k | 57 | $0.20 | 140.5 | 0.40 |
Replicate | Llama 3 (70B) | 8k | 88 | $1.18 | 48.2 | 1.85 |
Amazon Bedrock | Llama 3 (70B) | 8k | 88 | $2.86 | 48.8 | 0.48 |
OctoAI | Llama 3 (70B) | 8k | 88 | $0.90 | 61.2 | 0.29 |
Microsoft Azure | Llama 3 (70B) | 8k | 88 | $5.67 | 18.3 | 2.68 |
Fireworks | Llama 3 (70B) | 8k | 88 | $0.90 | 116.9 | 0.26 |
Deepinfra | Llama 3 (70B) | 8k | 88 | $0.64 | 20.9 | 0.35 |
Groq | Llama 3 (70B) | 8k | 88 | $0.64 | 357.3 | 0.40 |
Perplexity | Llama 3 (70B) | 8k | 88 | $1.00 | 48.1 | 0.33 |
Together.ai | Llama 3 (70B) | 8k | 88 | $0.90 | 126.1 | 0.63 |
Replicate | Llama 3 (8B) | 8k | 65 | $0.10 | 81.3 | 1.71 |
Amazon Bedrock | Llama 3 (8B) | 8k | 65 | $0.38 | 79.2 | 0.29 |
OctoAI | Llama 3 (8B) | 8k | 65 | $0.15 | 132.1 | 0.21 |
Microsoft Azure | Llama 3 (8B) | 8k | 65 | $0.55 | 76.5 | 0.89 |
Fireworks | Llama 3 (8B) | 8k | 65 | $0.20 | 252.6 | 0.25 |
Deepinfra | Llama 3 (8B) | 8k | 65 | $0.08 | 111.3 | 0.19 |
Groq | Llama 3 (8B) | 8k | 65 | $0.06 | 357 | 0.35 |
Perplexity | Llama 3 (8B) | 8k | 65 | $0.20 | 119.8 | 0.25 |
Together.ai | Llama 3 (8B) | 8k | 65 | $0.20 | 265.4 | 0.40 |
Fireworks | Code Llama (70B) | 4k | 58 | $0.90 | nodata | nodata |
Deepinfra | Code Llama (70B) | 4k | 58 | $0.60 | 33.5 | 0.45 |
Perplexity | Code Llama (70B) | 16k | 58 | $1.00 | nodata | nodata |
Together.ai | Code Llama (70B) | 4k | 58 | $0.90 | 30.0 | 0.55 |
Replicate | Llama 2 Chat (70B) | 4k | 50 | $1.18 | 52.7 | 1.92 |
Amazon Bedrock | Llama 2 Chat (70B) | 4k | 50 | $2.10 | 43.5 | 0.49 |
OctoAI | Llama 2 Chat (70B) | 4k | 50 | $0.90 | 130.4 | 0.21 |
Microsoft Azure | Llama 2 Chat (70B) | 4k | 50 | $1.60 | 17.4 | 3.25 |
Fireworks | Llama 2 Chat (70B) | 4k | 50 | $0.90 | 92.5 | 0.33 |
Deepinfra | Llama 2 Chat (70B) | 4k | 50 | $0.76 | 111.9 | 0.20 |
Perplexity | Llama 2 Chat (70B) | 4k | 50 | $1.00 | nodata | nodata |
Together.ai | Llama 2 Chat (70B) | 4k | 50 | $0.90 | 34.3 | 0.83 |
Replicate | Llama 2 Chat (13B) | 4k | 36 | $0.20 | 80.3 | 1.49 |
Amazon Bedrock | Llama 2 Chat (13B) | 4k | 36 | $0.81 | 50.5 | 0.34 |
OctoAI | Llama 2 Chat (13B) | 4k | 36 | $0.20 | 130.8 | 0.21 |
Microsoft Azure | Llama 2 Chat (13B) | 4k | 36 | $0.84 | 44.0 | 1.51 |
Fireworks | Llama 2 Chat (13B) | 4k | 36 | $0.20 | 106.8 | 0.31 |
Deepinfra | Llama 2 Chat (13B) | 4k | 36 | $0.35 | 110.5 | 0.20 |
Together.ai | Llama 2 Chat (13B) | 4k | 36 | $0.30 | 44.5 | 0.46 |
Replicate | Llama 2 Chat (7B) | 4k | 27 | $0.10 | 149.4 | 1.29 |
Microsoft Azure | Llama 2 Chat (7B) | 4k | 27 | $0.56 | 72.4 | 1.03 |
Fireworks | Llama 2 Chat (7B) | 4k | 27 | $0.20 | 165.8 | 0.27 |
Deepinfra | Llama 2 Chat (7B) | 4k | 27 | $0.20 | 21.5 | 0.41 |
Together.ai | Llama 2 Chat (7B) | 4k | 27 | $0.20 | 91.8 | 0.45 |
Mistral | Mixtral 8x22B | 65k | 78 | $3.00 | 67.6 | 0.28 |
OctoAI | Mixtral 8x22B | 65k | 78 | $1.20 | 89.7 | 0.27 |
Fireworks | Mixtral 8x22B | 65k | 78 | $1.20 | 79.5 | 0.25 |
Deepinfra | Mixtral 8x22B | 65k | 78 | $0.65 | 37.4 | 0.25 |
Perplexity | Mixtral 8x22B | 16k | 78 | $1.00 | nodata | nodata |
Together.ai | Mixtral 8x22B | 65k | 78 | $1.20 | 54.9 | 0.72 |
Mistral | Mistral Large | 33k | 75 | $6.00 | 38.7 | 0.36 |
Amazon Bedrock | Mistral Large | 33k | 75 | $6.00 | 34.5 | 0.43 |
Microsoft Azure | Mistral Large | 33k | 75 | $6.00 | 25.0 | 2.07 |
Mistral | Mistral Medium | 33k | 73 | $4.05 | 38.0 | 0.51 |
Mistral | Mistral Small | 33k | 71 | $1.50 | 36.6 | 0.35 |
Microsoft Azure | Mistral Small | 33k | 71 | $1.50 | 61.5 | 1.25 |
Mistral | Mixtral 8x7B | 33k | 65 | $0.70 | 66.6 | 0.31 |
Replicate | Mixtral 8x7B | 33k | 65 | $0.47 | 107.3 | 1.57 |
Amazon Bedrock | Mixtral 8x7B | 33k | 65 | $0.51 | 63.9 | 0.36 |
OctoAI | Mixtral 8x7B | 33k | 65 | $0.45 | 84.8 | 0.26 |
Lepton AI | Mixtral 8x7B | 33k | 65 | $0.50 | 72.4 | 0.37 |
Fireworks | Mixtral 8x7B | 33k | 65 | $0.50 | 254.0 | 0.25 |
Deepinfra | Mixtral 8x7B | 33k | 65 | $0.24 | 62.1 | 0.20 |
Groq | Mixtral 8x7B | 33k | 65 | $0.24 | 552.4 | 0.44 |
Perplexity | Mixtral 8x7B | 16k | 65 | $0.60 | 110.7 | 0.26 |
Together.ai | Mixtral 8x7B | 33k | 65 | $0.60 | 86.3 | 0.39 |
Mistral | Mistral 7B | 33k | 39 | $0.25 | 79.9 | 0.30 |
Replicate | Mistral 7B | 33k | 39 | $0.10 | 81.5 | 1.52 |
Amazon Bedrock | Mistral 7B | 33k | 39 | $0.16 | 72.1 | 0.33 |
OctoAI | Mistral 7B | 33k | 39 | $0.15 | 152.9 | 0.21 |
Fireworks | Mistral 7B | 33k | 39 | $0.20 | 264.1 | 0.18 |
Deepinfra | Mistral 7B | 33k | 39 | $0.07 | 71.8 | 0.30 |
Perplexity | Mistral 7B | 16k | 39 | $0.20 | 124.8 | 0.27 |
Together.ai | Mistral 7B | 8k | 39 | $0.20 | 63.6 | 0.33 |
Baseten | Mistral 7B | 4k | 39 | $0.20 | 216.2 | 0.18 |
Anthropic | Claude 3.5 Sonnet | 200k | 100 | $6.00 | 79.8 | 0.84 |
Amazon Bedrock | Claude 3 Opus | 200k | 94 | $30.00 | 22.4 | 1.78 |
Anthropic | Claude 3 Opus | 200k | 94 | $30.00 | 24.4 | 2.00 |
Amazon Bedrock | Claude 3 Sonnet | 200k | 78 | $6.00 | 46.5 | 0.83 |
Claude 3 Sonnet | 200k | 78 | $6.00 | nodata | nodata | |
Anthropic | Claude 3 Sonnet | 200k | 78 | $6.00 | 60.6 | 0.97 |
Amazon Bedrock | Claude 3 Haiku | 200k | 72 | $0.50 | 96.6 | 0.45 |
Claude 3 Haiku | 200k | 72 | $0.50 | nodata | nodata | |
Anthropic | Claude 3 Haiku | 200k | 72 | $0.50 | 148.1 | 0.55 |
Anthropic | Claude 2.0 | 100k | 69 | $12.00 | 38.7 | 1.27 |
Amazon Bedrock | Claude 2.1 | 200k | 63 | $12.00 | 35.8 | 1.67 |
Anthropic | Claude 2.1 | 200k | 63 | $12.00 | 37.2 | 1.21 |
Amazon Bedrock | Claude Instant | 100k | 63 | $1.20 | 80.0 | 0.54 |
Anthropic | Claude Instant | 100k | 63 | $1.20 | 98.0 | 0.57 |
Amazon Bedrock | Command Light | 4k | nodata | $0.38 | 33.9 | 0.56 |
Cohere | Command Light | 4k | nodata | $0.38 | 67.4 | 0.26 |
Amazon Bedrock | Command | 4k | nodata | $1.63 | 23.2 | 0.55 |
Cohere | Command | 4k | nodata | $1.25 | 24.5 | 0.57 |
Cohere | Command-R+ | 128k | 74 | $6.00 | 61.8 | 0.31 |
Microsoft Azure | Command-R+ | 128k | 74 | $6.00 | 59.2 | 0.47 |
Cohere | Command-R | 128k | 62 | $0.75 | 147.4 | 0.20 |
Microsoft Azure | Command-R | 128k | 62 | $0.75 | 48.2 | 0.46 |
Deepinfra | OpenChat 3.5 | 8k | 54 | $0.07 | 67.4 | 0.31 |
Together.ai | OpenChat 3.5 | 8k | 54 | $0.20 | 73.2 | 0.35 |
Lepton AI | DBRX | 33k | 74 | $0.90 | nodata | nodata |
Fireworks | DBRX | 33k | 74 | $1.20 | 51.9 | 0.37 |
Databricks | DBRX | 33k | 74 | $3.38 | 96.9 | 0.64 |
Together.ai | DBRX | 33k | 74 | $1.20 | 72.5 | 0.49 |
AI21 Labs | Jamba Instruct | 256k | 63 | $0.55 | 66.2 | 0.44 |
DeepSeek | DeepSeek-V2 | 128k | 82 | $0.17 | 16.9 | 1.59 |
Together.ai | Arctic | 4k | 63 | $2.40 | 72.7 | 0.50 |
Together.ai | Qwen2 (72B) | 128k | nodata | $0.90 | 42.5 | 0.58 |
Monthly subs
First, let's discuss what the majority of people use. When it comes to AI services, many leading companies, such as OpenAI, Anthropic, and Google, offer monthly subscription plans. With an average cost of around £18 per month, these subscription plans often provide benefits such as convenience for non-tech folks. Everyone knows how to use the chat app. In some cases, people might also benefit from priority access to newer models and features. But did you notice that if you are a really heavy user of GPT4, you know that there is a limit to how much you can use it before it says, "You have exceeded your limit and will be able to resume at xx:yy time, would you like to switch to GPT3.5 for now (our cheapest model that we give away for free anyways)? " You would think a paid monthly sub gives you unlimited access, and in a way, it's true, but it is also somewhat throttled. So, how much of that AI could you use paying for API only?
Let's take a look at OpenAI example. Monthly subscription costs $20. However from the table we can see that they sell 1M Tokens for $15. So for $20 I could technically use 1.3M Tokens total (in/out, meaning words that I submit to chat and words that I receive from chat LLM).
Let's break down what 1M Tokens is, to give you a better sense of scale. On average, a fluent adult reader can read about 200-300 words per minute.
1M tokens are roughly equivalent to about 750,000 words. At 250 words per minute, it would take approximately 50 hours to read 1 million tokens.
For a book equivalent, an average novel contains about 80,000-100,000 words.
One million tokens (≈ 750,000 words) is equivalent to about 7-9 average novels.
And if you think of A4 pages, a typical A4 page with standard margins and 12-point font contains about 500 words. So 750,000 words would fill approximately 1,500 A4 pages.
In terms of articles, if we consider an average long-form article to be about 2,000 words, 1 million tokens would be equivalent to about 375 such articles.
This gives you an idea of how substantial 1 million tokens are in terms of content. It's a significant amount of text, equivalent to several books or many hours of reading. So, what's the conclusion so far? With my heavy use of AI, I don't think I am generating 50 hours worth of reading, not even close, and I hit the GPT4 throttle quite frequently. So, in conclusion, do I save more money by just using APIs? Yes, absolutely. Subscription doesn't care if you are on holiday or just not using it when you are busy with the other stuff; with APIs, you just pay for what you use. Recently, Anthropic came out with their medium-model Claude 3.5 Sonnet that matches the IQ of GPT4. The price for 1M Tokens of Sonnet is just $6, which is a third of what OpenAI is asking for GPT4. A subscription gives you the convenience of a chat window with a premium model and a vendor lock-in, but you could host your own chat window, and there are many solutions like, for example, https://openwebui.com/, that will allow you to use a variety of different models without having to commit to a fixed monthly bill.
Private Hardware vs API
As we dive deeper into the debate of utilizing local GPUs versus cloud-based APIs for running large language models (LLMs), understanding each choice's broader implications is crucial. Opting for local GPUs involves investing in fairly powerful hardware like the Nvidia RTX 3090 or RTX 4090, which comes with substantial upfront costs (currently used on eBay for roughly £600 for 3090 and around £1500 for 4090) but offers control and data privacy. The operational costs, mainly electricity, also play a significant role in long-term budgeting. Conversely, using APIs from leading providers such as OpenAI, Google, or Anthropic presents a different set of advantages and challenges. APIs offer scalability and ease of access with variable costs based on usage, which can balloon quite quickly into a large bill if not used carefully.
Now, it's not a secret that we pay an ungodly price for electricity here in the UK to a local energy mafia that is fixated on ESG scores. As of April 2024, the average unit price per kilowatt-hour (kWh) for customers on the default tariff (Standard Variable Tariff or SVT) is now capped at 24.50p. RTX3090 is 350W with around 100W idle, and RTX4090 is 450W with around 100W idle. Let's calculate the cost for a scenario where we use the GPU actively for 12 hours and then let it idle for 12 hours each day. We'll need to consider both the active and idle power consumption.
Electricity price: £0.245 per kWh
Power consumption:
-
Idle: 100W (0.1 kW) for both cards
-
Active use:
RTX 3090: 350W (0.35 kW)
RTX 4090: 450W (0.45 kW) -
Daily energy consumption:
-
RTX 3090:
Active: 0.35 kW * 12 hours = 4.2 kWh
Idle: 0.1 kW * 12 hours = 1.2 kWh
Total: 4.2 kWh + 1.2 kWh = 5.4 kWh per day -
RTX 4090:
Active: 0.45 kW * 12 hours = 5.4 kWh
Idle: 0.1 kW * 12 hours = 1.2 kWh
Total: 5.4 kWh + 1.2 kWh = 6.6 kWh per day
-
Daily cost:
- RTX 3090: 5.4 kWh * £0.245 = £1.323 per day
- RTX 4090: 6.6 kWh * £0.245 = £1.617 per day
Monthly cost:
- RTX 3090: £1.323 * 30 = £39.69 per month
- RTX 4090: £1.617 * 30 = £48.51 per month
Yearly cost:
- RTX 3090: £39.69 * 12 = £476.28 per year
- RTX 4090: £48.51 * 12 = £582.12 per year
In summary, if you use a GPU for 12 hours actively and let it idle for 12 hours each day, it will set you back a bloody half a grand a year in the UK!
RTX 3090 | RTX 4090 | |
---|---|---|
Per Day | ~£1.32 | ~£1.62 |
Per Month | ~£40 | ~£50 |
Per Year | ~£480 | ~£580 |
I was quite shocked to see the numbers. Bear in mind that this doesn't even take into consideration power consumption by the CPU, especially if you run on dual-socket. So, next, I was curious to see "how much AI I can squeeze out" of my GPU vs. API. The numbers are actually quite crazy.
Let's first make an assumption: What are we running? Right now, one of the best open-source general intelligence models is something like Llama3-70B or Mixtral 8x22B. The "IQ" of GPT4 is 93, GPT3.5 is 65 and Llama3-70B is 88. So Llama is right between GPT4 and GPT3.5 in terms of intelligence. The cost of 1M tokens on GPT4 is $15, GPT3.5 is $0.75, and Llama3-70B on Groq and DeepInfra is just $0.64. Let's keep these numbers in mind.
But can we run Llama3-70B on our 24GB VRAM GPUs? The simple answer is no, not out of the box. We could run 7B or 13B parameters out of the box, but to run a 70B model on GPU with 24GB VRAM, we would need to use a quantized model and perhaps something like vLLM to optimize it further. Just for context, quantization techniques like 8-bit or even 4-bit quantization can dramatically reduce the memory footprint of a model. Libraries like GPTQ or LLM.int8() can be used to quantize models with minimal loss in performance, and vLLM, a.k.a. "Vector Language Model," is an open-source library for LLM inference acceleration that uses PagedAttention, an algorithm that effectively manages attention keys and values by leveraging GPU memory hierarchies. This allows it to support models much larger than the GPU's VRAM capacity. But for the sake of argument, let's just say that we can run the 70B model on RTX3090/4090. Several people have reported on Reddit that it generates roughly 20 tokens/s.
Let's compare cost/benefit of running Llama3-70B at home on RTX3090 vs API:
Home GPU RTX3090 costs:
- Initial cost: £600
- Annual electricity cost: £470
- Performance: 20 tokens/s
DeepInfra/Groq API costs:
- £0.50 per 1M tokens
So if we were to spend electricity money on APIs, how many millions of tokens we would generate, if annual electricity cost is £470?
Tokens on DeepInfra for £470 GBP:
£470 / (£0.50 / 1M) = 940M tokens
How many millions of tokens we would generate to match price of GPU if RTX3090 costs £600?
Tokens on DeepInfra for £600:
£600 / (£0.50 / 1M) = 1,200M tokens
On a contrary, given that performance of RTX3090 is not great, I was wondering how many millions of tokens can be generated with RTX3090 in a year?
Tokens per second: 20T/s
Seconds in a year: 365 * 24 * 60 * 60 = 31,536,000sec
Tokens in a year (100% utilization): 20T/s * 31,536,000sec = 630,720,000 T/y
However, we agreed that our usage pattern is 12h per day, so:
Tokens in a year (12h/day): 630,720,000 / 2 = 315,360,000 ≈ 315M tokens
So, let's recap:
- You'd need to use 940 million tokens on DeepInfra to match annual electricity costs of running the RTX3090.
- You'd need to use 1,200 million tokens on DeepInfra to match the purchase price of the RTX3090.
- The RTX3090 can generate about 315 million tokens per year at 12h/day usage.
- The total cost of RTX3090 for first year is:
£600 + £470 = £1,070
- TCO of RTX3090 equivalent in tokens on DeepInfra:
1,070 / (0.50 / 1M) = 2,140M tokens
Essentially, it means that if you use more than 2,140M tokens in the first year, you will break even, and then RTX3090 becomes more cost-effective. However, keep in mind that the RTX3090 can only generate about 315M tokens in a year at 12h/day usage. So even if we run it under the load 24/7, we would still not match the cost benefit of API due to slow performance and electricity cost. Also a vast majority of people won't be able to run 70B model using various technical optimization tricks and will be limited to models like 8B, 13B or 30B. Now, we have not even discussed APIs and costs of smaller models, like Llama3-8B, which costs just $0.06/1M Tokens running on Groq. Bear in mind that the IQ of Llama3-8B is 65, which matches the IQ of GPT3.5.
Conclusion
Unless you have other uses for the GPU, like Gaming or Video Editing or certain specific reasons to run locally, the API solutions seem way more cost-effective and flexible for most use cases, especially if you're not using extremely high volumes of tokens. I argued with myself that I need a GPU locally because I do some fine-tuning, I also host diffusion models. Therefore, I need to have it, but guess what? Thanks to services like https://www.runpod.io/, you can rent out a GPU at a very competitive price, especially if you are choosing Spot GPUs or building serverless microservices. For example, I use Faster-Whisper Serverless service to transcribe some of the YouTube videos and then summarize transcriptions using cheap models on DeepInfra, and that beats my local infrastructure at the cost of operation and ownership every time.