Are you planning to buy GPU to run LLMs at home? Cost/Benefit analysis in 2024.

Are you planning to buy GPU to run LLMs at home? Cost/Benefit analysis in 2024.

Summary

  • Cost Efficiency: APIs are generally more cost-effective for heavy users due to payment based on actual usage, unlike fixed monthly subscriptions which may include usage limits.
  • Local Hardware Expenses: High-end GPUs like the Nvidia RTX 3090 or RTX 4090 incur significant upfront and ongoing electricity costs, making them less economical for continuous use.
  • Token Costs: Annual electricity costs for a GPU could fund hundreds of millions more tokens via cloud APIs than the tokens generated locally on the same GPU.
  • Practicality: Cloud APIs provide greater scalability and flexibility without the need for local hardware maintenance. For most users, they offer a more practical and economical solution for running large language models.
  • Conclusion: Cloud APIs are recommended over local GPUs for most users, providing a more cost-effective, scalable, and maintenance-free option for accessing large language models.

Intro

As an IT professional who leverages cloud and AI technologies, including large language models (LLMs), in both professional and personal capacities, I've often found myself wondering if it is more economical or efficient to run LLMs locally on my own hardware or if I should rely on cloud services and APIs? This article delves into that very query, exploring the pros and cons of running LLMs on-premises—at home versus utilizing the vast resources of cloud-based APIs. We'll analyze the costs, performance, and practicality of each option to provide a comprehensive overview for anyone grappling with this decision. Let's first take a look at the data that I've put together.

Data

Vendor Model Context Window Model IQ Price/1M Tokens Speed Tokens/s Latency
OpenAI GPT-4o 128k 100 $7.50 80.6 0.52
OpenAI GPT-4 Turbo 128k 94 $15.00 27.8 0.69
Microsoft Azure GPT-4 Turbo 128k 94 $15.00 28.0 0.55
OpenAI GPT-4 8k 93 $37.50 22.6 0.86
Microsoft Azure GPT-4 8k 93 $37.50 20.2 0.55
OpenAI GPT-3.5 Turbo 16k 65 $0.75 65.9 0.45
Microsoft Azure GPT-3.5 Turbo 16k 65 $0.75 57.8 0.32
OpenAI GPT-3.5 Turbo Instruct 4k 60 $1.63 73.1 0.36
Microsoft Azure GPT-3.5 Turbo Instruct 4k 60 $1.63 137.4 0.60
Google Gemini 1.5 Pro 1m 93 $5.25 63.8 1.33
Google Gemini 1.5 Flash 1m 83 $0.53 142.7 1.32
Google Gemini 1.0 Pro 33k 62 $0.75 86.6 2.30
Fireworks Gemma 7B 8k 57 $0.20 234.8 0.26
Deepinfra Gemma 7B 8k 57 $0.07 64.9 0.29
Groq Gemma 7B 8k 57 $0.07 1029 0.88
Together.ai Gemma 7B 8k 57 $0.20 140.5 0.40
Replicate Llama 3 (70B) 8k 88 $1.18 48.2 1.85
Amazon Bedrock Llama 3 (70B) 8k 88 $2.86 48.8 0.48
OctoAI Llama 3 (70B) 8k 88 $0.90 61.2 0.29
Microsoft Azure Llama 3 (70B) 8k 88 $5.67 18.3 2.68
Fireworks Llama 3 (70B) 8k 88 $0.90 116.9 0.26
Deepinfra Llama 3 (70B) 8k 88 $0.64 20.9 0.35
Groq Llama 3 (70B) 8k 88 $0.64 357.3 0.40
Perplexity Llama 3 (70B) 8k 88 $1.00 48.1 0.33
Together.ai Llama 3 (70B) 8k 88 $0.90 126.1 0.63
Replicate Llama 3 (8B) 8k 65 $0.10 81.3 1.71
Amazon Bedrock Llama 3 (8B) 8k 65 $0.38 79.2 0.29
OctoAI Llama 3 (8B) 8k 65 $0.15 132.1 0.21
Microsoft Azure Llama 3 (8B) 8k 65 $0.55 76.5 0.89
Fireworks Llama 3 (8B) 8k 65 $0.20 252.6 0.25
Deepinfra Llama 3 (8B) 8k 65 $0.08 111.3 0.19
Groq Llama 3 (8B) 8k 65 $0.06 357 0.35
Perplexity Llama 3 (8B) 8k 65 $0.20 119.8 0.25
Together.ai Llama 3 (8B) 8k 65 $0.20 265.4 0.40
Fireworks Code Llama (70B) 4k 58 $0.90 nodata nodata
Deepinfra Code Llama (70B) 4k 58 $0.60 33.5 0.45
Perplexity Code Llama (70B) 16k 58 $1.00 nodata nodata
Together.ai Code Llama (70B) 4k 58 $0.90 30.0 0.55
Replicate Llama 2 Chat (70B) 4k 50 $1.18 52.7 1.92
Amazon Bedrock Llama 2 Chat (70B) 4k 50 $2.10 43.5 0.49
OctoAI Llama 2 Chat (70B) 4k 50 $0.90 130.4 0.21
Microsoft Azure Llama 2 Chat (70B) 4k 50 $1.60 17.4 3.25
Fireworks Llama 2 Chat (70B) 4k 50 $0.90 92.5 0.33
Deepinfra Llama 2 Chat (70B) 4k 50 $0.76 111.9 0.20
Perplexity Llama 2 Chat (70B) 4k 50 $1.00 nodata nodata
Together.ai Llama 2 Chat (70B) 4k 50 $0.90 34.3 0.83
Replicate Llama 2 Chat (13B) 4k 36 $0.20 80.3 1.49
Amazon Bedrock Llama 2 Chat (13B) 4k 36 $0.81 50.5 0.34
OctoAI Llama 2 Chat (13B) 4k 36 $0.20 130.8 0.21
Microsoft Azure Llama 2 Chat (13B) 4k 36 $0.84 44.0 1.51
Fireworks Llama 2 Chat (13B) 4k 36 $0.20 106.8 0.31
Deepinfra Llama 2 Chat (13B) 4k 36 $0.35 110.5 0.20
Together.ai Llama 2 Chat (13B) 4k 36 $0.30 44.5 0.46
Replicate Llama 2 Chat (7B) 4k 27 $0.10 149.4 1.29
Microsoft Azure Llama 2 Chat (7B) 4k 27 $0.56 72.4 1.03
Fireworks Llama 2 Chat (7B) 4k 27 $0.20 165.8 0.27
Deepinfra Llama 2 Chat (7B) 4k 27 $0.20 21.5 0.41
Together.ai Llama 2 Chat (7B) 4k 27 $0.20 91.8 0.45
Mistral Mixtral 8x22B 65k 78 $3.00 67.6 0.28
OctoAI Mixtral 8x22B 65k 78 $1.20 89.7 0.27
Fireworks Mixtral 8x22B 65k 78 $1.20 79.5 0.25
Deepinfra Mixtral 8x22B 65k 78 $0.65 37.4 0.25
Perplexity Mixtral 8x22B 16k 78 $1.00 nodata nodata
Together.ai Mixtral 8x22B 65k 78 $1.20 54.9 0.72
Mistral Mistral Large 33k 75 $6.00 38.7 0.36
Amazon Bedrock Mistral Large 33k 75 $6.00 34.5 0.43
Microsoft Azure Mistral Large 33k 75 $6.00 25.0 2.07
Mistral Mistral Medium 33k 73 $4.05 38.0 0.51
Mistral Mistral Small 33k 71 $1.50 36.6 0.35
Microsoft Azure Mistral Small 33k 71 $1.50 61.5 1.25
Mistral Mixtral 8x7B 33k 65 $0.70 66.6 0.31
Replicate Mixtral 8x7B 33k 65 $0.47 107.3 1.57
Amazon Bedrock Mixtral 8x7B 33k 65 $0.51 63.9 0.36
OctoAI Mixtral 8x7B 33k 65 $0.45 84.8 0.26
Lepton AI Mixtral 8x7B 33k 65 $0.50 72.4 0.37
Fireworks Mixtral 8x7B 33k 65 $0.50 254.0 0.25
Deepinfra Mixtral 8x7B 33k 65 $0.24 62.1 0.20
Groq Mixtral 8x7B 33k 65 $0.24 552.4 0.44
Perplexity Mixtral 8x7B 16k 65 $0.60 110.7 0.26
Together.ai Mixtral 8x7B 33k 65 $0.60 86.3 0.39
Mistral Mistral 7B 33k 39 $0.25 79.9 0.30
Replicate Mistral 7B 33k 39 $0.10 81.5 1.52
Amazon Bedrock Mistral 7B 33k 39 $0.16 72.1 0.33
OctoAI Mistral 7B 33k 39 $0.15 152.9 0.21
Fireworks Mistral 7B 33k 39 $0.20 264.1 0.18
Deepinfra Mistral 7B 33k 39 $0.07 71.8 0.30
Perplexity Mistral 7B 16k 39 $0.20 124.8 0.27
Together.ai Mistral 7B 8k 39 $0.20 63.6 0.33
Baseten Mistral 7B 4k 39 $0.20 216.2 0.18
Anthropic Claude 3.5 Sonnet 200k 100 $6.00 79.8 0.84
Amazon Bedrock Claude 3 Opus 200k 94 $30.00 22.4 1.78
Anthropic Claude 3 Opus 200k 94 $30.00 24.4 2.00
Amazon Bedrock Claude 3 Sonnet 200k 78 $6.00 46.5 0.83
Google Claude 3 Sonnet 200k 78 $6.00 nodata nodata
Anthropic Claude 3 Sonnet 200k 78 $6.00 60.6 0.97
Amazon Bedrock Claude 3 Haiku 200k 72 $0.50 96.6 0.45
Google Claude 3 Haiku 200k 72 $0.50 nodata nodata
Anthropic Claude 3 Haiku 200k 72 $0.50 148.1 0.55
Anthropic Claude 2.0 100k 69 $12.00 38.7 1.27
Amazon Bedrock Claude 2.1 200k 63 $12.00 35.8 1.67
Anthropic Claude 2.1 200k 63 $12.00 37.2 1.21
Amazon Bedrock Claude Instant 100k 63 $1.20 80.0 0.54
Anthropic Claude Instant 100k 63 $1.20 98.0 0.57
Amazon Bedrock Command Light 4k nodata $0.38 33.9 0.56
Cohere Command Light 4k nodata $0.38 67.4 0.26
Amazon Bedrock Command 4k nodata $1.63 23.2 0.55
Cohere Command 4k nodata $1.25 24.5 0.57
Cohere Command-R+ 128k 74 $6.00 61.8 0.31
Microsoft Azure Command-R+ 128k 74 $6.00 59.2 0.47
Cohere Command-R 128k 62 $0.75 147.4 0.20
Microsoft Azure Command-R 128k 62 $0.75 48.2 0.46
Deepinfra OpenChat 3.5 8k 54 $0.07 67.4 0.31
Together.ai OpenChat 3.5 8k 54 $0.20 73.2 0.35
Lepton AI DBRX 33k 74 $0.90 nodata nodata
Fireworks DBRX 33k 74 $1.20 51.9 0.37
Databricks DBRX 33k 74 $3.38 96.9 0.64
Together.ai DBRX 33k 74 $1.20 72.5 0.49
AI21 Labs Jamba Instruct 256k 63 $0.55 66.2 0.44
DeepSeek DeepSeek-V2 128k 82 $0.17 16.9 1.59
Together.ai Arctic 4k 63 $2.40 72.7 0.50
Together.ai Qwen2 (72B) 128k nodata $0.90 42.5 0.58

Monthly subs

First, let's discuss what the majority of people use. When it comes to AI services, many leading companies, such as OpenAI, Anthropic, and Google, offer monthly subscription plans. With an average cost of around £18 per month, these subscription plans often provide benefits such as convenience for non-tech folks. Everyone knows how to use the chat app. In some cases, people might also benefit from priority access to newer models and features. But did you notice that if you are a really heavy user of GPT4, you know that there is a limit to how much you can use it before it says, "You have exceeded your limit and will be able to resume at xx:yy time, would you like to switch to GPT3.5 for now (our cheapest model that we give away for free anyways)? " You would think a paid monthly sub gives you unlimited access, and in a way, it's true, but it is also somewhat throttled. So, how much of that AI could you use paying for API only?

Let's take a look at OpenAI example. Monthly subscription costs $20. However from the table we can see that they sell 1M Tokens for $15. So for $20 I could technically use 1.3M Tokens total (in/out, meaning words that I submit to chat and words that I receive from chat LLM).

Let's break down what 1M Tokens is, to give you a better sense of scale. On average, a fluent adult reader can read about 200-300 words per minute.
1M tokens are roughly equivalent to about 750,000 words. At 250 words per minute, it would take approximately 50 hours to read 1 million tokens.

For a book equivalent, an average novel contains about 80,000-100,000 words.
One million tokens (≈ 750,000 words) is equivalent to about 7-9 average novels.

And if you think of A4 pages, a typical A4 page with standard margins and 12-point font contains about 500 words. So 750,000 words would fill approximately 1,500 A4 pages.

In terms of articles, if we consider an average long-form article to be about 2,000 words, 1 million tokens would be equivalent to about 375 such articles.

This gives you an idea of how substantial 1 million tokens are in terms of content. It's a significant amount of text, equivalent to several books or many hours of reading. So, what's the conclusion so far? With my heavy use of AI, I don't think I am generating 50 hours worth of reading, not even close, and I hit the GPT4 throttle quite frequently. So, in conclusion, do I save more money by just using APIs? Yes, absolutely. Subscription doesn't care if you are on holiday or just not using it when you are busy with the other stuff; with APIs, you just pay for what you use. Recently, Anthropic came out with their medium-model Claude 3.5 Sonnet that matches the IQ of GPT4. The price for 1M Tokens of Sonnet is just $6, which is a third of what OpenAI is asking for GPT4. A subscription gives you the convenience of a chat window with a premium model and a vendor lock-in, but you could host your own chat window, and there are many solutions like, for example, https://openwebui.com/, that will allow you to use a variety of different models without having to commit to a fixed monthly bill.

Private Hardware vs API

As we dive deeper into the debate of utilizing local GPUs versus cloud-based APIs for running large language models (LLMs), understanding each choice's broader implications is crucial. Opting for local GPUs involves investing in fairly powerful hardware like the Nvidia RTX 3090 or RTX 4090, which comes with substantial upfront costs (currently used on eBay for roughly £600 for 3090 and around £1500 for 4090) but offers control and data privacy. The operational costs, mainly electricity, also play a significant role in long-term budgeting. Conversely, using APIs from leading providers such as OpenAI, Google, or Anthropic presents a different set of advantages and challenges. APIs offer scalability and ease of access with variable costs based on usage, which can balloon quite quickly into a large bill if not used carefully.

Now, it's not a secret that we pay an ungodly price for electricity here in the UK to a local energy mafia that is fixated on ESG scores. As of April 2024, the average unit price per kilowatt-hour (kWh) for customers on the default tariff (Standard Variable Tariff or SVT) is now capped at 24.50p. RTX3090 is 350W with around 100W idle, and RTX4090 is 450W with around 100W idle. Let's calculate the cost for a scenario where we use the GPU actively for 12 hours and then let it idle for 12 hours each day. We'll need to consider both the active and idle power consumption.

Electricity price: £0.245 per kWh

Power consumption:

  • Idle: 100W (0.1 kW) for both cards

  • Active use:
    RTX 3090: 350W (0.35 kW)
    RTX 4090: 450W (0.45 kW)

  • Daily energy consumption:

    • RTX 3090:
      Active: 0.35 kW * 12 hours = 4.2 kWh
      Idle: 0.1 kW * 12 hours = 1.2 kWh
      Total: 4.2 kWh + 1.2 kWh = 5.4 kWh per day

    • RTX 4090:
      Active: 0.45 kW * 12 hours = 5.4 kWh
      Idle: 0.1 kW * 12 hours = 1.2 kWh
      Total: 5.4 kWh + 1.2 kWh = 6.6 kWh per day

Daily cost:

  • RTX 3090: 5.4 kWh * £0.245 = £1.323 per day
  • RTX 4090: 6.6 kWh * £0.245 = £1.617 per day

Monthly cost:

  • RTX 3090: £1.323 * 30 = £39.69 per month
  • RTX 4090: £1.617 * 30 = £48.51 per month

Yearly cost:

  • RTX 3090: £39.69 * 12 = £476.28 per year
  • RTX 4090: £48.51 * 12 = £582.12 per year

In summary, if you use a GPU for 12 hours actively and let it idle for 12 hours each day, it will set you back a bloody half a grand a year in the UK!

RTX 3090 RTX 4090
Per Day ~£1.32 ~£1.62
Per Month ~£40 ~£50
Per Year ~£480 ~£580

I was quite shocked to see the numbers. Bear in mind that this doesn't even take into consideration power consumption by the CPU, especially if you run on dual-socket. So, next, I was curious to see "how much AI I can squeeze out" of my GPU vs. API. The numbers are actually quite crazy.

Let's first make an assumption: What are we running? Right now, one of the best open-source general intelligence models is something like Llama3-70B or Mixtral 8x22B. The "IQ" of GPT4 is 93, GPT3.5 is 65 and Llama3-70B is 88. So Llama is right between GPT4 and GPT3.5 in terms of intelligence. The cost of 1M tokens on GPT4 is $15, GPT3.5 is $0.75, and Llama3-70B on Groq and DeepInfra is just $0.64. Let's keep these numbers in mind.

But can we run Llama3-70B on our 24GB VRAM GPUs? The simple answer is no, not out of the box. We could run 7B or 13B parameters out of the box, but to run a 70B model on GPU with 24GB VRAM, we would need to use a quantized model and perhaps something like vLLM to optimize it further. Just for context, quantization techniques like 8-bit or even 4-bit quantization can dramatically reduce the memory footprint of a model. Libraries like GPTQ or LLM.int8() can be used to quantize models with minimal loss in performance, and vLLM, a.k.a. "Vector Language Model," is an open-source library for LLM inference acceleration that uses PagedAttention, an algorithm that effectively manages attention keys and values by leveraging GPU memory hierarchies. This allows it to support models much larger than the GPU's VRAM capacity. But for the sake of argument, let's just say that we can run the 70B model on RTX3090/4090. Several people have reported on Reddit that it generates roughly 20 tokens/s.

Let's compare cost/benefit of running Llama3-70B at home on RTX3090 vs API:

Home GPU RTX3090 costs:

  • Initial cost: £600
  • Annual electricity cost: £470
  • Performance: 20 tokens/s

DeepInfra/Groq API costs:

  • £0.50 per 1M tokens

So if we were to spend electricity money on APIs, how many millions of tokens we would generate, if annual electricity cost is £470?

Tokens on DeepInfra for £470 GBP:

£470 / (£0.50 / 1M) = 940M tokens

How many millions of tokens we would generate to match price of GPU if RTX3090 costs £600?

Tokens on DeepInfra for £600:

£600 / (£0.50 / 1M) = 1,200M tokens

On a contrary, given that performance of RTX3090 is not great, I was wondering how many millions of tokens can be generated with RTX3090 in a year?

Tokens per second: 20T/s

Seconds in a year: 365 * 24 * 60 * 60 = 31,536,000sec

Tokens in a year (100% utilization): 20T/s * 31,536,000sec = 630,720,000 T/y

However, we agreed that our usage pattern is 12h per day, so:

Tokens in a year (12h/day): 630,720,000 / 2 = 315,360,000 ≈ 315M tokens

So, let's recap:

  • You'd need to use 940 million tokens on DeepInfra to match annual electricity costs of running the RTX3090.
  • You'd need to use 1,200 million tokens on DeepInfra to match the purchase price of the RTX3090.
  • The RTX3090 can generate about 315 million tokens per year at 12h/day usage.
  • The total cost of RTX3090 for first year is:

£600 + £470 = £1,070

  • TCO of RTX3090 equivalent in tokens on DeepInfra:

1,070 / (0.50 / 1M) = 2,140M tokens

Essentially, it means that if you use more than 2,140M tokens in the first year, you will break even, and then RTX3090 becomes more cost-effective. However, keep in mind that the RTX3090 can only generate about 315M tokens in a year at 12h/day usage. So even if we run it under the load 24/7, we would still not match the cost benefit of API due to slow performance and electricity cost. Also a vast majority of people won't be able to run 70B model using various technical optimization tricks and will be limited to models like 8B, 13B or 30B. Now, we have not even discussed APIs and costs of smaller models, like Llama3-8B, which costs just $0.06/1M Tokens running on Groq. Bear in mind that the IQ of Llama3-8B is 65, which matches the IQ of GPT3.5.

Conclusion

Unless you have other uses for the GPU, like Gaming or Video Editing or certain specific reasons to run locally, the API solutions seem way more cost-effective and flexible for most use cases, especially if you're not using extremely high volumes of tokens. I argued with myself that I need a GPU locally because I do some fine-tuning, I also host diffusion models. Therefore, I need to have it, but guess what? Thanks to services like https://www.runpod.io/, you can rent out a GPU at a very competitive price, especially if you are choosing Spot GPUs or building serverless microservices. For example, I use Faster-Whisper Serverless service to transcribe some of the YouTube videos and then summarize transcriptions using cheap models on DeepInfra, and that beats my local infrastructure at the cost of operation and ownership every time.

Subscribe to Vitalij Neverkevic Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe