By Vitalij Neverkevic in ARTIFICIAL INTELLIGENCE — Jun 26, 2024

Are you planning to buy GPU to run LLMs at home? Cost/Benefit analysis in 2024.

Summary

Cost Efficiency: APIs are generally more cost-effective for heavy users due to payment based on actual usage, unlike fixed monthly subscriptions which may include usage limits.
Local Hardware Expenses: High-end GPUs like the Nvidia RTX 3090 or RTX 4090 incur significant upfront and ongoing electricity costs, making them less economical for continuous use.
Token Costs: Annual electricity costs for a GPU could fund hundreds of millions more tokens via cloud APIs than the tokens generated locally on the same GPU.
Practicality: Cloud APIs provide greater scalability and flexibility without the need for local hardware maintenance. For most users, they offer a more practical and economical solution for running large language models.
Conclusion: Cloud APIs are recommended over local GPUs for most users, providing a more cost-effective, scalable, and maintenance-free option for accessing large language models.

Intro

As an IT professional who leverages cloud and AI technologies, including large language models (LLMs), in both professional and personal capacities, I've often found myself wondering if it is more economical or efficient to run LLMs locally on my own hardware or if I should rely on cloud services and APIs? This article delves into that very query, exploring the pros and cons of running LLMs on-premises—at home versus utilizing the vast resources of cloud-based APIs. We'll analyze the costs, performance, and practicality of each option to provide a comprehensive overview for anyone grappling with this decision. Let's first take a look at the data that I've put together.

Data

Vendor	Model	Context Window	Model IQ	Price/1M Tokens	Speed Tokens/s	Latency
OpenAI	GPT-4o	128k	100	$7.50	80.6	0.52
OpenAI	GPT-4 Turbo	128k	94	$15.00	27.8	0.69
Microsoft Azure	GPT-4 Turbo	128k	94	$15.00	28.0	0.55
OpenAI	GPT-4	8k	93	$37.50	22.6	0.86
Microsoft Azure	GPT-4	8k	93	$37.50	20.2	0.55
OpenAI	GPT-3.5 Turbo	16k	65	$0.75	65.9	0.45
Microsoft Azure	GPT-3.5 Turbo	16k	65	$0.75	57.8	0.32
OpenAI	GPT-3.5 Turbo Instruct	4k	60	$1.63	73.1	0.36
Microsoft Azure	GPT-3.5 Turbo Instruct	4k	60	$1.63	137.4	0.60
Google	Gemini 1.5 Pro	1m	93	$5.25	63.8	1.33
Google	Gemini 1.5 Flash	1m	83	$0.53	142.7	1.32
Google	Gemini 1.0 Pro	33k	62	$0.75	86.6	2.30
Fireworks	Gemma 7B	8k	57	$0.20	234.8	0.26
Deepinfra	Gemma 7B	8k	57	$0.07	64.9	0.29
Groq	Gemma 7B	8k	57	$0.07	1029	0.88
Together.ai	Gemma 7B	8k	57	$0.20	140.5	0.40
Replicate	Llama 3 (70B)	8k	88	$1.18	48.2	1.85
Amazon Bedrock	Llama 3 (70B)	8k	88	$2.86	48.8	0.48
OctoAI	Llama 3 (70B)	8k	88	$0.90	61.2	0.29
Microsoft Azure	Llama 3 (70B)	8k	88	$5.67	18.3	2.68
Fireworks	Llama 3 (70B)	8k	88	$0.90	116.9	0.26
Deepinfra	Llama 3 (70B)	8k	88	$0.64	20.9	0.35
Groq	Llama 3 (70B)	8k	88	$0.64	357.3	0.40
Perplexity	Llama 3 (70B)	8k	88	$1.00	48.1	0.33
Together.ai	Llama 3 (70B)	8k	88	$0.90	126.1	0.63
Replicate	Llama 3 (8B)	8k	65	$0.10	81.3	1.71
Amazon Bedrock	Llama 3 (8B)	8k	65	$0.38	79.2	0.29
OctoAI	Llama 3 (8B)	8k	65	$0.15	132.1	0.21
Microsoft Azure	Llama 3 (8B)	8k	65	$0.55	76.5	0.89
Fireworks	Llama 3 (8B)	8k	65	$0.20	252.6	0.25
Deepinfra	Llama 3 (8B)	8k	65	$0.08	111.3	0.19
Groq	Llama 3 (8B)	8k	65	$0.06	357	0.35
Perplexity	Llama 3 (8B)	8k	65	$0.20	119.8	0.25
Together.ai	Llama 3 (8B)	8k	65	$0.20	265.4	0.40
Fireworks	Code Llama (70B)	4k	58	$0.90	nodata	nodata
Deepinfra	Code Llama (70B)	4k	58	$0.60	33.5	0.45
Perplexity	Code Llama (70B)	16k	58	$1.00	nodata	nodata
Together.ai	Code Llama (70B)	4k	58	$0.90	30.0	0.55
Replicate	Llama 2 Chat (70B)	4k	50	$1.18	52.7	1.92
Amazon Bedrock	Llama 2 Chat (70B)	4k	50	$2.10	43.5	0.49
OctoAI	Llama 2 Chat (70B)	4k	50	$0.90	130.4	0.21
Microsoft Azure	Llama 2 Chat (70B)	4k	50	$1.60	17.4	3.25
Fireworks	Llama 2 Chat (70B)	4k	50	$0.90	92.5	0.33
Deepinfra	Llama 2 Chat (70B)	4k	50	$0.76	111.9	0.20
Perplexity	Llama 2 Chat (70B)	4k	50	$1.00	nodata	nodata
Together.ai	Llama 2 Chat (70B)	4k	50	$0.90	34.3	0.83
Replicate	Llama 2 Chat (13B)	4k	36	$0.20	80.3	1.49
Amazon Bedrock	Llama 2 Chat (13B)	4k	36	$0.81	50.5	0.34
OctoAI	Llama 2 Chat (13B)	4k	36	$0.20	130.8	0.21
Microsoft Azure	Llama 2 Chat (13B)	4k	36	$0.84	44.0	1.51
Fireworks	Llama 2 Chat (13B)	4k	36	$0.20	106.8	0.31
Deepinfra	Llama 2 Chat (13B)	4k	36	$0.35	110.5	0.20
Together.ai	Llama 2 Chat (13B)	4k	36	$0.30	44.5	0.46
Replicate	Llama 2 Chat (7B)	4k	27	$0.10	149.4	1.29
Microsoft Azure	Llama 2 Chat (7B)	4k	27	$0.56	72.4	1.03
Fireworks	Llama 2 Chat (7B)	4k	27	$0.20	165.8	0.27
Deepinfra	Llama 2 Chat (7B)	4k	27	$0.20	21.5	0.41
Together.ai	Llama 2 Chat (7B)	4k	27	$0.20	91.8	0.45
Mistral	Mixtral 8x22B	65k	78	$3.00	67.6	0.28
OctoAI	Mixtral 8x22B	65k	78	$1.20	89.7	0.27
Fireworks	Mixtral 8x22B	65k	78	$1.20	79.5	0.25
Deepinfra	Mixtral 8x22B	65k	78	$0.65	37.4	0.25
Perplexity	Mixtral 8x22B	16k	78	$1.00	nodata	nodata
Together.ai	Mixtral 8x22B	65k	78	$1.20	54.9	0.72
Mistral	Mistral Large	33k	75	$6.00	38.7	0.36
Amazon Bedrock	Mistral Large	33k	75	$6.00	34.5	0.43
Microsoft Azure	Mistral Large	33k	75	$6.00	25.0	2.07
Mistral	Mistral Medium	33k	73	$4.05	38.0	0.51
Mistral	Mistral Small	33k	71	$1.50	36.6	0.35
Microsoft Azure	Mistral Small	33k	71	$1.50	61.5	1.25
Mistral	Mixtral 8x7B	33k	65	$0.70	66.6	0.31
Replicate	Mixtral 8x7B	33k	65	$0.47	107.3	1.57
Amazon Bedrock	Mixtral 8x7B	33k	65	$0.51	63.9	0.36
OctoAI	Mixtral 8x7B	33k	65	$0.45	84.8	0.26
Lepton AI	Mixtral 8x7B	33k	65	$0.50	72.4	0.37
Fireworks	Mixtral 8x7B	33k	65	$0.50	254.0	0.25
Deepinfra	Mixtral 8x7B	33k	65	$0.24	62.1	0.20
Groq	Mixtral 8x7B	33k	65	$0.24	552.4	0.44
Perplexity	Mixtral 8x7B	16k	65	$0.60	110.7	0.26
Together.ai	Mixtral 8x7B	33k	65	$0.60	86.3	0.39
Mistral	Mistral 7B	33k	39	$0.25	79.9	0.30
Replicate	Mistral 7B	33k	39	$0.10	81.5	1.52
Amazon Bedrock	Mistral 7B	33k	39	$0.16	72.1	0.33
OctoAI	Mistral 7B	33k	39	$0.15	152.9	0.21
Fireworks	Mistral 7B	33k	39	$0.20	264.1	0.18
Deepinfra	Mistral 7B	33k	39	$0.07	71.8	0.30
Perplexity	Mistral 7B	16k	39	$0.20	124.8	0.27
Together.ai	Mistral 7B	8k	39	$0.20	63.6	0.33
Baseten	Mistral 7B	4k	39	$0.20	216.2	0.18
Anthropic	Claude 3.5 Sonnet	200k	100	$6.00	79.8	0.84
Amazon Bedrock	Claude 3 Opus	200k	94	$30.00	22.4	1.78
Anthropic	Claude 3 Opus	200k	94	$30.00	24.4	2.00
Amazon Bedrock	Claude 3 Sonnet	200k	78	$6.00	46.5	0.83
Google	Claude 3 Sonnet	200k	78	$6.00	nodata	nodata
Anthropic	Claude 3 Sonnet	200k	78	$6.00	60.6	0.97
Amazon Bedrock	Claude 3 Haiku	200k	72	$0.50	96.6	0.45
Google	Claude 3 Haiku	200k	72	$0.50	nodata	nodata
Anthropic	Claude 3 Haiku	200k	72	$0.50	148.1	0.55
Anthropic	Claude 2.0	100k	69	$12.00	38.7	1.27
Amazon Bedrock	Claude 2.1	200k	63	$12.00	35.8	1.67
Anthropic	Claude 2.1	200k	63	$12.00	37.2	1.21
Amazon Bedrock	Claude Instant	100k	63	$1.20	80.0	0.54
Anthropic	Claude Instant	100k	63	$1.20	98.0	0.57
Amazon Bedrock	Command Light	4k	nodata	$0.38	33.9	0.56
Cohere	Command Light	4k	nodata	$0.38	67.4	0.26
Amazon Bedrock	Command	4k	nodata	$1.63	23.2	0.55
Cohere	Command	4k	nodata	$1.25	24.5	0.57
Cohere	Command-R+	128k	74	$6.00	61.8	0.31
Microsoft Azure	Command-R+	128k	74	$6.00	59.2	0.47
Cohere	Command-R	128k	62	$0.75	147.4	0.20
Microsoft Azure	Command-R	128k	62	$0.75	48.2	0.46
Deepinfra	OpenChat 3.5	8k	54	$0.07	67.4	0.31
Together.ai	OpenChat 3.5	8k	54	$0.20	73.2	0.35
Lepton AI	DBRX	33k	74	$0.90	nodata	nodata
Fireworks	DBRX	33k	74	$1.20	51.9	0.37
Databricks	DBRX	33k	74	$3.38	96.9	0.64
Together.ai	DBRX	33k	74	$1.20	72.5	0.49
AI21 Labs	Jamba Instruct	256k	63	$0.55	66.2	0.44
DeepSeek	DeepSeek-V2	128k	82	$0.17	16.9	1.59
Together.ai	Arctic	4k	63	$2.40	72.7	0.50
Together.ai	Qwen2 (72B)	128k	nodata	$0.90	42.5	0.58

Monthly subs

First, let's discuss what the majority of people use. When it comes to AI services, many leading companies, such as OpenAI, Anthropic, and Google, offer monthly subscription plans. With an average cost of around £18 per month, these subscription plans often provide benefits such as convenience for non-tech folks. Everyone knows how to use the chat app. In some cases, people might also benefit from priority access to newer models and features. But did you notice that if you are a really heavy user of GPT4, you know that there is a limit to how much you can use it before it says, "You have exceeded your limit and will be able to resume at xx:yy time, would you like to switch to GPT3.5 for now (our cheapest model that we give away for free anyways)? " You would think a paid monthly sub gives you unlimited access, and in a way, it's true, but it is also somewhat throttled. So, how much of that AI could you use paying for API only?

Let's take a look at OpenAI example. Monthly subscription costs $20. However from the table we can see that they sell 1M Tokens for $15. So for $20 I could technically use 1.3M Tokens total (in/out, meaning words that I submit to chat and words that I receive from chat LLM).

Let's break down what 1M Tokens is, to give you a better sense of scale. On average, a fluent adult reader can read about 200-300 words per minute.
1M tokens are roughly equivalent to about 750,000 words. At 250 words per minute, it would take approximately 50 hours to read 1 million tokens.

For a book equivalent, an average novel contains about 80,000-100,000 words.
One million tokens (≈ 750,000 words) is equivalent to about 7-9 average novels.

And if you think of A4 pages, a typical A4 page with standard margins and 12-point font contains about 500 words. So 750,000 words would fill approximately 1,500 A4 pages.

In terms of articles, if we consider an average long-form article to be about 2,000 words, 1 million tokens would be equivalent to about 375 such articles.

This gives you an idea of how substantial 1 million tokens are in terms of content. It's a significant amount of text, equivalent to several books or many hours of reading. So, what's the conclusion so far? With my heavy use of AI, I don't think I am generating 50 hours worth of reading, not even close, and I hit the GPT4 throttle quite frequently. So, in conclusion, do I save more money by just using APIs? Yes, absolutely. Subscription doesn't care if you are on holiday or just not using it when you are busy with the other stuff; with APIs, you just pay for what you use. Recently, Anthropic came out with their medium-model Claude 3.5 Sonnet that matches the IQ of GPT4. The price for 1M Tokens of Sonnet is just $6, which is a third of what OpenAI is asking for GPT4. A subscription gives you the convenience of a chat window with a premium model and a vendor lock-in, but you could host your own chat window, and there are many solutions like, for example, https://openwebui.com/, that will allow you to use a variety of different models without having to commit to a fixed monthly bill.

Private Hardware vs API

As we dive deeper into the debate of utilizing local GPUs versus cloud-based APIs for running large language models (LLMs), understanding each choice's broader implications is crucial. Opting for local GPUs involves investing in fairly powerful hardware like the Nvidia RTX 3090 or RTX 4090, which comes with substantial upfront costs (currently used on eBay for roughly £600 for 3090 and around £1500 for 4090) but offers control and data privacy. The operational costs, mainly electricity, also play a significant role in long-term budgeting. Conversely, using APIs from leading providers such as OpenAI, Google, or Anthropic presents a different set of advantages and challenges. APIs offer scalability and ease of access with variable costs based on usage, which can balloon quite quickly into a large bill if not used carefully.

Now, it's not a secret that we pay an ungodly price for electricity here in the UK to a local energy mafia that is fixated on ESG scores. As of April 2024, the average unit price per kilowatt-hour (kWh) for customers on the default tariff (Standard Variable Tariff or SVT) is now capped at 24.50p. RTX3090 is 350W with around 100W idle, and RTX4090 is 450W with around 100W idle. Let's calculate the cost for a scenario where we use the GPU actively for 12 hours and then let it idle for 12 hours each day. We'll need to consider both the active and idle power consumption.

Electricity price: £0.245 per kWh

Power consumption:

Idle: 100W (0.1 kW) for both cards
Active use:
RTX 3090: 350W (0.35 kW)
RTX 4090: 450W (0.45 kW)
Daily energy consumption:
- RTX 3090:
  Active: 0.35 kW * 12 hours = 4.2 kWh
  Idle: 0.1 kW * 12 hours = 1.2 kWh
  Total: 4.2 kWh + 1.2 kWh = 5.4 kWh per day
- RTX 4090:
  Active: 0.45 kW * 12 hours = 5.4 kWh
  Idle: 0.1 kW * 12 hours = 1.2 kWh
  Total: 5.4 kWh + 1.2 kWh = 6.6 kWh per day

Daily cost:

RTX 3090: 5.4 kWh * £0.245 = £1.323 per day
RTX 4090: 6.6 kWh * £0.245 = £1.617 per day

Monthly cost:

RTX 3090: £1.323 * 30 = £39.69 per month
RTX 4090: £1.617 * 30 = £48.51 per month

Yearly cost:

RTX 3090: £39.69 * 12 = £476.28 per year
RTX 4090: £48.51 * 12 = £582.12 per year

In summary, if you use a GPU for 12 hours actively and let it idle for 12 hours each day, it will set you back a bloody half a grand a year in the UK!

	RTX 3090	RTX 4090
Per Day	~£1.32	~£1.62
Per Month	~£40	~£50
Per Year	~£480	~£580

I was quite shocked to see the numbers. Bear in mind that this doesn't even take into consideration power consumption by the CPU, especially if you run on dual-socket. So, next, I was curious to see "how much AI I can squeeze out" of my GPU vs. API. The numbers are actually quite crazy.

Let's first make an assumption: What are we running? Right now, one of the best open-source general intelligence models is something like Llama3-70B or Mixtral 8x22B. The "IQ" of GPT4 is 93, GPT3.5 is 65 and Llama3-70B is 88. So Llama is right between GPT4 and GPT3.5 in terms of intelligence. The cost of 1M tokens on GPT4 is $15, GPT3.5 is $0.75, and Llama3-70B on Groq and DeepInfra is just $0.64. Let's keep these numbers in mind.

But can we run Llama3-70B on our 24GB VRAM GPUs? The simple answer is no, not out of the box. We could run 7B or 13B parameters out of the box, but to run a 70B model on GPU with 24GB VRAM, we would need to use a quantized model and perhaps something like vLLM to optimize it further. Just for context, quantization techniques like 8-bit or even 4-bit quantization can dramatically reduce the memory footprint of a model. Libraries like GPTQ or LLM.int8() can be used to quantize models with minimal loss in performance, and vLLM, a.k.a. "Vector Language Model," is an open-source library for LLM inference acceleration that uses PagedAttention, an algorithm that effectively manages attention keys and values by leveraging GPU memory hierarchies. This allows it to support models much larger than the GPU's VRAM capacity. But for the sake of argument, let's just say that we can run the 70B model on RTX3090/4090. Several people have reported on Reddit that it generates roughly 20 tokens/s.

Let's compare cost/benefit of running Llama3-70B at home on RTX3090 vs API:

Home GPU RTX3090 costs:

Initial cost: £600
Annual electricity cost: £470
Performance: 20 tokens/s

DeepInfra/Groq API costs:

£0.50 per 1M tokens

So if we were to spend electricity money on APIs, how many millions of tokens we would generate, if annual electricity cost is £470?

Tokens on DeepInfra for £470 GBP:

£470 / (£0.50 / 1M) = 940M tokens

How many millions of tokens we would generate to match price of GPU if RTX3090 costs £600?

Tokens on DeepInfra for £600:

£600 / (£0.50 / 1M) = 1,200M tokens

On a contrary, given that performance of RTX3090 is not great, I was wondering how many millions of tokens can be generated with RTX3090 in a year?

Tokens per second: 20T/s

Seconds in a year: 365 * 24 * 60 * 60 = 31,536,000sec

Tokens in a year (100% utilization): 20T/s * 31,536,000sec = 630,720,000 T/y

However, we agreed that our usage pattern is 12h per day, so:

Tokens in a year (12h/day): 630,720,000 / 2 = 315,360,000 ≈ 315M tokens

So, let's recap:

You'd need to use 940 million tokens on DeepInfra to match annual electricity costs of running the RTX3090.
You'd need to use 1,200 million tokens on DeepInfra to match the purchase price of the RTX3090.
The RTX3090 can generate about 315 million tokens per year at 12h/day usage.
The total cost of RTX3090 for first year is:

£600 + £470 = £1,070

TCO of RTX3090 equivalent in tokens on DeepInfra:

1,070 / (0.50 / 1M) = 2,140M tokens

Essentially, it means that if you use more than 2,140M tokens in the first year, you will break even, and then RTX3090 becomes more cost-effective. However, keep in mind that the RTX3090 can only generate about 315M tokens in a year at 12h/day usage. So even if we run it under the load 24/7, we would still not match the cost benefit of API due to slow performance and electricity cost. Also a vast majority of people won't be able to run 70B model using various technical optimization tricks and will be limited to models like 8B, 13B or 30B. Now, we have not even discussed APIs and costs of smaller models, like Llama3-8B, which costs just $0.06/1M Tokens running on Groq. Bear in mind that the IQ of Llama3-8B is 65, which matches the IQ of GPT3.5.

Conclusion

Unless you have other uses for the GPU, like Gaming or Video Editing or certain specific reasons to run locally, the API solutions seem way more cost-effective and flexible for most use cases, especially if you're not using extremely high volumes of tokens. I argued with myself that I need a GPU locally because I do some fine-tuning, I also host diffusion models. Therefore, I need to have it, but guess what? Thanks to services like https://www.runpod.io/, you can rent out a GPU at a very competitive price, especially if you are choosing Spot GPUs or building serverless microservices. For example, I use Faster-Whisper Serverless service to transcribe some of the YouTube videos and then summarize transcriptions using cheap models on DeepInfra, and that beats my local infrastructure at the cost of operation and ownership every time.