Is the GPU working? #4531

15731807423 · 2024-05-20T03:28:18Z

After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?

pdevine · 2024-05-20T05:14:51Z

@15731807423 what's the output of ollama ps? It should tell you how much of the model is on the GPU and how much is on the CPU.

15731807423 · 2024-05-20T06:28:13Z

@pdevine
This should be the occupied RAM and VRAM.
The utilization rate of GPU has always been 0% when answering.

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:70b      be39eb53a197    41 GB   42%/58% CPU/GPU 4 minutes from now

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   a6990ed6be41    5.4 GB  100% GPU        4 minutes from now

pdevine · 2024-05-20T07:09:46Z

@15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do /set verbose how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU.

Here's my ollama ps output on the 4090:

$ ollama ps
NAME      	ID          	SIZE 	PROCESSOR      	UNTIL
llama3:70b	be39eb53a197	41 GB	40%/60% CPU/GPU	4 minutes from now

15731807423 · 2024-05-20T07:28:03Z

@pdevine
What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct?

(base) PS C:\Windows\System32> ollama run llama3

/set verbose
Set 'verbose' mode.
你好
😊 你好！我是 Chatbot，很高兴见到你！如果你需要帮助或想聊天，请随时问我。 😊

total duration: 5.8423836s
load duration: 5.4839949s
prompt eval count: 12 token(s)
prompt eval duration: 17.113ms
prompt eval rate: 701.22 tokens/s
eval count: 34 token(s)
eval duration: 334.75ms
eval rate: 101.57 tokens/s

(base) PS C:\Windows\System32> ollama run llama3:70b

/set verbose
Set 'verbose' mode.
你好
😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔

total duration: 13.0373642s
load duration: 6.6727ms
prompt eval count: 12 token(s)
prompt eval duration: 2.312915s
prompt eval rate: 5.19 tokens/s
eval count: 22 token(s)
eval duration: 10.71453s
eval rate: 2.05 tokens/s

frederickjjoubert · 2024-05-20T22:15:51Z

I think might be related to #1651 ? It doesn't look like ollama is using the GPU on PopOS

pdevine · 2024-05-20T22:40:52Z

It is using the GPU, but it's not particularly efficient at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with:

>>> /set parameter num_gpu 0

Which should show you the difference in performance. You can also load a lower number of layers (i.e. /set parameter num_gpu 1) which will show offloading most of the layers in the model to the CPU. I believe the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU.

Back to CPU only (using num_gpu 0) I get:

total duration:       3m25.479681006s
load duration:        4.023984693s
prompt eval count:    208 token(s)
prompt eval duration: 41.733919s
prompt eval rate:     4.98 tokens/s
eval count:           259 token(s)
eval duration:        2m39.571141s
eval rate:            1.62 tokens/s

or roughly half the speed of the GPU.

dhiltgen · 2024-05-22T21:49:37Z

To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.

pdevine added nvidia Issues relating to Nvidia GPUs and CUDA gpu labels May 20, 2024

dhiltgen closed this as completed May 22, 2024

dhiltgen self-assigned this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the GPU working? #4531

Is the GPU working? #4531

15731807423 commented May 20, 2024

pdevine commented May 20, 2024

15731807423 commented May 20, 2024 •

edited by dhiltgen

pdevine commented May 20, 2024

15731807423 commented May 20, 2024

frederickjjoubert commented May 20, 2024

pdevine commented May 20, 2024

dhiltgen commented May 22, 2024

Is the GPU working? #4531

Is the GPU working? #4531

Comments

15731807423 commented May 20, 2024

pdevine commented May 20, 2024

15731807423 commented May 20, 2024 • edited by dhiltgen

pdevine commented May 20, 2024

15731807423 commented May 20, 2024

frederickjjoubert commented May 20, 2024

pdevine commented May 20, 2024

dhiltgen commented May 22, 2024

15731807423 commented May 20, 2024 •

edited by dhiltgen