Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the GPU working? #4531

Closed
15731807423 opened this issue May 20, 2024 · 7 comments
Closed

Is the GPU working? #4531

15731807423 opened this issue May 20, 2024 · 7 comments
Assignees
Labels
gpu nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@15731807423
Copy link

微信截图_20240520112116

After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?

@pdevine
Copy link
Contributor

pdevine commented May 20, 2024

@15731807423 what's the output of ollama ps? It should tell you how much of the model is on the GPU and how much is on the CPU.

@15731807423
Copy link
Author

15731807423 commented May 20, 2024

@pdevine
This should be the occupied RAM and VRAM.
The utilization rate of GPU has always been 0% when answering.

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:70b      be39eb53a197    41 GB   42%/58% CPU/GPU 4 minutes from now

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   a6990ed6be41    5.4 GB  100% GPU        4 minutes from now

@pdevine
Copy link
Contributor

pdevine commented May 20, 2024

@15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do /set verbose how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU.

Here's my ollama ps output on the 4090:

$ ollama ps
NAME      	ID          	SIZE 	PROCESSOR      	UNTIL
llama3:70b	be39eb53a197	41 GB	40%/60% CPU/GPU	4 minutes from now

@15731807423
Copy link
Author

@pdevine
What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct?

(base) PS C:\Windows\System32> ollama run llama3

/set verbose
Set 'verbose' mode.
你好
😊 你好!我是 Chatbot,很高兴见到你!如果你需要帮助或想聊天,请随时问我。 😊

total duration: 5.8423836s
load duration: 5.4839949s
prompt eval count: 12 token(s)
prompt eval duration: 17.113ms
prompt eval rate: 701.22 tokens/s
eval count: 34 token(s)
eval duration: 334.75ms
eval rate: 101.57 tokens/s

(base) PS C:\Windows\System32> ollama run llama3:70b

/set verbose
Set 'verbose' mode.
你好
😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔

total duration: 13.0373642s
load duration: 6.6727ms
prompt eval count: 12 token(s)
prompt eval duration: 2.312915s
prompt eval rate: 5.19 tokens/s
eval count: 22 token(s)
eval duration: 10.71453s
eval rate: 2.05 tokens/s

@pdevine pdevine added nvidia Issues relating to Nvidia GPUs and CUDA gpu labels May 20, 2024
@frederickjjoubert
Copy link

I think might be related to #1651 ? It doesn't look like ollama is using the GPU on PopOS

@pdevine
Copy link
Contributor

pdevine commented May 20, 2024

It is using the GPU, but it's not particularly efficient at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with:

>>> /set parameter num_gpu 0

Which should show you the difference in performance. You can also load a lower number of layers (i.e. /set parameter num_gpu 1) which will show offloading most of the layers in the model to the CPU. I believe the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU.

Back to CPU only (using num_gpu 0) I get:

total duration:       3m25.479681006s
load duration:        4.023984693s
prompt eval count:    208 token(s)
prompt eval duration: 41.733919s
prompt eval rate:     4.98 tokens/s
eval count:           259 token(s)
eval duration:        2m39.571141s
eval rate:            1.62 tokens/s

or roughly half the speed of the GPU.

@dhiltgen
Copy link
Collaborator

To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.

@dhiltgen dhiltgen self-assigned this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

No branches or pull requests

4 participants