-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' #4529
Comments
I have also encountered this problem, and I feel that it is the problem here: |
I got the same error on windows system: |
The specific reason may be that llama.cpp/convert-hf-to-gguf.py encountered issues during the rapid iteration process. I experienced the same problem when exporting and quantizing qwen2 in the latest version of llama.cpp, but the exported and quantized gguf models using an older version of llama.cpp for qwen2 are usable. You can try modifying this file like @binganao did, or simply roll back the version of llama.cpp and try again: cd llama.cpp
git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 check this release for detail. Then import and re-quantize the modelscope / hf folder of qwen2 according to the official ollama documentation. Hopefully this can solve your problem. |
I tried binganao's method, but it didn't work. However, following your suggestion to roll back to a previous version successfully resolved the issue. Thank you! |
I just tried a Qwen2 model I made recently with llama.cpp ./main and it loaded and generated with no issues. Are we sure this isn't ollama needing an update? |
I have the same issue when exporting and quantizing qwen1.5-7b-chat,(Error: llama runner process has terminated: signal: aborted (core dumped)). And I tried Treedy2020's method( |
The problem was that llama.cpp changed how the tokenizer worked because of changes w/ llama3 tokenization. This should be fixed in |
Could this be re-opened?
Now there's something strange going on too.
While i have 0.1.41 installed (arch linux):
So upon further inspection, this is how it's build: Which builds the tag 476fb8e, that is the 0.1.41 tag: https://github.com/ollama/ollama/releases/tag/v0.1.41 The llama-cpp version is this tag ggerganov/llama.cpp@5921b8f which is just a week old. Am i missing something here to get qwen2 working? |
Did you reboot your machine or do |
@cyp0633 yes! :) I did both (and a couple times), didn't help. Could someone else run |
same issue happened to me |
Issue can be closed again. Removing the one installed through the script made things work. Version is as expected now. |
update ollama to version 0.1.42 , then ok |
I was using LM studio and just had to update btw |
What is the issue?
I carefully read the contents of the readme's documentation to try and found that something went wrong
time=2024-05-20T10:06:02.688+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 33525"
time=2024-05-20T10:06:02.690+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T10:06:02.690+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T10:06:02.691+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140401842012160" timestamp=1716170762
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140401842012160" timestamp=1716170762 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="33525" tid="140401842012160" timestamp=1716170762
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T10:06:02.944+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T10:06:03.285+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T10:06:03.535+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/05/20 - 10:06:03 | 500 | 2.178464527s | 127.0.0.1 | POST "/api/chat"
time=2024-05-20T10:06:07.831+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T10:06:07.832+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T10:06:07.832+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43339"
time=2024-05-20T10:06:07.833+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140283378036736" timestamp=1716170767
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140283378036736" timestamp=1716170767 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="43339" tid="140283378036736" timestamp=1716170767
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T10:06:08.085+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T10:06:08.437+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T10:06:08.656+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.120574757
time=2024-05-20T10:06:08.688+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
I look at the 4b to 72b of qwen1.5 provided, so this should be provided by the tokenizer as well
OS
Linux
GPU
Nvidia
CPU
Other
Ollama version
client version is 0.1.38
The text was updated successfully, but these errors were encountered: