Dirt Cheap Local AI: RAG with a Nvidia P102-100 Mining GPU

I ran a documentation Retrieval Augmented Generation (RAG) pipeline (186,000 chunks) on an old Nvidia P102-100 minin GPU using a llama.cpp CUDA backend container (Qwen 2.5-7B-Instruct Q6_K) on Linux Mint. In this video I walk through the setup, retrieval behavior (k=5), context usage, and where the system fails (hello, ffmpeg). In this video I: Build a RAG pipeline over nine different tech projects Run a 6-bit Qwen 2.5 7B Instruct llm locally Compare “huge” vs “reasonable” context windows and VRAM usage Time retrieval and generation on a handful of real-world questions Show where the system does great, where it hallucinates, and where it finally just says “I don’t know” If you care about local AI, weird GPUs, or squeezing real work out of “e-waste” hardware, this one’s for you. ⏱ Chapters 00:00 – Can a $40 mining card run RAG? 00:32 – What is the P102-100, really? 01:18 – Model, corpus, and RAG setup 02:05 – 32K vs 8K context and VRAM usage 02:32 – How I measured retrieval, tokens, and latency 02:57 – Query #1 results 04:15 – Query #2 results 04:51 – Query #3 results 05:37 – Query #4 results 06:26 – What we learned about this “e-waste” GPU 07:42 – Next steps and future experiments Gear & setup (high level) • GPU: Nvidia P102-100 (Pascal mining card, 10 GB) • Platform: X99 system (Xeon E5-2690 v4, 64 GB DDR4 RAM), single-card inference • Model: Qwen 2.5 7B Instruct, 6-bit quant • Corpus: 186k semantic chunks from 9 projects RAG: faiss + sentence-transformers all-miniLM-L6-v2 More experiments with old and new hardware coming soon. #nvidia #localai #localllm #MiningGPU #LocalAI #RAG #RetrievalAugmentedGeneration #qwenai #LLM #Linux #Homelab #llamacpp #rag #retrievalaugmentedgeneration #linuxai

12+
1 просмотр
12 часов назад
12+
1 просмотр
12 часов назад

I ran a documentation Retrieval Augmented Generation (RAG) pipeline (186,000 chunks) on an old Nvidia P102-100 minin GPU using a llama.cpp CUDA backend container (Qwen 2.5-7B-Instruct Q6_K) on Linux Mint. In this video I walk through the setup, retrieval behavior (k=5), context usage, and where the system fails (hello, ffmpeg). In this video I: Build a RAG pipeline over nine different tech projects Run a 6-bit Qwen 2.5 7B Instruct llm locally Compare “huge” vs “reasonable” context windows and VRAM usage Time retrieval and generation on a handful of real-world questions Show where the system does great, where it hallucinates, and where it finally just says “I don’t know” If you care about local AI, weird GPUs, or squeezing real work out of “e-waste” hardware, this one’s for you. ⏱ Chapters 00:00 – Can a $40 mining card run RAG? 00:32 – What is the P102-100, really? 01:18 – Model, corpus, and RAG setup 02:05 – 32K vs 8K context and VRAM usage 02:32 – How I measured retrieval, tokens, and latency 02:57 – Query #1 results 04:15 – Query #2 results 04:51 – Query #3 results 05:37 – Query #4 results 06:26 – What we learned about this “e-waste” GPU 07:42 – Next steps and future experiments Gear & setup (high level) • GPU: Nvidia P102-100 (Pascal mining card, 10 GB) • Platform: X99 system (Xeon E5-2690 v4, 64 GB DDR4 RAM), single-card inference • Model: Qwen 2.5 7B Instruct, 6-bit quant • Corpus: 186k semantic chunks from 9 projects RAG: faiss + sentence-transformers all-miniLM-L6-v2 More experiments with old and new hardware coming soon. #nvidia #localai #localllm #MiningGPU #LocalAI #RAG #RetrievalAugmentedGeneration #qwenai #LLM #Linux #Homelab #llamacpp #rag #retrievalaugmentedgeneration #linuxai

, чтобы оставлять комментарии