February 2, 2025·12 min read

Building a Self-Hosted PDF Chat App with Deepseek-r1: Exploring RAG, Embeddings, and Vector Databases

Building a fully local PDF question-answering system using DeepSeek-R1, Ollama, and a simple vector store — no API keys, no data leaving your machine.

DeepSeek

RAG

Self-Hosting

Privacy

Due to work projects I've got going on right now, it's about time I learned about Retrieval Augmented Generation (RAG), text embeddings and vector databases.

I will be using Ollama to self-host a distilled version of the recently released Deepseek-R1, so if you want to follow along you can check out my post on how to host your own model.

Don't worry if what I just typed makes no sense to you. I will try explain concepts as simply as I can. I've often heard teaching something is the best way to test your own knowledge!

In this post I will cover:

What model I chose to use, why I chose that model, and how to host it
Core concepts of RAG
How I built the AI Chat app
What I learned about RAG, text embeddings, and vector databases

Full disclosure: I didn't write any of the code for the chat app. I used the AI IDE that's making waves — Cursor.

Why I Chose Deepseek-R1
How to Host Deepseek
Core Concepts Made Simple
Building the Chat App
Lessons Learned

Why I Chose Deepseek-R1

With the recent release of Deepseek, anyone can now run models that 'think' before they respond. This capability — essentially reasoning before query answering — was previously only available in closed ecosystems like OpenAI and accessible if you paid. Not anymore.

The full R1 is an incredibly powerful model but costs thousands in hardware to run, so I'm running a distilled version DeepSeek-R1-Distill-Qwen-14B-GGUF on an RTX 3080 GPU.

But what does 'distilled' mean? In simple terms, model distillation is a process where a smaller model (in this case Qwen-14B) is trained to mimic the behaviour of a larger, more intelligent model. So what that means is that the 'thinking' — otherwise known as Chain-of-Thought — was taught to the smaller Qwen model by the massive full DeepSeek-R1 model. As a result, we now have a small model that can be run on consumer hardware that has been trained how to think before it answers.

The combination of open-source, reasoning capabilities, and distilled versions came at the perfect time as I need to explore GenAI development and apply this knowledge in my career. I can now explore AI concepts, work with my own local data, and not have to pay a penny in API calls — all with the bonus of having a super intelligent AI fully available.

How to Host Deepseek

As mentioned previously I'm using Ollama to host the model. I won't go into how to set up Ollama (see my post) but I will show you how to pull the model I'm using.

First open a terminal and run the below command:

ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M

Once the model is downloaded you now need to run Ollama as a server. If you didn't know, Ollama can run in the background as a local server that you can make API calls to. This is how we will call the model from our app. Run this command in your terminal:

ollama serve

That's it! You now have a 'server' running locally.

Core Concepts Made Simple

Before we dive into the app let's cover some core concepts you need to understand. This helps us understand the why behind what I've built.

Retrieval-Augmented Generation (RAG)

What is RAG?

RAG is a technique to enhance a model's capabilities by giving it access to knowledge it wasn't originally trained on. This means that instead of relying solely on what the model already knows (from its training data), we can use RAG to retrieve relevant information from internal data sources for more accurate answers.

Why is RAG needed?

Think of a scenario. We have a user. They ask a model:

User: "How well did we do in Q1?"

AI: "I'm sorry. I don't have access to that information."

The question itself isn't hard for a model to answer but the data required to answer that question won't have been in the model's original training set — like internal financial reports or metrics. That data is held specifically in the user's organisation, not in the model's original training dataset. This is where RAG comes in.

Implementing RAG allows us to give the models access to data they were not originally trained on — such as the company's Q1 financial report — significantly enhancing their use in personal and enterprise use cases.

In this new scenario where we have implemented RAG, the chat would look something like:

User: "How well did we do in Q1?"

AI: "Based on the latest Q1 financial report, we performed..."

Chunking

What is chunking?

Chunking is a method to break down files into smaller sections (or chunks). These chunks are then processed individually, making it easier for models to handle vast volumes of data without exceeding context limits.

Why is chunking needed?

Models of today have something called a 'context window'. You can think of it as short term memory. The larger the size of the context window, the more information you can pass to the model in your chat session.

The issue we face here is that you can't just load the model with a heap of documentation as eventually you will hit the context limit. Sounds fine if you're using a small handful of data, but if you want to implement a model for use in a large enterprise, chunking is the way to go.

Chunking ensures data is broken down into digestible pieces and allows the model to process and retrieve the relevant chunks during a query.

Embeddings

What are embeddings?

Embedding is a process in which we use an embedding model to convert text into numbers — sounds weird, I know — like, for example, the company's Q1 report. We do that because large language models, under the hood, are just maths. They understand words via numbers rather than text like we do. This allows them to understand the relationships, meaning and content in documents much better.

Why are embeddings needed?

Going back to our Q1 report question:

The user asks "How well did we do in Q1?". We need a way for the model to understand that this question relates to the financial Q1 report we have given it access to.
We convert the user's question into embeddings. The model can now compare the numbers that represent the user's question and find the most relevant info — the Q1 report.
That's how models retrieve (the R in RAG) the right data to natural language queries.

Embeddings are awesome because they allow models to measure how similar a user's question — and therefore the question's embeddings — are to the data it has access to, even if the user phrases things slightly differently.

How do embeddings work?

I touched on this slightly just above but thought it worth delving just a little deeper.

Words are converted to vectors (numbers): Every word or sentence gets converted into a set of vectors.
Similarity comparison: The model searches for pieces of text that have similar numerical representations.
Search efficiency: Instead of scanning the whole document every single query, the model just takes the vectors and looks for similar ones.

In the user's question, the embeddings system can recognise that "What were the sales figures in Q1?" is closely related to the "Q1 Sales Report" section of the financial report, even if the user's question isn't an exact match.

Vector Databases

What is a vector database?

I'm not going to go into a lot of detail of what a vector database is but I will explain their part in this process.

So, we've converted our PDFs into vectors using an embedding model. What do we do with them?

A traditional database (like SQL) is great for structured data — like customer records or sales numbers. But what if we're using unstructured data? A traditional database is not optimised for allowing a model to search for meaning or intent in data rather than just the data itself.

Why is a vector database needed?

Going back again to our Q1 question:

A traditional database might search for the user's exact question, word for word. Problem is, unless the exact string of words appears the same on the PDF, the user won't get what they are looking for.
A vector database takes the embeddings (numbers) we loaded using the embeddings model, finds the relevant or similar content, and passes that back to the LLM.

It's quite easy to see why they are needed in that instance. Without them, queries using natural language would very rarely return what we are looking for.

Building the Chat App

First of all — why did I build the app? Great question, reader.

I built the app because, as hinted at before, I'm starting to face this challenge in my work projects and need to skill up in at least the concepts behind this stuff.

Here's the high-level architecture behind the app: the user uploads a PDF, it gets chunked and embedded into a local vector store, and then when they ask a question, the relevant chunks are retrieved and passed to DeepSeek-R1 to generate a grounded answer.

Coding

I can't code. Not sure if I ever will learn to code 'properly'. As a result, I used Cursor — the AI assisted coding application.

Cursor is the leading AI coding app. I 100% recommend checking it out. You can use the latest and greatest models and simple, plain English to spin up projects very fast.

The User Interface (UI)

I chose to use Streamlit as the frontend for this application.

Streamlit is a simple, open source UI library all built in Python. It's great for simple experimentation or demo apps such as the PDF one I built.

The UI consists of a sidebar where you upload your PDF and a main chat window where you can ask questions about it. Once a PDF is processed, its chunks and embeddings are cached in session state so repeat queries are fast.

Implementing RAG

I used LangChain for a lot of the RAG implementation in my app. LangChain is "a composable framework to build context-aware, reasoning applications with large language models (LLMs)".

Step 1: Chunking

To extract the text from my local files I first used PyPDFLoader from LangChain.

Then, to chunk the text, I used RecursiveCharacterTextSplitter, also from LangChain.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

The splitter is configured with two key parameters:

chunk_size: how many characters will be contained within each chunk. In this case 1000 — any more than 1000 and a new chunk will be created.
chunk_overlap: the number of characters that should overlap between each chunk. In this case 200. It means that in each chunk, the last 200 characters of the previous chunk will be included. This overlap helps maintain context between chunks and improves the LLM's retrieval accuracy.

Step 2: Embedding

I like running stuff locally as much as possible. Luckily, LangChain offers an Ollama embeddings library that makes it very easy to use local embedding models.

I chose to use the model nomic-embed-text, which you can find on the Ollama library. It is an open source embedding model.

from langchain_community.embeddings import OllamaEmbeddings
 
embeddings = OllamaEmbeddings(model="nomic-embed-text")

Step 3: Vector Database

After some research I chose to use FAISS (Facebook AI Similarity Search) as my vector database. LangChain offers a simple way to use FAISS with just a single import:

from langchain_community.vectorstores import FAISS

Then to actually use the vector database we pass it the chunks from before and call the nomic-embed-text embeddings model:

vectorstore = FAISS.from_documents(chunks, embeddings)

Adding Intelligence

Now we add the distilled DeepSeek model into the mix so we can not only chat with our data but utilise a model that thinks before it responds.

Make sure you've followed the How to Host Deepseek section to pull the model and start your local server.

Here is the function to define the model we are going to use:

from langchain_community.llms import Ollama
 
llm = Ollama(model="hf.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M")

It's very important that you've run:

ollama serve

If you haven't, you won't be able to call the model from the Streamlit app.

Seeing the Thinking

Now the cool part. We get to see the DeepSeek model 'think' through the user's question before it responds.

As a demo, I uploaded my CV and asked the model to tell me what the email address is in the PDF. The model first emits a <think>...</think> block where it reasons about which chunk of the document is most relevant, then produces its final answer. This chain-of-thought output is what makes the distilled R1 models particularly compelling for document Q&A — you can see exactly why it reached its conclusion.

Lessons Learned

Understanding RAG

We now understand the concepts of RAG — chunking, embedding, and vector databases — and the benefit of implementing it. Ultimately it enhances a model's accuracy in responses by leveraging external document context.

The Newly Available Self-Hosting Power

By highlighting what DeepSeek-R1 is, and model distillation, we have increased our understanding of what's now available on consumer-grade hardware.

Democratising Building

Although only briefly touched on, the use of new AI tools such as Cursor, along with these amazing new open source models, means that building stuff is now easier than ever. With just a few hours and your native language you can start to build and learn concepts at your own pace — focusing on the concepts rather than the syntax!

Contents