Retriever SDG Toolkit: Turn Your Documents Into Retriever Training Data
If you are building a RAG system, you have probably hit this wall: the generator is good, the vector database is fast, the prompt is carefully tuned, and the answer is still wrong because the right passage never made it into context.
That is a retrieval problem. More specifically, it is often a data problem. General-purpose embedding models understand broad semantic similarity, but they do not know the fine-grained distinctions in your product docs, tickets, policies, codebase, manuals, or internal taxonomy. To improve that, you need domain-specific retriever training and evaluation data: realistic queries, positive passages, held-out evals, and enough metadata to know whether the retriever actually found the right evidence.
The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time.
The new data-designer-retrieval-sdg toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.
