
Building Enterprise-Scale RAG Systems with Fireworks AI and MongoDB Atlas
By Fireworks AI |4/9/2025
Deepseek V3 0324, an updated version of the state-of-the-art DeepSeek V3 model, is now available. Try it now or read our DeepSeek quickstart!
By Fireworks AI |4/9/2025
This project is completely open-source, including all documentation, code, and tests, which can be found at: https://github.com/shubcodes/earnings-ai-demo
In the fast-paced world of enterprise data, extracting actionable insights from vast amounts of unstructured information is a challenge many organizations face. Whether it’s earnings calls, financial reports, legal documents, or technical specifications, the ability to retrieve, synthesize, and act on this information quickly can make or break a company’s competitive edge. Enter Retrieval-Augmented Generation (RAG) – a cutting-edge solution combining Large Language Models (LLMs) with powerful retrieval systems to deliver contextually rich, actionable insights in real time.
Let’s face it: traditional search systems just don’t cut it anymore. They’re limited to keyword matching and fail to grasp the semantic and contextual relationships necessary for enterprise-scale decision-making. Here’s why RAG stands out:
Our enterprise RAG system is built using Fireworks AI for inference and MongoDB Atlas for vector storage. Here’s an overview of the pipeline:
Document Processing Pipeline
The ability to process diverse document formats is the cornerstone of any robust RAG system. Without this capability, organizations are left with data silos and limited insights. The document processing pipeline ensures that content from PDFs, DOCX files, and plain text is not only extracted but also enriched with metadata for better querying.
class DocumentExtractor:
def extract_text(self, file_path: str) -> Dict:
path = Path(file_path)
if path.suffix.lower() == '.pdf':
text = self._extract_pdf(path)
elif path.suffix.lower() == '.docx':
text = self._extract_docx(path)
else:
text = path.read_text(encoding='utf-8')
return {
"text": text,
"metadata": {
"filename": path.name,
"file_type": path.suffix.lower()[1:],
"file_size": path.stat().st_size
}
}
Key Features:
Efficient transcription of audio content is another vital component. Fireworks’ Whisper V3 Turbo can transcribe one hour of audio in just 3 seconds, making it 20x faster than competing solutions such as OpenAI Whisper. This speed advantage significantly reduces latency and enhances user experiences.
class AudioTranscriber:
def __init__(self, api_key: str, base_url: str = "https://api.fireworks.ai/audio/v3-turbo"):
self.client = AudioInference(api_key=api_key, base_url=base_url)
def transcribe_audio(self, file_path: str) -> Dict:
with open(file_path, 'rb') as audio_file:
response = self.client.transcribe(audio=audio_file.read())
return {
"transcription": response.text,
"duration": response.metadata.duration
}
Once the content is extracted, converting it into vector embeddings allows the system to understand and process queries semantically. This step is critical for enabling similarity searches and advanced data retrieval.
def generate_document_embedding(self, text: str, prefix: str = "", method: str = "mean") -> List[float]:
chunks = self._chunk_text(text)
chunk_embeddings = self.generate_embeddings_batch(chunks, prefix)
if method == "mean":
return np.mean(chunk_embeddings, axis=0).tolist()
elif method == "max":
return np.max(chunk_embeddings, axis=0).tolist()
Efficient storage and retrieval of vector embeddings ensure that enterprise systems can handle queries at scale without compromising performance. MongoDB Atlas provides a scalable, high-performance solution for this task.
def query_similar(self, query_embedding: List[float], limit: int = 5, filters: Dict = None) -> List[Dict]:
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embeddings",
"numCandidates": limit * 10,
"limit": limit
}
}
]
if filters:
pipeline.append({"$match": filters})
pipeline.append({
"$project": {
"text": 1,
"metadata": 1,
"score": {"$meta": "vectorSearchScore"}
}
}
The intelligence of the RAG pipeline lies in its ability to synthesize information and generate coherent, actionable answers. Fireworks AI powers this layer, ensuring low-latency, high-accuracy responses. By integrating with MongoDB Atlas for semantic retrieval, Fireworks AI orchestrates an end-to-end workflow where user queries are processed efficiently and contextually enriched responses are returned.
When a user submits a query, the RAG system performs the following steps:
Including the source and confidence score in the response provides critical benefits:
To highlight the system’s efficiency:
Fireworks achieves remarkable performance through advanced quantization techniques and hardware optimizations, ensuring top-tier throughput and low cost without sacrificing accuracy.
The future of enterprise RAG systems lies in expanding capabilities while maintaining efficiency. Here’s how Fireworks AI and MongoDB Atlas are set to evolve:
Imagine a system where specialized agents handle specific types of data. For instance:
These agents would collaborate within a unified architecture, enabling seamless cross-domain queries.
Structured data, such as tables and charts, remains a challenge for many RAG systems. Future iterations will integrate advanced table-parsing models, enabling:
Organizations often operate in competitive landscapes where comparative analysis is key. Multi-tenant RAG systems will allow companies to:
Fireworks AI will continue to push boundaries in scalability by integrating:
The Fireworks AI + MongoDB Atlas stack revolutionizes how enterprises extract insights from unstructured data. By leveraging cutting-edge transcription, retrieval, and generation technologies, organizations can accelerate decision-making, reduce time-to-insight, and unlock the full potential of their data.
Ready to transform your data strategy? Start building on fireworks.ai