Infrastructure that enables easy access to unstructured data through AI interfaces is the best investment any team can make in their future. Building these systems comes with many challenges - in this post we discuss how we accurately answer questions across thousands of documents with different formats and content.
It’s about half way through 2024, and almost a year since our Baseplate Hacker News launch. The majority of AI applications I see today use some form of RAG, or Retrieval Augmented Generation, which is just a fancy term for searching for information and then stuffing the results into an LLM’s prompt. I look back to that launch video and see some of the early ideas that inspired our multi-modal knowledge base today, such as image embeddings with CLIP. The tools we can use to achieve our vision have improved a lot in the past year. Below are some techniques we’ve used to build the multi modal knowledge base for DealPage.
The basic building blocks of our system will be familiar if you’ve seen other RAG systems: ingestion & chunking, search & reranking, and generation.
How It Works at DealPage
We manage hundreds of knowledge bases for our customers. They contain a number of different file formats ranging from spreadsheets, to email threads, to presentations. Users (and our AI agent) need to be able find the right content at the right time to complete their work.
This makes handling different file formats effectively a huge priority for us. To make it even more complex, each customer has hundreds of separate “Deal Libraries” that help them organize their customer-facing documents like contracts.
Before You Get Started
Its important to have a set of evals you can run. There are infinite setups you can use for search, and only a few will actually work for your use case. Testing objectively against a test dataset is really, really important as you iterate.
Ingestion & Chunking
We’ve experimented with various chunking, parsing, and cleaning strategies over months of iteration. We have an opinionated way of doing this across tons of document types that provides the highest accuracy and best UX for our users.
Documents: PDFs, Pptx (Powerpoint), Docx (Word)
We use Unstructured.io to parse document elements into text, tables, and images.
We chunk the text using a technique called “semantic chunking”, which ensures that we can break up documents into searchable chunks without losing the cohesiveness or meaning. A write up from Langchain on this topic can be found here.
For tables and images, we feed them into multi-modal LLMs to generate a relevant caption. Once the captions are generated, we embed them along with the text chunks using our fine-tuned embedding model.
Embeddings allow us to effectively search through large troves of documents based on a user’s query. This is sometimes called “Neural Search” or “Semantic Search”, because we use massive AI models to encode the meaning behind questions and similar meaning document chunks.
For example, our embedding models understand that “data security” and “PII” are related terms, and we’re able to use that understanding to make our search results extremely high quality even for natural language queries with out-of-domain keywords.
It’s important to note that for images and tables, we only embed and search over the caption. We feed the original content back into the model during generation time to avoid losing any information.
Spreadsheets: Xlsx (Excel), .csv
We first parse the table elements from each sheet. However, since these sheets can be large (especially for security questionnaires), we further split the tables into smaller ones with 5 rows each, while preserving the header columns. For example, given a table with 10 rows:
This would be split into two smaller tables:
These table chunks are captioned and embedded the same way as described above for PDF tables.
URLs
For URL imports, we use Jina AI’s url reader, which does a good job parsing HTML from URLs into LLM friendly markdown.
Emails
We similarly import, transform, and clean emails from their native HTML into a markdown-esque format that is easy for our models and users to read.
Images
For jpeg/pngs, we use the same image captioning method above, and embed the captions.
Verified Answers
In DealPage, we allow users to add ad-hoc question/answer pairs. For these we embed a concatenated version of the question + answer as if it were a chunk.
Search & Reranking
We fine tuned both our embedding model and our reranker on a dataset we created with our own internal documents, synthetic data, publicly available documents, and anonymized data from our design partners.
Embedding models we discussed above, but they convert text into large multi-dimensional vectors that can be quickly searched through. Reranking models can take a chunk of text, compare it against the original query, and provide a score for how relevant the source text is.
This makes a 10-15% improvement in the quality of the sources we use to generate answers, and allows us to show our users a confidence score for each source.
We consistently experiment with top K (number of reranked results) and retrieve top K (results before reranking), and settled at values of around 10 and 100. These provide a good balance of performance and speed.
Generation
During generation, the text is fed into the model’s system message. Here is a basic prompt to give the model this context:
If the chunk is a table chunk, the HTML table is fed instead of the caption. For images, we append a user message like this, after generating a signed url for each image:
The Roadmap
Going forward, there’s a few directions we are going to improve our Knowledge Base. As we generate more data, there are always further opportunities for fine tuning.
We also want to add better support for structured queries (e.g. to SQL databases) on top of our existing spreadsheet queries.
Lastly, new modalities - specifically videos and clickthrough demos. Understanding and working with these formats is a big part of our roadmap for Paige.