This post was originally published on Variant’s blog.

Our company handbook, hosted at https://handbook.variant.no, is a good resource we refer to all the time both internally and externally at Variant. Even though it’s packed with valuable information, we recognize that it’s not always easy to get quick answers. So, we’re trying out a little experiment by leveraging the power of Large Language Models (LLMs), specifically GPT-3.5, to answer questions about the content of our handbook.

We were inspired by Greg Richardson’s implementation of a similar feature, which he did for the Supabase documentation. You can read more about how they did it in this blog post.

In this blog post I’ll take you through how we implemented this for our handbook using the Pinecone and the OpenAI APIs.

Indexing the Handbook

To start off, we need to retrieve and index the content of the handbook somehow! Fortunately for me, we already have an implementation for indexing the content in our handbook, because we do this for our existing handbook search-engine using Algolia. The details here are not important, and might be different from how you would do it for your own website. But in essence our search indexer runs through all the .mdx-files in our handbook, and retrieves all the text content in the handbook split into sections in JSON-format. For our normal search, all of these index items are then uploaded to Algolia. But in my case, I wanted to store them in a vector database instead.

Why a vector database?

If you’ve used ChatGPT, you might have noticed that it has the ability to remember what you’ve said inside the same conversation. But it doesn’t remember what you’ve said in previous conversations. This is because the model is (at the moment at least) a blank slate for each new conversation, and only remembers what it has originally been trained on (ostensibly information on the public Internet up till and including 2021). This means that if you want to have a conversation with the model on a narrow topic or domain, and get good and updated answers, you need to provide it with context. And this is where the vector database comes in.

The vector database will effectively function as a long-term memory for the LLM, which we can feed it with. This means that we can provide the model with context from our handbook, and it will be able to use this to provide better answers. A vector database is in this instance a better option than a traditional database because it allows us to quickly find different texts which relate to each other. Why this is important will become more clear later.

Saving the index

I chose to use Pinecone as my vector database, mainly because it’s a managed service, and I didn’t want to spend too much time on setting up and maintaining a database. But there are other alternatives available as well, as per the OpenAI cookbook for vector databases.

What I want to do in this case, is save the different index items to Pinecone. Each index item is a partial section from our handbook, and looks something like this:

{
  "title": "En variants håndbok",
  "url": "https://handbook.variant.no/#en-variants-håndbok",
  "content": "Om du ikke er en variant men liker det du leser,\n ta en titt på ledige stillinger hos oss. Mer info\nom oss på nettsiden vår .",
  "department": ["Trondheim", "Oslo", "Bergen", "Molde"]
}

The problem though is that vectors are essentially just arrays with floating-point numbers in them. So how do we represent a piece of text as a vector? In order to do that, we’ll have to convert the content to an “embedding”. Embeddings, in the context of machine learning, are a way to represent complex data, like words, sentences, or even images, as points in a multi-dimensional space (a vector). The magic of embeddings is that they can arrange words (or other data) in this multi-dimensional space so that similar words are close together, and dissimilar words are far apart. This allows us to more easily identify relationships between words and sentences with similar semantic meaning, just by comparing the distance between vectors.

Luckily for us, the OpenAI API has an endpoint for creating embeddings. So using the NodeJS-library for the OpenAI API, I can take the content field from the section above, and create an embedding for it like this:

const content = index[0].content;
const configuration = new Configuration({
  apiKey: openAIApiKey,
});
const openaiClient = new OpenAIApi(configuration);
const embeddingResponse = await openaiClient.createEmbedding({
  model: "text-embedding-ada-002",
  input: content,
});
const [{ embedding }] = embeddingResponse.data.data;

This will create a vector-array of floating-point numbers, which is the embedding I can save to my vector database. Since the embeddings created by the text-embedding-ada-002 have 1536 output dimensions — the index in Pinecone must be created and specifically set to support 1536 dimensions. For Pinecone this can be done through a simple API-call:

curl --location 'https://controller.eu-west4-gcp.pinecone.io/databases' \
--header 'Api-Key: <your-api-key>' \
--header 'accept: text/plain' \
--header 'content-type: application/json' \
--data '
  {
    "metric": "cosine",
    "pods": 1,
    "replicas": 1,
    "pod_type": "p2.x1",
    "metadata_config": {
      "indexed": ["department"]
    },
    "dimension": 1536,
    "name": "handbook-index"
  }
  '

In this case I’ve also specifically said that the department field should be indexed, so that I can filter the results based on department later. Doing this, I also make sure no other metadata-fields will be indexed, which will save memory and make the queries faster.

As noted earlier as well, the sections index is split up into multiple parts. This is probably a good thing for the queries later, since relatedness searches are faster when the vectors are smaller. So instead of saving the entire section as a vector in Pinecone, there are several small parts. But, each item is stored with the full content of each section as metadata, so that I can retrieve the full content if the queries hit any part of the section.

So when saving the items, I take the embedding created above by the OpenAI API, and save it along with metadata to Pinecone with their NodeJS-library:

const upsertRequest = {
  vectors: [
    {
      id: inputChecksum,
      values: embedding, // the embedding from earlier
      metadata: {
        title,
        content,       // the content the embedding is created from
        fullContent,   // the full content of the entire section
        url,
        department,
      },
    },
  ],
  namespace: "handbook-namespace",
};
await pineconeIndex.upsert({ upsertRequest });

Just to summarise what we’ve done so far:

Now we have a vector-database index that can be queried for relevant sections in the handbook based on questions being asked.

Retrieving the relevant sections

Since the input to ask questions about the handbook is open to anyone, we have to take extra care that we do not prompt GPT-3.5 with questions that do not comply with OpenAI’s usage policies. To ensure compliance, we can utilize their free moderation endpoint, which verifies whether a question aligns with their guidelines. So when the user asks a question, we first check if it complies with OpenAI’s usage policies:

const moderationResponse = await openai.createModeration({ input: question });
const [results] = moderationResponse.data.results;
if (results.flagged) {
  throw new Error("Doesn't comply with OpenAI usage policy");
}

If it passes, the next step is to query the vector database for relevant sections in the handbook. However, the question must first be transformed into an embedding. The idea here is to convert the question to a vector, which we can then query the Pinecone database with. This will allow us to find related sections in the handbook, just by comparing the distance between the section-vectors and the question-vector.

So to achieve this, we create an embedding for the question, as we did for the handbook sections:

const embeddingResponse = await openai.createEmbedding({
  model: "text-embedding-ada-002",
  input: question,
});
const [{ embedding }] = embeddingResponse.data.data;

With the question converted, we can now query the vector database for relevant and related handbook sections:

const queryRequest: QueryRequest = {
  vector: embedding,
  topK: 5,
  includeValues: false,
  includeMetadata: true,
  namespace: "handbook-namespace",
};
const queryResponse = await index.query({ queryRequest });
const uniqueFullContents = queryResponse.matches
  .map((m) => m.metadata)
  .map((m) => m.fullContent)
  .reduce(reduceToUniqueValues, []);

The query above will return the top 5 most relevant sections in the handbook, based on the question. We make sure to filter out duplicate sections, since the sections are split up into multiple parts and we don’t want to prompt GPT-3.5 with the same section multiple times.

Why use an LLM?

Now you might be asking yourself, why do we want to prompt GPT-3.5 for an answer when we’ve already pulled the relevant sections out of the database? This is because the LLM has the ability to summarise the relevant sections and answer succinctly with regards to your question. The alternative here could be to print out all the section contents, and let you read through to find the answer yourself, but I don’t find that to be very satisfying.

The prompt

Much can be said about how to construct a good prompt for GPT-3.5, but I’ll keep it short here. The prompt is constructed by combining the question with the relevant sections from the handbook, then sent to GPT-3.5 for completion:

const prompt = `
You are a very enthusiastic Variant representative who loves to help people! Given the following sections from the Variant handbook, answer the question using only that information. If you are unsure and the answer is not written in the handbook, say "Sorry, I don't know how to help with that." Please do not write URLs that you cannot find in the context section.

Context section:
${uniqueFullContents.join("\n---\n")}

Question: """
${question}
"""
`;

In addition to giving it the relevant sections from the handbook, we also set a tone-of-voice and some preconditions on how to answer the question. And when not to try to answer, for that matter!

Now, finally, we’re ready to ask GPT-3.5 for an answer:

const completionOptions: CreateCompletionRequest = {
  model: 'text-davinci-003',
  prompt,
  max_tokens: 512,
  temperature: 0,
  stream: false,
};
const res = await openai.createCompletion(completionOptions);
const { choices } = res.data;
const answer = choices[0].text;
console.log(answer); // or display it in the UI of your choice

Keep in mind that the GPT-3.5-generated responses are non-deterministic, meaning they may vary slightly each time. However, GPT-3.5 is adept at generating accurate answers when given enough context to do so.

So to summarise what we do when the user asks a question:

  1. Inspect the question for flagged content.
  2. Generate an embedding using the question text.
  3. Query the vector database for relevant handbook content.
  4. Create a natural language prompt containing the question and relevant content, providing sufficient context for GPT-3.5.
  5. Submit the prompt to GPT-3.5 to receive an answer.

And that is the very basics of how we built an integration towards an LLM in our handbook, based on Pinecone and the APIs from OpenAI. The full implementation with all the details can be viewed in the open source repository for our handbook. The most pertinent files are likely generate-embeddings.mjs for details on how we do indexing and insertion to Pinecone, and openai-data.ts for details on how we handle user queries.