RAG on Multiple Terms and Conditions Documents Varying By Geography¶

Join Discord if you need help + ⭐ Star us on Github ⭐

In this demo we are going to build a pipeline to build and update policy documents which vary by geography.

Approach:

Label documents during ingestion
Propogate the labels on the documents all the way into the vector store
During Retrieval make the LLM generate filters with labels based on the question
Pass the label filters into the vector store for retrieval
Make the LLM cite the sources of the response during response synthesis

Setup¶

In [ ]:

Copied!





%pip install indexify indexify-extractor-sdk openai

# Download Indexify Server
!curl https://getindexify.ai | sh

# Install Poppler (required for PDF extraction)
# You can use brew on MacOS.
!sudo apt-get install -y poppler-utils

# Download Extractors
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/pdf-extractor
%pip install indexify indexify-extractor-sdk openai

# Download Indexify Server
!curl https://getindexify.ai | sh

# Install Poppler (required for PDF extraction)
# You can use brew on MacOS.
!sudo apt-get install -y poppler-utils

# Download Extractors
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/pdf-extractor

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server

Create Extraction Policies¶

Instantiate the Indexify Client

In [ ]:

Copied!

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()
from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()

First, create an Extraction Graph with policies to get texts and contents out of the PDF and create chunks from the text and embeddings

In [ ]:

Copied!





extraction_graph_spec = """
name: 'knowledgebase'
extraction_policies:
  - extractor: 'tensorlake/pdf-extractor'
    name: 'pdfextractor'
  - extractor: 'tensorlake/chunk-extractor'
    name: 'chunks'
    content_source: 'pdfextractor'
    input_params:
      chunk_size: 512
      overlap: 150
  - extractor: 'tensorlake/minilm-l6'
    name: 'terms'
    content_source: 'chunks'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
extraction_graph_spec = """
name: 'knowledgebase'
extraction_policies:
  - extractor: 'tensorlake/pdf-extractor'
    name: 'pdfextractor'
  - extractor: 'tensorlake/chunk-extractor'
    name: 'chunks'
    content_source: 'pdfextractor'
    input_params:
      chunk_size: 512
      overlap: 150
  - extractor: 'tensorlake/minilm-l6'
    name: 'terms'
    content_source: 'chunks'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)  

Upload a PDF File¶

In [ ]:

Copied!





import requests
req = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_CALIFORNIA.pdf")
req1 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_HAWAII.pdf")
req2 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_ILLINOIS.pdf")


with open('sixt_US_en_CALIFORNIA.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_HAWAII.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_ILLINOIS.pdf', 'wb') as f:
    f.write(req.content)
import requests
req = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_CALIFORNIA.pdf")
req1 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_HAWAII.pdf")
req2 = requests.get("https://www.sixt.com/shared/t-c/sixt_US_en_ILLINOIS.pdf")


with open('sixt_US_en_CALIFORNIA.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_HAWAII.pdf','wb') as f:
    f.write(req.content)

with open('sixt_US_en_ILLINOIS.pdf', 'wb') as f:
    f.write(req.content)

In [ ]:

Copied!





content_id_ca = client.upload_file('knowledgebase', path="sixt_US_en_CALIFORNIA.pdf", labels={"state": "california"})
client.wait_for_extraction(content_id_ca)

content_id_ha = client.upload_file('knowledgebase', path="sixt_US_en_HAWAII.pdf", labels={"state": "hawaii"})
client.wait_for_extraction(content_id_ha)

content_id_il = client.upload_file('knowledgebase', path="sixt_US_en_ILLINOIS.pdf", labels={"state": "illinois"})
client.wait_for_extraction(content_id_il)
content_id_ca = client.upload_file('knowledgebase', path="sixt_US_en_CALIFORNIA.pdf", labels={"state": "california"})
client.wait_for_extraction(content_id_ca)

content_id_ha = client.upload_file('knowledgebase', path="sixt_US_en_HAWAII.pdf", labels={"state": "hawaii"})
client.wait_for_extraction(content_id_ha)

content_id_il = client.upload_file('knowledgebase', path="sixt_US_en_ILLINOIS.pdf", labels={"state": "illinois"})
client.wait_for_extraction(content_id_il)

What is happening behind the scenes¶

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

Perform RAG¶

Initialize the Langchain Retriever.

In [ ]:

Copied!





import os
from openai import OpenAI

oai_client = OpenAI(
    # This is the default and can be omitted
    api_key="",
)

def answer_question(question) -> str:
    chat_completion = oai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": f"given the question {question}, if there is the name of a US state, generate a predicate such as state=texas or state=new york. The predicate name and value should be in small letters.",
        }
    ],
    model="gpt-3.5-turbo",
    )
    query_filter = chat_completion.choices[0].message.content
    query_filter
    search_results = client.search_index("knowledgebase.terms.embedding", question, top_k=5, filters=[query_filter])
    context = ""
    for result in search_results:
        context += f"content_id: {result['content_id']}\n text: {result['text']}\n"
    context
    chat_completion = oai_client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f" Answer the question based on the context provided below and provide citation in the response as 'Citation: '. The context has the citation to content_ids and the text below it. \n Context: {context} \n \n Question: {question}",
            }
        ],
        model="gpt-3.5-turbo",
    )
    print(chat_completion.choices[0].message.content)
    chat_completion.choices[0].message.content
import os
from openai import OpenAI

oai_client = OpenAI(
    # This is the default and can be omitted
    api_key="",
)

def answer_question(question) -> str:
    chat_completion = oai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": f"given the question {question}, if there is the name of a US state, generate a predicate such as state=texas or state=new york. The predicate name and value should be in small letters.",
        }
    ],
    model="gpt-3.5-turbo",
    )
    query_filter = chat_completion.choices[0].message.content
    query_filter
    search_results = client.search_index("knowledgebase.terms.embedding", question, top_k=5, filters=[query_filter])
    context = ""
    for result in search_results:
        context += f"content_id: {result['content_id']}\n text: {result['text']}\n"
    context
    chat_completion = oai_client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f" Answer the question based on the context provided below and provide citation in the response as 'Citation: '. The context has the citation to content_ids and the text below it. \n Context: {context} \n \n Question: {question}",
            }
        ],
        model="gpt-3.5-turbo",
    )
    print(chat_completion.choices[0].message.content)
    chat_completion.choices[0].message.content

In [ ]:

Copied!

answer_question("If I rent a car from Sixt in California, how many days do I have to return the vehicle before being considered overdue??")
answer_question("If I rent a car from Sixt in California, how many days do I have to return the vehicle before being considered overdue??")

In [ ]:

Copied!

answer_question("If I rent a car from Sixt in Hawaii, how many days do I have to return the vehicle before being considered overdue??")
answer_question("If I rent a car from Sixt in Hawaii, how many days do I have to return the vehicle before being considered overdue??")