Setup¶

Join Discord if you need help + ⭐ Star us on Github ⭐

%pip install indexify indexify-extractor-sdk indexify-langchain

Download Indexify Server¶

Download Extractors¶

!indexify-extractor download tensorlake/chunk-extractor !indexify-extractor download tensorlake/minilm-l6 !indexify-extractor download tensorlake/marker

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server

Download the Rental PDF¶

In [3]:

Copied!

import requests
req = requests.get("https://www.timescar-rental.com/pdf/agreement/en_agreement_200401.pdf")

with open('en_agreement_200401.pdf','wb') as f:
    f.write(req.content)
import requests
req = requests.get("https://www.timescar-rental.com/pdf/agreement/en_agreement_200401.pdf")

with open('en_agreement_200401.pdf','wb') as f:
    f.write(req.content)

Test the extractors¶

We will try MarkdownExtractor first. The MarkdownExtractor can extract all the values from text in one shot and passes it to the next chained extractors as a markdown formatted document which can be used for question answering.

In [ ]:

Copied!





from indexify_extractor_sdk import load_extractor, Content

mdextractor, mdconfig_cls = load_extractor("indexify_extractors.markdown.markdown_extractor:MarkdownExtractor")
content = Content.from_file("en_agreement_200401.pdf")

md_result = mdextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in md_result)
text_content
from indexify_extractor_sdk import load_extractor, Content

mdextractor, mdconfig_cls = load_extractor("indexify_extractors.markdown.markdown_extractor:MarkdownExtractor")
content = Content.from_file("en_agreement_200401.pdf")

md_result = mdextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in md_result)
text_content

Create Extraction Graph¶

Instantiate the Indexify Client

In [ ]:

Copied!

from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()

First, create a policy to get texts and contents out of the Rental PDF.

In [ ]:

Copied!





extraction_graph_spec = """
name: "knowledgebase"
extraction_policies:
  - extractor: "tensorlake/markdown-extractor"
    name: "md-extraction"

  - extractor: "tensorlake/chunk-extractor"
    name: "chunks"
    content_source: "md-extraction"
    input_params:
      chunk_size: 512
      overlap: 150

  - extractor: "tensorlake/minilm-l6"
    name: "get-embeddings"
    content_source: "chunks"
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
extraction_graph_spec = """
name: "knowledgebase"
extraction_policies:
  - extractor: "tensorlake/markdown-extractor"
    name: "md-extraction"

  - extractor: "tensorlake/chunk-extractor"
    name: "chunks"
    content_source: "md-extraction"
    input_params:
      chunk_size: 512
      overlap: 150

  - extractor: "tensorlake/minilm-l6"
    name: "get-embeddings"
    content_source: "chunks"
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)  

Upload PDF File¶

In [ ]:

Copied!

cid = client.upload_file("knowledgebase", path="en_agreement_200401.pdf")
client.wait_for_extraction(cid)
cid = client.upload_file("knowledgebase", path="en_agreement_200401.pdf")
client.wait_for_extraction(cid)

What is happening behind the scenes¶

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of Rental PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

Perform RAG¶

Initialize the Langchain Retriever.

In [ ]:

Copied!

from indexify_langchain import IndexifyRetriever
params = {"name": "get-embeddings.embedding", "top_k": 3}
retriever = IndexifyRetriever(client=client, params=params)
from indexify_langchain import IndexifyRetriever
params = {"name": "get-embeddings.embedding", "top_k": 3}
retriever = IndexifyRetriever(client=client, params=params)

Now create a chain to prompt OpenAI with data retrieved from Indexify to create a simple Q&A bot

In [ ]:

Copied!





from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

In [ ]:

Copied!





template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Now ask any question related to the ingested Rental PDF document

In [ ]:

Copied!

chain.invoke("Who will be responsible for damages not compensated by the insurance?")
# The Renter and the Driver shall be responsible for damages not compensated for by the insurance benefit or compensation.
chain.invoke("Who will be responsible for damages not compensated by the insurance?")
# The Renter and the Driver shall be responsible for damages not compensated for by the insurance benefit or compensation.