Pdfqa

Installation and Setup¶

Join Discord if you need help + ⭐ Star us on Github ⭐

Install the indexify-extractor-sdk package using pip.

In [ ]:

Copied!

%pip install -q indexify-extractor-sdk
%pip install -q indexify-extractor-sdk

Download the required extractors:
- tensorlake/minilm-l6: An embedding extractor based on the MiniLM-L6 model.
- tensorlake/chunk-extractor: A text chunking extractor.
- tensorlake/marker: A PDF marker extractor.

In [ ]:

Copied!

!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/marker
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/marker

Start the Indexify Extractor server on a separate terminal using the indexify-extractor join-server command.

In [ ]:

Copied!

!indexify-extractor join-server
!indexify-extractor join-server

Install the indexify package using pip.

In [ ]:

Copied!

pip install -q indexify
pip install -q indexify

Indexify Client Setup¶

Import the IndexifyClient class from the indexify package.
Create an instance of the IndexifyClient called client.

In [4]:

Copied!

from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()

Create an Extraction Graph¶

Import the ExtractionGraph class fr
Define the extraction graph specification in YAML format:
- Set the name of the extraction graph to "pdfqa".
- Define the extraction policies:
  - Use the "tensorlake/marker" extractor for PDF marking and name it "mdextract".
  - Use the "tensorlake/chunk-extractor" for text chunking and name it "chunker".
    - Set the input parameters for the chunker:
      - chunk_size: 1000 (size of each text chunk)
      - overlap: 100 (overlap between chunks)
      - content_source: "mdextract" (source of content for chunking)
  - Use the "tensorlake/minilm-l6" extractor for embedding and name it "pdfembedding".
    - Set the content source for embedding to "chunker".
Create an ExtractionGraph object from the YAML specification using ExtractionGraph.from_yaml().
Create the extraction graph on the Indexify client using client.create_extraction_graph().

In [13]:

Copied!





from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
   - extractor: 'tensorlake/marker'
     name: 'mdextract'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'mdextract'
   - extractor: 'tensorlake/minilm-l6'
     name: 'pdfembedding'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
   - extractor: 'tensorlake/marker'
     name: 'mdextract'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'mdextract'
   - extractor: 'tensorlake/minilm-l6'
     name: 'pdfembedding'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Document Ingestion¶

Add the PDF document to the "pdfqa" extraction graph using client.upload_file().

In [ ]:

Copied!

content_id = client.upload_file("pdfqa", "chess.pdf")
client.wait_for_extraction(content_id)
content_id = client.upload_file("pdfqa", "chess.pdf")
client.wait_for_extraction(content_id)

Context Retrieval Function¶

Define a function called get_context that takes a question, index name, and top_k as parameters.
Search the specified index using client.search_index() with the given question and top_k.
Concatenate the retrieved passages into a single context string.
Return the context string.

In [32]:

Copied!





def get_context(question: str, index: str, top_k=3):
    results = client.search_index(name=index, query=question, top_k=3)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context
def get_context(question: str, index: str, top_k=3):
    results = client.search_index(name=index, query=question, top_k=3)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

Prompt Creation Function¶

Define a function called create_prompt that takes a question and context as parameters.
Create a prompt string that includes the question and context.
Return the prompt string.

In [33]:

Copied!

def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

Question Answering¶

Define a question string.
Call the get_context function with the question, index name ("pdfqa.pdfembedding.embedding"), and top_k (default is 3) to retrieve the relevant context.

In [34]:

Copied!

question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")
question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")

Setup OpenAI Client¶

Import the OpenAI class from the openai package.
Create an instance of the OpenAI client called client_openai with the API key.

In [ ]:

Copied!

from openai import OpenAI
client_openai = OpenAI(api_key="")
from openai import OpenAI
client_openai = OpenAI(api_key="")

Answering Question with OpenAI¶

Call the create_prompt function with the question and retrieved context to generate the prompt.
Use the client_openai.chat.completions.create() method to send the prompt to the OpenAI API.
- Set the model to "gpt-3.5-turbo".
- Pass the prompt as a message with the "user" role.
Print the generated answer from the API response.

In [ ]:

Copied!





prompt = create_prompt(question, context)

chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)
prompt = create_prompt(question, context)

chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)