Pdfqa
In [ ]:
Copied!
%pip install -q indexify-extractor-sdk
%pip install -q indexify-extractor-sdk
- Download the required extractors:
tensorlake/minilm-l6
: An embedding extractor based on the MiniLM-L6 model.tensorlake/chunk-extractor
: A text chunking extractor.tensorlake/marker
: A PDF marker extractor.
In [ ]:
Copied!
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/marker
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/marker
- Start the Indexify Extractor server on a separate terminal using the
indexify-extractor join-server
command.
In [ ]:
Copied!
!indexify-extractor join-server
!indexify-extractor join-server
- Install the
indexify
package using pip.
In [ ]:
Copied!
pip install -q indexify
pip install -q indexify
Indexify Client Setup¶
- Import the
IndexifyClient
class from theindexify
package. - Create an instance of the
IndexifyClient
calledclient
.
In [4]:
Copied!
from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()
Create an Extraction Graph¶
- Import the
ExtractionGraph
class fr - Define the extraction graph specification in YAML format:
- Set the name of the extraction graph to "pdfqa".
- Define the extraction policies:
- Use the "tensorlake/marker" extractor for PDF marking and name it "mdextract".
- Use the "tensorlake/chunk-extractor" for text chunking and name it "chunker".
- Set the input parameters for the chunker:
chunk_size
: 1000 (size of each text chunk)overlap
: 100 (overlap between chunks)content_source
: "mdextract" (source of content for chunking)
- Set the input parameters for the chunker:
- Use the "tensorlake/minilm-l6" extractor for embedding and name it "pdfembedding".
- Set the content source for embedding to "chunker".
- Create an
ExtractionGraph
object from the YAML specification usingExtractionGraph.from_yaml()
. - Create the extraction graph on the Indexify client using
client.create_extraction_graph()
.
In [13]:
Copied!
from indexify import ExtractionGraph
extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
- extractor: 'tensorlake/marker'
name: 'mdextract'
- extractor: 'tensorlake/chunk-extractor'
name: 'chunker'
input_params:
chunk_size: 1000
overlap: 100
content_source: 'mdextract'
- extractor: 'tensorlake/minilm-l6'
name: 'pdfembedding'
content_source: 'chunker'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
from indexify import ExtractionGraph
extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
- extractor: 'tensorlake/marker'
name: 'mdextract'
- extractor: 'tensorlake/chunk-extractor'
name: 'chunker'
input_params:
chunk_size: 1000
overlap: 100
content_source: 'mdextract'
- extractor: 'tensorlake/minilm-l6'
name: 'pdfembedding'
content_source: 'chunker'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Document Ingestion¶
- Add the PDF document to the "pdfqa" extraction graph using
client.upload_file()
.
In [ ]:
Copied!
content_id = client.upload_file("pdfqa", "chess.pdf")
client.wait_for_extraction(content_id)
content_id = client.upload_file("pdfqa", "chess.pdf")
client.wait_for_extraction(content_id)
Context Retrieval Function¶
Define a function called
get_context
that takes a question, index name, and top_k as parameters.Search the specified index using
client.search_index()
with the given question and top_k.Concatenate the retrieved passages into a single context string.
Return the context string.
In [32]:
Copied!
def get_context(question: str, index: str, top_k=3):
results = client.search_index(name=index, query=question, top_k=3)
context = ""
for result in results:
context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
return context
def get_context(question: str, index: str, top_k=3):
results = client.search_index(name=index, query=question, top_k=3)
context = ""
for result in results:
context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
return context
Prompt Creation Function¶
Define a function called
create_prompt
that takes a question and context as parameters.Create a prompt string that includes the question and context.
Return the prompt string.
In [33]:
Copied!
def create_prompt(question, context):
return f"Answer the question, based on the context.\n question: {question} \n context: {context}"
def create_prompt(question, context):
return f"Answer the question, based on the context.\n question: {question} \n context: {context}"
Question Answering¶
- Define a question string.
- Call the
get_context
function with the question, index name ("pdfqa.pdfembedding.embedding"), and top_k (default is 3) to retrieve the relevant context.
In [34]:
Copied!
question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")
question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")
Setup OpenAI Client¶
- Import the
OpenAI
class from theopenai
package. - Create an instance of the
OpenAI
client calledclient_openai
with the API key.
In [ ]:
Copied!
from openai import OpenAI
client_openai = OpenAI(api_key="")
from openai import OpenAI
client_openai = OpenAI(api_key="")
Answering Question with OpenAI¶
- Call the
create_prompt
function with the question and retrieved context to generate the prompt. - Use the
client_openai.chat.completions.create()
method to send the prompt to the OpenAI API.- Set the model to "gpt-3.5-turbo".
- Pass the prompt as a message with the "user" role.
- Print the generated answer from the API response.
In [ ]:
Copied!
prompt = create_prompt(question, context)
chat_completion = client_openai.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)
prompt = create_prompt(question, context)
chat_completion = client_openai.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)