Introduction¶
This notebook demonstrates how Indexify can make it easier to quickly extract insights from complex SEC filings like the Form 10-K annual report. Using Uber's 10-K as an example, we show how the Indexify library can enable question answering on the filing text to get rapid answers. We also illustrate how schema-based extraction can pull key data points from the unstructured document. The combination of question answering and schema-based extraction provides a powerful toolkit to derive insights from dense financial filings.
Setup¶
%pip install indexify indexify-extractor-sdk
# Download Indexify Server
!curl https://getindexify.ai | sh
# Install Poppler (required for PDF extraction)
# You can use brew on MacOS.
!sudo apt-get install -y poppler-utils
# Download Extractors
!indexify-extractor download tensorlake/chunk-extractor
!indexify-extractor download tensorlake/minilm-l6
!indexify-extractor download tensorlake/pdf-extractor
After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.
Open 2 terminals and run the following commands:
# Terminal 1
./indexify server -d
# Terminal 2
indexify-extractor join-server
Test the extractors¶
We will try PDFExtractor first. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the next chained extractors which can be used for question answering.
We'll start by downloading Uber's Form 10-K report.
import requests
req = requests.get("https://d18rn0p25nwr6d.cloudfront.net/CIK-0001543151/6fabd79a-baa9-4b08-84fe-deab4ef8415f.pdf")
with open('form10-k.pdf','wb') as f:
f.write(req.content)
from indexify_extractor_sdk import load_extractor, Content
pdfextractor, pdfconfig_cls = load_extractor("indexify_extractors.pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("form10-k.pdf")
config = pdfconfig_cls()
pdf_result = pdfextractor.extract(content, config)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
print(text_content)
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 ____________________________________________ FORM 10-K ____________________________________________ (Mark One) ☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended December 31, 2023 OR ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from_____ to _____ Commission File Number: 001-38902 ____________________________________________ UBER TECHNOLOGIES, INC. (Exact name of registrant as specified in its charter) ____________________________________________ Delaware 45-2647441 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification No.) 1725 3rd Street San Francisco, California 94158 (Address of principal executive offices, including zip code) (415) 612-8582 (Registrant’s telephone number, including area code) ____________________________________________ Securities registered pursuant to Section 12(b) of the Act: Title of each class Trading Symbol(s) Name of each exchange on which registered Common Stock, par value $0.00001 per share UBER New York Stock Exchange Securities registered pursuant to Section 12(g) of the Act: None Indicate by check mark whether the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ☒ No ☐ Indicate by check mark whether the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. Yes ☐ No ☒ Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☒ No ☐ Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes ☒ No ☐ Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.
Create a Client¶
Instantiate the Indexify Client
from indexify import IndexifyClient
client = IndexifyClient()
1. Question Answering Task¶
Extraction Graph Setup¶
Import the
ExtractionGraph
class from theindexify
package.Define the extraction graph specification in YAML format:
- Set the name of the extraction graph to "pdfqa".
- Define the extraction policies:
- Use the "tensorlake/pdf-extractor" extractor for PDF marking and name it "docextractor".
- Use the "tensorlake/chunk-extractor" for text chunking and name it "chunks".
- Set the input parameters for the chunker:
chunk_size
: 1000 (size of each text chunk)overlap
: 100 (overlap between chunks)content_source
: "docextractor" (source of content for chunking)
- Set the input parameters for the chunker:
- Use the "tensorlake/minilm-l6" extractor for embedding and name it "get-embeddings".
- Set the content source for embedding to "chunks".
Create an
ExtractionGraph
object from the YAML specification usingExtractionGraph.from_yaml()
.Create the extraction graph on the Indexify client using
client.create_extraction_graph()
.
from indexify import ExtractionGraph
extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
- extractor: 'tensorlake/pdf-extractor'
name: 'docextractor'
- extractor: 'tensorlake/chunk-extractor'
name: 'chunker'
input_params:
chunk_size: 1000
overlap: 100
content_source: 'docextractor'
- extractor: 'tensorlake/minilm-l6'
name: 'embedder'
content_source: 'chunker'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Upload the FORM 10-K PDF File¶
content_id = client.upload_file("pdfqa", "form10-k.pdf")
client.wait_for_extraction(content_id)
'357d5a0d5e9a7d30'
What is happening behind the scenes¶
Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.
With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.
Perform RAG with OpenAI¶
def get_context(question: str, index: str, top_k=3):
results = client.search_index(name=index, query=question, top_k=top_k)
context = ""
for result in results:
context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
return context
question = "What are the disclosure with respect to Foreign Subsidiaries?"
context = get_context(question, "pdfqa.embedder.embedding")
context
'content id: 42c552ac1b572bd3 \n\n passage: harm to the acquired company’s brand.\nIn addition, our acquisition of Careem has increased our risks under the U.S. Foreign Corrupt Practices Act (“FCPA”) and other similar laws outside the United\nStates. Our existing and planned safeguards, including training and compliance programs to discourage corrupt practices by such parties, may not prove effective,\nand such parties may engage in conduct for which we could be held responsible.\nWe may not receive a favorable return on investment for prior or future business combinations, and we cannot predict whether these transactions will be\naccretive to the value of our common stock. It is also possible that acquisitions, combinations, divestitures, joint ventures, or other strategic transactions we\nannounce could be viewed negatively by the press, investors, platform users, or regulators, any or all of which may adversely affect our reputation and our\ncontent id: 5c647fd06e97f75a \n\n passage: by the release of the valuation allowance due to deferred tax liabilities recorded as a result of the acquisitions providing an additional source of taxable income to\nsupport the realizability of pre-existing deferred tax assets.\nFor the year ended December 31, 2022, the increase in the valuation allowance was primarily attributable to an increase in deferred tax assets resulting from\nthe loss from operations, offset by the deferred tax impact from the transfer of certain intangible assets among our wholly-owned subsidiaries.\nITEM 9. CHANGES IN AND DISAGREEMENTS WITH ACCOUNTANTS ON ACCOUNTING AND FINANCIAL DISCLOSURE\nNone.\nITEM 9A. CONTROLS AND PROCEDURES\nEvaluation of Disclosure Controls and Procedures\nWe maintain disclosure controls and procedures that are designed to provide reasonable assurance that information required to be disclosed in reports that we\ncontent id: 24992794bef930f6 \n\n passage: authorizations of management and directors of the company; and (iii) provide reasonable assurance regarding prevention or timely detection of unauthorized\nacquisition, use, or disposition of the company’s assets that could have a material effect on the financial statements.\nBecause of its inherent limitations, internal control over financial reporting may not prevent or detect misstatements. Also, projections of any evaluation of\neffectiveness to future periods are subject to the risk that controls may become inadequate because of changes in conditions, or that the degree of compliance with\nthe policies or procedures may deteriorate.\n71\n'
def create_prompt(question, context):
return f"Answer the question, based on the context.\n question: {question} \n context: {context}"
prompt = create_prompt(question, context)
from openai import OpenAI
client_openai = OpenAI()
Now ask any question related to the ingested FORM 10-K PDF document
chat_completion = client_openai.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)
The disclosures with respect to Foreign Subsidiaries include risks under the U.S. Foreign Corrupt Practices Act and other similar laws, potential negative impacts on reputation from acquisitions or strategic transactions, and the evaluation of controls and procedures to prevent unauthorized acquisition, use, or disposition of assets.
2. Schema-based Retrieval Task¶
Alert: The following example will cost a lot of OpenAI credits. Move ahead at your own risk.
Extraction Graph Setup¶
extraction_graph_spec = """
name: 'pdfschema'
extraction_policies:
- extractor: 'tensorlake/pdf-extractor'
name: 'docextractor'
- extractor: 'tensorlake/schema'
name: 'schemaprocessor'
input_params:
service: 'openai'
schema: {'properties': {'file_number': {'title': 'File Number', 'type': 'string'}, 'registrant_name': {'title': 'Registrant Name', 'type': 'string'}, 'jurisdiction': {'title': 'Jurisdiction', 'type': 'string'}, 'employer_id_number': {'title': 'Employer Id Number', 'type': 'string'}, 'address': {'title': 'Address', 'type': 'string'}, 'telephone_number': {'title': 'Telephone Number', 'type': 'string'}, 'title_of_each_class': {'title': 'Title Of Each Class', 'type': 'string'}, 'trading_symbol': {'title': 'Trading Symbol', 'type': 'string'}, 'name_of_exchange': {'title': 'Name Of Exchange', 'type': 'string'}}, 'required': ['file_number', 'registrant_name', 'jurisdiction', 'employer_id_number', 'address', 'telephone_number', 'title_of_each_class', 'trading_symbol', 'name_of_exchange'], 'title': 'Form', 'type': 'object'}
content_source: 'docextractor'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Upload the FORM 10-K PDF File¶
content_id = client.upload_file("pdfschema", "form10-k.pdf")
client.wait_for_extraction(content_id)
'6c898c11de955629'
View the extracted content¶
client.get_extracted_content('6c898c11de955629')