Each organization sits on a goldmine of information that often goes to waste. The fundamental problem is that their valuable knowledge is scattered across thousands of unstructured documents, buried in emails, presentations, and internal wikis. At Level, we faced this exact challenge. And because one of our core advantages is the knowledge we capture from our portfolio and partners, we quickly worked on a solution. Today, we feel that we have developed the infrastructure and applications to yield alpha in our investing processes. Our system is not just a scaled RAG chatbot, which is the broader industry’s current state-of-the-art. Instead, our solution intelligently extracts entities from unstructured documents and connects them with our existing Level Ventures dataset. This system captures subjective information that’s more difficult to semantically model at scale, while our structured dataset captures objective data points. Together, the magic comes from the applications that harness this marriage to drive efficiency in our processes and find deeper insights at scale.

Concretely, the documents we’ve ingested have already enabled a myriad of valuable use cases we use daily:

For fund investing, it allows us to compare and contrast a manager’s update on companies and theses against others.
For company investing, it ensures we’re notified of important updates across the portfolio and helps us identify experts in our network during diligence.
For thesis development, it offers a market pulse by identifying venture trends that are gaining traction and sources non-consensus documents in our corpus.

And shortly, we expect to build magical products that take advantage of our unstructured infrastructure in new ways. Some ideas we are working on include:

Relationship predicates, which model organization-to-organization or people-to-people relationships in unstructured data. We believe proper modeling will allow us to model customer channels and networks, predicting subsequent links and evolutions, among other things.
DeepResearch agents that harness our proprietary unstructured dataset to drive opinions, summaries, and answers to questions from our curated sources.
Market simulations of opinions over time, to predict evolutions in views within the venture market.
Identification of weak signals within the venture market before they become full-formed narratives.

In this post, we'll walk you through the decision points and implementation of our unstructured data pipeline and share practical insights we learned along the way. Later, double-click into the various use cases that have transformed our operations.

Design Constraints: Building around Real-World Usage

Before diving into our implementation, it's essential to understand the key constraints that shaped our approach. These requirements ensured our system would deliver practical value in a fast-paced venture environment:

Seamless Integration and Evolving with Structured Data Our existing Level dataset entity resolves organizations, people, funding rounds, investments, and more from over a dozen sources. Any unstructured data system needed to integrate seamlessly with this foundation, allowing us to enrich—rather than fragment—our knowledge base. Further, if our structured dataset morphs with new entity resolutions, renamings or corrections (which are plentiful in the early-stage venture dataset landscape), the matching in the structured dataset must also evolve.
Real-Time Processing In venture capital, timing is critical. Our system needed to process documents within minutes of uploading and integrate its findings into our dataset in under 15 minutes. This rapid turnaround ensures that teams can make informed decisions with the freshest possible information.
Contextual Document Organization Documents don't exist in isolation. Our system needed to preserve relationships between related materials—like grouping all files from a single data room—to maintain critical context and enable a grouped analysis.
Format Flexibility Information comes in many forms. Our infrastructure must handle virtually any document type: PDFs, text files, podcasts, webpages, spreadsheets, and even screenshots. This flexibility ensures we capture insights regardless of their original format.
Human-in-the-Loop Correction AI isn't perfect. When our system makes extraction or entity resolution errors, we need a straightforward mechanism for human experts to override and correct these outputs, ensuring data quality improves over time.
Handling Scale Well We intend to extract entities from our internal documents and sources online. We expect our pipeline to reach hundreds of thousands of documents. Scale is required to reduce search costs in addition to boosting alpha discovery.

We reviewed a few solutions, including Glean, V7 Go, Perplexity, and Notion AI, and decided it would be best to build our tool in-house.

Implementation

Our solution aims to extract entities from raw documents and organize the subjective nuggets of information into categories. We leveraged state-of-the-art LLM models and built agentic processes to ensure high-quality extractions. Furthermore, we built a human-in-the-loop framework for monitoring the system, allowing human intervention edits when necessary. This section details our approach, challenges, and the key technical solutions we've implemented to create a scalable and accurate system.

Entity Extraction and Their Organization

The process of extracting entities from raw documents is as follows: using VC-context-aware prompting, we extract mentions of entities no matter how brief or extensive. Then, we enrich that entity by organizing the information into categories. The fundamental entity types and their enriched metadata we’ve focused on so far are:

People Entities: Individuals mentioned in documents
- We capture contextual information (their role, actions, relationships)
- We extract subjective information (opinions about them, sentiment)
- Simple example: "Elon Musk announced Tesla's new battery technology." → Person: Elon Musk; Context: Tesla executive involved with battery technology
Company Entities: Organizations including businesses, investment firms, and government agencies
- We record context (industry, relationships, activities)
- We identify roadmap information (future plans, strategies)
- We extract traction data (growth metrics, achievements)
- We capture subjective assessments (reputation, evaluations)
- Simple example: "Stripe processed over $640B last year and plans to expand into Latin America" → Company: Stripe; Traction: $640B processed; Roadmap: Latin American expansion
Domain Entities: Narratives and perspectives being shared in the document
- We extract supporting evidence and context
- We identify the author's stated opinions
- Simple example: "AI safety remains underinvested despite increasing capabilities" → Domain: AI safety; Context: current investment levels vs. capability growth

Of note, we use prompt engineering to ensure that each entity mention undergoes deduplication to consolidate mentions (e.g., if there are numerous mentions of "Elon" in a document, they should still be filed under the mention "Elon Musk", instead of each mention being labeled uniquely). Further, we prompt engineered GPT-4o quite a lot to get a sensible output. In my calculation, although the model was good at extracting entities the way we intended to, it often made obvious mistakes on harder cases.

A Dynamic, Scalable UX to Track Progress and Correct Agentic Processes

The front-end observability of this pipeline first utilized Airtable. The system connected a document sheet to the entities each extracted. We orchestrated API calls with webhooks to process and fill the table. However, given our data volume, we quickly encountered Airtable's size limitations. This led us to adopt NocoDB, an open-source Airtable alternative that allowed us to manage the database limits and scale to our needs.

In summary, our front end now consists of four interconnected tables:

Document table (source materials)
People entities table
Company entities table
Domain entities table

Each row represents an entity and maintains links to its source documents while each document links to all of its extracted entities, creating a bidirectional relationship that facilitates easy navigation and search. The columns in each of the rows represent important extractions, metadata, or agentic decision points that were made.

For agent transparency, the core information used for key decisions - like entity resolution picks and metadata research - is shared directly on the UX and easily modifiable. Human edits trigger webhooks that update and maintain downstream processes. Common failure points are also evaluated directly in the UX. This allows someone on our team to audit the agent’s decisions and modify the decision-making. Often, we find that the agents make the right decisions. However, sometimes, under challenging scenarios, we work to correct or enrich the dataset manually using the user interface we’ve designed.

The Entity Resolution Challenge

Beyond Simple Name Matching

Entity resolution represents the most complex challenge in our pipeline. Matching solely on entity names proves insufficient for accurate results due to companies having the same names or being represented slightly differently in text vs. our systems (i.e., “LUX” vs. “Lux Capital”). Entity resolving people and companies comes with several challenges:

Documents rarely include definitive identifiers like LinkedIn profiles or company websites.
Contextual understanding is required to disambiguate entities properly, and even then, further venture context can be necessary.

Solution: Agentic Research-Driven Approach

After extensive experimentation with various methods, we developed a research-driven approach using the Google Search API within an agentic process which works as follows:

Any mention of an entity in a document must include a metadata component that states what kind of entity it was. For instance, was the company an AI customer, investor, or consulting firm? This happens at the entity extraction phase.
Next, our system generates contextually informed search queries based on document content during entity resolution.
These queries are sent to the Google Search API.
An agent evaluates the returned link options to determine relevant matches.

We benchmarked this approach against alternatives, including Perplexity and ChatGPT, which have web search capabilities. Google consistently outperformed other options, particularly for obscure people and companies. We attribute this to Google's superior indexing capabilities, though we're open to reevaluating as other services improve.

Sophisticated Matching Logic

Once an entity is researched, we execute resolution queries that incorporate flexible matching criteria:

For People:

Name variations and nicknames (John vs. Johnny vs. Jon)
Professional background (previous employers, roles)
Educational history
X links, LinkedIn links

For Companies:

Common aliases and abbreviations
Subsidiary relationships
Historical name changes
Homepage, X links, LinkedIn links

After identifying potential matches, an agent evaluates which options from our dataset are worthy of matching to the extracted entity using all available metadata to determine the correct match within our structured dataset. This approach delivers significantly better accuracy than simplistic text matching partially because our dataset encompasses half of a billion people entities and approximately ten million organizations, meaning there are many duplicate names.

For entities outside the startup ecosystem (politicians, NGOs, etc.), we've integrated Wikidata as a supplementary reference source.

Quality Assurance Through Confidence Scoring

A key innovation in our approach is the assignment of confidence scores to entity matches. Our agents can select "None" when uncertain, allowing us to focus human review efforts on low-confidence matches.

Because LLMs are not very good at returning numerical answers, and we found that using the same level of intelligence models for identifying QA issues did not pick out previously made LLM mistakes, we spent a lot of time laying out the various failure cases of LLM operations and prompt engineering to catch the failure-cases well.

System Integrations

Document Ingestion Methods

Our pipeline accepts virtually any document type and processes it without human intervention. We designed around three types of document ingesters: manual uploads, agentic scraping where a document, webpage or audio file, can be included or not, and traditional scraping. As such, we’ve built various “feeder” integrations:

Direct Upload Form: Upload any document through a simple interface
Email Integration: Tag emails to trigger automated processing of message bodies and/or attachments
Google Drive Folder Ingestion: Process entire document collections
Regular Content Updates: Automated ingestion of podcasts and blog posts
Third-Party Integrators: API connections with external content sources, like quarterly transcripts
Call notes: either uploading transcripts or notes from a meeting
Stalkers: these aim to collect recent podcasts, blog posts, Tweets, or any updates from a curated set of key opinion leaders.

Quality Assurance Macro-Processes

After completing the entire ingestion process, each step undergoes automated QA checks through secondary agent workflows designed to catch common failure patterns. Documents that trigger these patterns are automatically flagged for review, which enables humans to fix challenges much faster.

We also periodically audit and reprocess low-confidence entity matches as our system continuously improves.

Python Stack Shout-outs

OpenRouter, which we use to make our LLM-calls and experiment with various models.
LangFuse covers almost all of our observability needs.
FastAPI, which we’ve leveraged to create our flexible API service in Python and AWS.
NocoDB is responsible for our front-end tooling.
Reducto handles most of our PDF-to-text needs.
AssemblyAI is an API which we use to do speech-to-text.
Firecrawl helps us build many of the blog scrapers we leverage.
Hex is where we build bespoke document ingesters.
Relay App handles automatic ingestions using email, it’s a Zapier alternative which we like a lot.

Examples of our System in Practice

For decks and updates

Using an example deck from The Weekend Fund that was shared publicly on Signature Block:

Some examples from the 15 extracted people are:

Will Peng
- Opinion: Will Peng appreciates Ryan Hoover for his quick decision-making and incredible network, which helped in closing enterprise accounts and top technology hires.
- Metadata: Founder of Northstar
Alex Bouaziz
- Opinion: Alex Bouaziz describes Ryan Hoover and Vedika Jain as awesome, noting their deep appreciation of founders and ability to move mountains.
- Metadata: Founder of Deel
Tessa Mu
- Nugget: Tessa Mu describes Ryan Hoover as tirelessly supportive, consistently going above and beyond to help with various aspects of business growth.
- Metadata: Founder of Unicycle

Some examples from the 67 extracted companies are:

MainStreet
- Opinion: The document suggests that MainStreet's pivot and subsequent growth were impressive, leading to significant investment interest.
- Context: MainStreet is highlighted as a portfolio company of Weekend Fund, showcasing its growth and success.
InVideo
- Nugget: The document suggests that InVideo's rapid growth and investment interest are notable.
- Funding Traction: InVideo achieved a 3.3x markup after Sequoia India led its Series A.
Sequoia India
- Context: Mentioned as the lead investor in InVideo's Series A
Kairos Society
- Nugget: N/A
- Context: Vedika Jain's prior experience in venture capital, indicating her background in investment and technology.

The domains extracted were:

Remote work
- Supporting Info: Supporting information includes the statistic that 80% of surveyed CEOs plan to allow remote work at least part-time, and 47% will allow full-time remote work post-pandemic. Additionally, Deel, a company providing payroll solutions for remote teams, is highlighted as a successful investment, with 1,400+ companies using their services, including notable customers like Notion, Andela, and Bubble.
Audio/Voice
- Opinion: The opinion expressed in the document is optimistic about the Audio/Voice domain, suggesting that it is a burgeoning field with significant growth potential.
- Supporting Info: Supporting information includes the doubling of AirPods sales year-over-year, the statistic that 1/3 of U.S. adults own a smart speaker, and the success of Clubhouse as a new platform in the audio/voice space. Additionally, Voiceflow, a portfolio company, is mentioned as an example, with ~5% of Alexa voice apps powered by Voiceflow and usage by major companies like Amazon, NYT, McDonalds, and BMW.
Digitization of India
- Opinion: The opinion expressed is positive, suggesting that the digitization of India is a promising area for investment due to the rapid technological adoption and infrastructure development. The document implies that this trend will continue to create opportunities for companies that can capitalize on the growing digital ecosystem in India.
- Supporting Info: Supporting information includes the doubling of India's internet connectivity over the last four years and the rapid adoption of digital infrastructure such as India Stack, Aadhar, and digital payments. Dukaan, a portfolio company, is provided as an example of success in this domain, with 3 million stores created in only 7 months since launch and over 2 million page views every day.

For podcasts

From this recent BG2 episode (Bill Gurley and Brad Gerstner podcast):

Some of the 21 people mentioned are:

Satya Nadella
- Opinion: The speakers interpret Satya's comments as a cautious approach to AI spending, emphasizing the need for a sustainable business model and infrastructure investment.
Kevin Weil
- Context: Kevin Wile is mentioned as the person running the product team at OpenAI, indicating his role in the company's product development efforts.

Some of the 36 company mentions are:

XAI
- Opinion: Grok3 is seen as an impressive new entrant in the AI model market, with strong execution on the product side.
- Traction: Grok3 rocketed to the top of app downloads on the iPhone charts, indicating strong consumer interest and traction.
- Roadmap: Grok3 is leveraging the X platform to drive app downloads and consumer engagement, with plans to expand its voice capabilities.
Microsoft
- Traction: Microsoft is investing heavily in AI infrastructure, spending 80% of its free cash flow on CapEx related to AI.
- Roadmap: Microsoft is focused on expanding its AI capabilities and infrastructure, with partnerships and potential acquisitions to enhance its position.
- Context: Discussed as a major player in the AI space, with significant investment in AI infrastructure and partnerships.

Some of 5 domains extracted were:

Pre-Training in AI Models
- Opinion: The speakers express skepticism about the future potential of pre-training in AI models. They mention that while some see Grok3's success as proof of pre-training's potential, they believe it has hit a ceiling. They argue that building bigger clusters and more parameters won't necessarily lead to significant improvements. They reference opinions from others like Ilya and Andreessen, who share similar concerns about the limitations of pre-training.
AI Regulation and US-China Relations
- Opinion: The speakers express concern that AI regulation and the framing of US-China relations as an 'AI war' could hinder US competitiveness and innovation. They argue that regulatory actions like the diffusion rule may backfire by limiting US companies' ability to compete globally. They also express skepticism about the idea of winning an AI war with China, emphasizing the need to focus on innovation rather than control.

For Quarterly Transcripts

Quarterly transcripts offer another way to gather perspectives and enrich our dataset. See this example of Nvidia’s Q3 Fiscal 2025 transcript.

Although for our investing context, less critical, our pipeline extracted the various analysts who asked questions:

Aaron Rakers: Analyst at Wells Fargo
Timothy Arcuri: Analyst at UBS
… and 12 others

Some of the company mentions:

Adobe
- Traction: Adobe Firefly delivers AI services using NVIDIA's technology.
- Context: Adobe Firefly is mentioned as one of the AI-native companies seeing success with NVIDIA's infrastructure.
Cloudera
- Traction: Cloudera is using NVIDIA AI to build Co-Pilots and agents.
- Context: Cloudera is mentioned as a company working with NVIDIA to accelerate the development of AI applications.
Nutanix
- Traction: Nutanix is using NVIDIA AI to build Co-Pilots and agents.
- Context: Nutanix is mentioned as a company working with NVIDIA to accelerate the development of AI applications.
Oracle
- Traction: OCI is planning to deploy NVIDIA H200 infrastructure to meet rising demand for AI training and inference workloads.
- Context: OCI is mentioned as a future provider of NVIDIA H200-powered cloud instances.
Bernstein Research
- Context: Bernstein Research is mentioned as a participant in the NVIDIA earnings call.
… there were 67 total company entities mentioned.

Domains (there were nine total):

Industrial AI and Robotics
- Opinion: The opinion expressed about Industrial AI and Robotics is highly positive. Colette Kress highlights the adoption of NVIDIA Omniverse by some of the largest industrial manufacturers in the world, suggesting a strong belief in its potential to drive growth for NVIDIA. The mention of Foxconn using digital twins and industrial AI built on NVIDIA Omniverse to drive new levels of efficiency further supports this positive outlook.
- Supporting Info: Supporting information includes the financial results shared by Colette Kress, such as the adoption of NVIDIA Omniverse by some of the largest industrial manufacturers in the world. The mention of Foxconn using digital twins and industrial AI built on NVIDIA Omniverse to drive new levels of efficiency, with an expected reduction of over 30% in annual kilowatt-hour usage in its Mexico facility, is highlighted as a key factor driving Industrial AI and Robotics' growth. Additionally, the breakthroughs in physical AI and foundation models that understand the physical world are mentioned as supporting details.
Gaming and AI PCs
- Opinion: The opinion expressed about Gaming and AI PCs is positive. Colette Kress highlights the increase in gaming revenue and the introduction of new GeForce RTX AI PCs, suggesting a strong belief in their potential to drive growth for NVIDIA. The mention of strong back-to-school sales and the anticipation of Microsoft's Copilot+ capabilities further supports this positive outlook.
- Supporting Info: Supporting information includes the financial results shared by Colette Kress, such as the increase in gaming revenue to $3.3 billion, up 14% sequentially and 15% year-on-year. The mention of strong back-to-school sales and the introduction of new GeForce RTX AI PCs with Microsoft's Copilot+ capabilities are highlighted as key factors driving Gaming and AI PCs' growth. Additionally, the celebration of the 25th anniversary of the GeForce 256 and its impact on computing graphics and the AI revolution are mentioned as supporting details.

For Emails

From a public newsletter email from LaunchGravity regarding the YC 2025 Winter Batch Demo Day.

There were 10 total people mentions, 59 company mentions, and eight domain mentions.

Interesting extractions were:

Leaping AI
- Traction: Leaping AI's voice AI agents automate up to 70% of customer support calls while maintaining a 90% customer satisfaction rate.
Agent-Driven Automation
- Supporting Info: The author supports their opinion by listing several startups in the Y Combinator Winter 2025 batch that are leveraging AI agents for automation. Examples include Caseflood, which replaces law firm admin with AI agents, and Tally, which automates accounting, tax, and audit processes.
- Context: The author mentions Agent-Driven Automation as a significant trend in the Y Combinator Winter 2025 batch, highlighting its application across various industries such as law, healthcare, accounting, and enterprise systems. The context is to showcase the innovative use of AI agents to automate and optimize business processes, thereby reducing costs and increasing efficiency.
Vertical AI in Regulated Markets
- Opinion: The document conveys a positive sentiment towards Vertical AI in Regulated Markets, suggesting that the tailored AI solutions for these industries can significantly improve efficiency and compliance. The ability to automate complex processes and reduce timeframes is seen as a valuable innovation.
- Supporting Info: The author provides examples of startups like Archon, which reduces FedRAMP compliance time, and HealthKey, which matches patient EHR data with clinical trial eligibility. These examples illustrate the practical benefits and applications of Vertical AI in regulated markets.

Application-Level Use Cases

Once integrated, the unstructured dataset, combined with the structured dataset, enables numerous powerful applications for alpha during investment. We categorize these as either "application-level" (directly surfacing insights) or "unstructured derivatives" (a data layer that is further processed on top of the unstructured dataset).

Portfolio Monitoring

Our primary use case is tracking updates related to specific companies over time. This visibility across our extensive portfolio helps us identify companies with valuable information "nuggets." We extract semantic triggers that flag essential opportunities, such as key hires, traction metrics, or product roadmap developments. By capturing author opinions (typically from managers reporting on their companies), we can classify sentiment on the company. We further extract relevant excerpts from the mention, such as revenue growth (i.e., “30% QoQ”).

These unstructured derivatives power various internal decision-making tools.

Relationship Modeling

We identify relationships mentioned in documents, particularly customer connections and interpersonal networks. Our system tracks 12 organization relationship types (including supplier, competitor, reference, and partnership) and 11 person-to-person relationship types (such as customer, reference, and co-worker).

While structured data provides some relationship insights, certain connections—especially customer/supplier networks—are difficult to map without unstructured data analysis. By integrating person-to-person relationships into our structured dataset's network primitive, we enable more accurate relationship inferences and intelligent networking capabilities.

You can see some of the extracted predicates from The Weekend Fund 3.0 deck referenced above:

Expertise Search

Our internal "Knowledge" application allows unified searching and reading of the unstructured dataset. Users can view summaries of when someone is mentioned in documents and read content authored by specific individuals. By incorporating unstructured data into our Expertise Data Primitive, we can identify subject matter experts during diligence processes via search.

We’ve also begun to explore tabular network exploration modalities in a filterable view, which allow us to model 2nd degree connections.

Domain Intelligence

We've developed an edge in identifying emerging domains and tracking domain momentum by analyzing narratives from a curated set of authors. This provides strategic advantages in opportunity identification and market positioning. One example is using unstructured data and large-language-models to evolve market and opinion maps over time.

Structured Data Enrichment

We leverage unstructured data to extract and organize structured information—funding rounds, investments, acquisitions, positions, and new companies—which we then reintegrate into our primary dataset.

This critical process strengthens historical and up-to-date data, enhancing our modeling capabilities and helping us identify rare opportunities. Until 2025, this process was entirely manual. Automation has accelerated our processing speed 10x and improved our ability to add new companies, funding rounds, and acquisitions to our dataset before they appear in licensed databases—if they ever do.

Cross-Reference Analysis

We analyze document groups collectively—the aggregate narratives and updates they contain—to compare manager data rooms for context. We cross-reference information from different managers to build a more complete and verified picture for companies with multiple updates from different authors.

Orchestration + End-to-End Pipeline

In summary, our pipeline functions as follows:

Various feeder types extract and add content to our unstructured pipeline.
An unstructured pipeline that extracts entities from any given document enriches and entity resolves the entity to our dataset.
Unstructured derivatives extract various features and content from the unstructured dataset.
Applications that serve the information in context with our structured dataset for various use-cases.

Beyond the technical achievements, this system has become a fundamental differentiator for Level Ventures. It enhances every aspect of our investment process—from thesis development to portfolio support—giving us an edge in identifying non-obvious opportunities. We believe that this platform will bring our search costs of frontier technologies down significantly and boost alpha discovery.

For our portfolio funds, we're excited to begin offering this framework as a service soon, enabling them to extract similar insights from their own unstructured data.

While we continue to value the human elements of venture investing, our structured + unstructured data approach ensures we're bringing the best of both worlds to every decision. In our next post, we'll share specific market insights from this system and preview our upcoming domain intelligence capabilities this summer.