10/2/2026

The State of AI Data Infrastructure 2026

mins read

Research

Data

Data is the core infrastructure of enterprise software. As enterprises adopt AI and data-driven workflows, the systems that collect, process, and govern data have become a primary source of competitive advantage. The “data layer” determines not just how software scales, but how effectively organizations generate insight, maintain compliance, and innovate. This layer is undergoing a structural shift. Cloud-native architectures, real-time and unstructured data, and AI workloads have transformed data infrastructure from a back-office function into a strategic control point across the enterprise

This research maps the Enterprise Data Layer end-to-end – from data generation and ingestion to storage, analytics, and governance – highlighting the technologies reshaping each segment, the operational challenges enterprises face, and the resulting investment opportunities.

The goal is simple: identify where value is being created today in the data stack, and where it will concentrate next as data becomes the foundation of enterprise software.

1. Data Generation & Labelling

Data is the fuel for AI models – a key layer in the stack determining model quality, safety, and scope. High-quality data is foundational for building large AI models. The performance of today’s advanced large language models (LLMs) depends heavily on the quality and scale of their training data. However, the push toward ever-larger models is outstripping the available supply of real-world, labelled data. The pool of usable internet data is running dry.

This scarcity has sparked solutions enabling companies to generate new synthetic data and unlock previously inaccessible unstructured data through automated data labelling.

1.1. Synthetic Data

Synthetic data has emerged as the structural remedy to the lack of real-world data: it delivers abundant, label-rich, and compliant datasets, allows edge case scenarios to be generated on-demand, and is available at a materially lower cost. By augmenting limited real datasets with extensive artificial examples, synthetic data enables continued performance improvements at scale without being bottlenecked by data scarcity.

Synthetic data is particularly valuable when real data is scarce, expensive, or sensitive (e.g. medical records or personal information) or where critical edge-case coverage and outcome diversity are key for the application (e.g. autonomous vehicles, healthcare). While powerful, synthetic data introduces risks including reduced realism, bias amplification, and potential model degradation when it substitutes too heavily for real-world data.

Market Dynamics:

The use of synthetic data has increased dramatically over the last few years as foundational models have continued to scale and increase in size, exhausting the supply of readily available real data. As models continue to scale, this trend is only likely to accelerate. (See illustration below)

Several trends are shaping the synthetic data landscape:

Vertical, closed-loop stacks: Startups are converging on domain-specific generators that encode industry edge cases and workflows, with vertical-first GTM and productised offerings (APIs, dataset packs, marketplaces).
Multimodal by default: Modern pipelines span 3D, video, sensor, time-series, and tabular data – often built on digital twins or game engines to generate rare, safety-critical scenarios.
Moats in governance and fidelity: Differentiation is shifting to coverage and realism metrics, privileged real-data access, time-to-coverage SLAs, and governance tooling (bias audits, collapse detection, traceability).
Platforms over point solutions: Leading vendors are evolving from data providers into end-to-end platforms with coverage planning, controllable generation, auto-labelling, and quality scoring.
Synthetic data as simulation infrastructure: Beyond training, synthetic data increasingly underpins simulation and experimentation – from clinical trials to marketing and software testing.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Synthesise AI	CA US	Synthetic-data platform for computer-vision training.	~$17M Series A (2022) led by 468 Capital.
Mostly AI	Austria	Synthetic data generator for enterprise & privacy compliance.	$25M Series B (Jan 2022) led by Molten Ventures.
Unlearn AI	CA US	AI-driven digital twins to optimize clinical trials.	$50M Series C (Feb 2024) led by Altimeter Capital.
Datagen	Israel	“Data-as-code”:synthetic data platform for computer vision.	$50M Series B (Mar 2022) led by Scale Venture Partners.
Sky Engine AI	Poland	Synthetic-data cloud for vision AI (medical. automotive. robotics)	$9M Series A (Jan 2024) led by Cogito Capital Partners.
Hazy	UK	Enterprise synthetic-data startup focused on privacy-preserving data.	$9M Series A (2023) led by Conviction.
Tonic AI	CA US	Synthetic/fake data generation platform for devs & data-ops teams.	$35M Series B (2021) led by Insight Partners.
Gretel Labs (Gretel AI)	CA US	Synthetic-data APIs focused on safe sharing & privacy for AI.	$50M Series B (Oct 2021) led by Anthos Capital.
Artificial Societies	UK	AI platform simulating human-behaviour via synthetic personas	$5.35M Seed (2025) led by Point72 Ventures
Synthesized	UK	Uses AI to automate software testing via synthetic data generation.	$20 M Series A (Sept 2025) led by Redalpine Venture Partners
K2View	TX US	Data-platform enabling AI-ready micro-databases and real-time data products	$15M Growth (Jun 2025) led by Trinity Capital

💡Key Takeaways

As firms start to exhaust the supply of readily available & economical ‘real’ data, synthetic data will become more and more critical
Vertically focused firms will win because:
- Data is domain-specific: modalities, schemas, labelling requirements, types of edge cases, and privacy rules all differ by sector
- GTM is domain-specific: workflows, procurement cycles, compliance gates, and ROI proofs vary
Winners will be end-to-end platforms that plan coverage, generate, auto-label, validate, trace, and integrate into training and governance, vs standalone generators

1.2 Data Labelling

Data labelling gives structure and meaning to raw data so models can learn. Early LLMs relied heavily on labelled examples for supervised learning and instruction tuning (clear input-output pairs that taught models how to follow tasks and reflect user intent).

High-quality annotation remains critical because it drives:

Accuracy: Life-or-death distinctions, like telling a pedestrian from a traffic sign.
Scale: Large labelled datasets make models more robust and performant.
Adaptability: Good labels help models generalize and capture rare edge cases.
Domain nuance: Legal, medical, and customer-support models need precise, contextual annotation.
Safety: Labels underpin hallucination detection, evaluation, and alignment.

The bottleneck is supply. The world generates enormous amounts of unstructured text, images, and audio – but only a tiny share is clean, consistent, and usable for training. This gap created a market for specialist labelling firms that transform messy public data into training-ready datasets and help enterprises label proprietary data to ground their internal LLMs with domain-specific context.

Market Dynamics:

The category began as labour-intensive annotation work with large workforces tagging images, transcribing audio, and classifying text. Early providers solved a staffing and workflow problem: sourcing, coordinating, and quality-controlling a global annotator pool. Over time, they evolved into full data-operations platforms responsible for consistency, QA, and delivery of high-quality labelled data. Today, expectations have shifted again: providers are increasingly asked to integrate labelled datasets into model pipelines, becoming broader training-data infrastructure players.

The market is now shifting on two fronts:

Automation: Routine labelling is being semi-automated through AI-assisted annotation, heuristics, and active learning-cutting cost and cycle time while preserving quality.
Shift to High Skill Labour: Human work is moving from basic tagging to high-skill RLHF and preference modelling, where domain experts (clinicians, lawyers, bankers) shape model behaviour on specialised tasks.

Labelling has evolved from commodity annotation to a strategic data-operations layer that produces scarce, high-signal supervision for training, fine-tuning, and continuous improvement of both enterprise and frontier models. Beyond labelling data to feed directly into LLMs, enterprises are using AI-powered labelling solutions to create structured datasets from proprietary internal data (using tools like Encord) or to enrich datasets with publicly available sources (for example, automating CRM enrichment with tools like Kernel or Freckle).

Company	Location	Description	Funding (Last Round Info)
Micro1	CA US	AI-driven recruitment & labelling for engineering + healthcare roles.	~$35M Series A (2025) at $500M valuation
Turing	CA US	Platform matching engineers and AI workflows/training at scale.	$111M Series E (2025) valuing company ~$2.2B.
Surge AI	CA US	High-quality human feedback / RLHF + data-labelling platform.	First major external round in progress (seeking $1B at $25B).
Datacurve	CA US	Coding-domain expert data for foundation model labs.	$15M Series A (2025) led by Chemistry VC
Mercor	CA US	AI recruiting platform focused on specialist talent (law, medicine, chemistry).	$350M Series C (2025) at ~$10B valuation led by Felicis.
Labelbox	CA US	Enterprise AI training-data platform for annotation & management.	$110M Series D (2022) led by SoftBank Vision Fund II.
Scale AI	CA US	Large-scale data-labelling and evaluation platform for AI models.	$28B (2024). Acqui-hire by Meta.
Kernel	UK	Agent-infrastructure and CRM/RevOps data platform (AI native).	$9M Series A (2025) led by Kinnevik.
Freckle	Australia	Data-enrichment & research platform for non-technical users (RevOps).	$4M Seed (2025) led by Google's AI fund Gradient.
Encord	UK	Enables enterprises to index, curate and annotate multimodal AI data for deployment	$30M Series B (2024) led by Next47

💡Key Takeaways

Data labelling is crucial for AI providers – owning the right data – the rarest, highest-impact, domain-specific data – is what will define the next generation of winners in AI
Expectations for data labelling platforms have shifted from Labour marketplaces → data-quality platforms → integrated supervision pipelines wired into training loops
Type of labelling work has evolved from simple labelling tasks (which can now be automated by existing LLMs) to complex RLHF workflows which require sourcing talent with high levels of domain expertise

2. Data Ingestion & Transformation

The ingestion and transformation layer moves raw data from source systems into analytics-ready formats within data warehouses and lakes. It includes the pipelines and tools that extract data, transform it, and load it for downstream use. As data volume, variety, and velocity have grown, this layer has evolved from batch ETL to cloud-native ELT, real-time streaming, and increasingly AI-assisted pipelines.

Three trends are reshaping the layer:

ETL → ELT → Streaming as real-time use cases proliferate
Unstructured data ingestion driven by AI workloads
Intelligent pipelines with greater automation and adaptability

2.1 ETL → ELT → Streaming

From batch ETL to ELT:

Traditional ETL emerged when compute and storage were scarce, requiring data to be transformed before loading. With distributed systems (Hadoop, Spark) and cloud warehouses, ELT became viable: raw data is loaded first, then transformed in-warehouse. Lower storage costs and scalable compute made ELT more flexible and faster, establishing it as the backbone of the modern data stack. This model is tightly coupled with cloud platforms like Snowflake and BigQuery, and transformation tools such as dbt.

Rise of streaming data pipelines:

Real-time use cases—live analytics, personalization, fraud detection—have driven adoption of streaming architectures. Tools like Apache Kafka, Apache Flink, and Spark Streaming process continuous event streams from sources such as IoT devices, applications, and logs, delivering insights in seconds rather than hours.

Most modern architectures now combine batch and streaming pipelines, applying each where latency and cost trade-offs make sense.

Market dynamics:

The integration market is mature but rapidly shifting cloudward. Legacy vendors (Informatica, Talend, IBM) have been displaced or forced to modernize, while cloud-native ELT players like Fivetran and Matillion scaled quickly by focusing on simple, reliable ingestion into cloud warehouses. By the mid-2020s, ELT is largely replacing on-prem ETL.

Streaming is the fastest-growing segment. Kafka has become the default data backbone, with startups building higher-level abstractions on top: Glassflow (stream deduplication and joins for ClickHouse), Estuary (sub-second real-time pipelines), Quix (IoT and industrial streaming), and Tinybird (real-time analytics on ClickHouse).

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Fivetran	CA US	Cloud-based ELT platform with pre-built connectors for syncing data to warehouses (pioneer of automated data pipelines).	$565M Series D extension (2021) led by A16Z
Confluent	CA USA	Streaming data platform based on Apache Kafka, offers a managed service and ecosystem around real-time event streaming and processing.	IPO in June 2021
Glassflow	Germany	Deduplicates and joins Kafka data streams for ClickHouse. It is open-source, fast, and reduces the load on your ClickHouse.	$4.8M Seed (2024) led by Upfront Ventures
Estuary	NY US	A platform designed for real-time ETL and ELT data pipelines, enabling seamless data integration across databases, data warehouses, and SaaS applications.	$17M Series A (2025) led by M13
Quix	UK	A platform that provides real-time data processing and integration solutions for various industries, including manufacturing, energy, and finance.	$13M Series A (2022) led by MMC Ventures
Conduktor	NY US	An Enterprise Data Management platform that specializes in simplifying and scaling data streaming using Apache Kafka.	$30M Series B (2024) led by RTP Global
Tinybird	Spain	A data platform that enables engineering teams to build real-time, user-facing analytics applications based on top of ClickHouse.	$30M Series B (2024) led by Balderton Capital

💡Key Takeaways

Data ingestion and transformation techniques have evolved from ETL to ELT to streaming to meet the growing demand for faster, real-time data.
This layer is dominated by well-funded incumbents like Fivetran , Databricks , and Confluent , which form the foundation of ETL/ELT data pipelines.
Streaming-native applications that modernize the ingestion pipeline for real-time data are developing rapidly; Forestay is tracking these closely.

2.2 Unstructured Data Transformation

IBM estimates that over 80% of enterprise data is unstructured, spanning documents, text, images, audio, video, logs, and social content. Historically, transformation pipelines focused on structured data, as unstructured processing relied on brittle NLP, OCR, and ML systems with limited accuracy. The rise of LLMs has materially improved extraction and understanding of unstructured data, unlocking use cases from AI search and copilots to compliance monitoring and document summarization.

Two dominant approaches have emerged to connect unstructured data to downstream systems:

Category 1, Extract-and-Structure Approach:Unstructured data is converted into structured formats (e.g. CSV, JSON) by extracting predefined fields before storage or analysis. Formerly known as IDP, this approach has evolved with LLMs to deliver higher accuracy and flexibility. Vendors like Unstructured, Rossum, and Graphlit preprocess documents, extract text and metadata, and map them into queryable schemas.

Category 2, Direct-Load with On-Demand Extraction:Unstructured content is stored in raw or semi-structured form and indexed in vector databases. Information is retrieved and interpreted at query time via LLMs, typically using RAG. This model avoids upfront structuring and is well-suited to search and Q&A use cases, as seen with Glean and Hebbia.

The optimal approach depends on the use case, particularly requirements around latency, accuracy, repeatability, and downstream system integration:

Category 1:
- Advantages:
  - Higher initial accuracy through specialized extraction models
  - Explicit data validation and quality controls
  - Better for compliance-heavy environments requiring audit trails
  - Lower query-time latency since data is pre-structured
  - More cost-effective for repeated queries on the same data
- Challenges:
  - Requires upfront schema definition
  - Less flexible for ad-hoc queries
  - Additional preprocessing overhead
  - Potential information loss during extraction
- Use cases: Excels in scenarios requiring high accuracy, regulatory compliance, and repeated queries, such as financial processing, invoice automation, and compliance-heavy industries like healthcare and insurance.
Category 2:
- Advantages:
  - Maximum flexibility for diverse query patterns
  - Preserves original document context
  - No upfront schema requirements –
  - Better for exploratory analysis and semantic search
  - Easier to add new data sources
- Challenges:
  - Higher query-time costs (LLM inference)
  - Potential accuracy variability
  - More challenging to ensure consistent results
  - Harder to implement compliance controls
- Use cases: Dominates in knowledge work, research, and customer support scenarios where query patterns are unpredictable and semantic understanding is paramount: legal, customer success, research and knowledge search

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Unstructured	CA US	Open-source toolkit and API for processing unstructured files for LLMs (handles dozens of file types and transforms them into cleaned text or embeddings for AI applications).	$40M Series B (2024) led by Menlo Ventures, joined by IBM, NVIDIA, Databricks.
Graphlit	Seattle US	API-first platform for turning raw unstructured data into usable content for AI, provides a GraphQL API to extract and link knowledge (entities, relationships) from text, enabling developers to build AI-driven applications (e.g. semantic search, RAG).	$3.6M Seed (2025) led by Valia Ventures.
Glean	CA US	Enterprise knowledge layer	$150M Series F (June 2025) led by Wellington Management.
Flexor	Israel	Unstructured data transformation platform that enables data practitioners to convert raw textual data into structured, actionable insights.	$6M Seed (2024) led by LTV Partners, Dell and Arc Capital.
Mindee	France	AI document data extraction API with RAG-based continuous learning.	$7M Series A (2023).
Vespa.ai	Norway	An open-source big data processing and serving engine intended for traditional information retrieval and modern embedding-based techniques.	$31M Series A (2023) led by Blossom Capital.

💡Key Takeaways

Unstructured data processing is gaining significant traction due to improved LLM capabilities and growing demand for GenAI applications like RAG.
Both categories of transformation techniques offer compelling opportunities. Enterprises select different approaches based on their specific domain requirements and use cases.

2.3 Intelligent Data Pipelines

Traditional data pipelines are manual and brittle: engineers write and maintain transformation code, manage schedules, debug failures, and resolve data quality issues by hand. Intelligent data pipelines aim to automate this lifecycle by embedding AI across development, testing, monitoring, and optimization—shifting from manual pipeline engineering to agentic data engineering. Vendors such as Monte Carlo, Bigeye, Soda Data, Great Expectations, Prophecy, and Mage are driving this transition.

Market dynamics cluster around four areas:

Data Pipeline Observability:Platforms like Monte Carlo, Bigeye, and Soda Data monitor pipeline health (volume, schema, drift) and alert teams to costly failures.
Automated Testing & CI/CD:Tools such as Great Expectations bring test-driven development to data, enabling automated quality gates and safer deployments.
No-Code and AI-Assisted Pipeline Design:Vendors like Prophecy and Mage use visual builders and LLM copilots to generate and maintain pipeline code, dramatically reducing development time for standard use cases.
Self-Optimizing Pipelines: Cloud services (AWS Glue, Azure Data Factory, GCP Dataflow) and startups like Upsolver, Ascend.io, and Nexla automate scaling, schema evolution, performance tuning, and failure recovery.

Most tools remain point solutions, strong in one layer of the pipeline lifecycle. As with DevOps, we expect consolidation toward a small number of platforms that unify orchestration, observability, and AI-assisted development into a single workflow.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Monte Carlo	CA US	Data observability platform that monitors data pipelines for anomalies, data quality issues, and downtime.	$135M Series D (Jan 2022) led by IVP (valuation $1.6B)
Great Expectations	UT US	Open-source data validation framework (Great Expectations) and cloud platform. Allows teams to define expectations (tests) for data and catch errors.	$40M Series B (2021) led by Tiger Global.
Prophecy	CA US	Developer of a Low-Code Data Engineering platform with an AI copilot.	$47M Series B Extension (2024) led by Smith Point Capital.
Soda Data	Belgium	Data quality and observability startup offering both open-source tools and a cloud platform.	$17M strategic funding (2024) from Singular, Point Nine, etc.
Datafold	NY US	A data reliability platform that helps data engineering teams automate and improve data quality management.	$4M Series A extension (2025) led by Goodwin Procter

💡Key Takeaways

Data pipelines are becoming significantly more intelligent, combining software engineering best practices with the power of LLMs.
Many feature-level startups exist in the space. We expect the data pipeline tooling stack to converge in the coming years, just like DevOps tools consolidated into cloud platforms and CI suites.

3. Data Storage

The data storage layer is the foundation of modern data infrastructure, spanning warehouses, lakes, and specialized databases. As data volumes and formats grow—and AI becomes embedded across products and operations—storage must support unstructured data, real-time access, and tight integration with ML pipelines. Traditional databases built for structured reporting are no longer sufficient.

This shift has driven two core architectural responses:

Vector databases, which store and search unstructured data via embeddings, enabling semantic search, retrieval, and AI-native applications.
Lakehouse architectures, which unify data lakes and warehouses into a single, scalable foundation for analytics and AI, enabled by open formats such as Delta Lake and Apache Iceberg.

The data storage layer has evolved in response to rising scale, complexity, and performance demands. Early systems centered on relational databases, optimized for structured, transactional workloads with ACID (atomicity, consistency, isolation, durability) guarantees. As applications scaled and data became more heterogeneous, distributed SQL and NoSQL databases emerged to support flexibility and web-scale use cases.

The explosion of data volume then drove the split between data warehouses (optimized, governed analytics) and data lakes (cheap, schema-less storage). This separation added operational complexity, prompting the rise of the lakehouse: a unified architecture combining warehouse reliability with lake-scale economics, enabled by formats like Delta Lake and Apache Iceberg.

Today, AI workloads are reshaping the stack again. Lakehouses have become the system of record, complemented by purpose-built databases—notably vector databases—to support real-time, unstructured, and AI-native workloads.

3.1 Vector Databases

Vector databases store data as high-dimensional embeddings, enabling semantic similarity search rather than exact-match queries. They are now core AI infrastructure, powering use cases such as semantic search, recommendations, and Retrieval-Augmented Generation (RAG) by grounding model outputs in relevant enterprise data. Gartner estimates that by 2026, over 30% of enterprises will use vector databases to augment AI models with proprietary data.

Market Dynamics:

The generative AI wave triggered rapid adoption and heavy funding – Weaviate, Zilliz (Milvus), and Pinecone raised significant capital, alongside projects like Qdrant and Chroma. However, vector search is quickly commoditizing as incumbents integrate it natively (e.g. AWS OpenSearch, PostgreSQL pgVector). As a result, standalone vector databases must differentiate on performance, scale, and advanced AI use cases.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Pinecone	NY US	Fully managed vector database for AI applications, enabling semantic search and retrieval-augmented generation (RAG).	$100M Series B (2023) led by A16Z.
Qdrant	Germany	Open-source vector database optimized for high-performance similarity search and filtering.	$27M Series A (2024) led by Spark Capital, raising Series B now.
Weaviate	Netherlands	Open-source vector database with integrated machine learning models for semantic search and hybrid queries.	$50M Series B (2023) led by Index.
Zilliz (Milvus)	CA US	Company behind Milvus, the open-source vector database designed for massive-scale similarity search and AI applications.	$60M Series C (2022) led by Prosperity7.
Chroma	CA US	Open-source embedding database designed to be simple and developer-friendly for building AI applications.	$18M Seed (2023) led by Quiet Capital.

💡Key Takeaways

Vector database is undoubtedly an important component of AI infrastructure. However it is becoming increasingly commoditized and integrated into broader database or cloud offerings, becoming more like a feature than a product.
Standalone vector database players are mainly catering to complex, performance-demanding use cases, where there is intense competition among these standalone players. Therefore, we believe this space is currently not well positioned for Forestay investment.

3.2 Data Lakehouse

A data lakehouse unifies data lakes and warehouses by storing all raw data – structured, semi-structured, and unstructured – in low-cost cloud object storage. A transactional metadata layer adds warehouse features like schema enforcement, ACID transactions, and SQL analytics. This architecture solves the limitations of separate data lakes and warehouses: it combines cheap, scalable storage with fast queries and governance, enabling business intelligence, machine learning, and streaming analytics on a single platform without costly ETL pipelines or vendor lock-in.

Market Dynamics:

The data lakehouse has quickly moved from niche concept to mainstream architecture for large data platforms. Industry surveys show a surge in adoption and understanding over the past few years. In 2023, only a few organizations had deep lakehouse expertise; by 2025, over 38% reported a “detailed understanding” of lakehouse concepts – up from just 4% in 2023. A recent survey of enterprise data teams found that enabling AI use cases is now the top driver of data architecture investments. Over 85% of organizations have budgeted for and are implementing AI-ready data infrastructure, with the lakehouse identified as the strategic core.

There are three open table formats that power lakehouse technology: Apache Iceberg, Apache Hudi, and Delta Lake. Apache Iceberg has gained the most traction recently and is becoming the default format, primarily due to its vendor-neutral status under the Apache Foundation and superior performance. Major cloud providers (Snowflake, AWS, and GCP) have widely adopted it. Enterprises using Apache Iceberg benefit from the ability to move data between platforms without vendor lock-in. A rich ecosystem of startups like Dremio and Onehouse are building tools to optimize lakehouse storage cost and performance, especially for Iceberg. Meanwhile, incumbents like Databricks (which coined the lakehouse term) and Snowflake are in a high-profile race to dominate the unified analytics market. Cloud giants AWS, Google, and Microsoft have all launched lakehouse offerings as well (Amazon Lake Formation, Google BigLake, and Microsoft Fabric) to blend lake and warehouse capabilities.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Databricks	CA US	Unified data lakehouse platform (built on Apache Spark) for analytics and AI.	$1B Series K (2025) led by A16Z.
Snowflake	Bozeman USA	Cloud data warehouse platform now expanding into lakehouse capabilities (data sharing, cross-cloud analytics).	IPO in 2020
Dremio	CA US	Data lakehouse query engine (SQL analytics directly on data lake storage).	$160M Series E (2022) led by Adams Street Partners.
Onehouse	CA US	Managed lakehouse service based on Apache Hudi (streamlined data lakes with incremental updates).	$35M Series B (2024) led by Craft Ventures.
Ryft	NY US	Automated data management software to optimize iceberg tables.	$8M Seed (2025) led by Index.
Qbeast	Spain	Facilitates querying over lakehouse	$8M Seed (2025) led by Surge.
e6data	CA US	Improve query on lakehouse	$10M Seed (2025) led by Accel.
Espresso	NY US	ML-based storage optimization in Snowflake	$11M Seed (2024) led by FirstMark.

💡Key Takeaways

The data lakehouse merges the scalability of data lakes with the performance and governance of data warehouses, making it ideal for AI applications.
Snowflake and Databricks pioneered the approach, and all major cloud providers have now adopted it. Startup opportunities exist in optimizing the ecosystem built on top of these platforms.

4. Data query & analysis

Enterprises manage two fundamentally different data types: unstructured content (documents, emails, PDFs, audio) and structured tables in warehouses and lakes. Historically, insight came almost entirely from structured data, as extracting value from unstructured content was manual, slow, and expensive.

Market Trends:

Vector stores deployed alongside lakes and lakehouses now index unstructured assets and semantically join them with relational data. Natural-language interfaces query across both, enabling analysis that is both statistically precise and contextually aware.

RAG has become the core data primitive: LLMs retrieve relevant enterprise content to ground answers in internal data. Leading systems such as Hebbia go beyond simple vector search, separating question understanding from retrieval and composing evidence across many documents with citations. In parallel, MCP has emerged as an AI-native query gateway—an open standard from Anthropic that securely exposes tools like SQL, vector search, and governed documents through a unified interface. Together, RAG and MCP enable AI-native agents that turn fragmented enterprise knowledge into executable workflows.

The application layer is shifting from passive Q&A to agentic execution. With MCP, applications can retrieve data and take actions across systems (e.g. drafting contracts, updating CRMs). As a result, winners are moving beyond narrow RAG point solutions toward vertical platforms that automate end-to-end workflows and produce domain-specific outputs.

Four models coexist:

Vertical Platform: Purpose-built platforms with wide ranging AI features to automate a wide variety of a given verticals workflows
Point Solution: Target specific pain points in vertical workflow but lack breadth or interoperability across workflows
Foundation Models & General Tools: Generalist AI platforms offer flexible capabilities but require heavy customisation for vertical precision; though foundation models are starting to train for vertical use cases (e.g. OpenAI for financial modelling use cases)
General Productivity Point Solutions: Horizontal AI tools solve narrow tasks common to all firms (e.g. note taking / search) but lack domain context and integration depth

Enterprises increasingly prefer vertical platforms, driven by consolidation benefits, tighter integrations, and higher output quality. While generalist tools handle simple tasks, complex workflows demand domain accuracy, workflow-native design, expert account management, and flexible model choice.

This has fueled rapid adoption of vertical AI platforms such as Luminance, Harvey and Legora (legal), Model ML and Rillet (finance), Abridge (healthcare), and Liberate (insurance). Their moat combines proprietary retrieval, domain-tuned models, and generation optimized for industry-specific accuracy and formats.

Looking ahead, differentiation will shift from capability to cost and latency. As buyers compare AI apps against each other—not against humans—winners will optimize aggressively through model routing, context caching, and efficient retrieval. Providers that fail to do so will see margins compress as customers demand faster responses and lower cost per task.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Harvey	CA US	Domain-specific AI platform for legal and professional services work.	$150M Series F (2025) at $8B led by A16Z.
Legora	Sweden (with offices in NY & London)	Collaborative AI workspace for lawyers, speeding research, drafting and review.	$150M Series C (2025) at $1.5B led by Bessemer Venture Partners.
Model ML	NY US	AI workspace automating financial and consulting analysis and research workflows.	$60M Series A (2025) at $325M led by FT Partners.
Rillet	CA US	AI-native ERP automating accounting, consolidations and revenue recognition for SaaS.	$70M Series B (2025) co-led by A16Z & ICONIQ.
Liberate	CA US	Reasoning AI agents to automate underwriting and claims in P&C insurance.	$50M Series B (2025) at $300M led by Battery Ventures (with Canapi, Redpoint, Eclipse, Commerce Ventures).
Abridge	PA US	Ambient AI that turns clinician–patient conversations into billable clinical notes.	$300M Series E (2025) at $5B led by A16Z with Khosla Ventures.
Hebbia	NY US	AI platform for complex financial and legal document analysis and workflows.	$130M Series B (2024) at $700M led by A16Z.
Glean	CA US	Work AI platform for enterprise search, assistant, agents and analytics.	$150M Series F (2025) at $7.2B led by Wellington Management.
Kore.ai	FL US	Enterprise conversational and generative AI platform for customer and employee experiences.	$150M Series D / growth (2024) led by FTV Capital (with NVIDIA, others).
AlphaSense	NY USA	AI-powered market intelligence and search platform for financial users.	$650M Series F (2024) co-led by Viking Global Investors & BDT & MSD Partners.
Ambience Healthcare	CA USA	Ambient AI platform for clinical documentation, coding and revenue-cycle workflows.	$243M Series C (2025) co-led by Oak HC/FT & A16Z.
Nabla	France	Ambient AI copilot and agents for doctors	$70M Series C (2025) led by Cathay Innovation.
Orbital	UK	Commercial real estate focused AI copilot automating real-estate legal checks.	$40M Series B (2025)
ThoughtSpot	CA US	Agentic Data Analytics Platform	$124M Series F (2023)

💡Key Takeaways

The query/analysis layer now fuses structured and unstructured data via vector stores and natural-language interfaces; RAG is evolving beyond nearest-neighbour fetch to large-context, multi-step retrieval, while MCP provides governed, AI-native access to enterprise systems for grounded, cited answers.
Frontier in the application layer has moved beyond retrieval: these systems take actions across systems (e.g., draft contracts, update CRM) – delivering finished, source-grounded outputs rather than summaries.
Enterprises prefer few, deep vendors; vertical platforms win on domain accuracy, workflow fit, privileged data integrations, domain-savvy account teams, and flexible model choice. Generalist tools cover low-end, narrow tasks.
As comparisons move from “AI vs human” to “app vs app,” winners optimize cost/latency via model routing and context caching – improving speed, unit economics, and the breadth of automatable work.

5. Data Governance

The data governance layer for enterprise AI has evolved rapidly over the past decade, unfolding in three waves:

Pre–Gen AI governance, centered on cataloging and control
Gen AI–driven transformation, expanding governance to unstructured data and models
AI/ML security and agentic compliance, addressing the risks of autonomous systems

Across the last two waves, governance has expanded well beyond lineage and cataloging to include model transparency, data ethics, cross-border sovereignty, and real-time monitoring of data usage and quality.

5.1 Pre-GenAI

Legacy platforms such as Collibra, Alation, and BigID established the foundations of enterprise data governance through catalogs, lineage, stewardship, and policy enforcement. While effective for BI and warehouse-centric stacks, these systems struggled to scale with cloud-native data sprawl and emerging AI workloads.

5.2 The GenAI Transformation

Generative AI fundamentally changed governance requirements. Enterprises now needed to govern unstructured data, vector stores, models, and LLM pipelines—often in real time. This drove demand for:

AI-driven automation in discovery, classification, and anomaly detection
Collaborative governance spanning legal, compliance, product, and data teams
Regulatory agility to adapt to fast-changing privacy and AI regulations

Platforms like Atlan and Alation have embedded these capabilities, while BigID expanded into DSPM, unifying security, privacy, and compliance across traditional and AI-native data assets.

5.3 AI/ML Security & Agentic Compliance

Since 2022, a new class of AI/ML-native governance platforms has emerged, focused on securing AI systems themselves. These tools address:

Model behavior monitoring, including drift, bias, hallucinations, and attacks
Agentic lineage and auditability, as AI agents continuously ingest and generate data

Together, these platforms enable continuous governance across AI pipelines—combining security, compliance, and ethics while preserving the speed required for enterprise AI deployment.

Notable Companies:

Company	Location	Description	Funding (Last Round Info)
Lakera	Switzerland	Model security, LLM protection and policy enforcement for enterprises.	Acquired by Checkpoint in 2025 for $300M.
Robust Intelligence	CA US	Model risk platform detecting failures, bias, drift and adversarial vulnerabilities.	Acquired by Cisco in 2025 for $400M.
CalypsoAI	CA US	AI security, agentic oversight and governance for enterprise and government.	~$5M Series A (2025) led by Paladin Capital Group
Trojai	Canada	Regulatory compliance, dataset risk scanning and model vulnerability detection.	$5.75M Seed (2024) led by Flying Fish Ventures
Guardrails AI	CA US	LLM safety, observability, monitoring and structured output validation.	$7.5M Seed (Feb 2024) led by Zetta Venture Partners
Witness AI	UK	AI safety, privacy, audit and enterprise control plane for LLM workflows.	$27.5M Series A (May 2024) co-led by Google Ventures (GV) & Ballistic Ventures
Overmind	UK	Security and observability layer for agentic AI systems	£500K Pre-Seed led by Antler (2025)

💡Key Takeaways

Enterprise data governance is undergoing its most significant evolution, driven by the rise of AI. Every enterprise deploying AI now needs a unified layer of trust, transparency, and control across models, data, and agents.
Modern governance now extends far beyond the basic capabilities of cataloging and lineage, by spanning the entire AI lifecycle and embedding real-time access control, monitoring, lineage , and policy enforcement (into both data and model pipelines).
We expect continued expansion of agentic AI security and compliance platforms, with specialized AI agents automating vulnerability detection, policy enforcement, and proactive risk mitigation across enterprise environments.

6. Conclusion

In conclusion, as AI becomes embedded in every workflow, the battle moves down-stack: the winners will be the companies that control data quality, governance, and integration—not those with marginally better models. The enterprise data layer is consolidating into fewer, broader platforms designed to deliver measurable work products under compliance constraints.

Data is the durable moat: Proprietary data + synthetic edge-case coverage + lineage + eval systems > model novelty.
Consolidation gravity: Orchestration, observability, governance, and eval are bundling; standalone tools become features.
Outcome > capability: Procurement shifts toward applications that ship completed outputs with audit trails and SLAs.
Lakehouse backbone: Iceberg-led open table formats are becoming the neutral substrate for GenAI applications.
HITL up-market: Differentiation moves to expert feedback, rules, evaluation, and governance—not cheap annotation.

‍

References and Resources

Published on:

10/2/2026

Authors

Bin Jin, PhD

Data Science Manager

Varun Doshi

Associate

Julien van den Rul

Vice President

The State of AI Data Infrastructure 2026

1. Data Generation & Labelling

1.1. Synthetic Data

💡Key Takeaways

1.2 Data Labelling

💡Key Takeaways

2. Data Ingestion & Transformation

2.1 ETL → ELT → Streaming

💡Key Takeaways

2.2 Unstructured Data Transformation

💡Key Takeaways

2.3 Intelligent Data Pipelines

💡Key Takeaways

3. Data Storage

3.1 Vector Databases

💡Key Takeaways

3.2 Data Lakehouse

💡Key Takeaways

4. Data query & analysis

💡Key Takeaways

5. Data Governance

5.1 Pre-GenAI

5.2 The GenAI Transformation

5.3 AI/ML Security & Agentic Compliance

💡Key Takeaways

6. Conclusion

References and Resources

Authors

Bin Jin, PhD

Varun Doshi

Julien van den Rul

Related articles

Want to know more?

Send us a message