Data Pipelines

Tools for data ingestion, transformation, and orchestration

18 tools

Airbyte

Freemium

Open-source data integration platform. 300+ connectors for moving data to warehouses and vector stores.

Data Pipelines

Apache Airflow

Open Source

Workflow orchestration platform. Schedule and monitor complex data pipelines with Python DAGs.

Data Pipelines

Apify

Freemium

Cloud platform for web scraping and automation. Pre-built scrapers for popular sites. Proxy and storage included.

Data Pipelines

Azure Document Intelligence

Freemium

Microsoft's OCR and document extraction. Forms, receipts, invoices with pre-built and custom models.

Data Pipelines

Browserbase

Freemium

Headless browser infrastructure for AI agents. Run browser sessions at scale for web automation.

Data Pipelines

Crawlee

Open Source

Web scraping and browser automation library. Handles anti-bot protections. TypeScript and Python.

Data Pipelines

Dagster

Freemium

Data orchestration platform. Asset-based pipelines with built-in testing and type checking.

Data Pipelines

dbt

Freemium

SQL-based data transformation tool. Define models, test, and document your data pipeline in version-controlled SQL.

Data Pipelines

Docling

Open Source

IBM's document conversion library. PDF, DOCX, PPTX to Markdown/JSON. OCR and table extraction built-in.

Data Pipelines

Embedchain

Open Source

RAG framework by Mem0. Create AI apps over any data in minutes. Supports 20+ data source types.

Data Pipelines

Firecrawl

Freemium

Turn websites into LLM-ready data. Crawl, scrape, and convert web pages to clean Markdown for RAG.

Data Pipelines

Jina Reader

Freemium

Convert any URL to LLM-friendly text. Simple API: prefix URL with r.jina.ai. Free tier available.

Data Pipelines

LlamaParse

Freemium

LlamaIndex's document parser. Handles complex PDFs with tables, charts, and mixed layouts for RAG.

Data Pipelines

Marker

Open Source

Fast PDF to Markdown converter. Handles complex layouts, tables, and equations. Local processing.

Data Pipelines

MegaParse

Open Source

Universal document parser. Supports PDF, DOCX, PPTX, and more. Integrates with LangChain and LlamaIndex.

Data Pipelines

Prefect

Freemium

Modern workflow orchestration. Python-native, with retries, caching, and observability built-in.

Data Pipelines

R2R (SciPhi)

Open Source

Production-ready RAG engine. Ingestion, search, and generation in one system with knowledge graph support.

Data Pipelines

Unstructured

Freemium

ETL for unstructured data. Extract and transform PDFs, images, HTML, and Office docs for LLM consumption.

Data Pipelines