Data Pipelines
Tools for data ingestion, transformation, and orchestration
18 tools
Airbyte
FreemiumOpen-source data integration platform. 300+ connectors for moving data to warehouses and vector stores.
Data PipelinesApache Airflow
Open SourceWorkflow orchestration platform. Schedule and monitor complex data pipelines with Python DAGs.
Data PipelinesApify
FreemiumCloud platform for web scraping and automation. Pre-built scrapers for popular sites. Proxy and storage included.
Data PipelinesAzure Document Intelligence
FreemiumMicrosoft's OCR and document extraction. Forms, receipts, invoices with pre-built and custom models.
Data PipelinesBrowserbase
FreemiumHeadless browser infrastructure for AI agents. Run browser sessions at scale for web automation.
Data PipelinesCrawlee
Open SourceWeb scraping and browser automation library. Handles anti-bot protections. TypeScript and Python.
Data PipelinesDagster
FreemiumData orchestration platform. Asset-based pipelines with built-in testing and type checking.
Data Pipelinesdbt
FreemiumSQL-based data transformation tool. Define models, test, and document your data pipeline in version-controlled SQL.
Data PipelinesDocling
Open SourceIBM's document conversion library. PDF, DOCX, PPTX to Markdown/JSON. OCR and table extraction built-in.
Data PipelinesEmbedchain
Open SourceRAG framework by Mem0. Create AI apps over any data in minutes. Supports 20+ data source types.
Data PipelinesFirecrawl
FreemiumTurn websites into LLM-ready data. Crawl, scrape, and convert web pages to clean Markdown for RAG.
Data PipelinesJina Reader
FreemiumConvert any URL to LLM-friendly text. Simple API: prefix URL with r.jina.ai. Free tier available.
Data PipelinesLlamaParse
FreemiumLlamaIndex's document parser. Handles complex PDFs with tables, charts, and mixed layouts for RAG.
Data PipelinesMarker
Open SourceFast PDF to Markdown converter. Handles complex layouts, tables, and equations. Local processing.
Data PipelinesMegaParse
Open SourceUniversal document parser. Supports PDF, DOCX, PPTX, and more. Integrates with LangChain and LlamaIndex.
Data PipelinesPrefect
FreemiumModern workflow orchestration. Python-native, with retries, caching, and observability built-in.
Data PipelinesR2R (SciPhi)
Open SourceProduction-ready RAG engine. Ingestion, search, and generation in one system with knowledge graph support.
Data PipelinesUnstructured
FreemiumETL for unstructured data. Extract and transform PDFs, images, HTML, and Office docs for LLM consumption.
Data Pipelines