Skip to main content
Skip to main content
Edit this page

DataStore: Pandas-Compatible API with SQL Optimization

DataStore is chDB's pandas-compatible API that combines the familiar pandas DataFrame interface with the power of SQL query optimization. Write pandas-style code, get ClickHouse performance.

Key Features

  • Pandas Compatibility: 209 pandas DataFrame methods, 56 .str methods, 42+ .dt methods
  • SQL Optimization: Operations automatically compile to optimized SQL queries
  • Lazy Evaluation: Operations are deferred until results are needed
  • 630+ API Methods: Comprehensive API surface for data manipulation
  • ClickHouse Extensions: Additional accessors (.arr, .json, .url, .ip, .geo) not available in pandas

Architecture

DataStore Architecture

DataStore uses lazy evaluation with dual-engine execution:

  1. Lazy Operation Chain: Operations are recorded, not executed immediately
  2. Smart Engine Selection: QueryPlanner routes each segment to optimal engine (chDB for SQL, Pandas for complex ops)
  3. Intermediate Caching: Results cached at each step for fast iterative exploration

See Execution Model for details.

One-Line Migration from Pandas

# Before (pandas)
import pandas as pd
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()

# After (DataStore) - just change the import!
from chdb import datastore as pd
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()

Your existing pandas code works unchanged, but now runs on the ClickHouse engine.

Performance Comparison

DataStore delivers significant performance improvements over pandas, especially for aggregation and complex pipelines:

OperationPandasDataStoreSpeedup
GroupBy count347ms17ms19.93x
Complex pipeline2,047ms380ms5.39x
Filter+Sort+Head1,537ms350ms4.40x
GroupBy agg406ms141ms2.88x

Benchmark on 10M rows. See benchmark script and Performance Guide for details.

When to Use DataStore

Use DataStore when:

  • Working with large datasets (millions of rows)
  • Performing aggregations and groupby operations
  • Querying data from files, databases, or cloud storage
  • Building complex data pipelines
  • You want pandas API with better performance

Use raw SQL API when:

  • You prefer writing SQL directly
  • You need fine-grained control over query execution
  • Working with ClickHouse-specific features not exposed in pandas API

Feature Comparison

FeaturepandaspolarsDuckDBDataStore
Pandas API compatible-PartialNoFull
Lazy evaluationNoYesYesYes
SQL query supportNoYesYesYes
ClickHouse functionsNoNoNoYes
String/DateTime accessorsYesYesNoYes + extras
Array/JSON/URL/IP/GeoNoPartialNoYes
Direct file queriesNoYesYesYes
Cloud storage supportNoLimitedYesYes

API Statistics

CategoryCountCoverage
DataFrame methods209100% of pandas
Series.str accessor56100% of pandas
Series.dt accessor42+100%+ (includes ClickHouse extras)
Series.arr accessor37ClickHouse-specific
Series.json accessor13ClickHouse-specific
Series.url accessor15ClickHouse-specific
Series.ip accessor9ClickHouse-specific
Series.geo accessor14ClickHouse-specific
Total API methods630+-

Getting Started

API Reference

Advanced Topics

Configuration & Debugging

Pandas User Guides

Quick Example

from chdb import datastore as pd

# Read data from various sources
ds = pd.read_csv("sales.csv")
# or: ds = pd.DataStore.uri("s3://bucket/sales.parquet")
# or: ds = pd.DataStore.from_mysql("mysql://user:pass@host/db/table")

# Familiar pandas operations - automatically optimized to SQL
result = (ds
    .filter(ds['amount'] > 1000)           # WHERE amount > 1000
    .groupby('region')                      # GROUP BY region
    .agg({'amount': ['sum', 'mean']})       # SUM(amount), AVG(amount)
    .sort_values('sum', ascending=False)    # ORDER BY sum DESC
    .head(10)                               # LIMIT 10
)

# View the generated SQL
print(result.to_sql())

# Execute and get results
df = result.to_df()  # Returns pandas DataFrame

Next Steps