Skip to main content
Skip to main content

Execution Engine Configuration

DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.

Available Engines

EngineDescriptionBest For
autoAutomatically selects best engine per operationGeneral use (default)
chdbForces all operations through ClickHouse SQLLarge datasets, aggregations
pandasForces all operations through pandasCompatibility testing, pandas-specific features

Setting the Engine

Global Configuration

from chdb.datastore.config import config

# Option 1: Using set method
config.set_execution_engine('auto')    # Default
config.set_execution_engine('chdb')    # Force ClickHouse
config.set_execution_engine('pandas')  # Force pandas

# Option 2: Using shortcuts
config.use_auto()     # Auto-select
config.use_chdb()     # Force ClickHouse
config.use_pandas()   # Force pandas

Checking Current Engine

print(config.execution_engine)  # 'auto', 'chdb', or 'pandas'

Auto Mode

In auto mode (default), DataStore selects the optimal engine for each operation:

Operations Executed in chDB

  • SQL-compatible filtering (filter(), where())
  • Column selection (select())
  • Sorting (sort(), orderby())
  • Grouping and aggregation (groupby().agg())
  • Joins (join(), merge())
  • Distinct (distinct(), drop_duplicates())
  • Limiting (limit(), head(), tail())

Operations Executed in pandas

  • Custom apply functions (apply(custom_func))
  • Complex pivot tables with custom aggregations
  • Operations not expressible in SQL
  • When input is already a pandas DataFrame

Example

from chdb import datastore as pd
from chdb.datastore.config import config

config.use_auto()  # Default

ds = pd.read_csv("data.csv")

# This uses chDB (SQL)
result = (ds
    .filter(ds['amount'] > 100)   # SQL: WHERE
    .groupby('region')            # SQL: GROUP BY
    .agg({'amount': 'sum'})       # SQL: SUM()
)

# This uses pandas (custom function)
result = ds.apply(lambda row: complex_calculation(row), axis=1)

chDB Mode

Force all operations through ClickHouse SQL:

config.use_chdb()

When to Use

  • Processing large datasets (millions of rows)
  • Heavy aggregation workloads
  • When you want maximum SQL optimization
  • Consistent behavior across all operations

Performance Characteristics

Operation TypePerformance
GroupBy/AggregationExcellent (up to 20x faster)
Complex FilteringExcellent
SortingVery Good
Simple Single FiltersGood (slight overhead)

Limitations

  • Custom Python functions may not be supported
  • Some pandas-specific features require conversion

pandas Mode

Force all operations through pandas:

config.use_pandas()

When to Use

  • Compatibility testing with pandas
  • Using pandas-specific features
  • Debugging pandas-related issues
  • When data is already in pandas format

Performance Characteristics

Operation TypePerformance
Simple Single OperationsGood
Custom FunctionsExcellent
Complex AggregationsSlower than chDB
Large DatasetsMemory intensive

Cross-DataStore Engine

Configure the engine for operations that combine columns from different DataStores:

# Set cross-DataStore engine
config.set_cross_datastore_engine('auto')
config.set_cross_datastore_engine('chdb')
config.set_cross_datastore_engine('pandas')

Example

ds1 = pd.read_csv("sales.csv")
ds2 = pd.read_csv("inventory.csv")

# This operation involves two DataStores
result = ds1.join(ds2, on='product_id')
# Uses cross_datastore_engine setting

Engine Selection Logic

Auto Mode Decision Tree

Operation requested
    │
    ├─ Can be expressed in SQL?
    │      │
    │      ├─ Yes → Use chDB
    │      │
    │      └─ No → Use pandas
    │
    └─ Cross-DataStore operation?
           │
           └─ Use cross_datastore_engine setting

Function-Level Override

Some functions can have their engine explicitly configured:

from chdb.datastore.config import function_config

# Force specific functions to use specific engine
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper', 'lower')

See Function Config for details.


Performance Comparison

Benchmark results on 10M rows:

Operationpandas (ms)chdb (ms)Speedup
GroupBy count3471719.93x
Combined ops1,5352346.56x
Complex pipeline2,0473805.39x
Filter+Sort+Head1,5373504.40x
GroupBy agg4061412.88x
Single filter2765260.52x

Key insights:

  • chDB excels at aggregations and complex pipelines
  • pandas is slightly faster for simple single operations
  • Use auto mode to get the best of both

Best Practices

1. Start with Auto Mode

config.use_auto()  # Let DataStore decide

2. Profile Before Forcing

config.enable_profiling()
# Run your workload
# Check profiler report to see where time is spent

3. Force Engine for Specific Workloads

# For heavy aggregation workloads
config.use_chdb()

# For pandas compatibility testing
config.use_pandas()

4. Use explain() to Understand Execution

ds = pd.read_csv("data.csv")
query = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'sum'})

# See what SQL will be generated
query.explain()

Troubleshooting

Issue: Operation slower than expected

# Check current engine
print(config.execution_engine)

# Enable debug to see what's happening
config.enable_debug()

# Try forcing specific engine
config.use_chdb()  # or config.use_pandas()

Issue: Unsupported operation in chdb mode

# Some pandas operations aren't supported in SQL
# Solution: use auto mode
config.use_auto()

# Or explicitly convert to pandas first
df = ds.to_df()
result = df.some_pandas_specific_operation()

Issue: Memory issues with large data

# Use chdb engine to avoid loading all data into memory
config.use_chdb()

# Filter early to reduce data size
result = ds.filter(ds['date'] >= '2024-01-01').to_df()