Skip to main content
Skip to main content
Edit this page

Key Differences from pandas

While DataStore is highly compatible with pandas, there are important differences to understand.

Summary Table

AspectpandasDataStore
ExecutionEager (immediate)Lazy (deferred)
Return typesDataFrame/SeriesDataStore/ColumnExpr
Row orderPreservedPreserved (automatic)
inplaceSupportedNot supported
IndexFull supportSimplified
MemoryAll data in memoryData at source

1. Lazy vs Eager Execution

pandas (Eager)

Operations execute immediately:

import pandas as pd

df = pd.read_csv("data.csv")  # Loads entire file NOW
result = df[df['age'] > 25]   # Filters NOW
grouped = result.groupby('city')['salary'].mean()  # Aggregates NOW

DataStore (Lazy)

Operations are deferred until results are needed:

from chdb import datastore as pd

ds = pd.read_csv("data.csv")  # Just records the source
result = ds[ds['age'] > 25]   # Just records the filter
grouped = result.groupby('city')['salary'].mean()  # Just records

# Execution happens here:
print(grouped)        # Executes when displaying
df = grouped.to_df()  # Or when converting to pandas

Why It Matters

Lazy execution enables:

  • Query optimization: Multiple operations compile to one SQL query
  • Column pruning: Only needed columns are read
  • Filter pushdown: Filters apply at the source
  • Memory efficiency: Don't load data you don't need

2. Return Types

pandas

df['col']           # Returns pd.Series
df[['a', 'b']]      # Returns pd.DataFrame
df[df['x'] > 10]    # Returns pd.DataFrame
df.groupby('x')     # Returns DataFrameGroupBy

DataStore

ds['col']           # Returns ColumnExpr (lazy)
ds[['a', 'b']]      # Returns DataStore (lazy)
ds[ds['x'] > 10]    # Returns DataStore (lazy)
ds.groupby('x')     # Returns LazyGroupBy

Converting to pandas Types

# Get pandas DataFrame
df = ds.to_df()
df = ds.to_pandas()

# Get pandas Series from column
series = ds['col'].to_pandas()

# Or trigger execution
print(ds)  # Automatically converts for display

3. Execution Triggers

DataStore executes when you need actual values:

TriggerExampleNotes
print() / repr()print(ds)Display needs data
len()len(ds)Need row count
.columnsds.columnsNeed column names
.dtypesds.dtypesNeed type info
.shapeds.shapeNeed dimensions
.valuesds.valuesNeed actual data
.indexds.indexNeed index
to_df()ds.to_df()Explicit conversion
Iterationfor row in dsNeed to iterate
equals()ds.equals(other)Need comparison

Operations That Stay Lazy

OperationReturns
filter()DataStore
select()DataStore
sort()DataStore
groupby()LazyGroupBy
join()DataStore
ds['col']ColumnExpr
ds[['a', 'b']]DataStore
ds[condition]DataStore

4. Row Order

pandas

Row order is always preserved:

df = pd.read_csv("data.csv")
print(df.head())  # Always same order as file

DataStore

Row order is automatically preserved for most operations:

ds = pd.read_csv("data.csv")
print(ds.head())  # Matches file order

# Filter preserves order
ds_filtered = ds[ds['age'] > 25]  # Same order as pandas

DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.

When Order Is Preserved

  • File sources (CSV, Parquet, JSON, etc.)
  • pandas DataFrame sources
  • Filter operations
  • Column selection
  • After explicit sort() or sort_values()
  • Operations that define order (nlargest(), nsmallest(), head(), tail())

When Order May Differ

  • After groupby() aggregations (use sort_values() to ensure consistent order)
  • After merge() / join() with certain join types

5. No inplace Parameter

pandas

df.drop(columns=['col'], inplace=True)  # Modifies df
df.fillna(0, inplace=True)              # Modifies df
df.rename(columns={'old': 'new'}, inplace=True)

DataStore

inplace=True is not supported. Always assign the result:

ds = ds.drop(columns=['col'])           # Returns new DataStore
ds = ds.fillna(0)                       # Returns new DataStore
ds = ds.rename(columns={'old': 'new'})  # Returns new DataStore

Why No inplace?

DataStore uses immutable operations to enable:

  • Query building (lazy evaluation)
  • Thread safety
  • Easier debugging
  • Cleaner code

6. Index Support

pandas

Full index support:

df = df.set_index('id')
df.loc['user123']           # Label-based access
df.loc['a':'z']             # Label-based slicing
df.reset_index()
df.index.name = 'user_id'

DataStore

Simplified index support:

# Basic operations work
ds.loc[0:10]               # Integer position
ds.iloc[0:10]              # Same as loc for DataStore

# For pandas-style index operations, convert first
df = ds.to_df()
df = df.set_index('id')
df.loc['user123']

DataStore Source Matters

  • DataFrame source: Preserves pandas index
  • File source: Uses simple integer index

7. Comparison Behavior

Comparing with pandas

pandas doesn't recognize DataStore objects:

import pandas as pd
from chdb import datastore as ds

pdf = pd.DataFrame({'a': [1, 2, 3]})
dsf = ds.DataFrame({'a': [1, 2, 3]})

# This doesn't work as expected
pdf == dsf  # pandas doesn't know DataStore

# Solution: convert DataStore to pandas
pdf.equals(dsf.to_pandas())  # True

Using equals()

# DataStore.equals() also works
dsf.equals(pdf)  # Compares with pandas DataFrame

8. Type Inference

pandas

Uses numpy/pandas types:

df['col'].dtype  # int64, float64, object, datetime64, etc.

DataStore

May use ClickHouse types:

ds['col'].dtype  # Int64, Float64, String, DateTime, etc.

# Types are converted when going to pandas
df = ds.to_df()
df['col'].dtype  # Now pandas type

Explicit Casting

# Force specific type
ds['col'] = ds['col'].astype('int64')

9. Memory Model

pandas

All data lives in memory:

df = pd.read_csv("huge.csv")  # 10GB in memory!

DataStore

Data stays at source until needed:

ds = pd.read_csv("huge.csv")  # Just metadata
ds = ds.filter(ds['year'] == 2024)  # Still just metadata

# Only filtered result is loaded
df = ds.to_df()  # Maybe only 1GB now

10. Error Messages

Different Error Sources

  • pandas errors: From pandas library
  • DataStore errors: From chDB or ClickHouse
# May see ClickHouse-style errors
# "Code: 62. DB::Exception: Syntax error..."

Debugging Tips

# View the SQL to debug
print(ds.to_sql())

# See execution plan
ds.explain()

# Enable debug logging
from chdb.datastore.config import config
config.enable_debug()

Migration Checklist

When migrating from pandas:

  • Change import statement
  • Remove inplace=True parameters
  • Add explicit to_df() where pandas DataFrame is required
  • Add sorting if row order matters
  • Use to_pandas() for comparison tests
  • Test with representative data sizes

Quick Reference

pandasDataStore
df[condition]Same (returns DataStore)
df.groupby()Same (returns LazyGroupBy)
df.drop(inplace=True)ds = ds.drop()
df.equals(other)ds.to_pandas().equals(other)
df.loc['label']ds.to_df().loc['label']
print(df)Same (triggers execution)
len(df)Same (triggers execution)