Skip to main content
Skip to main content
Edit this page

Pandas Compatibility

DataStore implements 209 pandas DataFrame methods for full API compatibility. Your existing pandas code works with minimal changes.

Compatibility Approach

# Typical migration - just change the import
- import pandas as pd
+ from chdb import datastore as pd

# Your code works unchanged
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()

Key principles:

  • All 209 pandas DataFrame methods implemented
  • Lazy evaluation for SQL optimization
  • Automatic type wrapping (DataFrame → DataStore, Series → ColumnExpr)
  • Immutable operations (no inplace=True)

Attributes and Properties

PropertyDescriptionTriggers Execution
shape(rows, columns) tupleYes
columnsColumn names (Index)Yes
dtypesColumn data typesYes
valuesNumPy arrayYes
indexRow indexYes
sizeNumber of elementsYes
ndimNumber of dimensionsNo
emptyIs DataFrame emptyYes
TTransposeYes
axesList of axesYes

Examples:

from chdb import datastore as pd

ds = pd.read_csv("data.csv")

print(ds.shape)      # (1000, 5)
print(ds.columns)    # Index(['name', 'age', 'city', 'salary', 'dept'])
print(ds.dtypes)     # name: object, age: int64, ...
print(ds.empty)      # False

Indexing and Selection

MethodDescriptionExample
df['col']Select columnds['age']
df[['col1', 'col2']]Select columnsds[['name', 'age']]
df[condition]Boolean indexingds[ds['age'] > 25]
df.loc[...]Label-based accessds.loc[0:10, 'name']
df.iloc[...]Integer-based accessds.iloc[0:10, 0:3]
df.at[...]Single value by labelds.at[0, 'name']
df.iat[...]Single value by positionds.iat[0, 0]
df.head(n)First n rowsds.head(10)
df.tail(n)Last n rowsds.tail(10)
df.sample(n)Random sampleds.sample(100)
df.select_dtypes()Select by dtypeds.select_dtypes(include='number')
df.query()Query expressionds.query('age > 25')
df.where()Conditional replaceds.where(ds['age'] > 0, 0)
df.mask()Inverse whereds.mask(ds['age'] < 0, 0)
df.isin()Value membershipds['city'].isin(['NYC', 'LA'])
df.get()Safe column accessds.get('col', default=None)
df.xs()Cross-sectionds.xs('key')
df.pop()Remove columnds.pop('col')

Statistical Methods

MethodDescriptionSQL Equivalent
mean()Mean valueAVG()
median()Median valueMEDIAN()
mode()Mode value-
std()Standard deviationSTDDEV()
var()VarianceVAR()
min()MinimumMIN()
max()MaximumMAX()
sum()SumSUM()
prod()Product-
count()Non-null countCOUNT()
nunique()Unique countUNIQ()
value_counts()Value frequenciesGROUP BY
quantile()QuantileQUANTILE()
describe()Summary statistics-
corr()Correlation matrixCORR()
cov()Covariance matrixCOV()
corrwith()Pairwise correlation-
rank()Rank valuesRANK()
abs()Absolute valuesABS()
round()Round valuesROUND()
clip()Clip values-
cumsum()Cumulative sumWindow function
cumprod()Cumulative productWindow function
cummin()Cumulative minWindow function
cummax()Cumulative maxWindow function
diff()DifferenceWindow function
pct_change()Percent changeWindow function
skew()SkewnessSKEW()
kurt()KurtosisKURT()
sem()Standard error-
all()All true-
any()Any true-
idxmin()Index of min-
idxmax()Index of max-

Examples:

ds = pd.read_csv("data.csv")

# Basic statistics
print(ds['salary'].mean())
print(ds['age'].std())
print(ds.describe())

# Group statistics
print(ds.groupby('department')['salary'].mean())
print(ds.groupby('city').agg({'salary': ['mean', 'std'], 'age': 'count'}))

Data Manipulation

MethodDescription
drop()Drop rows/columns
drop_duplicates()Remove duplicates
duplicated()Mark duplicates
dropna()Remove missing values
fillna()Fill missing values
ffill()Forward fill
bfill()Backward fill
interpolate()Interpolate values
replace()Replace values
rename()Rename columns/index
rename_axis()Rename axis
assign()Add new columns
astype()Convert types
convert_dtypes()Infer types
copy()Copy DataFrame

Examples:

ds = pd.read_csv("data.csv")

# Drop operations
result = ds.drop(columns=['unused_col'])
result = ds.drop_duplicates(subset=['user_id'])
result = ds.dropna(subset=['email'])

# Fill operations
result = ds.fillna(0)
result = ds.fillna({'age': 0, 'name': 'Unknown'})

# Transform operations
result = ds.rename(columns={'old_name': 'new_name'})
result = ds.assign(
    full_name=lambda x: x['first_name'] + ' ' + x['last_name'],
    age_group=lambda x: pd.cut(x['age'], bins=[0, 25, 50, 100])
)

Sorting and Ranking

MethodDescription
sort_values()Sort by values
sort_index()Sort by index
nlargest()N largest values
nsmallest()N smallest values

Examples:

# Sort by single column
result = ds.sort_values('salary', ascending=False)

# Sort by multiple columns
result = ds.sort_values(['department', 'salary'], ascending=[True, False])

# Get top/bottom N
result = ds.nlargest(10, 'salary')
result = ds.nsmallest(5, 'age')

Reshaping

MethodDescription
pivot()Pivot table
pivot_table()Pivot with aggregation
melt()Unpivot
stack()Stack columns to index
unstack()Unstack index to columns
transpose() / TTranspose
explode()Explode lists to rows
squeeze()Reduce dimensions
droplevel()Drop index level
swaplevel()Swap index levels
reorder_levels()Reorder levels

Examples:

# Pivot table
result = ds.pivot_table(
    values='amount',
    index='region',
    columns='product',
    aggfunc='sum'
)

# Melt (unpivot)
result = ds.melt(
    id_vars=['name'],
    value_vars=['score1', 'score2', 'score3'],
    var_name='test',
    value_name='score'
)

# Explode arrays
result = ds.explode('tags')

Combining / Joining

MethodDescription
merge()SQL-style merge
join()Join on index
concat()Concatenate
append()Append rows
combine()Combine with function
combine_first()Combine with priority
update()Update values
compare()Show differences

Examples:

# Merge (join)
result = pd.merge(df1, df2, on='id', how='left')
result = df1.join(df2, on='id')

# Concatenate
result = pd.concat([df1, df2, df3])
result = pd.concat([df1, df2], axis=1)

Binary Operations

MethodDescription
add() / radd()Addition
sub() / rsub()Subtraction
mul() / rmul()Multiplication
div() / rdiv()Division
truediv() / rtruediv()True division
floordiv() / rfloordiv()Floor division
mod() / rmod()Modulo
pow() / rpow()Power
dot()Matrix multiplication

Examples:

# Arithmetic operations
result = ds['col1'].add(ds['col2'])
result = ds['price'].mul(ds['quantity'])

# With fill_value for missing data
result = ds['col1'].add(ds['col2'], fill_value=0)

Comparison Operations

MethodDescription
eq()Equal
ne()Not equal
lt()Less than
le()Less than or equal
gt()Greater than
ge()Greater than or equal
equals()Test equality
compare()Show differences

Function Application

MethodDescription
apply()Apply function
applymap()Apply element-wise
map()Map values
agg() / aggregate()Aggregate
transform()Transform
pipe()Pipe functions
groupby()Group by

Examples:

# Apply function
result = ds['name'].apply(lambda x: x.upper())
result = ds.apply(lambda row: row['a'] + row['b'], axis=1)

# Aggregate
result = ds.agg({'col1': 'sum', 'col2': 'mean'})
result = ds.agg(['sum', 'mean', 'std'])

# Pipe
result = (ds
    .pipe(filter_active)
    .pipe(calculate_metrics)
    .pipe(format_output)
)

Time Series

MethodDescription
rolling()Rolling window
expanding()Expanding window
ewm()Exponentially weighted
resample()Resample time series
shift()Shift values
asfreq()Convert frequency
asof()Latest value as of
at_time()Select at time
between_time()Select time range
first() / last()First/last periods
to_period()Convert to period
to_timestamp()Convert to timestamp
tz_convert()Convert timezone
tz_localize()Localize timezone

Examples:

# Rolling window
result = ds['value'].rolling(window=7).mean()

# Expanding window
result = ds['value'].expanding().sum()

# Shift
result = ds['value'].shift(1)  # Lag
result = ds['value'].shift(-1)  # Lead

Missing Data

MethodDescription
isna() / isnull()Detect missing
notna() / notnull()Detect non-missing
dropna()Drop missing
fillna()Fill missing
ffill()Forward fill
bfill()Backward fill
interpolate()Interpolate
replace()Replace values

I/O Methods

MethodDescription
to_csv()Export to CSV
to_json()Export to JSON
to_excel()Export to Excel
to_parquet()Export to Parquet
to_feather()Export to Feather
to_sql()Export to SQL database
to_pickle()Pickle
to_html()HTML table
to_latex()LaTeX table
to_markdown()Markdown table
to_string()String representation
to_dict()Dictionary
to_records()Records
to_numpy()NumPy array
to_clipboard()Clipboard

See I/O Operations for detailed documentation.


Iteration

MethodDescription
items()Iterate (column, Series)
iterrows()Iterate (index, Series)
itertuples()Iterate as namedtuples

Key Differences from Pandas

1. Return Types

# Pandas returns Series
pdf['col']  # → pd.Series

# DataStore returns ColumnExpr (lazy)
ds['col']   # → ColumnExpr

2. Lazy Execution

# DataStore operations are lazy
result = ds.filter(ds['age'] > 25)  # Not executed yet
df = result.to_df()  # Executed here

3. No inplace Parameter

# Pandas
df.drop(columns=['col'], inplace=True)

# DataStore (always returns new object)
ds = ds.drop(columns=['col'])

4. Comparing Results

# Use to_pandas() for comparison
pd.testing.assert_frame_equal(
    ds.to_pandas(),
    expected_df
)

See Key Differences for complete details.