Key Differences from pandas
While DataStore is highly compatible with pandas, there are important differences to understand.
Summary Table
| Aspect | pandas | DataStore |
|---|---|---|
| Execution | Eager (immediate) | Lazy (deferred) |
| Return types | DataFrame/Series | DataStore/ColumnExpr |
| Row order | Preserved | Preserved (automatic) |
| inplace | Supported | Not supported |
| Index | Full support | Simplified |
| Memory | All data in memory | Data at source |
1. Lazy vs Eager Execution
pandas (Eager)
Operations execute immediately:
DataStore (Lazy)
Operations are deferred until results are needed:
Why It Matters
Lazy execution enables:
- Query optimization: Multiple operations compile to one SQL query
- Column pruning: Only needed columns are read
- Filter pushdown: Filters apply at the source
- Memory efficiency: Don't load data you don't need
2. Return Types
pandas
DataStore
Converting to pandas Types
3. Execution Triggers
DataStore executes when you need actual values:
| Trigger | Example | Notes |
|---|---|---|
print() / repr() | print(ds) | Display needs data |
len() | len(ds) | Need row count |
.columns | ds.columns | Need column names |
.dtypes | ds.dtypes | Need type info |
.shape | ds.shape | Need dimensions |
.values | ds.values | Need actual data |
.index | ds.index | Need index |
to_df() | ds.to_df() | Explicit conversion |
| Iteration | for row in ds | Need to iterate |
equals() | ds.equals(other) | Need comparison |
Operations That Stay Lazy
| Operation | Returns |
|---|---|
filter() | DataStore |
select() | DataStore |
sort() | DataStore |
groupby() | LazyGroupBy |
join() | DataStore |
ds['col'] | ColumnExpr |
ds[['a', 'b']] | DataStore |
ds[condition] | DataStore |
4. Row Order
pandas
Row order is always preserved:
DataStore
Row order is automatically preserved for most operations:
DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.
When Order Is Preserved
- File sources (CSV, Parquet, JSON, etc.)
- pandas DataFrame sources
- Filter operations
- Column selection
- After explicit
sort()orsort_values() - Operations that define order (
nlargest(),nsmallest(),head(),tail())
When Order May Differ
- After
groupby()aggregations (usesort_values()to ensure consistent order) - After
merge()/join()with certain join types
5. No inplace Parameter
pandas
DataStore
inplace=True is not supported. Always assign the result:
Why No inplace?
DataStore uses immutable operations to enable:
- Query building (lazy evaluation)
- Thread safety
- Easier debugging
- Cleaner code
6. Index Support
pandas
Full index support:
DataStore
Simplified index support:
DataStore Source Matters
- DataFrame source: Preserves pandas index
- File source: Uses simple integer index
7. Comparison Behavior
Comparing with pandas
pandas doesn't recognize DataStore objects:
Using equals()
8. Type Inference
pandas
Uses numpy/pandas types:
DataStore
May use ClickHouse types:
Explicit Casting
9. Memory Model
pandas
All data lives in memory:
DataStore
Data stays at source until needed:
10. Error Messages
Different Error Sources
- pandas errors: From pandas library
- DataStore errors: From chDB or ClickHouse
Debugging Tips
Migration Checklist
When migrating from pandas:
- Change import statement
- Remove
inplace=Trueparameters - Add explicit
to_df()where pandas DataFrame is required - Add sorting if row order matters
- Use
to_pandas()for comparison tests - Test with representative data sizes
Quick Reference
| pandas | DataStore |
|---|---|
df[condition] | Same (returns DataStore) |
df.groupby() | Same (returns LazyGroupBy) |
df.drop(inplace=True) | ds = ds.drop() |
df.equals(other) | ds.to_pandas().equals(other) |
df.loc['label'] | ds.to_df().loc['label'] |
print(df) | Same (triggers execution) |
len(df) | Same (triggers execution) |