Performance Guide
DataStore delivers significant performance improvements over pandas for many operations. This guide explains why and how to optimize your workloads.
Why DataStore Is Faster
1. SQL Pushdown
Operations are pushed down to the data source:
2. Column Pruning
Only needed columns are read:
3. Lazy Evaluation
Multiple operations compile to one query:
Benchmark: DataStore vs pandas
Test Environment
- Data: 10 million rows
- Hardware: Standard laptop
- File format: CSV
Results
| Operation | pandas (ms) | DataStore (ms) | Winner |
|---|---|---|---|
| GroupBy count | 347 | 17 | DataStore (19.93x) |
| Combined ops | 1,535 | 234 | DataStore (6.56x) |
| Complex pipeline | 2,047 | 380 | DataStore (5.39x) |
| MultiFilter+Sort+Head | 1,963 | 366 | DataStore (5.36x) |
| Filter+Sort+Head | 1,537 | 350 | DataStore (4.40x) |
| Head/Limit | 166 | 45 | DataStore (3.69x) |
| Ultra-complex (10+ ops) | 1,070 | 338 | DataStore (3.17x) |
| GroupBy agg | 406 | 141 | DataStore (2.88x) |
| Select+Filter+Sort | 1,217 | 443 | DataStore (2.75x) |
| Filter+GroupBy+Sort | 466 | 184 | DataStore (2.53x) |
| Filter+Select+Sort | 1,285 | 533 | DataStore (2.41x) |
| Sort (single) | 1,742 | 1,197 | DataStore (1.45x) |
| Filter (single) | 276 | 526 | Comparable |
| Sort (multiple) | 947 | 1,477 | Comparable |
Key Insights
- GroupBy operations: DataStore up to 19.93x faster
- Complex pipelines: DataStore 5-6x faster (SQL pushdown benefit)
- Simple slice operations: Performance comparable - difference negligible
- Best use case: Multi-step operations with groupby/aggregation
- Zero-copy:
to_df()has no data conversion overhead
When DataStore Wins
Heavy Aggregations
Complex Pipelines
Large File Processing
Multiple Column Operations
When pandas Is Comparable
In most scenarios, DataStore matches or exceeds pandas performance. However, pandas may be slightly faster in these specific cases:
Small Datasets (<1,000 rows)
Simple Slice Operations
Custom Python Lambda Functions
Even in scenarios where DataStore is "slower", performance is typically on par with pandas - the difference is negligible for practical use. DataStore's advantages in complex operations far outweigh these edge cases.
For fine-grained control over execution, see Execution Engine Configuration.
Zero-Copy DataFrame Integration
DataStore uses zero-copy for reading and writing pandas DataFrames. This means:
Key implications:
to_df()is essentially free - no serialization or memory copying- Creating DataStore from pandas DataFrame is instant
- Memory is shared between DataStore and pandas views
Optimization Tips
1. Use Parquet Instead of CSV
Expected improvement: 3-10x faster reads
2. Filter Early
3. Select Only Needed Columns
4. Leverage SQL Aggregations
5. Use head() Instead of Full Queries
6. Batch Operations
7. Use explain() to Optimize
Profiling Your Workload
Enable Profiling
Identify Bottlenecks
Compare Approaches
Best Practices Summary
| Practice | Impact |
|---|---|
| Use Parquet files | 3-10x faster reads |
| Filter early | Reduce data processing |
| Select needed columns | Reduce I/O and memory |
| Use GroupBy/aggregations | Up to 20x faster |
| Batch operations | Avoid repeated execution |
| Profile before optimizing | Find real bottlenecks |
| Use explain() | Verify query optimization |
| Use head() for samples | Avoid full table scans |
Quick Decision Guide
| Your Workload | Recommendation |
|---|---|
| GroupBy/aggregation | Use DataStore |
| Complex multi-step pipeline | Use DataStore |
| Large files with filters | Use DataStore |
| Simple slice operations | Either (comparable performance) |
| Custom Python lambda functions | Use pandas or convert late |
| Very small data (<1,000 rows) | Either (negligible difference) |
For automatic optimal engine selection, use config.set_execution_engine('auto') (default).
See Execution Engine Configuration for details.