Troubleshooting¶
Installation Issues¶
Platform Compatibility¶
chDB currently supports Python 3.8+ on macOS and Linux (x86_64 and ARM64). Windows support is not available yet.
Supported Platforms:
macOS (x86_64 and ARM64)
Linux (x86_64 and ARM64)
Python 3.8, 3.9, 3.10, 3.11, 3.12+
Import Errors¶
If you encounter import errors:
ImportError: No module named '_chdb'
ImportError: No module named 'chdb'
Solution: Ensure chDB is properly installed:
pip uninstall chdb
pip install chdb
Alternative installation methods:
# Force reinstall
pip install --force-reinstall chdb
# Install specific version
pip install chdb==3.7.0
Check installation:
import chdb
print(f"chDB version: {chdb.__version__}")
print(f"Engine version: {chdb.engine_version}")
Python Version Issues¶
If you get Python version compatibility errors:
ERROR: chdb requires Python >=3.8
Solution: Upgrade your Python version:
# Check current Python version
python --version
# Use Python 3.8+ explicitly
python3.8 -m pip install chdb
Query Execution Issues¶
Memory Issues¶
If you encounter memory-related errors:
Memory limit exceeded
Out of memory while executing query
Solutions:
Use Streaming Queries for Large Datasets
from chdb import session as chs
sess = chs.Session()
# Process large datasets with streaming
rows_cnt = 0
with sess.send_query("SELECT * FROM numbers(1000000)", "CSV") as stream_result:
for chunk in stream_result:
# Process chunk by chunk to avoid memory issues
rows_cnt += chunk.rows_read()
print(f"Processed {rows_cnt} rows")
Use File-based Sessions for Persistence
# Use persistent storage to reduce memory usage
sess = chs.Session("large_dataset.chdb") # File-based storage
# Instead of in-memory
# sess = chs.Session() # In-memory storage
Process Data in Smaller Batches
import chdb
# Good: Process in batches
for i in range(0, 1000000, 10000):
result = chdb.query(f"SELECT * FROM numbers({i}, 10000)")
# Process batch
# Avoid: Loading entire dataset at once
# result = chdb.query("SELECT * FROM numbers(1000000)")
Use Column Selection
# Good: Select only needed columns
result = chdb.query("SELECT id, name FROM large_table WHERE id > 100")
# Avoid: Select all columns
# result = chdb.query("SELECT * FROM large_table WHERE id > 100")
File Access Issues¶
If you encounter file access errors:
Permission denied: Cannot read file
File not found: /path/to/file.csv
Cannot determine file format
Solutions:
Check File Permissions and Path
import os
import chdb
# Check if file exists
file_path = "data.csv"
if not os.path.exists(file_path):
print(f"File does not exist: {file_path}")
# Use absolute path
abs_path = os.path.abspath(file_path)
result = chdb.query(f"SELECT * FROM file('{abs_path}', 'CSV')")
Supported File Formats
# chDB supports 60+ formats including:
result = chdb.query("SELECT * FROM file('data.parquet', 'Parquet')")
result = chdb.query("SELECT * FROM file('data.csv', 'CSV')")
result = chdb.query("SELECT * FROM file('data.json', 'JSONEachRow')")
result = chdb.query("SELECT * FROM file('data.orc', 'ORC')")
File Format Detection Issues
# Explicitly specify format and schema if auto-detection fails
result = chdb.query("""
SELECT * FROM file('data.csv', 'CSV',
'id UInt32, name String, age UInt8')
""")
Working with Remote Files
# HTTP/HTTPS files
result = chdb.query("""
SELECT * FROM url('https://example.com/data.csv', 'CSV')
""")
Connection and Session Issues¶
Session Already Exists Error
Session already exists
Solution: Only one session can be active at a time per process:
from chdb import session as chs
# Close existing session before creating new one
if 'sess' in locals():
sess.close()
sess = chs.Session()
DB-API Connection Issues
import chdb.dbapi as dbapi
# Always close connections properly
conn = dbapi.connect()
try:
cur = conn.cursor()
cur.execute("SELECT 1")
result = cur.fetchone()
finally:
cur.close()
conn.close()
# Or use context manager for automatic cleanup
with dbapi.connect() as conn:
cur = conn.cursor()
cur.execute("SELECT 1")
result = cur.fetchone()
cur.close()
Performance Issues¶
If queries are running slowly:
Solutions:
Use Efficient Query Patterns
# Good: Select specific columns
result = chdb.query("SELECT id, name FROM users WHERE id > 100")
# Good: Use LIMIT for exploration
result = chdb.query("SELECT * FROM large_table LIMIT 100")
# Avoid: Select all columns from large tables
# result = chdb.query("SELECT * FROM users WHERE id > 100")
Optimize Data Formats
# Parquet is usually faster than CSV for analytical queries
result = chdb.query("SELECT * FROM file('data.parquet', 'Parquet')")
# For repeated queries, consider using session with persistent storage
from chdb import session as chs
sess = chs.Session("analytics.chdb")
# Load data once
sess.query("CREATE TABLE users AS SELECT * FROM file('users.parquet', 'Parquet')")
# Query multiple times efficiently
result1 = sess.query("SELECT COUNT(*) FROM users WHERE age > 25")
result2 = sess.query("SELECT AVG(age) FROM users GROUP BY department")
Use Column-based Operations
# Good: Use aggregations and grouping
result = chdb.query("""
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department
ORDER BY AVG(salary) DESC
""")
# Good: Use window functions for analytics
result = chdb.query("""
SELECT name, salary,
rank() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
""")
DataFrame Integration Issues¶
Pandas DataFrame Problems
import chdb.dataframe as cdf
import pandas as pd
# Ensure DataFrames have proper column types
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85.5, 92.0, 88.5]
})
# Use chDB dataframe query
result = cdf.query("SELECT name, score FROM __tbl__ WHERE score > 85", tbl=df)
# Or use Python table engine
result = chdb.query("SELECT name, AVG(score) FROM Python(df) GROUP BY name")
Arrow Integration Issues
import pyarrow as pa
import chdb
# Create Arrow table with proper types
arrow_table = pa.table({
'id': pa.array([1, 2, 3], type=pa.int64()),
'name': pa.array(['Alice', 'Bob', 'Charlie'], type=pa.string()),
'score': pa.array([85.5, 92.0, 88.5], type=pa.float64())
})
# Query Arrow table
result = chdb.query("SELECT * FROM Python(arrow_table) WHERE score > 85")
UDF (User Defined Functions) Issues¶
UDF Import/Registration Problems
from chdb.udf import chdb_udf
from chdb import query
# Ensure UDF is stateless and uses proper imports
@chdb_udf()
def clean_text(text):
# Import modules inside the function
import re
return re.sub(r'[^\w\s]', '', text.lower())
# Test UDF
result = query("SELECT clean_text('Hello, World!') as cleaned")
print(result)
UDF Type Issues
# Specify return type if not String
@chdb_udf(return_type="UInt64")
def calculate_sum(a, b):
return int(a) + int(b)
# All input arguments are strings, convert as needed
@chdb_udf()
def process_json(json_str):
import json
try:
data = json.loads(json_str)
return str(data.get('value', 0))
except:
return '0'
Streaming Query Issues¶
Resource Not Released
from chdb import session as chs
sess = chs.Session()
# Always close streaming results if not fully consumed
stream_result = sess.send_query("SELECT * FROM numbers(1000000)", "CSV")
try:
for i, chunk in enumerate(stream_result):
if i >= 5: # Early termination
break
# Process chunk
finally:
stream_result.close() # Important: release resources
# Or use with statement for automatic cleanup
with sess.send_query("SELECT * FROM numbers(1000000)", "CSV") as stream_result:
for chunk in stream_result:
# Process chunk
pass
# Automatically closed
Arrow RecordBatch Issues
import pyarrow as pa
# Ensure proper batch size for memory management
stream_result = sess.send_query("SELECT * FROM numbers(100000)", "Arrow")
# Use appropriate batch size
batch_reader = stream_result.record_batch(rows_per_batch=10000)
for batch in batch_reader:
print(f"Processing batch: {batch.num_rows} rows")
# Process batch
stream_result.close()
Debug and Diagnostics¶
Enable Verbose Logging
import chdb
# Enable detailed output for debugging
result = chdb.query("SELECT 1", "Pretty")
# Check query performance metrics
print(f"Rows read: {result.rows_read()}")
print(f"Bytes read: {result.bytes_read()}")
print(f"Elapsed time: {result.elapsed()} seconds")
Session with Debug Parameters
from chdb import session as chs
# Create session with debug logging
sess = chs.Session("debug.chdb?log-level=debug&verbose")
result = sess.query("SELECT version()", "Pretty")
print(result)
Command Line Debug Mode
# Run chDB from command line with debug output
python3 -m chdb "SELECT version()" Pretty
python3 -m chdb "SELECT count() FROM numbers(100)" JSON
Getting Help¶
If you need additional help:
Check the GitHub Issues
Read the ClickHouse Documentation
Join the Discord Community
Check the Project Documentation
Error Reporting¶
When reporting errors, please include:
chDB version:
print(chdb.__version__)Python version:
print(sys.version)Operating system
Complete error traceback
Minimal example that reproduces the issue
import chdb
import sys
print(f"chDB version: {chdb.__version__}")
print(f"Python version: {sys.version}")
print(f"Engine version: {chdb.engine_version}")
Common Error Messages¶
“Session already exists” Only one session can be active per process. Close existing sessions before creating new ones.
“Memory limit exceeded” Use streaming queries, file-based sessions, or process data in smaller batches.
“File not found” Check file paths, use absolute paths, and ensure file exists and is readable.
“Cannot determine file format” Explicitly specify file format and schema in your queries.
“Import Error: No module named ‘_chdb’”
Reinstall chDB: pip uninstall chdb && pip install chdb
“Python version not supported” chDB requires Python 3.8+. Upgrade your Python installation.
Frequently Asked Questions¶
Q: What platforms does chDB support?
A: chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64). Windows support is not available yet.
Q: Can chDB work with large datasets?
A: Yes, chDB can handle large datasets efficiently. Use streaming queries, file-based sessions, and persistent storage for very large datasets.
Q: Can I use chDB in production?
A: Yes, chDB is production-ready and part of the ClickHouse family. Test thoroughly in your specific environment and follow best practices for resource management.
Q: How does chDB compare to SQLite?
A: chDB is optimized for analytical workloads (OLAP) while SQLite is better for transactional workloads (OLTP). chDB offers better performance for complex analytical queries, aggregations, and data processing tasks.
Q: What file formats does chDB support?
A: chDB supports 70+ formats including Parquet, CSV, JSON, Arrow, ORC, and many more. See the ClickHouse formats documentation for the complete list.
Q: Can I query Pandas DataFrames directly?
- A: Yes, chDB provides multiple ways to query Pandas DataFrames:
chdb.dataframe.query()functionPython(df)table engineDataFrame-to-Parquet conversion
Q: How do I optimize query performance?
A: Use column selection instead of SELECT *, leverage Parquet format for better performance, use persistent sessions for repeated queries, and consider using streaming for large datasets.
Q: Can I use external Python libraries in UDFs?
A: Yes, but you must import all required modules inside the UDF function. UDFs should be stateless and pure Python functions.