User Defined Functions (UDF) ============================= chDB supports Python User Defined Functions (UDFs) that allow you to extend SQL capabilities with custom Python logic. UDFs can be called from SQL queries and integrate seamlessly with chDB's query engine. Overview -------- Python UDFs in chDB provide: - **Custom Business Logic**: Implement domain-specific calculations and transformations - **External Library Integration**: Use any Python library within your SQL queries - **Stateless Operations**: Functions are called for each row of input data API Reference ------------- .. automodule:: chdb.udf :members: :show-inheritance: Creating UDFs -------------- **Basic UDF Registration** .. code-block:: python from chdb.udf import chdb_udf from chdb import query # Simple UDF with default String return type @chdb_udf() def sum_udf(lhs, rhs): return int(lhs) + int(rhs) # Use the UDF in a query result = query("SELECT sum_udf(12, 22)") print(result) # Returns: 34 **Specifying Return Types** .. code-block:: python # UDF with specific return type @chdb_udf(return_type="UInt64") def calculate_total(price, quantity): return int(price) * int(quantity) # UDF returning Float64 @chdb_udf(return_type="Float64") def calculate_average(total, count): return float(total) / float(count) if int(count) > 0 else 0.0 # Use in complex queries result = query(""" SELECT product_name, calculate_total(price, quantity) as total_value, calculate_average(total_sales, days_active) as avg_daily_sales FROM sales_data """) Text Processing UDFs -------------------- **String Manipulation** .. code-block:: python from chdb.udf import chdb_udf from chdb import query @chdb_udf() def clean_text(text): # Import modules inside the function import re import string # Remove punctuation and normalize text = text.translate(str.maketrans('', '', string.punctuation)) text = re.sub(r'\s+', ' ', text.strip().lower()) return text @chdb_udf() def extract_domain(email): import re pattern = r'@([A-Za-z0-9.-]+\.[A-Za-z]{2,})' match = re.search(pattern, email) return match.group(1) if match else '' # Usage example result = query(""" SELECT email, extract_domain(email) as domain, clean_text(description) as clean_desc FROM user_data """) **Advanced Text Analysis** .. code-block:: python @chdb_udf() def sentiment_score(text): # Simple sentiment scoring (you could use NLTK, TextBlob, etc.) import re positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful'] negative_words = ['bad', 'terrible', 'awful', 'horrible', 'poor'] words = re.findall(r'\b\w+\b', text.lower()) positive_count = sum(1 for word in words if word in positive_words) negative_count = sum(1 for word in words if word in negative_words) return str(positive_count - negative_count) JSON and Data Processing UDFs ------------------------------ **JSON Manipulation** .. code-block:: python @chdb_udf() def parse_json_field(json_str, field_path): import json try: data = json.loads(json_str) # Support nested field paths like "user.profile.name" fields = field_path.split('.') current = data for field in fields: if isinstance(current, dict) and field in current: current = current[field] else: return '' return str(current) except: return '' @chdb_udf() def json_array_length(json_str): import json try: data = json.loads(json_str) if isinstance(data, list): return str(len(data)) return '0' except: return '0' # Usage in analytics queries result = query(""" SELECT user_id, parse_json_field(metadata, 'profile.age') as age, parse_json_field(metadata, 'preferences.theme') as theme, json_array_length(tags) as tag_count FROM user_events """) Mathematical and Statistical UDFs ---------------------------------- **Custom Calculations** .. code-block:: python @chdb_udf(return_type="Float64") def compound_interest(principal, rate, time): # Calculate compound interest import math p = float(principal) r = float(rate) / 100 # Convert percentage to decimal t = float(time) amount = p * math.pow(1 + r, t) return amount - p # Return interest earned @chdb_udf(return_type="Float64") def haversine_distance(lat1, lon1, lat2, lon2): # Calculate distance between two coordinates import math # Convert latitude and longitude from degrees to radians lat1, lon1, lat2, lon2 = map(math.radians, [float(lat1), float(lon1), float(lat2), float(lon2)]) # Haversine formula dlat = lat2 - lat1 dlon = lon2 - lon1 a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2 c = 2 * math.asin(math.sqrt(a)) r = 6371 # Radius of earth in kilometers return c * r Data Validation UDFs -------------------- **Input Validation and Cleaning** .. code-block:: python @chdb_udf() def validate_email(email): import re pattern = r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' return 'true' if re.match(pattern, email) else 'false' @chdb_udf() def normalize_phone(phone): import re # Remove all non-digit characters digits = re.sub(r'[^\d]', '', phone) # Format as US phone number if len(digits) == 10: return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}" elif len(digits) == 11 and digits[0] == '1': return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}" else: return digits @chdb_udf() def clean_currency(amount): import re # Remove currency symbols and formatting cleaned = re.sub(r'[^\d.-]', '', amount) try: return str(float(cleaned)) except: return '0.0' UDF with External Libraries ---------------------------- **Using Third-party Libraries** .. code-block:: python @chdb_udf() def hash_password(password): # Example using hashlib (built-in) import hashlib return hashlib.sha256(password.encode()).hexdigest() @chdb_udf() def parse_user_agent(ua_string): # This would require installing user-agents library # pip install user-agents try: from user_agents import parse user_agent = parse(ua_string) return f"{user_agent.browser.family}|{user_agent.os.family}" except ImportError: # Fallback if library not available return "unknown|unknown" @chdb_udf() def geocode_ip(ip_address): # This would require a geolocation service # For demo purposes, returning mock data import re # Simple IP validation if re.match(r'^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$', ip_address): return f"Country:Unknown,City:Unknown" return "Invalid IP" UDF Best Practices and Guidelines ----------------------------------- **Important Notes from README** Based on the official documentation, here are the key guidelines for chDB UDFs: 1. **Stateless Functions**: UDFs should be stateless. Only User Defined Functions are supported, not User Defined Aggregation Functions (UDAFs). 2. **Default Return Type**: Default return type is String. Specify return type explicitly for other types. 3. **Input Arguments**: All function arguments are received as strings (TabSeparated format), so convert them as needed. 4. **Row Processing**: The function is called for each line of input, similar to: .. code-block:: python # Conceptual execution model def sum_udf(lhs, rhs): return int(lhs) + int(rhs) for line in sys.stdin: args = line.strip().split('\t') lhs = args[0] rhs = args[1] print(sum_udf(lhs, rhs)) sys.stdout.flush() 5. **Module Imports**: Import all required Python modules **inside the function**: .. code-block:: python @chdb_udf() def func_use_json(arg): import json # Import inside the function try: data = json.loads(arg) return str(data.get('key', '')) except: return '' 6. **Python Interpreter**: The Python interpreter used is the same as the one running the script (from ``sys.executable``). UDF Configuration and Advanced Usage ------------------------------------ **UDF Path Configuration** .. code-block:: python import chdb # Specify custom UDF path when querying result = chdb.query( "SELECT my_custom_function(column) FROM table", udf_path="./my_udfs/" ) **Supported Data Types** UDFs support all ClickHouse data types. Specify return type using ClickHouse type names: .. code-block:: python @chdb_udf(return_type="UInt32") def count_words(text): import re words = re.findall(r'\b\w+\b', text) return len(words) @chdb_udf(return_type="Array(String)") def split_text(text, delimiter): return text.split(delimiter) @chdb_udf(return_type="Decimal(10,2)") def calculate_tax(amount, rate): return float(amount) * float(rate) / 100 **Error Handling in UDFs** Always implement proper error handling: .. code-block:: python @chdb_udf() def safe_divide(a, b): try: numerator = float(a) denominator = float(b) if denominator == 0: return 'NULL' return str(numerator / denominator) except (ValueError, TypeError): return 'ERROR' Performance Considerations -------------------------- **Optimization Tips** 1. **Minimize Imports**: Only import what you need inside the function 2. **Avoid Heavy Computations**: UDFs are called for each row 3. **Use Built-in Functions When Possible**: ClickHouse built-ins are usually faster 4. **Cache Expensive Operations**: For complex calculations, consider pre-processing data .. code-block:: python @chdb_udf() def optimized_function(input_data): # Import only what's needed import json # Handle common cases quickly if not input_data or input_data == 'null': return '' try: # Main processing data = json.loads(input_data) return str(data.get('result', '')) except: return '' Debugging UDFs -------------- **Testing UDFs Locally** .. code-block:: python # Test your UDF independently before using in queries @chdb_udf() def test_function(input_val): import json try: data = json.loads(input_val) return str(data.get('value', 0)) except Exception as e: # Debug: print error (remove in production) print(f"Error: {e}") return '0' # Test locally first test_input = '{"value": 42}' result = test_function(test_input) print(f"Test result: {result}") # Then use in query query_result = query("SELECT test_function('{}') as result".format(test_input)) **Common Issues and Solutions** 1. **Import Errors**: Always import inside the function 2. **Type Conversion**: All inputs are strings, convert explicitly 3. **Return Type Mismatch**: Ensure returned value matches declared type 4. **Exception Handling**: Wrap operations in try-catch blocks Complete Example: Log Processing UDF ------------------------------------ Here's a comprehensive example that demonstrates multiple UDF concepts: .. code-block:: python from chdb.udf import chdb_udf from chdb import query @chdb_udf() def parse_log_entry(log_line): """Parse Apache/Nginx log entries""" import re import json from datetime import datetime # Apache Common Log Format regex pattern = r'(\S+) \S+ \S+ \[(.*?)\] "(.*?)" (\d+) (\d+|-) "(.*?)" "(.*?)"' try: match = re.match(pattern, log_line) if not match: return json.dumps({"error": "Invalid log format"}) ip, timestamp, request, status, size, referer, user_agent = match.groups() # Parse request request_parts = request.split(' ') method = request_parts[0] if len(request_parts) > 0 else '' path = request_parts[1] if len(request_parts) > 1 else '' result = { "ip": ip, "timestamp": timestamp, "method": method, "path": path, "status": int(status), "size": int(size) if size != '-' else 0, "referer": referer if referer != '-' else '', "user_agent": user_agent } return json.dumps(result) except Exception as e: return json.dumps({"error": str(e)}) @chdb_udf() def extract_log_field(log_json, field_name): """Extract specific field from parsed log JSON""" import json try: data = json.loads(log_json) return str(data.get(field_name, '')) except: return '' # Example usage sample_log = '192.168.1.1 - - [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "-" "Mozilla/4.0"' result = query(f""" SELECT parse_log_entry('{sample_log}') as parsed, extract_log_field(parse_log_entry('{sample_log}'), 'ip') as client_ip, extract_log_field(parse_log_entry('{sample_log}'), 'status') as status_code """) print(result) See Also -------- - :doc:`troubleshooting` - UDF troubleshooting and common issues - :doc:`examples` - More UDF examples and use cases - :doc:`api` - Complete API reference - `ClickHouse Data Types `_ - Supported return types