Technical Details
NL-Cube Technical Details
This document provides in-depth technical information about NL-Cube’s implementation, covering database integration, LLM functionality, and the data processing pipeline.
DuckDB Integration
NL-Cube uses DuckDB as its embedded analytics database engine, chosen for its performance and ease of deployment.
Connection Management
NL-Cube implements a sophisticated multi-database connection management system:
Connection Pool: Maintains a configurable pool of database connections using
r2d2
// ConnectionManager implementation impl ManageConnection for DuckDBConnectionManager { type Connection = Connection; type Error = duckdb::Error; fn connect(&self) -> Result<Self::Connection, Self::Error> { Connection::open(&self.connection_string) } fn is_valid(&self, conn: &mut Self::Connection) -> Result<(), Self::Error> { .execute("SELECT 1", [])?; connOk(()) } fn has_broken(&self, _conn: &mut Self::Connection) -> bool { false } }
Multi-Database Support: Each subject gets its own DuckDB database file
pub struct MultiDbConnectionManager { : String, main_db_path: PathBuf, data_dir: Arc<Mutex<HashMap<String, String>>>, attached_dbs}
Thread Safety: Blocking database operations are executed in dedicated Tokio tasks
tokio::task::spawn_blocking(move || { // Database operations that might block let conn = Connection::open(&db_path_string)?; // ... })
Schema Management
The SchemaManager
component maintains metadata about database schemas:
- Schema Discovery: Scans subject directories for database files
- Cache Management: Maintains a cached view of schemas and tables
- Refresh Mechanism: Periodically updates schema information
pub struct SchemaManager {
: RwLock<HashMap<String, Vec<String>>>,
schema_cache: RwLock<chrono::DateTime<chrono::Utc>>,
last_refresh: PathBuf,
data_dir}
Query Execution Flow
When executing a natural language query:
- Context Gathering: Schema information is extracted from the database
- LLM Translation: Natural language is translated to SQL
- Execution: SQL is executed against the appropriate subject database
- Arrow Conversion: Results are converted to Apache Arrow format
- Response: Arrow data is returned to the client for visualization
// Simplified query flow
let table_metadata = app_state.get_table_metadata(Some(&target_subject)).await?;
let raw_sql = mgr.generate_sql(&payload.question, &table_metadata).await?;
let conn = duckdb::Connection::open(&db_path)?;
let arrow_batch = stmt.query_arrow([])?;
LLM Integration
NL-Cube features a flexible LLM integration system that supports multiple backends through a provider architecture.
Provider Architecture
The SqlGenerator
trait defines the interface for LLM providers:
#[async_trait]
pub trait SqlGenerator: Send + Sync {
async fn generate_sql(&self, question: &str, schema: &str) -> Result<String, LlmError>;
}
Currently implemented providers include:
Ollama Provider: For local model execution
pub struct OllamaProvider { : reqwest::Client, client: String, api_url: String, model}
Remote Provider: For cloud-based LLM APIs
pub struct RemoteLlmProvider { : reqwest::Client, client: String, api_url: String, api_key: String, model}
Prompt Engineering
NL-Cube uses carefully crafted prompts to guide the LLM in generating SQL:
- Schema Context: Database structure is provided to the LLM
- Clear Instructions: The prompt contains specific SQL generation rules
- Examples: Sample queries help the model understand the expected output
Example prompt template:
### Instructions:
Your task is to convert a question into a SQL query for DuckDB, given a database schema.
Adhere to these rules:
- **Be careful with column names - they are case sensitive**
- **Use the exact spelling of column names as provided in the schema**
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity.
- When creating a ratio, always cast the numerator as float
### Input:
Generate a SQL query that answers the question `{question}`.
This query will run on a DuckDB database with the following tables and columns:
{schema}
### Expected SQL Format:
- Use lowercase for SQL keywords (SELECT, FROM, WHERE, etc.)
- Reference column names exactly as shown in the schema
- Make sure to use double quotes around column names with spaces or special characters
- End your query with a semicolon
### Response:
Based on your instructions, here is the SQL query I have generated:
```sql
SQL Extraction and Validation
After receiving the LLM response:
- SQL Extraction: The generated SQL is extracted from the response
- Validation: Basic syntax checks ensure the SQL is well-formed
- Parameter Stripping: Any parameters are properly formatted
Data Processing Pipeline
File Ingestion
NL-Cube supports ingesting data from CSV and Parquet files through a flexible ingestor architecture:
pub trait FileIngestor: Send + Sync {
fn ingest(
&self,
: &Path,
path: &str,
table_name: &str,
subject-> Result<schema::TableSchema, IngestError>;
) }
The ingestion process for each file type:
- CSV Ingestion:
- Read sample rows to infer schema
- Create table with appropriate column types
- Use DuckDB’s
read_csv_auto
for optimized loading - Verify row count after ingestion
- Parquet Ingestion:
- Extract schema from Parquet metadata
- Create matching table structure
- Use DuckDB’s
read_parquet
for optimized loading - Handle binary fields and large files
Schema Inference
NL-Cube uses a combination of techniques to infer the schema from data files:
- Type Detection: Analyzes sample data to determine column types
- Nullability: Determines if columns can contain NULL values
- Constraints: Identifies potential primary keys and constraints
The schema is represented using a type-safe model:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum DataType {
,
Integer,
BigInt,
DoubleString,
,
Boolean,
Date,
TimestampString),
Unknown(}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ColumnSchema {
pub name: String,
pub data_type: DataType,
pub nullable: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TableSchema {
pub name: String,
pub columns: Vec<ColumnSchema>,
}
Apache Arrow Integration
NL-Cube uses Apache Arrow for efficient data interchange:
- Query Results: Database query results are converted to Arrow format
- Streaming: Data is streamed to the client in batches
- Zero-Copy: The format enables efficient memory usage
- Client Integration: Seamlessly integrates with Perspective visualization
// Convert query results to Arrow
let arrow_batch = stmt.query_arrow([])?;
let schema = arrow_batch.get_schema();
let record_batches = arrow_batch.collect::<Vec<_>>().to_vec();
// Serialize to IPC format
let mut buffer = Vec::new();
let mut file_writer = arrow::ipc::writer::FileWriter::try_new(&mut buffer, schema.deref())?;
for batch in &record_batches {
.write(batch)?;
file_writer}
.finish()?; file_writer
Web API Reference
NL-Cube provides a comprehensive REST API for programmatic access.
Core Endpoints
Query Execution
POST /api/query
Executes a raw SQL query against the selected subject database.
Request:
{
"query": "SELECT * FROM orders LIMIT 10;"
}
Response: - Content-Type: application/vnd.apache.arrow.file - Headers: - X-Total-Count: Number of rows - X-Execution-Time: Execution time in ms - X-Columns: JSON array of column names - X-Generated-SQL: The executed SQL query
Natural Language Query
POST /api/nl-query
Translates a natural language question to SQL and executes it.
Request:
{
"question": "What are the top 5 products by revenue?"
}
Response: - Content-Type: application/vnd.apache.arrow.file - Headers: - X-Total-Count: Number of rows - X-Execution-Time: Execution time in ms - X-Columns: JSON array of column names - X-Generated-SQL: The generated SQL query
Database Management
GET /api/subjects
Returns a list of available subject databases.
GET /api/subjects/{subject}
Returns details about a specific subject, including tables.
POST /api/subjects/{subject}
Creates a new subject database.
POST /api/subjects/select/{subject}
Selects a subject as the current database context.
DELETE /api/subjects/{subject}
Deletes a subject database.
File Upload
POST /api/upload/{subject}
Uploads files to a subject database for ingestion.
- Content-Type: multipart/form-data
- Supports CSV and Parquet files
Schemas
GET /api/schema
Returns the schema definition for the current subject.
Reports
GET /api/reports
Returns a list of saved reports.
GET /api/reports/{id}
Returns a specific saved report.
POST /api/reports
Saves a new report.
DELETE /api/reports/{id}
Deletes a saved report.
Authentication
NL-Cube currently uses stateless authentication. Future versions will support OAuth.
Error Handling
API errors follow a consistent format:
{
"error": "Error message",
"status": 400
}
Common status codes: - 200: Success - 400: Bad Request (invalid parameters) - 404: Not Found (subject or resource not found) - 500: Internal Server Error
Performance Optimizations
NL-Cube includes several performance optimizations:
- Connection Pooling: Reuses database connections for efficiency
- Async Processing: Non-blocking I/O for web requests
- Blocking Task Offloading: CPU-intensive tasks run in dedicated threads
- Arrow Data Format: Efficient data interchange
- Static File Embedding: UI assets are embedded in the binary
Critical database operations are executed in dedicated blocking tasks:
tokio::task::spawn_blocking(move || {
let rt = tokio::runtime::Handle::current();
let result = rt.block_on(async {
// Database operations
});
let _ = tx.send(result);
});
Security Considerations
- Input Validation: All user inputs are validated before processing
- Parameterized Queries: Prevents SQL injection
- File Type Validation: Only allows approved file formats
- Resource Limits: Prevents excessive resource usage
- Cross-Origin Protection: CORS headers restrict access
Deployment Considerations
Resource Requirements
Minimum recommended specifications: - 4GB RAM - 2 CPU cores - 1GB free disk space
Scaling
For larger deployments: - Increase pool_size
in configuration - Allocate more memory for DuckDB - Consider using a more powerful machine
Environment Variables
DATA_DIR
: Override the data directory locationDEV_MODE
: Enable template hot-reloading (development only)