Developer Guide

NL-Cube Developer Guide

This guide provides information for developers who want to contribute to or extend NL-Cube. It covers project setup, code organization, and contribution guidelines.

Development Environment Setup

Prerequisites

Rust (1.84.0 or later, 2024 edition)
Git
DuckDB
Ollama (optional, for local LLM testing)

Setting Up the Development Environment

Clone the repository

git clone https://github.com/joefrost01/nl-cube.git
cd nl-cube

Install Rust dependencies

The project uses Cargo for dependency management. All dependencies are specified in Cargo.toml.

Install local LLM (optional)

For local LLM testing, install Ollama and pull a SQL-focused model:

# Install Ollama from https://ollama.ai/download
ollama pull sqlcoder

Configure the application

Create a config.toml file in the project root:

data_dir = "data"

[database]
connection_string = "nl-cube.db"
pool_size = 5

[web]
host = "127.0.0.1"
port = 3000
static_dir = "static"

[llm]
backend = "ollama"
model = "sqlcoder"
api_url = "http://localhost:11434/api/generate"

Building and Running

For development:

cargo run

For production build:

cargo build --release

Running Tests

cargo test

Code Organization

NL-Cube is organized into several key modules:

Project Structure

nl-cube/
├── src/                 # Rust source code
│   ├── config.rs        # Configuration management
│   ├── db/              # Database connection and schema management
│   ├── ingest/          # File ingestion (CSV, Parquet)
│   ├── llm/             # Language model integration
│   ├── util/            # Utility functions
│   ├── web/             # Web server and API
│   └── main.rs          # Application entry point
├── static/              # Frontend assets
│   ├── css/             # Stylesheets
│   ├── js/              # JavaScript modules
│   └── index.html       # Main application page
├── docs/                # Documentation (Quarto)
├── templates/           # HTML templates
├── Cargo.toml           # Rust dependencies
├── config.toml          # Configuration file
└── README.md            # Project overview

Core Modules

1. Configuration (`src/config.rs`)

Handles parsing configuration from files and command-line arguments:

pub struct AppConfig {
    pub database: DatabaseConfig,
    pub web: WebConfig,
    pub llm: LlmConfig,
    pub data_dir: String,
}

2. Database (`src/db/`)

db_pool.rs: Connection pool management
multi_db_pool.rs: Multiple database support
schema_manager.rs: Schema tracking and cache

3. Ingestion (`src/ingest/`)

csv.rs: CSV file processor
parquet.rs: Parquet file processor
schema.rs: Schema definition types

4. LLM Integration (`src/llm/`)

models.rs: Data structures for LLM interactions
providers/: LLM backend implementations
- ollama.rs: Ollama integration
- remote.rs: Remote API integration

5. Web Server (`src/web/`)

handlers/: API and UI request handlers
routes.rs: URL routing
state.rs: Application state management
static_files.rs: Static file serving
templates.rs: Template rendering

Frontend Structure

HTML: Basic structure in static/index.html
CSS: Styling in static/css/nlcube.css
JavaScript:
- nlcube.js: Main application logic
- perspective-utils.js: Visualization handling
- query-utils.js: Query management
- upload-utils.js: File upload management
- reports-utils.js: Saved reports handling

Key Components

AppState

The central state container that holds shared resources:

pub struct AppState {
    pub config: AppConfig,
    pub db_pool: Pool<DuckDBConnectionManager>,
    pub llm_manager: Arc<Mutex<LlmManager>>,
    pub data_dir: PathBuf,
    pub subjects: RwLock<Vec<String>>,
    pub startup_time: chrono::DateTime<chrono::Utc>,
    pub current_subject: RwLock<Option<String>>,
    pub schema_manager: SchemaManager,
}

LlmManager

Manages language model interactions through the SqlGenerator trait:

#[async_trait]
pub trait SqlGenerator: Send + Sync {
    async fn generate_sql(&self, question: &str, schema: &str) -> Result<String, LlmError>;
}

IngestManager

Handles file ingestion through the FileIngestor trait:

pub trait FileIngestor: Send + Sync {
    fn ingest(
        &self,
        path: &Path,
        table_name: &str,
        subject: &str,
    ) -> Result<schema::TableSchema, IngestError>;
}

Web Server

Built using Axum, the web server provides both API endpoints and UI routes:

pub fn ui_routes() -> Router<Arc<AppState>> {
    Router::new()
        .route("/", get(handlers::ui::index_handler))
        .route("/static/{*path}", get(static_handler))
}

pub fn api_routes() -> Router<Arc<AppState>> {
    Router::new()
        .nest(
            "/api",
            Router::new()
                // Query endpoints
                .route("/query", post(handlers::api::execute_query))
                .route("/nl-query", post(sync_nl_query_handler))
                // ... other routes
        )
}

Schema Management

The SchemaManager tracks database schemas for LLM context:

pub struct SchemaManager {
    schema_cache: RwLock<HashMap<String, Vec<String>>>,
    last_refresh: RwLock<chrono::DateTime<chrono::Utc>>,
    data_dir: PathBuf,
}

Design Patterns

NL-Cube uses several key design patterns:

1. Repository Pattern

Database interactions are encapsulated behind traits and managers:

pub trait SqlGenerator: Send + Sync {
    async fn generate_sql(&self, question: &str, schema: &str) -> Result<String, LlmError>;
}

2. Dependency Injection

Components receive their dependencies through constructors:

pub fn new_with_multi_db(
    config: AppConfig,
    db_pool: Pool<DuckDBConnectionManager>,
    multi_db_manager: Arc<MultiDbConnectionManager>,
    llm_manager: LlmManager,
    data_dir: PathBuf,
) -> Self { ... }

3. Trait-Based Polymorphism

Interfaces are defined as traits, allowing for multiple implementations:

pub trait FileIngestor: Send + Sync { ... }

pub struct CsvIngestor { ... }
pub struct ParquetIngestor { ... }

4. Actor Model

Asynchronous tasks with message passing for concurrent operations:

let (tx, rx) = oneshot::channel();
tokio::task::spawn_blocking(move || {
    // Task execution
    let _ = tx.send(result);
});
// Wait for result
match rx.await { ... }

Extension Points

Adding a New File Ingestor

Create a new struct that implements the FileIngestor trait
Add the ingestor to IngestManager
Update file type detection in the upload handler

Example:

pub struct JsonIngestor { ... }

impl FileIngestor for JsonIngestor {
    fn ingest(
        &self,
        path: &Path,
        table_name: &str,
        subject: &str,
    ) -> Result<TableSchema, IngestError> {
        // Implementation
    }
}

Adding a New LLM Provider

Create a new struct that implements the SqlGenerator trait
Add the provider to LlmManager
Update the configuration schema

Example:

pub struct CustomLlmProvider { ... }

#[async_trait]
impl SqlGenerator for CustomLlmProvider {
    async fn generate_sql(&self, question: &str, schema: &str) -> Result<String, LlmError> {
        // Implementation
    }
}

Extending the API

Add new route definitions in src/web/routes.rs
Create handler functions in src/web/handlers/api.rs
Update the API documentation

Example:

// In routes.rs
.route("/api/custom", post(handlers::api::custom_handler))

// In handlers/api.rs
pub async fn custom_handler(
    State(app_state): State<Arc<AppState>>,
    Json(payload): Json<CustomRequest>,
) -> Result<Json<CustomResponse>, (StatusCode, String)> {
    // Implementation
}

Contributing Guidelines

Pull Request Process

Fork the repository and create a feature branch
Make your changes with appropriate tests
Ensure all tests pass with cargo test
Format your code with cargo fmt
Check for linting issues with cargo clippy
Submit a pull request with a clear description of changes

Coding Standards

Follow the Rust API Guidelines
Use async/await consistently for asynchronous code
Add doc comments to public interfaces
Include error handling with custom error types
Format code with rustfmt
Use clippy to catch common mistakes

Documentation

Update documentation for API changes
Add doc comments for public functions
Include examples where appropriate
Update the changelog for significant changes

Testing

Add unit tests for new functionality
Include integration tests for API endpoints
Test with different configurations
Verify performance for data-intensive operations

Troubleshooting Development Issues

Common Issues

Database connection errors: - Check the database file path in configuration - Verify DuckDB is installed and the correct version - Increase the connection pool size if needed

LLM integration issues: - Verify Ollama is running (curl http://localhost:11434/api/version) - Check if the model is available (ollama list) - Review the prompt template for errors

Build errors: - Update Rust to the latest version (rustup update) - Clear cargo cache (cargo clean) - Check for incompatible dependency versions

Performance Profiling

For performance issues, use these tools:

Tokio Console: Monitor async tasks
Flamegraph: Visualize CPU usage
DHAT: Analyze heap allocations

Example flamegraph generation:

cargo install flamegraph
CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --bin nl-cube

NL-Cube Developer Guide

Development Environment Setup

Prerequisites

Setting Up the Development Environment

Building and Running

Running Tests

Code Organization

Project Structure

Core Modules

1. Configuration (src/config.rs)

2. Database (src/db/)

3. Ingestion (src/ingest/)

4. LLM Integration (src/llm/)

5. Web Server (src/web/)

Frontend Structure

Key Components

AppState

LlmManager

IngestManager

Web Server

Schema Management

Design Patterns

1. Repository Pattern

2. Dependency Injection

3. Trait-Based Polymorphism

4. Actor Model

Extension Points

Adding a New File Ingestor

Adding a New LLM Provider

Extending the API

Contributing Guidelines

Pull Request Process

Coding Standards

Documentation

Testing

Troubleshooting Development Issues

Common Issues

Performance Profiling

Additional Resources

1. Configuration (`src/config.rs`)

2. Database (`src/db/`)

3. Ingestion (`src/ingest/`)

4. LLM Integration (`src/llm/`)

5. Web Server (`src/web/`)