Roadmap

NL-Cube Roadmap

This document outlines the planned future development of NL-Cube. The roadmap is organized into short-term, medium-term, and long-term goals to provide a clear vision of where the project is headed.

Data Import Enhancements

Additional Import Formats

Expand NL-Cube’s data ingestion capabilities to support all formats that DuckDB can handle:

JSON Files: Support for structured JSON data files
Excel Files: Direct import of XLSX/XLS spreadsheets
ORC Files: Support for Optimized Row Columnar format
Avro Files: Support for Apache Avro format
XML Files: Structured XML document ingestion
HTTP/REST Sources: Direct import from REST APIs
Database Connections: Import from other databases

Implementation will leverage DuckDB’s built-in capabilities:

-- Example of how DuckDB handles different formats
CREATE TABLE json_table AS SELECT * FROM read_json('data.json', auto_detect=true);
CREATE TABLE excel_table AS SELECT * FROM read_excel('data.xlsx');
CREATE TABLE orc_table AS SELECT * FROM read_orc('data.orc');

Enhanced Schema Inference

Improved type detection for edge cases
Better handling of date/time formats
Detection of primary and foreign keys
Smart detection of hierarchical data
Preservation of original metadata

Expanded LLM Support

Configurable LLM Integration via Rig

Integration with Rig crate for LLM orchestration
User-selectable models through configuration
Dynamically switchable LLM providers
Model version tracking and compatibility checks
Performance benchmarking across models

Configuration example:

# Different models can be configured
[[llm.models]]
name = "sqlcoder-34b"
provider = "databricks"
model_id = "databricks/dbrx-instruct"

[[llm.models]]
name = "arctic-sqler"
provider = "anthropic"
model_id = "claude-3-opus-20240229"

Prompt Template Customization

User-editable prompt templates
Domain-specific prompt optimization
Few-shot learning examples for specialized domains
Prompt versioning and A/B testing

Report Management

Report Generation and Saving

Template-Based Reports: Customizable report templates
Scheduled Reports: Automated report generation
Export Formats: PDF, Excel, HTML, and Markdown
Collaboration: Sharing and commenting on reports
Versioning: Track changes to reports over time

Visualization Library

Expanded chart types and visualizations
Custom visualization themes
Interactive dashboards
Embeddable report widgets
Annotation and markup tools

Advanced Data Modeling

Table Relationship Definition

GUI for defining relationships between tables
Automatic foreign key detection
Entity-relationship diagram generation
Join path recommendations for queries
Referential integrity enforcement

Example relationship definition:

{
  "relationships": [
    {
      "from": {
        "table": "orders",
        "column": "customer_id"
      },
      "to": {
        "table": "customers",
        "column": "id"
      },
      "type": "many-to-one"
    }
  ]
}

Schema Management

Allow addition of data to existing tables
Schema evolution and migration
Column-level metadata and descriptions
Data quality constraints
Schema versioning

Real-Time Data Processing

Hot Watch Folder

Automated ingestion of files from watched directories
Configurable processing rules based on file patterns
Error handling and notification for problematic files
Throttling and batching for high-volume scenarios
Processing history and audit logs

Configuration example:

[[watch_folders]]
path = "data/incoming/sales"
subject = "sales"
pattern = "*.csv"
table_prefix = "sales_"
poll_interval_seconds = 30

Streaming Data Support

Integration with Apache Kafka and other streaming platforms
Real-time data processing pipelines
Windowed aggregations on streaming data
Configurable stream processors
Stream-to-table materialization

Example Kafka configuration:

[[streaming.sources]]
type = "kafka"
bootstrap_servers = "kafka1:9092,kafka2:9092"
topic = "sales_data"
group_id = "nl-cube-consumer"
subject = "sales"
table = "real_time_sales"

Aggregation Engine

User-defined aggregate definitions
Incremental aggregation updates
Time-based and event-based windows
Direct Perspective integration for live updates
Materialized view management

Security and Multi-User Support

OAuth Integration

Support for OAuth 2.0 authentication flows
Integration with identity providers (Google, GitHub, Microsoft)
JWT token handling
Role-based authorization
API key management for programmatic access

Multi-User Mode

User account management
Personalized settings and preferences
Resource quotas and usage tracking
Activity logging and audit trails
Access control for subjects and reports

Local LLM Integration

Bundled LLM Support

Option to bundle lightweight local LLMs
Optimized models for SQL generation
No internet dependency for core functionality
Fine-tuning tools for domain-specific datasets
Model switchover between local and remote as needed

Pre-Loaded Datasets

Domain-specific sample datasets
Example queries and reports
Guided tutorials using sample data
Benchmarking datasets
Easy data reset and refresh

User Experience Improvements

NL Query Auto-Complete

Intelligent suggestions as you type
Auto-completion for column names and values
Query history integration
Context-aware suggestions based on schema
Semantic understanding of partial queries

Progressive Web App

Offline capability
Mobile-friendly responsive design
Native app-like experience
Push notifications
Background synchronization

Keyboard Shortcuts and Power User Features

Comprehensive keyboard navigation
Customizable keyboard shortcuts
Command palette for quick actions
Batch operations
Query scripting capabilities

Enterprise Features

Advanced Administration

Centralized deployment management
Usage analytics and monitoring
Backup and disaster recovery
Resource governance
Health checks and diagnostics

Integration Ecosystem

API for third-party integration
Plugin architecture
Webhooks for event-driven workflows
SSO integration
Enterprise data catalog integration

Development Milestones

Short-Term (3-6 months)

Additional file formats: JSON, Excel
Report saving and management
Basic relationship definition
Data append to existing tables
Auto-complete for column names

Medium-Term (6-12 months)

Hot watch folder for auto-ingestion
Streaming data support (Kafka)
User-defined aggregates
OAuth security integration
Local LLM bundling options

Long-Term (12+ months)

Full multi-user mode
Enterprise administration
Advanced streaming analytics
Comprehensive plugin system
AI-assisted data modeling

Feedback and Prioritization

The NL-Cube roadmap is guided by user feedback and community needs. We welcome contributions and suggestions through:

GitHub issues and discussions
Community forums
User feedback surveys
Usage analytics

Priority will be given to features that:

Improve core natural language query capabilities
Enhance user experience for non-technical users
Expand data connectivity options
Simplify deployment and administration

To contribute to the roadmap or provide feedback, please open an issue on the GitHub repository.