We introduce a framework for privacy-preserving, on-device AI agent workloads. By decoupling agentic tasks into function selection and argument generation, both tackled by local LLM orchestration, our system delivers accuracy approaching cloud-based models while fully protecting user data from third-party exposure and enabling cost-efficient execution on consumer hardware.
Classification: Which tool to invoke?
Generation: Tool-specific arguments
Privacy-preserving orchestration
AgentFlux: Decoupled Post-Training Pipeline & Inference Framework
See how AgentFlux orchestrates privacy-preserving tool calling across multiple applications, performing complex tasks while keeping all sensitive data on your device.
Demo: AgentFlux performing file operations tasks entirely on-device, compared to Llama-3.1-8B model
AgentFlux bridges the performance gap between frontier orchestration models and local deployable systems.
Agents that run fully offline, protecting sensitive user data from third-party exposure
No reasoning latency overhead, fast execution on consumer hardware
Training for tool ecosystems that evolve over time
By decoupling fine-tuning and introducing dynamic adapter loading, AgentFlux democratizes agentic AI, bringing practical autonomy to the edge.
AgentFlux enables local consolidation, analysis, and reporting across blockchain and traditional finance, keeping all private data on-device. Only anonymized outputs leave the user's machine.
Local models execute privacy-preserving tasks for sensitive data while collaborating with cloud models for large public data like web search summarization.
AgentFlux addresses data leakage in coding assistants by executing parts locally and cost-efficiently while maintaining access to entire codebases.
AI systems are rapidly expanding from chatbots and media generation to robotics and financial applications. Leading AI platforms run in the cloud, sending all user queriesโoften including sensitive context like code, preferences, and past interactionsโto third-party providers.
User data, including medical and financial records, is routinely exposed to cloud providers.
Cloud APIs charge per token and throttle requests, with true steady-state costs still unknown.
AgentFlux introduces a new framework for edge computing that partitions workloads into two distinct tasks: selection of functions that need to be called, and argument generation. This partitioning allows for a hierarchical architecture that achieves end-to-end accuracy comparable with state-of-the-art cloud models while maintaining the benefits of privacy and performance on consumer-grade GPUs.
AgentFlux architecture showing the decoupled post-training pipeline and inference framework with specialized LoRA adapters
Agentic systems autonomously solve complex tasks through iterative cycles: decomposing goals into discrete steps, executing each by invoking external tools, and dynamically adjusting based on tool outputs. Success hinges on LLM orchestrationโthe system's ability to accurately select the right tool and generate correct arguments at each decision point.
AgentFlux fundamentally reimagines this orchestration. Rather than relying on a monolithic LLM orchestrator, it employs multiple specialized LoRA adapters trained through a decoupled post-training pipeline and coordinated by a novel inference framework.
Functions as a classifier, identifying the optimal tool for each workflow step during inference.
Produces precise, context-appropriate arguments for the selected tool at each step.
Dynamically loads the tool selector adapter to determine which tool to invoke.
Dynamically loads the corresponding argument generator adapter to construct the tool's input parameters.
Base model routes to relevant toolset (Filesystem, Notion, Monday.com)
Load Tool Selector LoRA adapter โ Classify which specific tool to invoke
Load Argument Generator LoRA adapter โ Generate precise, structured arguments
Execute in containerized sandbox โ Return observation โ Continue or summarize
Complete inference pipeline showing hierarchical orchestration with dynamic LoRA adapter loading