Ensemble Reasoning
Ensemble Reasoning is a new methodology to join LLMs that Reason with an ensemble of modules that can further the reasoning process. These modules include other supporting LLMs, machine learning models (like GNNs), Knowledge Graph Query services, and others. Ensemble Reasoning dramatically improves A.I. Agents in speed, cost, and capability by moving much of "tool use" into the reasoning process of the LLM.
An article which discusses Ensemble Reasoning at a high level: https://blog.vital.ai/2025/01/13/agents-and-ensemble-reasoning/
The implementation of the core functionality is found in the github repository: https://github.com/vital-ai/vital-llm-reasoner
A deployable server incorporating the core functionality is found in the github repository: https://github.com/vital-ai/vital-llm-reasoner-server
The server can be deployed as a Docker container in an ARM Linux environment that includes NVIDIA GPU(s).
The current implementation uses the QwQ 32B Preview Model: https://huggingface.co/Qwen/QwQ-32B-Preview
Other reasoning models will be supported going forward.
The reasoning model inference uses vLLM or Llama.cpp and this server infrastructure is used to serve the Ensemble.
Reasoning tokens are consumed in a streaming context from the primary reasoning model and piped to the ensemble. Tokens produced by the ensemble are streamed back into the primary model and appended to the current reasoning trace to further the reasoning process.
There many be highly specific dependencies on CUDA-specific functions, versions of vLLM/Llama.cpp, versions of PyTorch, or the specific reasoning model version in order to support manipulating the token inference stream of the primary model, especially in a performant way.
Ensemble members (aka "ensemble tools") are being implemented in the core github repository. These implementations generally are wrappers/connectors to tools such as KGraphService for knowledge graph queries.
Currently in development is a framework in the core github repository to manage the requests to the ensemble tools that optimizes the flow of information back into the primary model without stalling the inference. This may involve reordering the ensemble requests and optimizing the reasoner prompts to arrive at a JIT framework which delivers knowledge to the reasoner just as the reasoner requires it. Wherever possible the ensemble members should be running entirely in the same container as the primary model, utilizing a cache of off-container data (including warming/pre-populating it), and/or have highly optimized query requests if the queries go outside the container over the network.
Last updated