ADR-0022: LongRunning Node Implementation¶
Status¶
Accepted
Date¶
2025-02-27
Context¶
In ADR-0016, we outlined the need for a node type that can handle long-running processes with checkpoints. These long-running nodes enable workflows to:
- Process work incrementally over multiple sessions
- Save state between executions
- Resume from the last checkpoint
- Handle complex multi-step processes that may be paused and resumed
Long-running nodes are particularly useful for:
- Processing that spans multiple sessions
- Workflows that need to wait for external events or human interaction
- Resource-intensive tasks that should be broken into manageable chunks
- Processes that may be interrupted but need to resume from a checkpoint
Decision¶
We will implement a new crate floxide-longrunning
that provides the LongRunningNode
trait and related implementations as described in ADR-0016. The implementation will follow these design decisions:
1. Core LongRunningOutcome
Enum¶
We will implement a LongRunningOutcome
enum that represents the two possible outcomes of a long-running process:
pub enum LongRunningOutcome<T, S> {
/// Processing is complete with result
Complete(T),
/// Processing needs to be suspended with saved state
Suspend(S),
}
2. LongRunningNode
Trait¶
We will implement the LongRunningNode
trait as outlined in ADR-0016:
#[async_trait]
pub trait LongRunningNode<Context, Action>: Send + Sync
where
Context: Send + Sync + 'static,
Action: ActionType + Send + Sync + 'static,
Self::State: Serialize + Deserialize<'static> + Send + Sync + 'static,
Self::Output: Send + 'static,
{
/// Type representing the node's processing state
type State;
/// Type representing the final output
type Output;
/// Process the next step, potentially suspending execution
async fn process(
&self,
state: Option<Self::State>,
ctx: &mut Context,
) -> Result<LongRunningOutcome<Self::Output, Self::State>, FloxideError>;
/// Get the node's unique identifier
fn id(&self) -> NodeId;
}
3. LongRunningActionExt
Extension Trait¶
We will provide a LongRunningActionExt
trait to extend ActionType
with long-running specific actions:
pub trait LongRunningActionExt: ActionType {
/// Create a suspend action for long-running nodes
fn suspend() -> Self;
/// Create a resume action for long-running nodes
fn resume() -> Self;
/// Create a complete action for long-running nodes
fn complete() -> Self;
/// Check if this is a suspend action
fn is_suspend(&self) -> bool;
/// Check if this is a resume action
fn is_resume(&self) -> bool;
/// Check if this is a complete action
fn is_complete(&self) -> bool;
}
4. Concrete Implementations¶
We will provide these concrete implementations:
SimpleLongRunningNode
: A long-running node that uses a closure for processingLongRunningNodeAdapter
: An adapter to use a long-running node as a standard nodeStateStore
: A trait for storing and retrieving node statesInMemoryStateStore
: A simple in-memory implementation ofStateStore
for testing
5. Workflow Integration¶
The LongRunningNodeAdapter
will implement the Node
trait, allowing long-running nodes to be used in standard workflows. This adapter will handle state management and action conversion.
Consequences¶
Advantages¶
- State Persistence: Enables workflows to save and resume state across multiple executions.
- Incremental Processing: Allows breaking down large tasks into manageable chunks.
- Checkpoint Recovery: Provides a mechanism for resuming from the last successful checkpoint.
- Workflow Suspension: Supports pausing workflows for external events or human interaction.
- Resource Efficiency: Prevents long-running tasks from blocking workflow execution.
Disadvantages¶
- Complexity: Adds another node type, increasing the conceptual overhead.
- State Management: Requires proper state serialization and storage infrastructure.
- Debugging Challenges: Stateful workflows can be more difficult to debug and reason about.
- Implementation Overhead: Users need to implement proper state management.
Implementation Notes¶
- States must be serializable and deserializable to be properly stored between executions.
- The actual storage mechanism (database, file system, etc.) is left to the implementation.
- The
LongRunningNodeAdapter
allows using long-running nodes seamlessly in standard workflows. - Integration tests demonstrate proper state management and resumption.
Alternatives Considered¶
1. Using Event-Driven Nodes for Long-Running Processes¶
We considered implementing long-running process support as a special case of event-driven nodes. However, the state management requirements are sufficiently different to warrant a separate abstraction.
2. Implicit State Management in the Workflow Engine¶
We considered building state management directly into the workflow engine rather than in the nodes. While this would simplify the node implementation, it would make the workflow engine more complex and limit flexibility.
3. Framework-Provided Storage Backend¶
We considered implementing a standard storage backend for states. However, we decided that providing a trait-based interface allows users to implement storage that best fits their needs.