ADR-0022: LongRunning Node Implementation¶

Status¶

Accepted

Date¶

2025-02-27

Context¶

In ADR-0016, we outlined the need for a node type that can handle long-running processes with checkpoints. These long-running nodes enable workflows to:

Process work incrementally over multiple sessions
Save state between executions
Resume from the last checkpoint
Handle complex multi-step processes that may be paused and resumed

Long-running nodes are particularly useful for:

Processing that spans multiple sessions
Workflows that need to wait for external events or human interaction
Resource-intensive tasks that should be broken into manageable chunks
Processes that may be interrupted but need to resume from a checkpoint

Decision¶

We will implement a new crate floxide-longrunning that provides the LongRunningNode trait and related implementations as described in ADR-0016. The implementation will follow these design decisions:

1. Core `LongRunningOutcome` Enum¶

We will implement a LongRunningOutcome enum that represents the two possible outcomes of a long-running process:

pub enum LongRunningOutcome<T, S> {
    /// Processing is complete with result
    Complete(T),
    /// Processing needs to be suspended with saved state
    Suspend(S),
}

2. `LongRunningNode` Trait¶

We will implement the LongRunningNode trait as outlined in ADR-0016:

#[async_trait]
pub trait LongRunningNode<Context, Action>: Send + Sync
where
    Context: Send + Sync + 'static,
    Action: ActionType + Send + Sync + 'static,
    Self::State: Serialize + Deserialize<'static> + Send + Sync + 'static,
    Self::Output: Send + 'static,
{
    /// Type representing the node's processing state
    type State;

    /// Type representing the final output
    type Output;

    /// Process the next step, potentially suspending execution
    async fn process(
        &self,
        state: Option<Self::State>,
        ctx: &mut Context,
    ) -> Result<LongRunningOutcome<Self::Output, Self::State>, FloxideError>;

    /// Get the node's unique identifier
    fn id(&self) -> NodeId;
}

3. `LongRunningActionExt` Extension Trait¶

We will provide a LongRunningActionExt trait to extend ActionType with long-running specific actions:

pub trait LongRunningActionExt: ActionType {
    /// Create a suspend action for long-running nodes
    fn suspend() -> Self;

    /// Create a resume action for long-running nodes
    fn resume() -> Self;

    /// Create a complete action for long-running nodes
    fn complete() -> Self;

    /// Check if this is a suspend action
    fn is_suspend(&self) -> bool;

    /// Check if this is a resume action
    fn is_resume(&self) -> bool;

    /// Check if this is a complete action
    fn is_complete(&self) -> bool;
}

4. Concrete Implementations¶

We will provide these concrete implementations:

SimpleLongRunningNode: A long-running node that uses a closure for processing
LongRunningNodeAdapter: An adapter to use a long-running node as a standard node
StateStore: A trait for storing and retrieving node states
InMemoryStateStore: A simple in-memory implementation of StateStore for testing

5. Workflow Integration¶

The LongRunningNodeAdapter will implement the Node trait, allowing long-running nodes to be used in standard workflows. This adapter will handle state management and action conversion.

Consequences¶

Advantages¶

State Persistence: Enables workflows to save and resume state across multiple executions.
Incremental Processing: Allows breaking down large tasks into manageable chunks.
Checkpoint Recovery: Provides a mechanism for resuming from the last successful checkpoint.
Workflow Suspension: Supports pausing workflows for external events or human interaction.
Resource Efficiency: Prevents long-running tasks from blocking workflow execution.

Disadvantages¶

Complexity: Adds another node type, increasing the conceptual overhead.
State Management: Requires proper state serialization and storage infrastructure.
Debugging Challenges: Stateful workflows can be more difficult to debug and reason about.
Implementation Overhead: Users need to implement proper state management.

Implementation Notes¶

States must be serializable and deserializable to be properly stored between executions.
The actual storage mechanism (database, file system, etc.) is left to the implementation.
The LongRunningNodeAdapter allows using long-running nodes seamlessly in standard workflows.
Integration tests demonstrate proper state management and resumption.

Alternatives Considered¶

1. Using Event-Driven Nodes for Long-Running Processes¶

We considered implementing long-running process support as a special case of event-driven nodes. However, the state management requirements are sufficiently different to warrant a separate abstraction.

2. Implicit State Management in the Workflow Engine¶

We considered building state management directly into the workflow engine rather than in the nodes. While this would simplify the node implementation, it would make the workflow engine more complex and limit flexibility.

3. Framework-Provided Storage Backend¶

We considered implementing a standard storage backend for states. However, we decided that providing a trait-based interface allows users to implement storage that best fits their needs.