Testing and verifying concurrent systems is hard due to their non-deterministic nature — verifying behavior that changes with each execution is difficult. Race conditions thrive in the non-deterministic world of thread scheduling. Even more challenging is verifying timeliness constraints, i.e. ensuring that operations complete within specified deadlines or that service guarantees are maintained under load. Traditional testing approaches struggle with concurrency, and mocking strategies often fail to capture the subtle interactions between threads, time, and shared state that cause real production failures.
The io-sim Haskell library, developed
by Well-Typed in partnership with engineers from IOG and Quviq, offers a
compelling solution to this problem. The library provides a pure simulation environment
for IO computations, enabling deterministic execution of concurrent code with
accurate time simulation and detailed execution traces. Unlike other testing
approaches, with io-sim one is able to write highly concurrent, real-time
systems and verify their timeliness constraints in a deterministic manner, by
accurately simulating GHC’s runtime system (e.g. asynchronous exceptions,
timeouts & delays, etc.).
This blog post introduces and explores io-sim through a practical example:
debugging an elevator controller that violates its response time requirements.
There’s also this great blog post announcing
io-simand it goes a bit more into detail about its features!
The Problem
Consider a simple elevator located in a three-floor building (ground, first , second). It takes roughly 1 second for the elevator to go up and down between each floor. The service requirement is: no passenger should wait more than 4 seconds from pressing the call button until the elevator doors open at their floor. It should be possible to test and verify this requirement when writing our elevator controller.
This ensures a reasonable quality of service and prevents frustration. Given the short distance between floors, 4 seconds is sensible. In the worst case, the elevator must travel from ground to second floor and back again.
Here’s a first attempt at modelling the system. Let’s start with the core data structures:
data Direction = Up | Down | None
deriving (Eq, Show)
data Floor = Ground | First | Second
deriving (Eq, Ord, Show, Enum)
data ElevatorState = ElevatorState
{ currentFloor :: Floor
, moving :: Direction
, requests :: [Floor]
} deriving (Show)The elevator’s state tracks three things: where it currently is, which direction it’s moving (if any), and a queue of floor requests.
The system has two main components that run concurrently:
- An elevator controller that continuously processes the request queue
- Button press handler that adds new floor requests
Let’s look at the controller first:
-- | Initialize an empty elevator state.
--
-- The elevator starts on the ground floor
--
initElevator :: IO (TVar ElevatorState)
initElevator = newTVarIO $ ElevatorState Ground None []
-- | Elevator controller logic.
--
-- 1. Read the current 'ElevatorState'
-- 2. Check if there are any requested floors
-- 3.
-- 3.1. Block waiting for new requests if there aren't any
-- 3.2. If there any requests, move to the floor at the top of the queue.
--
-- Straightforward FIFO elevator algorithm.
--
elevatorController :: TVar ElevatorState -> IO ()
elevatorController elevatorVar = forever $ do
-- Atomically get the next floor from the queue
(nextFloor, dir) <- atomically $ do
state <- readTVar elevatorVar
case requests state of
[] -> retry -- Block until a request arrives
(targetFloor:rs) -> do
-- Remove the floor from queue and start moving
let direction = getDirection (currentFloor state) targetFloor
writeTVar elevatorVar $ state
{ moving = direction, requests = rs }
return (targetFloor, direction)
putStrLn ("Going " ++ show dir ++ " to " ++ show nextFloor)
moveToFloor elevatorVar nextFloorThe moveToFloor function simulates the physical movement of the elevator.
moveToFloor :: TVar ElevatorState -> Floor -> IO ()
moveToFloor elevatorVar targetFloor = do
elevatorState <- readTVarIO elevatorVar
when (currentFloor elevatorState /= targetFloor) $ do
-- Takes 1 second to move between floors
threadDelay (1000000 * numberOfFloorsToMove)
atomically $
modifyTVar elevatorVar (\elevatorState' ->
elevatorState' { currentFloor = targetFloor
, moving = Idle
}
)
putStrLn ("Arrived at " ++ show targetFloor)The buttonPress function handles both external calls (someone waiting for
the elevator) and internal requests (someone inside selecting a destination):
-- | Whenever a button is pressed this function is called.
--
-- There are two scenarios when a button is pressed:
--
-- 1. When a person is calling the elevator to a floor in order to enter it.
-- 2. When a person is inside the elevator and wants to instruct the elevator
-- to go to a particular floor.
--
buttonPress :: TVar ElevatorState -> Floor -> IO ()
buttonPress elevatorVar floor = do
putStrLn ("Pressing button to " ++ show floor)
atomically $
modifyTVar elevatorVar $ \state -> do
case requests state of
rs@(nextFloor:_)
| let mostRecentRequestedFloor = last rs
, nextFloor /= floor
|| mostRecentRequestedFloor /= floor ->
state { requests = rs ++ [floor] }
| otherwise -> state
[] -> state { requests = [floor] }Consider the following example scenario and timeline:
- The elevator starts on the ground floor.
- Person A is on the first floor and presses the button to call the elevator to the first floor.
- While the elevator is going up, Person B arrives on the ground floor calls it to the ground floor.
- Elevator arrives at the first floor.
- Person A enters and presses the button to go to the second floor.
- Elevator goes to the ground floor to pick up Person B.
- Person B enters and presses the button to go to the first floor.
- Elevator goes to the second floor.
- Elevator goes to the first floor.
-- | This example mimicks the scenario above, pressing buttons in the right
-- order.
elevatorExample :: [Floor] -> IO ()
elevatorExample floors = do
elevator <- initElevator
withAsync (elevatorController elevator)
$ \controllerAsync -> do
-- Simulate multiple people pressing buttons simultaneously
forConcurrently_ floors (buttonPress elevator)
threadDelay (10 * 1000000)
cancel controllerAsync
elevatorExample [First, Ground, Second, First]This function spawns the elevator controller and then simulates multiple button presses happening concurrently. Let’s trace through our example:
Pressing button to First
Going Up to First
Pressing button to Ground
Pressing button to Second
Pressing button to First
Arrived at First
Going Down to Ground
Arrived at Ground
Going Up to Second
Arrived at Second
Going Down to First
Arrived at First
Does such a simple implementation adhere to the specified time constraints? The answer is no, a FIFO elevator algorithm is easy to implement but can be inefficient if the requests are spread out across floors, leading to more travel time.
How would one go about to test/verify this? Testing timeliness constraints in concurrent IO is tricky, due to its non-deterministic nature and limited observability.
io-sim: Deterministic IO Simulator
io-sim closes the gap between the code that’s actually run in
production and the code that runs in tests. Combined with property based
testing techniques it is possible to simulate execution of a program for years
worth of simulated time and find reproducible, rare edge-case bugs.
io-sim achieves this by taking advantage of the
io-classes set of packages,
which offers a class-based API compatible with most of the core Haskell
packages, including mtl. In general the APIs follow the base or async
io-sim is a time based, discrete event simulator. Which means, it provides a
granular execution trace that can be used from inspecting the commit order of
STM transactions to validating a high level, temporal logic property over some
abstract trace. The best part is that code requires minimal changes to use
io-sim, just polymorphic type signatures that work with both IO and
IOSim monads. Here’s the elevator controller code refactored for testing
with io-sim:
initElevator :: MonadSTM m => m (TVar m ElevatorState)
initElevator = ...
elevatorController
:: ( MonadSTM m
, MonadDelay m
, MonadSay m
)
=> TVar m ElevatorState -> m ()
elevatorController elevatorVar =
...
say ("Going " ++ show dir ++ " to " ++ show nextFloor)
...
moveToFloor
:: ( MonadSTM m
, MonadDelay m
, MonadSay m
)
=> TVar m ElevatorState -> Floor -> m ()
moveToFloor elevatorVar targetFloor = do
...
say ("Arrived at " ++ show targetFloor)
getDirection :: Floor -> Floor -> Direction
getDirection from to = ...
buttonPress
:: ( MonadSTM m
, MonadSay m
)
=> TVar m ElevatorState -> Floor -> m ()
buttonPress elevatorVar floor = do
say ("Pressing button to " ++ show floor)
...
elevatorExample
:: ( MonadSTM m
, MonadAsync m
, MonadDelay m
, MonadSay m
)
=> [Floor]
-> m ()
elevatorExample floors = ...Notice that only type signatures and IO operations needed changes. The core
business logic remains identical. When instantiated to IO, say becomes
putStrLn, but in the IOSim monad it produces traceable events.
main :: IO ()
main = do
let simpleExample :: [Floor]
simpleExample = [First, Ground, Second, First]
-- Runs the 'elevatorExample' in IO. This outputs exactly the same output
-- as before
elevatorExample simpleExample
-- Runs the 'elevatorExample' in IOSim.
putStrLn . intercalate "\n"
. map show
. selectTraceEventsSayWithTime
-- ^ Extracts only the 'say' events from the 'SimTrace' and
-- attaches the timestamp for each event
--
-- selectTraceEventsSayWithTime :: SimTrace a -> [(Time, String)]
--
-- This function takes a 'SimTrace' and filters all 'EventSay'
-- traces. It also captures the time of the trace event.
$ runSimTrace (elevatorExample simpleExample)
-- ^ Runs example in 'IOSim'
--
-- runSimTrace :: (forall s. IOSim s a) -> SimTrace a
--
-- This function runs a IOSim program, yielding an execution trace.Running the program above, the first noticeable thing is that when the program
runs in IO, it actually takes 10 real seconds to run due to the
threadDelay calls. However, when the program runs in IOSim the output is
instantaneous. This is because io-sim operates on simulated time rather than
wall-clock time, i.e. only the internal clock advances when threads execute
time-dependent operations like threadDelay or timeouts. Between these
operations, the simulation executes as if it had infinite CPU speed, i.e. all
computations at a given timestamp complete instantly, yet remain sequentially
ordered and deterministic.
(Time 0s,"Pressing button to First")
(Time 0s,"Going Up to First")
(Time 0s,"Pressing button to Ground")
(Time 0s,"Pressing button to Second")
(Time 0s,"Pressing button to First")
(Time 1s,"Arrived at First")
(Time 1s,"Going Down to Ground")
(Time 2s,"Arrived at Ground")
(Time 2s,"Going Up to Second")
(Time 4s,"Arrived at Second")
(Time 4s,"Going Down to First")
(Time 5s,"Arrived at First")
This particular scenario doesn’t violate the constraint. To find violations,
property-based testing can explore the space of possible request patterns. The
only problem is that our say traces are strings which is not a very
functional way of tracing things.
contra-tracer: Structured Tracing
While say provides basic, string-based tracing, real systems need structured
tracing of domain-specific events. String-based logging quickly becomes
inadequate when trying to verify complex properties or analyze system behavior
programmatically. Tracing strongly-typed events that can be filtered,
analyzed, and used in property tests is much better. The
contra-tracer library
provides a contravariant tracing abstraction that integrates seamlessly with
io-sim.
The key advantages of structured tracing:
- Type Safety: Events are strongly typed, preventing typos and logging errors.
- Composability: Tracers can be filtered, transformed, and combined.
- Testability: Events can be programmatically analyzed in tests.
All one needs to do is to have a custom trace type:
data ElevatorTrace = ButtonPress Floor
| Going Direction Floor
| ArrivedAt Floor
deriving (Eq, Show, Typeable)And substitute all calls to say for traceWith tracer (ButtonPress floor),
for example.
With structured tracing in place, extracting and analyzing traces becomes type-safe and straightforward:
-- | Extract typed elevator events with timestamps
extractElevatorEvents :: SimTrace a -> [(Time, ElevatorTrace)]
extractElevatorEvents =
selectTraceEventsDynamicWithTimeProperty-Based Testing: Verifying Timing Constraints
The elevator system began with a clear requirement: no passenger should wait more than 4 seconds. The FIFO implementation seemed reasonable, but the elevator can end up travelling between the bottom and top floors whilst someone in the middle waits their turn.
With typed traces from contra-tracer and deterministic simulation from
io-sim, QuickCheck can systematically explore the space of possible request
patterns and verify this property.
To verify our timing constraint, we need to:
- Generate random sequences of floor requests
- Run each sequence through the elevator simulation
- Check that every passenger gets service within 4 seconds
Let’s start with the test data generation:
-- | 'Floor' Arbitrary instance.
--
-- Randomly generate floors. The shrink instance is the most important here
-- since it will be responsible for generating a simpler counterexample.
--
instance Arbitrary Floor where
arbitrary = elements [Ground, First, Second]
shrink Second = [Ground, First]
shrink First = [Ground]
shrink Ground = []The shrinking strategy is important because when QuickCheck finds a failing
case with many floors, it will try simpler combinations to find the minimal
reproduction of the original input.
To verify the property that no passenger waits more than 4 seconds for the elevator to arrive to its floor, one needs to track the button presses and measure how long until the elevator arrives.
The property works by maintaining a map of pending requests. Each
ButtonPress adds an entry (keeping the earliest if multiple people request
the same floor), and each ArrivedAt checks if that floor was requested and
whether the wait exceeded 4 seconds:
-- Traverse the event trace and check if there is any gap longer than 4s
-- between requests and the elevator arriving at the request's floor.
--
violatesFourSecondRule :: [(Time, ElevatorTrace)] -> Property
violatesFourSecondRule events = counterexample (intercalate "\n" $ map show events)
$ checkViolations events Map.empty
where
checkViolations :: [(Time, ElevatorTrace)] -> Map Floor DiffTime -> Property
-- Fail if there are pending requests
checkViolations [] pending =
counterexample ("Elevator never arrived at: " ++ show pending)
(Map.null pending)
checkViolations ((Time t, event):rest) pending =
case event of
-- Add request to the pending requests map. Note that if there's
-- already a request for a particular floor, overwriting the
-- timestamp is not the right thing to do because there's an older
-- request that shouldn't be forgotten.
--
ButtonPress floor ->
checkViolations rest (Map.alter (maybe (Just t) Just) floor pending)
-- The elevator arrived at a floor. Check if it took more than 4
-- seconds to do so. If not continue and remove the request from
-- the pending map.
--
ArrivedAt floor ->
case Map.lookup floor pending of
Nothing ->
checkViolations rest pending
Just requestTime
| let time = t - requestTime
counterexample ( "Passenger waited "
++ show time
++ " for the elevator to arrive to the "
++ show floor
++ " floor"
) False
| otherwise -> checkViolations rest (Map.delete floor pending)
_ -> checkViolations rest pendingThen it is just a matter of running the example for randomly generated inputs,
extract the trace and use QuickCheck to assert if the property is true or not.
prop_no_passenger_waits_4_seconds :: [Floor] -> Property
prop_no_passenger_waits_4_seconds floors =
-- Run the button press sequence and get the execution trace
--
let trace = extractElevatorEvents
$ runSimTrace
$ elevatorExample (Tracer (emit traceM)) floors
in violatesFourSecondRule trace
whereRunning this property, QuickCheck quickly finds a counterexample:
*** Failed! Falsified (after 8 tests and 2 shrinks):
[Second,Ground,First]
(Time 0s,ButtonPress Second)
(Time 0s,Going Up Second)
(Time 0s,ButtonPress Ground)
(Time 0s,ButtonPress First)
(Time 2s,ArrivedAt Second)
(Time 2s,Going Down Ground)
(Time 4s,ArrivedAt Ground)
(Time 4s,Going Up First)
(Time 5s,ArrivedAt First)
Passenger waited 5s for the elevator to arrive to the First floor
The counterexample is minimal thanks to QuickCheck’s shrinking. Here, one
can imagine three passengers, pressing a button almost at the same time. Since
the elevator starts on the Ground floor and it is the Second floor passenger
that wins the race, the elevator starts going to the Second floor and queues
the Ground and the First floor requests, by this order. It then takes 5
seconds in total for the elevator to arrive at the First floor, violating
the timeliness requirement.
With property test in place, it is possible to iterate on better algorithms with
confidence. prop_no_passenger_waits_4_seconds property will be able to assert
if any of the improvements actually meet the timing requirements.
Using io-sim in the Real World
Real systems don’t explicitly block, they perform actual work that takes time.
To make such code testable with io-sim, one can introduce a typeclass abstraction
(e.g. MonadElevator m) with methods like moveElevator. In production,
this would perform real hardware control; in tests, it would use threadDelay to
simulate the operation’s duration.
In this elevator example, in a real system, there would be a sensor which would
inform the controller at what time the elevator arrive at a specific floor, at
which point the internal logic about the current floor of the elevator would be
updated. With suitable abstraction, that implementation could replace our simplification
using threadDelay.
io-sim can accurately simulate the standard IO operations, but this additional
abstraction also introduces the challenge of verifying that the model accurately
describes the real-world interactions. For example, 1 second is actually a very
fast elevator, so our model and timeliness requirements may have to be modified
slightly.
That’s a topic left for another blog post!
Related Tools and Libraries
The Haskell ecosystem offers several libraries to test concurrent systems, each one addresses different aspects of the problem. Here are two of the most popular and known ones:
Each takes a slightly different approach to exploring thread schedules, invariants, or state-space, and all have proven useful in practice.
dejafu explores all possible thread interleavings to find concurrency bugs.
The library offers a similar typeclass abstraction to io-classes for
concurrency primitives, allowing testing code that uses threads, MVars and
STM.
quickcheck-state-machine tests stateful programs using state machine models
with pre and post-conditions. The library can find race conditions through
parallel testing. It excels at testing APIs with complex state dependencies,
e.g. databases or file systems, but focuses on state correctness rather than
temporal properties.
io-sim distinguishes itself by being the only time-based simulator. One
can’t easily ask “what happens when this operation takes 150ms instead of
15ms?” with dejafu nor quickcheck-state-machine. io-sim enables testing
of timeout logic, retry mechanisms, timeliness constraints, etc. The ability
to compress years of simulated execution into seconds of test runtime makes
io-sim particularly valuable for testing long-running systems where bugs
emerge only after extended operation.
Conclusion
The key insight is that io-sim simulates the actual behavior of Haskell’s
runtime. STM transactions, thread scheduling, and time passing behave exactly
as in production, but deterministically.
For concurrent Haskell systems with timing requirements, e.g. network
protocols, distributed systems, or real-time controllers, io-sim allows the
verification of time-sensitive properties. The library offers
much more than shown here, including thread scheduling exploration testing with
partial order reduction.
The complete code examples are available here.