This post is about how to write tests that can catch race conditions in a reproducible way. The approach is programming language agnostic, and should work in most languages that have a decent multi-threaded story. It's a white-box testing approach, meaning you will have to modify the software under test.
In my previous post, we had a look at how to mechanically derive parallel tests that can uncover race conditions from a sequential fake1.
One of the nice things about the approach is that it's a black-box testing technique, i.e. it doesn't require the user to change the software under test. On the other hand because threads will interleave differently when we rerun the tests, there by potentially causing different outcomes. This in turn creates problems for the shrinking of failing test cases2.
As a workaround, I suggested that when a race condition is found in the unmodified code, one could swap the shared memory module for one that introduces sleeps around the operations. This creates less non-determinism, because the jitter of each operation will have less of an impact, and therefore helps shrinking. This isn't a satisfactory solution, of course, and I left a to do item to implement a deterministic scheduler, like the authors do in the paper that first introduced parallel property-based testing.
The idea of the deterministic scheduler is that it should be possible to rerun a multi-threaded program and get exactly the same interleaving of threads each time.
The deterministic scheduler from the above mentioned paper is called PULSE. It was supposedly released under the BSD license, however I've not been able to find it. PULSE is written in Erlang and the paper uses it to test Erlang code. In Erlang everything is triggered by message passing, so I think that the correct way of thinking about what PULSE does is that it acts as a person-in-the-middle proxy. With other words, an Erlang process doesn't send a messaged directly to another process, but instead asks the scheduler to send it to the process. That way all messages go via the scheduler and it can choose the message order. Note that a seed can be used to introduce randomness, without introducing non-determinism.
I implemented a proxy scheduler like this in Haskell (using
distributed-process
, think Haskell trying to be like Erlang) about 6
years
ago,
but I didn't know how to do it in a non-message-passing setting. I was
therefore happy to see that my previous post inspired matklad to write a
post
where he shows how he'd do it in a multi-threaded shared memory setting.
In this post I'll port matklad's approach from Rust to Haskell and hook it up to the parallel property-based testing machinery from my previous post.
Another difference between matklad and the approach in this post is that matklad uses an ad hoc correctness criteria, whereas I follow the parallel property-based testing paper and use linearisability. An ad hoc criteria can be faster than linearisability checking, but depending on how complicated your system is, it might be harder to find one. Linearisability checking on the other hand follows mechanically (for free) from a sequential (single-threaded) model/fake.
If you know what you are doing, then by all means figure out an ad hoc correctness criteria like matklad does. If on the other hand you haven't tested much concurrent code before, then I'd recommend starting with the linearisability checking approach that we are about to describe3.
In order to explain what we'd like to do, it's helpful to consider an example of a race condition. The text book example of a race condition is a counter which is incremented by two threads at the same time.
One possible interleaving of the two threads that yields the correct result is the following:
Time | Thread 1 | Thread 2 | Integer value | |
---|---|---|---|---|
0 | 0 | |||
1 | read value | ← | 0 | |
2 | increase value | 0 | ||
3 | write back | → | 1 | |
4 | read value | ← | 1 | |
5 | increase value | 1 | ||
6 | write back | → | 2 |
However there are other interleavings where one of the threads overwrites the other thread's increment, yielding an incorrect result:
Time | Thread 1 | Thread 2 | Integer value | |
---|---|---|---|---|
0 | 0 | |||
1 | read value | ← | 0 | |
2 | read value | ← | 0 | |
3 | increase value | 0 | ||
4 | increase value | 0 | ||
5 | write back | → | 1 | |
6 | write back | → | 1 |
In most programming languages the thread interleaving is non-deterministic, and so we get irreproducible failures also sometimes known as "Heisenbugs".
What we'd like to do is to be able to start a program with some seed and if the same seed is used then we get the same thread interleaving and therefore a reproducible result.
The idea, due to matklad, is to insert pauses around each shared memory operation (the reads and writes) and have a scheduler unpause one thread at the time. The scheduler is parametrised by a seed, which is fed into a pseudorandom number generator which in turn lets the scheduler deterministically choose which thread to unpause.
In the rest of this post we will port matklad's deterministic scheduler from Rust to Haskell, hopefully in a way that shows that this can be done in any other language with decent multi-threaded programming primitives. Then we'll do a short recap of how parallel property-based testing works, and finally we'll hook up the deterministic scheduler to the parallel property-based testing machinery.
The implementation of the deterministic scheduler can be split up in three parts. First we'll implement a way for the spawned threads to communicate with the scheduler, this communication channel will be used to pause and unpause the threads. After that we'll make a wrapper datatype around Haskell's threads which also includes the communication channel. Finally, we'll have all the pieces to implement the deterministic scheduler itself.
The scheduler needs to be able to communicate with the running threads, in order to be able to deterministically unpause, or "step", one thread at a time.
We'll use Haskell's TMVar
s for this, but any kind of shared memory
will do. Haskell's MVar
s can be thought of boxes that contain a value,
where taking something out of a box that is empty blocks and putting
something into a box that is full blocks as well. Where "blocks" means
that the run-time will suspend the thread that tries the blocking action
and only wake it up when the MVar
changes, i.e. it's an efficient way
of waiting compared to
busy-waiting or spinning.
The T
in TMVar
s adds
STM
transactions around MVar
s, we'll see an example of what these are
useful for shortly.
We'll call our communication channel Signal
:
data Signal = SingleThreaded | MultiThreaded (TMVar ())
deriving Eq
There are two ways to create a Signal
, one for single-threaded and
another for multi-threaded execution:
newSingleThreadedSignal :: Signal
newSingleThreadedSignal = SingleThreaded
newMultiThreadedSignal :: IO Signal
newMultiThreadedSignal = MultiThreaded <$> newEmptyTMVarIO
The idea being that in the single-threaded case the scheduler shouldn't be doing anything. In particular pausing a thread in the single-threaded case is a no-op:
pause :: Signal -> IO ()
pause SingleThreaded = return ()
pause (MultiThreaded tmvar) = atomically (takeTMVar tmvar)
Notice that in the multi-threaded case the pause operation will try to
take a value from the TMVar
and also notice that the TMVar
starts
off being empty, so this will cause the thread to block.
The way the scheduler can unpause the thread is by putting a unit value
into the TMVar
, which will cause the takeTMVar
finish.
unpause :: Signal -> IO ()
unpause SingleThreaded = error
"unpause: a single thread should never be paused"
unpause (MultiThreaded tmvar) = atomically (putTMVar tmvar ())
For our scheduler implementation we'll also need a way to check if a thread is paused:
isPaused :: Signal -> STM Bool
isPaused SingleThreaded = return False
isPaused (MultiThreaded tmvar) = isEmptyTMVar tmvar
It's also useful to be able to check if all threads are paused:
waitUntilAllPaused :: [Signal] -> IO ()
waitUntilAllPaused signals = atomically $ do
bs <- mapM isPaused signals
guard (and bs)
Notice that STM makes this easy as we can do this check atomically.
Having implemented the communication channel between the thread and the
scheduler, we are now ready to introduce our "managed" threads (we call
them "managed" because they are managed by the scheduler). These threads
are basically a wrapper around Haskell's Async
threads that also
includes our communication channel, Signal
.
data ManagedThreadId a = ManagedThreadId
{ _mtidName :: String
, _mtidSignal :: Signal
, _mtidAsync :: Async a
}
deriving Eq
Our managed thread can be spawned as follows:
spawn :: String -> (Signal -> IO a) -> IO (ManagedThreadId a)
spawn name io = do
s <- newMultiThreadedSignal
a <- async (io s)
return (ManagedThreadId name s a)
Noticed that the spawned IO action gets access to the communication channel.
The Async
thread API exposes a way to check if a thread is still
executing, threw an exception or finished yielding a result. We'll
extend this by also being able to check if the thread is paused as
follows.
data ThreadStatus a = Paused | Finished a | Threw SomeException
getThreadStatus :: ManagedThreadId a -> IO (ThreadStatus a)
getThreadStatus mtid = atomically go
where
go = do
res <- pollSTM (_mtidAsync mtid)
case res of
Nothing -> do
b <- isPaused (_mtidSignal mtid)
if b
then return Paused
else go
Just (Left err) -> return (Threw err)
Just (Right x) -> return (Finished x)
We now got all the pieces we need to implement our deterministic scheduler.
The idea is to wait until all threads are paused, then step one of them and wait until it either pauses again or finishes. If it pauses again, then repeat the stepping. If it finishes, remove it from the list of stepped threads and continue stepping.
schedule :: RandomGen g => [ManagedThreadId a] -> g -> IO ([a], g)
schedule mtids0 gen0 = do
res <- timeout 1000000 (waitUntilAllPaused (map _mtidSignal mtids0))
case res of
Nothing -> error "schedule: all threads didn't pause within a second"
Just () -> do
go mtids0 gen0 []
where
go :: RandomGen g => [ManagedThreadId a] -> g -> [a] -> IO ([a], g)
go [] gen acc = return (reverse acc, gen)
go mtids gen acc = do
let (ix, gen') = randomR (0, length mtids - 1) gen
mtid = mtids !! ix
unpause (_mtidSignal mtid)
status <- getThreadStatus mtid
case status of
Finished x -> do
go (mtids \\ [mtid]) gen' (x : acc)
Paused -> go mtids gen' acc
Threw err -> error ("schedule: " ++ show err)
We can now also implement a useful higher-level combinator:
mapConcurrently :: RandomGen g => (Signal -> a -> IO b) -> [a] -> g -> IO ([b], g)
mapConcurrently f xs gen = do
mtids <- forM (zip [0..] xs) $ \(i, x) ->
spawn ("Thread " ++ show i) (\sig -> f sig x)
schedule mtids gen
To show that our scheduler is indeed deterministic, let's implement the race condition between two increments from the introduction.
First let's introduce an interface for shared memory. The idea is that we will use two different instances of this interface: a "real" one which just does what we'd expect from shared memory, and a "fake" one which pauses around the real operations.
data SharedMemory a = SharedMemory
{ memReadIORef :: IORef a -> IO a
, memWriteIORef :: IORef a -> a -> IO ()
}
realMem :: SharedMemory a
realMem = SharedMemory readIORef writeIORef
fakeMem :: Signal -> SharedMemory a
fakeMem signal =
SharedMemory
{ memReadIORef = \ref -> do
pause signal
x <- readIORef ref
pause signal
return x
, memWriteIORef = \ref x -> do
pause signal
writeIORef ref x
pause signal
}
The real one will be used when we deploy the actual software and the fake one while we do our testing.
We can now implement our counter example against the shared memory interface:
data AtomicCounter = AtomicCounter (IORef Int)
newCounter :: IO AtomicCounter
newCounter = do
ref <- newIORef 0
return (AtomicCounter ref)
incr :: SharedMemory Int -> AtomicCounter -> IO ()
incr mem (AtomicCounter ref) = do
i <- memReadIORef mem ref
memWriteIORef mem ref (i + 1)
get :: SharedMemory Int -> AtomicCounter -> IO Int
get mem (AtomicCounter ref) = memReadIORef mem ref
Finally, we can implement the race condition test using the counter and two threads that do increments:
test :: Int -> IO (Int, Bool, Int)
test seed = do
counter <- newCounter
mtid1 <- spawn "0" (\signal -> incr (fakeMem signal) counter)
mtid2 <- spawn "1" (\signal -> incr (fakeMem signal) counter)
let gen = mkStdGen seed
_ <- schedule [mtid1, mtid2] gen
two <- get realMem counter
return (seed, two == 2, two)
test1 :: IO ()
test1 = mapM_ (\seed -> print =<< test seed) [0..10]
The test is parametrised by a seed for the scheduler. If we run it with different seeds we get different outcomes:
>>> test1
(0,True,2)
(1,True,2)
(2,False,1)
(3,True,2)
(4,False,1)
(5,True,2)
(6,False,1)
(7,False,1)
(8,True,2)
(9,True,2)
(10,False,1)
If we fix the seed to one which makes our test fail:
test2 :: IO ()
test2 = let seed = 2 in replicateM_ 10 (print =<< test seed)
Then we get the same outcome every time:
>>> test2
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
These quick tests seem to suggest that our scheduler is in fact deterministic.
In our counter example above, we had two concurrent increments, in this case it's easy to see what the answer must be (the counter must have the value of two, if we start counting from zero and increment by one).
However for more complicated scenarios it gets less clear, consider:
- Two increments and a get operation all happening concurrently, what's the right return value of the get? It depends, it can be 0, 1 or 2;
- Consider the counter being on a remote server and clients doing the increments and gets via some network. Imagine a client first does an increment and this request times out or the client crashes, then another client does a get operation, what's the return value of the get? It depends, it can be 0 or 1 depending on if the timeout or crash happened before or after the server received the increment;
- The above gets a lot more complicated with more operations involved or more complicated data structures than a counter, e.g. key-value store with deletes.
Luckily there's a correctness criteria for concurrent programs like these which is based on a sequential model, Linearizability: a correctness condition for concurrent objects by Herlihy and Wing (1990), which hides the complexity of non-determinism and crashing threads and works on arbitrary data structures. This is what we use in parallel property-based testing.
The idea in a nutshell: execute commands in parallel, collect a concurrent history of when each command started and stopped executing, try to find an interleaving of commands which satisfies the sequential model. For a more detailed explanation see my previous post.
The point is not to reimplement the parallel property-based testing machinery from my previous post here, but merely show that integrating the deterministic scheduler isn't too much work.
We need to change the code from the previous post in three different places: the sequential module, the parallel module and the counter example itself.
First we import the library code that we wrote above in this post:
+import qualified ManagedThread2 as Scheduler
Then we move the runCommandMonad
method from the ParallelModel
class
into the StateModel
class and change it so that it has access to the
communication channel to the scheduler (Signal
):
+ runCommandMonad :: proxy state -> CommandMonad state a -> Scheduler.Signal -> IO a
We then change the runCommands
function to use runCommandMonad
and
use a single-threaded Signal
, i.e. one that doesn't do any pauses:
runCommands :: forall state. StateModel state
- => Commands state -> PropertyM (CommandMonad state) ()
-runCommands (Commands cmds0) = go initialState emptyEnv cmds0
+ => Commands state -> PropertyM IO ()
+runCommands (Commands cmds0) =
+ hoist (flip (runCommandMonad (Proxy :: Proxy state)) Scheduler.newSingleThreadedSignal) $
+ go initialState emptyEnv cmds0
where
go :: state -> Env state -> [Command state (Var (Reference state))]
-> PropertyM (CommandMonad state) ()
In order for this to typecheck we need a helper function that changes
the underlying monad of a PropertyM
(QuickCheck's monadic properties):
+hoist :: Monad m => (forall x. m x -> IO x) -> PropertyM m a -> PropertyM IO a
+hoist nat (MkPropertyM f) = MkPropertyM $ \g ->
+ let
+ MkGen h = f (fmap (fmap (return . ioProperty)) g)
+ in
+ MkGen (\r n -> nat (h r n))
Again we import the deterministic scheduler that we defined in this post:
+import qualified ManagedThread2 as Scheduler
As we said above, the runCommandMonad
method was moved into the
sequential testing module:
- runCommandMonad :: proxy state -> CommandMonad state a -> IO a
We'll reuse QuickCheck's seed for our scheduler, the following helper
function extracts the seed from QuickCheck's PropertyM
:
+getSeed :: PropertyM m QCGen
+getSeed = MkPropertyM (\f -> MkGen (\r n -> unGen (f r) r n))
We now have all the pieces we need to rewrite runParallelCommands
to
use the deterministic scheduler:
runParallelCommands :: forall state. ParallelModel state
=> ParallelCommands state -> PropertyM IO ()
runParallelCommands cmds0@(ParallelCommands forks0) = do
+ gen <- getSeed
+ liftIO (putStrLn ("Seed: " ++ show gen))
forM_ (parallelCommands cmds0) $ \cmd -> do
let name = commandName cmd
monitor (tabulate "Commands" [name] . classify True name)
monitor (tabulate "Concurrency" (map (show . length . unFork) forks0))
q <- liftIO newTQueueIO
c <- liftIO newAtomicCounter
- env <- liftIO (runForks q c emptyEnv forks0)
+ env <- liftIO (runForks q c emptyEnv gen forks0)
hist <- History <$> liftIO (atomically (flushTQueue q))
let ok = linearisable env (interleavings hist)
unless ok (monitor (counterexample (show hist)))
assert ok
where
- runForks :: TQueue (Event state) -> AtomicCounter -> Env state -> [Fork state]
- -> IO (Env state)
- runForks _q _c env [] = return env
- runForks q c env (Fork cmds : forks) = do
- envs <- liftIO $
- mapConcurrently (runParallelReal q c env) (zip [Pid 0..] cmds)
+ runForks :: RandomGen g => TQueue (Event state) -> AtomicCounter -> Env state -> g
+ -> [Fork state] -> IO (Env state)
+ runForks _q _c env _gen [] = return env
+ runForks q c env gen (Fork cmds : forks) = do
+ (envs, gen') <- liftIO $
+ Scheduler.mapConcurrently (runParallelReal q c env) (zip [Pid 0..] cmds) gen
let env' = combineEnvs (env : envs)
- runForks q c env' forks
+ runForks q c env' gen' forks
runParallelReal :: TQueue (Event state) -> AtomicCounter -> Env state
- -> (Pid, Command state (Var (Reference state))) -> IO (Env state)
- runParallelReal q c env (pid, cmd) = do
+ -> Scheduler.Signal -> (Pid, Command state (Var (Reference state))) -> IO (Env state)
+ runParallelReal q c env signal (pid, cmd) = do
atomically (writeTQueue q (Invoke pid cmd))
- eResp <- try (runCommandMonad (Proxy :: Proxy state) (runReal (fmap (lookupEnv env) cmd)))
+ eResp <- try (runCommandMonad (Proxy :: Proxy state) (runReal (fmap (lookupEnv env) cmd)) signal)
case eResp of
Left (err :: SomeException) ->
error ("runParallelReal: " ++ displayException err)
We start by replacing our sleeps (threadDelay
s) and direct
manipulation of shared memory ({read,write}IORef
) with operations from
the shared memory interface:
+import qualified ManagedThread2 as Scheduler
-incrRaceCondition :: IO ()
-incrRaceCondition = do
- n <- readIORef gLOBAL_COUNTER
- threadDelay 100
- writeIORef gLOBAL_COUNTER (n + 1)
- threadDelay 100
+incrRaceCondition :: Scheduler.SharedMemory Int -> IO ()
+incrRaceCondition mem = do
+ n <- liftIO (Scheduler.memReadIORef mem gLOBAL_COUNTER)
+ Scheduler.memWriteIORef mem gLOBAL_COUNTER (n + 1)
-get :: IO Int
-get = readIORef gLOBAL_COUNTER
+get :: Scheduler.SharedMemory Int -> IO Int
+get mem = Scheduler.memReadIORef mem gLOBAL_COUNTER
We need to pass in the Signal
communication channel when constructing
the shared memory interface. With our change to runCommandMonad
we
have access to the sig
nal when translating the CommandMonad
into
IO
, so we can simply pass the sig
nal through using the reader monad
(recall that ReaderT Scheduler.Signal IO a
is isomorphic to
Scheduler.Signal -> IO a
).
+ type CommandMonad Counter = ReaderT Scheduler.Signal IO
+
+ runCommandMonad _ m sig = runReaderT m sig
The construction of the fake shared memory interface happens in the
runReal
function, where ask
retrieves the sig
nal via the reader
monad:
-- We also need to explain which part of the counter API each command
-- corresponds to.
- runReal :: Command Counter r -> IO (Response Counter r)
- runReal Get = Get_ <$> get
- runReal Incr = Incr_ <$> incrRaceCondition
+ runReal :: Command Counter r -> CommandMonad Counter (Response Counter r)
+ runReal cmd = do
+ sig <- ask
+ let mem = Scheduler.fakeMem sig
+ case cmd of
+ Get -> liftIO (Get_ <$> get mem)
+ Incr -> liftIO (Incr_ <$> incrRaceCondition mem)
The final changes are in the in the parallel property itself. Where we
can now remove the replicateM_ 10
, which repeats the test 10 times,
because the thread scheduling is now deterministic and we don't need to
repeat the test in order to avoid being unlucky with only getting thread
interleavings that don't reveal the bug.
prop_parallelCounter :: ParallelCommands Counter -> Property
prop_parallelCounter cmds = monadicIO $ do
- replicateM_ 10 $ do
- run reset
- runParallelCommands cmds
+ run reset
+ runParallelCommands cmds
assert True
Running the parallel property gives output such as the following:
>>> quickCheck prop_parallelCounter
Seed: SMGen 14250666030628800360 1954972351745194697
Seed: SMGen 13912848539649022280 6105520832690741705
Seed: SMGen 11982463081258021613 5563494797767522969
Seed: SMGen 3766496530906674898 8913882928510646053
Seed: SMGen 9878140450988724144 11431408445192688375
Seed: SMGen 10677049786290338516 2728325560351012375
Seed: SMGen 8857820011662424543 17283242182436244785
Seed: SMGen 8857820011662424543 17283242182436244785
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rink)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
*** Failed! Assertion failed (after 7 tests and 3 shrinks):
ParallelCommands [Fork [Incr,Incr],Fork [Get]]
History [Invoke (Pid 0) Incr,Invoke (Pid 1) Incr,Ok (Pid 0) (Incr_ ()),Ok (Pid 1) (Incr_ ()),Invoke (
Pid 0) Get,Ok (Pid 0) (Get_ 1)]
We can see that different seeds are used up until the test fails, then shrinking is done with the same seed.
I hope I've managed to give a glimpse of how we can deterministically test multi-threaded code using a deterministic scheduler, and how this technique can be applied to parallel property-based testing.
While this seems to work, there are several ways in which it can be improved upon:
-
Some seeds don't give the minimal counterexample (the one we saw above with two concurrent increments followed by a get). While shrinking can be improved as already pointed out in my previous post, the problem could also be that shrinking changes the interleavings. Let's say we generated three concurrent increments followed by a get, this triggers the race condition if one of those increments overwrite the other's increment. It could be that trying to shrink away any of the increments (to get to the minimal counterexample) fails because by removing any of them will cause the scheduler to unpause the remaining ones in a different order, and thus potentially failing to trigger the race condition.
One possible solution to this problem could be to "tombstone" the shrunk commands/threads rather than removing them and then change
runReal
so that tombstoned commands get run using an instance of the shared memory interface in which the pauses happen but not the mutation of the memory. The idea being that by doing so the scheduler will still use the pseudorandom number generator for the shrunk commands and thus the original interleavings will be preserved.Another possible solution, due to Daniel Gustafsson, is to save more info than merely the seed. In particular the scheduler could generate the first interleaving using a seed, but then save the interleaving (the order in which the threads were unpaused) and consequent reruns can reuse the interleaving (or shrunk subsets of it) instead of the seed and thereby preserve the original interleaving.
-
Currently random interleavings are checked, but we could also imagine enumerating all interleavings up to some depth. This would be more in line with what model checkers do. Perhaps SmallCheck could be used for this? It would also be interesting to compare this approach to what the dejafu library does.
-
While the approach in this post works in all languages due to its white-box nature, it's interesting to consider what would it would take to turn it into a black-box approach? Where with black-box I mean that the programmer doesn't need to change their code to get the deterministic testing.
Two black-box approaches that I'm aware of are:
- Intercepting and recording the syscalls that the multi-threaded program does and then somehow using the recorded trace to deterministically reproduce the same execution when the program is rerun. I believe this is what Mozilla's time travelling debugger, rr, and Facebook's hermit does);
- Antithesis' deterministic hypervisor.
Both of these approaches involve a lot of engineering work though, and I'm curious if we can get there cheaper?
One thing I'm interested in is: what if we had a programming language that's able to switch between the fake and real shared memory interface, depending on if we are testing or not? The multi-threaded code that the user writes in that case doesn't need to be changed to get the deterministic testing, i.e. a black-box approach.
Implementing a new language and rewriting all your code in that language is also a lot of work as well though. Perhaps existing languages can be incrementally changed to expose scheduler hooks or allow user defined schedulers? Either way, it seems to me that this should be solved at the language-level, rather than OS-level, but maybe that's partly because I don't understand the OS-level solutions well enough. I'd be curious to hear about other opinions or ideas.
-
We've looked at linearisability, but what about other consistency models? For example, eventual consistency? See John Hughes et al's Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service (2016) as well as matklad's post for hints.
-
Partial-order reduction: during concurrent execution sometimes we can commute two operations without changing the outcome, e.g. the interleaving of two increments doesn't matter, they all end up the same state. We can exploit this fact to check less histories;
-
We looked at shared memory, but there are other ways of getting data races, e.g. via concurrent file system access, mmaped memory, etc. All these other ways of concurrently mutating some state would require interfaces with fake implementations that insert pauses around the actual mutation. It would be interesting to take an example of a concurrent (and persisted?) data structure, e.g. the LMAX Disruptor or Aeron's log buffers, and implement and test it in the same way we tested the counter.
If you've feedback, comments or are interested in working on any of the above, feel free to get in touch.
Thanks to Daniel Gustafsson for discussing tombstone solution to the problem where shrinking can cause different thread interleavings.
Footnotes
-
If you haven't heard of fakes before, think of them as a more elaborate test double than a mock. A mock of a component expects to be called in some particular way (i.e. exposes only some limited subset of the components API), and typically throw an exception when called in any other way. While a fake exposes the full API and can be called just like the real component, but unlike the real component it takes some shortcuts. For example a fake might lose all data when restarted (i.e. keeps all data in memory, while the real component persists the data to some stable storage). ↩
-
The reason for shrinking not working so well with non-determinism is because shrinking stops when the property passes. So if some input causes the property to fail and then we rerun the property on a smaller input it might be the case that the smaller input still contains the race condition, but because the interleaving of threads is non-deterministic we are unlucky and the race condition isn't triggered and the property passes, which stops the shrinking process. ↩
-
Jepsen's Knossos checker also uses linearisability checking and has found many bugs, so we are in good company.
Note that the most recent Jepsen analyses use the Elle checker, rather than the Knossos checker. The Elle checker doesn't do linearisability checking, but rather looks for cycles in the dependencies of database transactions. Checking for cycles is less general than linearisability checking, but also more efficient. See the Elle paper for details. ↩