Yes that's alright, I initially thought the failure was caused by a crash, as I experienced one for these tests, but after trying again with openbox it looks like this test sequence randomly succeeds although it is expected not to.
I'm relieved to learn that mutter is not used for testing :)