This uses the Mach COW mechanism to implement writewatch functionality.
Below is the same micro-benchmark @gofman used in his [UFFD MR](https://gitlab.winehq.org/wine/wine/-/merge_requests/7871).
``` Parameters: - number of concurrent threads; - number of pages; - delay between reading / resetting write watches (ms) - random (1) or sequentual (0) page write access; - reset with WRITE_WATCH_FLAG_RESET in GetWriteWatch (1) or in a separate ResetWriteWatch call (0).
Result is in the form of <average write to page time, ns> / <average GetWriteWatch() time, mcs>
Parameters Windows Mach COW Fallback 6 1080 3 1 1 897 / 80 371 / 12634 66202 / 186 6 1080 3 1 0 855 / 87 369 / 12637 66766 / 187 8 8192 3 1 1 6526 / 268 627 / 113263 111053 / 485 8 8192 3 1 0 1197 / 509 623 / 113810 122921 / 489 8 8192 1 1 1 1227 / 412 636 / 118930 150628 / 388 8 8192 1 1 0 5721 / 144 631 / 120538 146392 / 384 8 64 1 1 1 572 / 7 490 / 1078 1000 / 89 8 64 1 1 0 530 / 13 500 / 1075 1167 / 77 ```
This was all on the same M2 Max machine with Windows being win11 on ARM in a VM running the x64 binary emulated and otherwise Wine through Rosetta with and without this MR.
Unlike UFFD which is always better than fallback and comparable to the Windows performance, here good average write to page time is traded for bad average `GetWriteWatch()` time (pretty much in equal ratios).
However in real world applications (like the FFXIV + Dalamud mod framework/loader use case) the startup time is reduced from about 25.5s to 23.6s with this change from a cold start, including loading a modern dotnet 9 runtime into the game process and initializing a complex mod collection, with a fairly high GC pressure.
This is probably because the `GetWriteWatch()` calls the GC does mostly happen concurrently, whereas in Wines fallback implementation running threads are interrupted and often wait on the global virtual lock in Wine while the segfault is handled and parallel accesses to write watched memory and other VM operations are blocked.
Another advantage is that `VPROT_WRITEWATCH` can be used then for other purposes in the future and also Rosetta being a bit finicky sometimes with reported protections with the current implementation, but behaved always as expected so far in my testing with the new one.
On native ARM64 the `VM_PROT_COPY`/`SM_COW` mechanism works also as expected on native 16k pages (not that this matters much at the moment).
`GetWriteWatch()` with the reset flag also does not need to be transactional (unlike UFFD), since only marked pages are reset here and not the entire range.