Technical Debt¶
Hypervisor¶
CONFIG_PDX¶
Xen uses the term MFN for Machine Frame Number, which is synonymous with Linux’s PFN, and maps linearly to system/host/machine physical addresses.
For every page of RAM, a struct page_info
is needed for tracking purposes.
In the simple case, the frametable is an array of struct page_info[]
indexed by MFN.
However, this is inefficient when a system has banks of RAM at spread out in address space, as a large amount of space is wasted on frametable entries for non-existent frames. This wastes both virtual address space and RAM.
As a consequence, Xen has a compression scheme known as PDX which removes unused bits out of the middle of MFNs, to make a more tightly packed Page inDeX, which in turn reduces the size of the frametable for system.
At the moment, PDX compression is unconditionally used.
However, PDX compression does come with a cost in terms of the complexity to convert between PFNs and pages, which is a common operation in Xen.
Typically, ARM32 systems do have RAM banks in discrete locations, and want to use PDX compression, while typically ARM64 and x86 systems have RAM packed from 0 with no holes.
The goal of this work is to have CONFIG_PDX
selected by ARM32 only. This
requires slightly untangling the memory management code in ARM and x86 to give
it a clean compile boundary where PDX conversions are used.
Waitqueue infrastructure¶
Livepatching safety in Xen depends on all CPUs rendezvousing on the return to
guest path, with no stack frame. The vCPU waitqueue infrastructure undermines
this safety by copying a stack frame sideways, and longjmp()
-ing away.
Waitqueues are only used by the introspection/mem_event/paging infrastructure, where the design of the rings causes some problems. There is a single 4k page used for the ring, which serves both synchronous requests, and lossless async requests. In practice, introspecting an 11-vcpu guest is sufficient to cause the waitqueue infrastructure to start to be used.
A better design of ring would be to have a slot per vcpu for synchronous requests (simplifies producing and consuming of requests), and a multipage ring buffer (of negotiable size) with lossy semantics for async requests.
A design such as this would guarantee that Xen never has to block waiting for userspace to create enough space on the ring for a vcpu to write state out.
Note
There are other aspects of the existing ring infrastructure which are driving a redesign, but these don’t relate directly to the waitqueue infrastructure and livepatching safety.
The most serious problem is that the ring infrastructure is GFN based, which leaves the guest either able to mess with the ring, or a shattered host superpage where the ring used to be, and the guest balloon driver able to prevent the introspection agent from connecting/reconnecting the ring.
As there are multiple compelling reasons to redesign the ring infrastructure, the plan is to introduce the new ring ABI, deprecate and remove the old ABI, and simply delete the waitqueue infrastructure at that point, rather than try to redesign livepatching from scratch in an attempt to cope with unwinding old stack frames.
Dom0¶
Remove xenstored’s dependencies on unstable interfaces¶
Various xenstored implementations use libxc for two purposes. It would be a substantial advantage to move xenstored onto entirely stable interfaces, which disconnects it from the internal of the libxc.
Foreign mapping of the store ring
This is obsolete since Xen c/s 6a2de353a9 (2012) which allocated grant entries instead, to allow xenstored to function as a stub-domain without dom0 permissions. Xen c/s 38eeb3864d dropped foreign mapping for cxenstored. However, there are no OCaml bindings for libxengnttab.
Work Items:
- Minimal
tools/ocaml/libs/xg/
binding fortools/libs/gnttab/
. - Replicate Xen c/s 38eeb3864d for oxenstored as well.
- Minimal
Figuring out which domain(s) have gone away
Currently, the handling of domains is asymmetric.
When a domain is created, the toolstack explicitly sends an
XS_INTRODUCE(domid, store mfn, store evtchn)
message to xenstored, to cause xenstored to connect to the guest ring, and fire the@introduceDomain
watch.When a domain is destroyed, Xen fires
VIRQ_DOM_EXC
which is bound by xenstored, rather than the toolstack. xenstored updates its idea of the status of domains, and fires the@releaseDomain
watch.Xenstored uses
xc_domain_getinfo()
, to work out which domain(s) have gone away, and only cares about the shutdown status.Furthermore,
@releaseDomain
(likeVIRQ_DOM_EXC
) is a single-bit message, which requires all listeners to evaluate whether the message applies to them or not. This results in a flurry ofxc_domain_getinfo()
calls from multiple entities in the system, which all serialise on the domctl lock in Xen.Work Items:
- Figure out how shutdown status can be expressed in a stable way from Xen.
- Figure out if
VIRQ_DOM_EXC
and@releaseDomain
can be extended or superseded to carry at least a domid, to make domain shutdown scale better. - Figure out if
VIRQ_DOM_EXC
would better be bound by the toolstack, rather than xenstored.