Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 03 Jul 2023 09:34:57 +0200
From:      Corvin =?ISO-8859-1?Q?K=F6hne?= <corvink@FreeBSD.org>
To:        Elena Mihailescu <elenamihailescu22@gmail.com>
Cc:        freebsd-virtualization@freebsd.org, Mihai Carabas <mihai.carabas@gmail.com>,  Matthew Grooms <mgrooms@shrew.net>
Subject:   Re: Warm and Live Migration Implementation for bhyve
Message-ID:  <b66fb737fca369239b3953892132f7e29906564f.camel@FreeBSD.org>
In-Reply-To: <CAGOCPLg4ZeaRLK0VeRzifteXt3dJnSqZ=YT5BJ8EtH7%2BwMkTfA@mail.gmail.com>
References:   <CAGOCPLhJrNrysBM1vc87vfkX5jZLCmnyfGf%2Bcv2wmHFF1UhC-w@mail.gmail.com> <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org> <CAGOCPLg4ZeaRLK0VeRzifteXt3dJnSqZ=YT5BJ8EtH7%2BwMkTfA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--=-FXVYXhKGuN8C2nzIOaXc
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, 2023-06-27 at 16:35 +0300, Elena Mihailescu wrote:
> Hi Corvin,
>=20
> Thank you for the questions! I'll respond to them inline.
>=20
> On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne <corvink@freebsd.org>
> wrote:
> >=20
> > Hi Elena,
> >=20
> > thanks for posting this proposal here.
> >=20
> > Some open questions from my side:
> >=20
> > 1. How is the data send to the target? Does the host send a
> > complete
> > dump and the target parses it? Or does the target request data one
> > by
> > one und the host sends it as response?
> >=20
> It's not a dump of the guest's state, it's transmitted in steps.
> However, some parts may be migrated as a chunk (e.g., the emulated
> devices' state is transmitted as the buffer generated from the
> snapshot functions).
>=20

How does the receiver know which chunk relates to which device? It
would be nice if you can start bhyve on the receiver side without
parameters e.g. `bhyve --receive=3D127.0.0.1:1234`. Therefore, the
protocol has to carry some information about the device configuration.

> I'll try to describe a bit the protocol we have implemented for
> migration, maybe it can partially respond to the second and third
> questions.
>=20
> The destination host waits for the source host to connect (through a
> socket).
> After that, the source sends its system specifications (hw_machine,
> hw_model, hw_pagesize). If the source and destination hosts have
> identical hardware configurations, the migration can take place.
>=20
> Then, if we have live migration, we migrate the memory in rounds
> (i.e., we get a list of the pages that have the dirty bit set, send
> it
> to the destination to know what pages will be received, then send the
> pages through the socket; this process is repeated until the last
> round).
>=20
> Next, we stop the guest's vcpus, send the remaining memory (for live
> migration) or the guest's memory from vmctx->baseaddr for warm
> migration. Then, based on the suspend/resume feature, we get the
> state
> of the virtualized devices (the ones from the kernel space) and send
> this buffer to the destination. We repeat this for the emulated
> devices as well (the ones from the userspace).
>=20
> On the receiver host, we get the memory pages and set them to their
> according position in the guest's memory, use the restore functions
> for the state of the devices and start the guest's execution.
>=20
> Excluding the guest's memory transfer, the rest is based on the
> suspend/resume feature. We snapshot the guest's state, but instead of
> saving the data locally, we send it via network to the destination.
> On
> the destination host, we start a new virtual machine, but instead of
> reading/getting the state from the disk (i.e., the snapshot files) we
> get this state via the network from the source host.
>=20
> If the destination can properly resume the guest activity, it will
> send an "OK" to the source host so it can destroy/remove the guest
> from its end.
>=20
> Both warm and live migration are based on "cold migration". Cold
> migration means we suspend the guest on the source host, and restore
> the guest on the destination host from the snapshot files. Warm
> migration only does this using a socket, while live migration changes
> the way the memory is migrated.
>=20
> > 2. What happens if we add a new data section?
> >=20
> What are you referring to with a new data section? Is this question
> related to the third one? If so, see my answer below.
>=20
> > 3. What happens if the bhyve version differs on host and target
> > machine?
>=20
> The two hosts must be identical for migration, that's why we have the
> part where we check the specifications between the two migration
> hosts. They are expected to have the same version of bhyve and
> FreeBSD. We will add an additional check in the check specs part to
> see if we have the same FreeBSD build.
>=20
> As long as the changes in the virtual memory subsystem won't affect
> bhyve (and how the virtual machine sees/uses the memory), the
> migration constraints should only be related to suspend/resume. The
> state of the virtual devices is handled by the snapshot system, so if
> it is able to accommodate changes in the data structures, the
> migration process will not be affected.
>=20
> Thank you,
> Elena
>=20
> >=20
> >=20
> > --
> > Kind regards,
> > Corvin
> >=20
> > On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote:
> > > Hello,
> > >=20
> > > This mail presents the migration feature we have implemented for
> > > bhyve. Any feedback from the community is much appreciated.
> > >=20
> > > We have opened a stack of reviews on Phabricator
> > > (https://reviews.freebsd.org/D34717) that is meant to split the
> > > code
> > > in smaller parts so it can be more easily reviewed. A brief
> > > history
> > > of
> > > the implementation can be found at the bottom of this email.
> > >=20
> > > The migration mechanism we propose needs two main components in
> > > order
> > > to move a virtual machine from one host to another:
> > > 1. the guest's state (vCPUs, emulated and virtualized devices)
> > > 2. the guest's memory
> > >=20
> > > For the first part, we rely on the suspend/resume feature. We
> > > call
> > > the
> > > same functions as the ones used by suspend/resume, but instead of
> > > saving the data in files, we send it via the network.
> > >=20
> > > The most time consuming aspect of migration is transmitting guest
> > > memory. The UPB team has implemented two options to accomplish
> > > this:
> > > 1. Warm Migration: The guest execution is suspended on the source
> > > host
> > > while the memory is sent to the destination host. This method is
> > > less
> > > complex but may cause extended downtime.
> > > 2. Live Migration: The guest continues to execute on the source
> > > host
> > > while the memory is transmitted to the destination host. This
> > > method
> > > is more complex but offers reduced downtime.
> > >=20
> > > The proposed live migration procedure (pre-copy live migration)
> > > migrates the memory in rounds:
> > > 1. In the initial round, we migrate all the guest memory (all
> > > pages
> > > that are allocated)
> > > 2. In the subsequent rounds, we migrate only the pages that were
> > > modified since the previous round started
> > > 3. In the final round, we suspend the guest, migrate the
> > > remaining
> > > pages that were modified from the previous round and the guest's
> > > internal state (vCPU, emulated and virtualized devices).
> > >=20
> > > To detect the pages that were modified between rounds, we propose
> > > an
> > > additional dirty bit (virtualization dirty bit) for each memory
> > > page.
> > > This bit would be set every time the page's dirty bit is set.
> > > However,
> > > this virtualization dirty bit is reset only when the page is
> > > migrated.
> > >=20
> > > The proposed implementation is split in two parts:
> > > 1. The first one, the warm migration, is just a wrapper on the
> > > suspend/resume feature which, instead of saving the suspended
> > > state
> > > on
> > > disk, sends it via the network to the destination
> > > 2. The second part, the live migration, uses the layer previously
> > > presented, but sends the guest's memory in rounds, as described
> > > above.
> > >=20
> > > The migration process works as follows:
> > > 1. we identify:
> > > =C2=A0- VM_NAME - the name of the virtual machine which will be
> > > migrated
> > > =C2=A0- SRC_IP - the IP address of the source host
> > > =C2=A0- DST_IP - the IP address of the destination host (default is
> > > 24983)
> > > =C2=A0- DST_PORT - the port we want to use for migration
> > > 2. we start a virtual machine on the destination host that will
> > > wait
> > > for a migration. Here, we must specify SRC_IP (and the port we
> > > want
> > > to
> > > open for migration, default is 24983).
> > > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst
> > > 3. using bhyvectl on the source host, we start the migration
> > > process.
> > > e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm
> > >=20
> > > A full tutorial on this can be found here:
> > > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-Machine-Migra=
tion-using-bhyve
> > >=20
> > > For sending the migration request to a virtual machine, we use
> > > the
> > > same thread/socket that is used for suspend.
> > > For receiving a migration request, we used a similar approach to
> > > the
> > > resume process.
> > >=20
> > > As some of you may remember seeing similar emails from our part
> > > on
> > > the
> > > freebsd-virtualization list, I'll present a brief history of this
> > > project:
> > > The first part of the project was the suspend/resume
> > > implementation
> > > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard
> > > (https://reviews.freebsd.org/D19495).
> > > After that, we focused on two tracks:
> > > 1. adding various suspend/resume features (multiple device
> > > support -
> > > https://reviews.freebsd.org/D26387, CAPSICUM support -
> > > https://reviews.freebsd.org/D30471, having an uniform file format
> > > -
> > > at
> > > that time, during the bhyve bi-weekly calls, we concluded that
> > > the
> > > JSON format was the most suitable at that time -
> > > https://reviews.freebsd.org/D29262) so we can remove the #ifdef
> > > BHYVE_SNAPSHOT guard.
> > > 2. implementing the migration feature for bhyve. Since this one
> > > relies
> > > on the save/restore, but does not modify its behaviour, we
> > > considered
> > > we can go in parallel with both tracks.
> > > We had various presentations in the FreeBSD Community on these
> > > topics:
> > > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020,
> > > AsiaBSDCon2023.
> > >=20
> > > The first patches for warm and live migration were opened in
> > > 2021:
> > > https://reviews.freebsd.org/D28270,
> > > https://reviews.freebsd.org/D30954. However, the general feedback
> > > on
> > > these was that the patches are too big to be reviewed, so we
> > > should
> > > split them in smaller chunks (this was also true for some of the
> > > suspend/resume improvements). Thus, we split them into smaller
> > > parts.
> > > Also, as things changed in bhyve (i.e., capsicum support for
> > > suspend/resume was added this year), we rebased and updated our
> > > reviews.
> > >=20
> > > Thank you,
> > > Elena
> > >=20
> >=20

--=20
Kind regards,
Corvin

--=-FXVYXhKGuN8C2nzIOaXc
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEgvRSla3m2t/H2U9G2FTaVjFeAmoFAmSieiEACgkQ2FTaVjFe
AmrDpw//VS6X267yW6TRsUR6y+hT3YDd5TZQ+dbRBql+L2KtKfOSFam0b9bsIlMS
KjYZOSRIptU7Uq83IqQPwPaUYFlxrJC3MnQlHQvfGH72uUoT9hojlkOdoan5s9Ex
DEsXzrE3l6DVwOINjxAdXU+Q7dFjYEj+Je+A81P001jT1/VOdOlqkKf31dwHcbaM
nIi78rvr1kNbbtUSP68yjJ7xDjwRZNTt/uLLK57T60wXE9eUPAXMowN9iiB3IUb9
nMGszxSTqENRZMaFIv0VmY1U3wUAPEkgN11WmyScAl9ymnibIKqfYWfmW6gvI8tp
eLtbfV/SY/1MsGKm0cDTXcVB8zN5OzEDZHNCe8gWP/BY/uu/R22xw6EBp/SoQYWo
oJau0ymYotfAqvxhHWNL2b8A7Izyh4vjW5AWBrvhO89vwAO84WJZUexUHAIQHKBk
0GDZDgoftd5pXR8RADATqNjcs0Oco32BMJM3sqWrQ/ced7YMx+Fgv7A7nYVM2L7l
5aZZD+NMNqATXQxXJtmE0KFjE0VzRHXZRkN1bBughvqrz89oDhLKcDzHeahFDhUR
ao9Si2/YWU5zjRdBIyMzpF5Jr6aPk9GZLyEx9DnOuhrnHKsZkYcsluYp11pIk3yw
vecG/WPfh4SEKBNzFS5GlYUoNOTJ+ykExPrbrsy4V8ZxtZwPk5E=
=d2lk
-----END PGP SIGNATURE-----

--=-FXVYXhKGuN8C2nzIOaXc--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b66fb737fca369239b3953892132f7e29906564f.camel>