Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 05 Jun 2023 17:32:57 +0200
From:      Corvin =?ISO-8859-1?Q?K=F6hne?= <corvink@FreeBSD.org>
To:        Vitaliy Gusev <gusev.vitaliy@gmail.com>, virtualization@freebsd.org
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: BHYVE SNAPSHOT image format proposal
Message-ID:  <6b98da58a5bd8e83bc466efa20b5a900298210aa.camel@FreeBSD.org>
In-Reply-To: <67FDC8A8-86A6-4AE4-85F0-FF7BEF9F2F06@gmail.com>
References:  <67FDC8A8-86A6-4AE4-85F0-FF7BEF9F2F06@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
On Tue, 2023-05-23 at 19:05 +0300, Vitaliy Gusev wrote:
> Hi,
> 
> Here is a proposal for bhyve snapshot/checkpoint image format
> improvements.
> 

Hi Vitaliy,

thank you very much for sending this proposal to the mailing list.

> It implies moving snapshot code to nvlist engine. 
> 
> Current snapshot implementation has disadvantages:
> 
>  * 3 files per snapshot: .meta, .kern, vram
>  * Binary Stream format of data.
>  * Adding  optional variable - breaks resume
>  * Removing variable - breaks resume
>  * Changing saved order of variables - breaks resume
>  * Hard to get information about what is saved and decode.
>  * Hard to debug if somethings goes wrong
>  * No versions. If change code, resume of an old images can be
>    passed, but with UB.
> 
> New nvlist implementation should solve all things above. The first
> step -
> improve snapshot/checkpoint saving format. It eliminates three files
> usage
> per a snapshot.
> 
> 
> 1. BHYVE SNAPSHOT image format:  
> 
> > +———————————————————————————————————————+
> > |      HEADER PHYS - 4096 BYTES         |
> > +———————————————————————————————————————+
> > |                                       |
> > |                DATA                   |
> > |                                       |
> > +———————————————————————————————————————+
> 
> 2. HEADER PHYS format: 
> 
>  0    +—————————————————————————————————————————+ 
>       |        IDENT STRING  - 64 BYTES         |
>  64   +—————————————————————————————————————————+   
>       | NVLIST SIZE  - 4 BYTES   |  NVLIST DATA |
>  72   +—————————————————————————————————————————+
>       |                                         |
>       |               NVLIST DATA               |
>       |                                         |
>  4096 +—————————————————————————————————————————+
> 
> > 
> > IDENT STRING - Each producer can set its own value to specify
> > image.
> > NVLIST SIZE  - The following packed header nvlist data size.
> > NVLIST DATA - Packed nvlist header data.
> > 
> > 4KB should be enough for the HEADER to keep basic information about
> > Sections. However, it can
> > be enlarged lately, without breaking backward compatibility. 
> > 

I can't see an advantage of using a fixed sized header of 4KB. You have
to parse the offset and size of every section anyways. If it's for
alignment requirements you can still align all sections on save and set
the offset accordingly. So, why complicating things by using a fixed
header size?

The IDENT STRING seems to be very large. Even a GUID which should be a
global unique identifier uses just 16 Bytes. Additionally, it might be
worth using a dedicated ident and version field for an easier version
parsing. E.g.:

+------------------+-------------------+
| IDENT - 56 BYTES | VERSION - 8 BYTES |
+------------------+-------------------+

IDENT - "BHYVE CHECKPOINT IMAGE"
VERSION - 1 (as uint64_t)

Btw: I don't care but here we could leave some free space for possible
forward compatibility. E.g.:

+------------------+-------------------+-------------------------+
| IDENT - 16 BYTES | VERSION - 8 BYTES | _FREE_SPACE_ - 40 BYTES |
+------------------+-------------------+-------------------------+

> 3. NVLIST HEADER consists of Sections in the following format:
> 
>  * Name - string
>  * Type:    string:
>     - “text,   - plain text,
>     - “nvlist” - packed nvlist,
>     - “binary” - raw binary data.
>  * Size - Size of section - uint64
>  * Offset - Offset in image format - uint64
> > 
>     Predefined sections:  “config”, “devices”, “kernel”, “memory”. 
> 

LGTM.

> 
> 4. EXAMPLE:
> 
> 
>  IDENT STRING:
> 
>        "BHYVE CHECKPOINT IMAGE VERSION 1"
> 
>  NVLIST HEADER: 
> 
>   [config]
>         config.offset = 0x1000 (4096)
>         config.size = 0x1f6 (502)
>         config.type = "text"
>   [kernel]
>         kernel.offset = 0x11f6 (4598)
>         kernel.size = 0x19a7 (6567)
>         kernel.type = “nvlist"
>   [devices]
>         devices.offset = 0x2b9d (11165)
>         devices.size = 0x10145ba (16860602)
>         devices.type = "nvlist"
>   [memory]
>         memory.offset = 0x1200000 (18874368)
>         memory.size = 0x3ce00000 (1021313024)
>         memory.type = “binary"
> 
>  SECTIONS:
> 
>  [section "config" size 0x1f6 offset 0x1000]:
> > memory.size=1024M
> > x86.strictmsr=true
> > x86.vmexit_on_hlt=true
> > cpus=2
> > acpi_tables=true
> > pci.0.0.0.device=hostbridge
> > pci.0.31.0.device=lpc
> > pci.0.4.0.device=virtio-net
> > pci.0.4.0.backend=tap0
> > pci.0.7.0.device=fbuf
> > pci.0.7.0.tcp=10.42.0.78:5900
> > pci.0.7.0.w=1024
> > pci.0.7.0.h=768
> > pci.0.5.0.device=ahci
> > pci.0.5.0.port.0.type=cd
> > pci.0.5.0.port.0.path=/ISO/ubuntu-22.04.1-live-server-amd64.iso
> > lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
> > checkpoint.date="Wed Jan 25 23:48:29 2023"
> > name=ubuntu22
> 

Not sure if it's just an example for the "text" type. bhyve converts it
into a nvlist, so it could be saved directly as nvlist.

Btw: I would only implement the "text" type if there's an usecase that
can't be solved by one of the other types.

>  [section "kernel" size 0x19a7 offset 0x11f6]:
>    [vm]
>         vm.vds_version = 0x1 (1)
>         vm.cpu0.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00
> 00 00 00 ...  size=0x28
>         vm.cpu1.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00
> 00 00 00 ...  size=0x28
>         vm.checkpoint_tsc = 0xe2e0ac6fbe456 (3991273496896598)
>    [hpet]
>         hpet.vds_version = 0x1 (1)
>         hpet.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 ...  size=0x118
>    [vmx]
>         vmx.vds_version = 0x1 (1)
>         vmx.cpu_features = 0 (0)
>         vmx.cpu0.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 40 B4 21
> B9 FF FF FF FF ...  size=0x288
>         vmx.cpu1.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 00 00 67
> 41 D8 9B FF FF ...  size=0x288
>    [ioapic]
>         ioapic.vds_version = 0x1 (1)
>         ioapic.data(BINARY): 00 00 01 00 00 00 00 00 00 00 00 00 00
> 00 00 00 ...  size=0x208
>    [lapic]
>         lapic.vds_version = 0x1 (1)
>         lapic.cpu0.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 ...  size=0x460
>         lapic.cpu1.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 ...  size=0x460
>    [atpit]
>         atpit.vds_version = 0x1 (1)
>         atpit.data(BINARY): 00 00 00 00 00 00 00 00 54 AD 51 97 0F 0E
> 00 00 ...  size=0xa0
>    [atpic]
>         atpic.vds_version = 0x1 (1)
>         atpic.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 01 00
> 00 00 ...  size=0x84
>    [pmtimer]
>         pmtimer.vds_version = 0x1 (1)
>         pmtimer.uptime = 0x26fd133e5cc (2679274464716)
>    [rtc]
>         rtc.vds_version = 0x1 (1)
>         rtc.data(BINARY): 0A 00 00 00 00 FE FF FF 10 35 13 3D 40 F7
> 14 00 ...  size=0x98
> 
> —
> Thanks,
> Vitaliy Gusev
> 

All in all, it looks good. Keep on your work!

Regards checksum feature:
We should focus on enabling this feature by default before adding
advanced features. So, keep it simple and small.

Regards forward compatibility:
Backward compatibility is way more important than forward
compatibility. Nevertheless, forward compatibility would be nice to
have. So, we should keep it in mind when modifying the layout. For the
moment, just focus on a format which is backward compatible.


-- 
Kind regards,
Corvin

[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEgvRSla3m2t/H2U9G2FTaVjFeAmoFAmR+ACkACgkQ2FTaVjFe
AmqHiA//Z+21VQdlu4imactssAT4+30yMrPULmDp98ncd7Z+jyixmXNa8+i3b0Wb
XuTMBVERSGRxuaUjca2ChG98qlI3liLVgvRuvZqT2mi7VwGaQlAnMaEkL40mxq6M
H+y9zUnOXKUGjTHS2nKCJflPihcCx/Nj0uif4WiEePCazmbp/KEOMXbrTlEIUNC/
39OHjG0q2BPLp1P0c0eF4UJLXr8P0ieeeWDKWSSaLscJs59ON+H0cn8z28AbWxTk
KBo3i4UwXS/Ei/VIVmM2I3Xg51CeCEXdlpg8vf+/MNqiQ4/xUvQdDBdbzgUVAmwG
fhuKVds/DJa3vJPNHiS/EoLFOeyo5IXVQhowi2OVzDznwRXlmGcGHJ80/XDmHuZr
+N6ekQJVOH/OUrNcsFlCWenWmWYeajIq9F2EWBOnnLy6jnieQ6qCC0Tflkn1wvIN
FmS1qg2z7Y5OIxOKMOKfK8vckiaaAMrK76GehNjqBI3/7FzO7rMFqz1CcCPfkS9E
atf2AaSCIFiBJpTFyj2HZwdlL7nKNh8d7raACyC3G3MRMZgCZ8Fhiyp7t+LL4Dfk
F303H2zaFCo6lZR4GpRBsadmxo8QnXPw42eWLDy4kBP1WoA+TMbk8mnK/eONDvYz
aYmTEOcJIdIylWK/S2p8bPs1ydeBCF9e6xbK7ErGBTbtnprVKcs=
=oFeS
-----END PGP SIGNATURE-----

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6b98da58a5bd8e83bc466efa20b5a900298210aa.camel>