From nobody Fri Jul 14 17:39:58 2023
X-Original-To: virtualization@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4R2dzJ01Jmz2tryk
	for <virtualization@mlmmj.nyi.freebsd.org>; Fri, 14 Jul 2023 17:40:12 +0000 (UTC)
	(envelope-from mgrooms@shrew.net)
Received: from mx1.shrew.net (mx1.shrew.net [38.97.5.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4R2dzH25Mdz4564
	for <virtualization@freebsd.org>; Fri, 14 Jul 2023 17:40:11 +0000 (UTC)
	(envelope-from mgrooms@shrew.net)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	spf=pass (mx1.freebsd.org: domain of mgrooms@shrew.net designates 38.97.5.131 as permitted sender) smtp.mailfrom=mgrooms@shrew.net;
	dmarc=none
Received: from mail.shrew.net (mail.shrew.prv [10.24.10.20])
	by mx1.shrew.net (8.15.2/8.15.2) with ESMTP id 36EHe4Mj085556
	for <virtualization@freebsd.org>; Fri, 14 Jul 2023 12:40:04 -0500 (CDT)
	(envelope-from mgrooms@shrew.net)
Received: from [10.22.200.32] (unknown [136.49.230.220])
	by mail.shrew.net (Postfix) with ESMTPSA id E7997194F36
	for <virtualization@freebsd.org>; Fri, 14 Jul 2023 12:39:58 -0500 (CDT)
Message-ID: <79fabe94-b800-c713-48ea-518da1f74e4d@shrew.net>
Date: Fri, 14 Jul 2023 12:39:58 -0500
List-Id: Discussion <freebsd-virtualization.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization
List-Help: <mailto:virtualization+help@freebsd.org>
List-Post: <mailto:virtualization@freebsd.org>
List-Subscribe: <mailto:virtualization+subscribe@freebsd.org>
List-Unsubscribe: <mailto:virtualization+unsubscribe@freebsd.org>
Sender: owner-freebsd-virtualization@freebsd.org
X-BeenThere: freebsd-virtualization@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:102.0) Gecko/20100101
 Thunderbird/102.12.0
From: Matthew Grooms <mgrooms@shrew.net>
Subject: Re: BHYVE SNAPSHOT image format proposal
To: virtualization@freebsd.org
References: <67FDC8A8-86A6-4AE4-85F0-FF7BEF9F2F06@gmail.com>
 <6b98da58a5bd8e83bc466efa20b5a900298210aa.camel@FreeBSD.org>
 <8387AC83-6667-48E5-A3FA-11475EA96A5F@gmail.com>
 <d92db9bfbea181d6eb9d57b579d67e8e118ef4de.camel@FreeBSD.org>
 <986A83D8-E0E0-4030-9369-A5B3500E27C6@gmail.com>
Content-Language: en-US
In-Reply-To: <986A83D8-E0E0-4030-9369-A5B3500E27C6@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.shrew.net [10.24.10.10]); Fri, 14 Jul 2023 12:40:04 -0500 (CDT)
X-Spamd-Result: default: False [-1.21 / 15.00];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	NEURAL_SPAM_LONG(0.98)[0.984];
	NEURAL_HAM_MEDIUM(-0.89)[-0.890];
	R_SPF_ALLOW(-0.20)[+mx];
	MIME_GOOD(-0.10)[text/plain];
	FROM_EQ_ENVFROM(0.00)[];
	MLMMJ_DEST(0.00)[virtualization@freebsd.org];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	MIME_TRACE(0.00)[0:+];
	R_DKIM_NA(0.00)[];
	ASN(0.00)[asn:174, ipnet:38.0.0.0/8, country:US];
	DMARC_NA(0.00)[shrew.net];
	RCVD_COUNT_THREE(0.00)[3];
	RCPT_COUNT_ONE(0.00)[1];
	FROM_HAS_DN(0.00)[];
	ARC_NA(0.00)[];
	RCVD_TLS_LAST(0.00)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[virtualization@freebsd.org];
	TO_DN_NONE(0.00)[];
	MID_RHS_MATCH_FROM(0.00)[]
X-Rspamd-Queue-Id: 4R2dzH25Mdz4564
X-Spamd-Bar: -
X-ThisMailContainsUnwantedMimeParts: N

On 6/7/23 13:25, Vitaliy Gusev wrote:
> Hi Corvin,
> 
>> On 6 Jun 2023, at 15:59, Corvin Köhne <corvink@FreeBSD.org> wrote:
>>> ...
>>
>> We may have different version of the format from the same produce.
>> IMHO, it makes sense to have a dedicated IDENT and VERSION field to
>> easily figure out
>>
>> 1) if the producer of the image is known
>> 2) if we support that version of the producer
>>
>> Even if you allocated a huge amount of free space, someone would need
>> more. So, what do you think about this format:
>>
>> +---------------------------------------------------------------------+
>> | IDENT - 16 BYTES                                                    |
>> +-------------------+-----------------------+-------------------------+
>> | VERSION - 4 BYTES | NVLIST SIZE - 4 BYTES | NVLIST OFFSET - 8 BYTES |
>> +-------------------+-----------------------+-------------------------+
>> | POSSIBLE FREE SPACE (e.g. for custom data, alignment etc.)          |
>> +---------------------------------------------------------------------+
>> | NVLIST DATA                                                         |
>> +---------------------------------------------------------------------+
>> | POSSIBLE FREE SPACE (for whatever reason)                           |
>> +---------------------------------------------------------------------+
>> | SNAPSHOT DATA                                                       |
>> +——————————————————————————————————+
>>

I agree. The only values that need to exist in the initial header would 
be the identity and version. This could be as simple as "BHYVE" which is 
similar to how common network protocols identify themselves ...

SSH-protoversion-softwareversion SP comments CR LF

Note the use of a carriage return + line feed to signify the identity + 
version string. It avoids the need to reserve a specific number of bytes 
for this purpose and offers more free form expression of values. If you 
want to include a stream subtype, you could add another field resulting 
in something like ...

IDENT-SUBTYPE-VERSION CR LF

I think it would also be helpful to borrow the concept of using generic 
data stream section headers. Instead of expecting a fixed number bytes 
after the identity + version, you can append section headers that define 
the type and the length of each. For example ...

+-----------------------------------+
| bhyve-checkpoint-1.0\n\r          |
+-----------------------------------+
| TYPE - 2 BYTES | LENGTH - 8 BYTES |
+-----------------------------------+
| DATA ...                          |
+-----------------------------------+
| TYPE - 2 BYTES | LENGTH - 8 BYTES |
+-----------------------------------+
| DATA ...                          |
+-----------------------------------+

Example section types:

0x0001 = NVLIST A
0x0002 = NVLIST B
0x0003 = MEMORY
0x0004 = ...
0xFFFF = CHECKSUM

Any number of sections could be included in a single file. Sections can 
be added or without breaking the stream format. Required sections for 
the subtype could be accounted for and simple integrity checks could be 
performed based on section length. Checksum validation can be performed 
by only parsing the trailing checksum section and the rest as raw input. 
The same basic structure could also be used in a network stream so there 
is potential for code reuse.

> 
> Note, simple string "BHYVE CHECKPOINT IMAGE” has 22 bytes. So 16 bytes seems
> too small.
> 
> So I would not to complicate a header first.
> 
> I would rather describe ideas, conditions and then solutions:
> 
>  1. Need to distinguish snapshot image file from other files.
>     _Solution_: Header should have "magic id”.
> >  2. Need a barrier for resuming if image "is not ours”. Idea is not to
>     allow to resume images from other producers.
> 
>     The reason to have it and use it instead of header versioning:
> 
>     Imagine that mainstream has its own implementation and company’s
>     fork repo has its own implementation. How to *ensure* that the
>     *versions* in an image file are ours and not somebody’s else?
> 
>     _Solution_:  Header should have “Producer id” string.
> 
>     Example: Snapshot image file has empty Producer string , but bhyve
>     has current Producer as “MYCOMPANY”. Strings are not equal, resume
>     must fail.
> 

It would appear that you'd like the provider to be a string. I'm not 
sure why this couldn't be added by the provider as a value in an nvlist 
section. The consumer can filter on that value. Why should it be part of 
the stream format?

>  3. The Rule above does not restrict getting/decoding data from an image
>     file. It should be possible to look at an image file and analyse
>     internals, to get/decode values, etc.
>     _Solution_: Have additional option either in bhyve or bhyvectl to
>     get into image file.
> 

See my final comments regarding encoding.

>  4. Following nvlist header data should have a short information about
>     image file and its internals.
>     _Solution_: NV HEADER can have several sections: “config”, “kernel”,
>     “devices”, “memory”, …
>

See my final comments regarding encoding.

>  5. Versioning of NV HEADER. Idea is to have an information in
>     advance whether it is possible to be resumed or not. In other words,
>     before do resume, get information about ability to resume.
> 
>     _Solution_: Each Section should have “version”  and “subversion”.
>     While “version” is responsible for both  types of compatibility:
>     backward and forward, “subversion” is for forward compatibility only.
> 
>     _Rules for check_:
>              If bh_version == version && bh_subversion >= subversion  then
>                        Bhyve able to resume the Section
>              Else
>                        Bhyve cannot resume the Section
>              Endif
>     _
>     Example 1_: Section in image has “version=1", “subversion=5”,  Bhyve
>     has “version=1", “subversion=6". That means, bhyve can resume the
>     Section.
> 
>     _Example 2_: The same image Section, but bhyve has “version=1",
>     “subversion=4". Bhyve cannot resume the Section.
> 
>     Example 3: The same image Section, but bhyve has “version=2",
>     “subversion=5”. Bhyve cannot resume the Section.
> 
>     _Rules for increasing versions_:
>        -  If during code-change “backward” compatibility is broken,
>     “version” should be increased and “subversion” is set to 0.
>        -  If during code-change “forward” compatibility is broken,
>     “subversion” should be increased.
> 

More version values could be included I suppose, but it seems overly 
complicated to me. I see most of the change over time being in nvlist 
value sections. There's a lot of flexibility and validation that can 
occur there before pushing version numbers down into individual stream 
sections, but I guess I could be convinced otherwise.

In any case, we need to be careful to not rely too much on version 
checks. Otherwise it will be difficult to snapshot/restore/migrate 
between different versions of bhyve.

>  6. Other versioning in HEADER is redundant. If something is changed in
>     the format, “magic id” can be changed appropriately.
>     _Solution_: “magic id” should be stable and not changed for a long time.
> 

I think that's what the major version would be used for.

> 
> As result I would suggest to give at least 32 bytes for "magic id” / 
> ident and 32 bytes for “producer id”.
> > Format of entire image file can be:
> 
> 
> +-----------------------------------------------------------------+
> |               HEADER MAGIC ID     - 32 BYTES                    |
> +-----------------------------------------------------------------+
> |               HEADER PRODUCER ID  - 32 BYTES                    |
> +———————————————----------------————————--—-----------------------+
> |               NVLIST HEADER SIZE  -  4 BYTES                    |
> +-----------------------------------------------------------------+
> |               NVLIST HEADER DATA (SECTIONS)                     |
> +-----------------------------------------------------------------+
> |                       SNAPSHOT DATA                             |
> +-----------------------------------------------------------------+
> 
> 
> _MAGIC ID_: should be hardcoded: "BHYVE CHECKPOINT IMAGE”.
> 
> _PRODUCER ID_: can be empty and supported by producer, i.e. reserved.
> _
> _
> _NVLIST HEADER SIZE_: has enough dimension, but in general size is less 
> than 4KB
> 
> _NVLIST HEADER DATA_: Packed nvlist data, contains Sections:  “config”, 
> “kernel”, “devices”, “memory”, … :
> 
>     [config]
> 
>              offset = 0x1000 (4096)
> 
>              size = 0x1f6 (502)
> 
>              type = text
> 
>     vers = 1
>     subvers = 5
> 
>     [kernel]
> 
>              offset = 0x11f6 (4598)
> 
>              size = 0x19a7 (6567)
> 
>              type = nvlist
> 
>     vers = 1
>     subvers = 0
> 
>     [devices]
> 
>              offset = 0x2b9d (11165)
> 
>              size = 0x10145ba (16860602)
> 
>              type = nvlist
> 
>     vers = 2
>     subvers = 1
> 
>     [memory]
> 
>              offset = 0x1200000 (18874368)
> 
>              size = 0x3ce00000 (1021313024)
> 
>              type = pages
> 
>     vers = 1
>     subvers = 0
>

I see a need to define a format for bhyve so it's possible to mix 
different sections and encodings inside a unified stream. But all the 
data in your nvlist example above can be easily be represented as text. 
We already have JSON, YAML, XML, etc ... By adopting an preexisting 
format, we could retain the snapshot structure instead of lifting it up 
into the stream format. Even if we decide to break the structure up into 
different nvlist stream sections, using a common format would allow 
other tools to more easily parse and validate the structure inside these 
sections. Isn't that a good thing? Is there a reason we're trying to 
reinvent the wheel here?

Thanks,

-Matthew