virtio-comment message

Subject: Re: [virtio-comment] [RFC PATCH v2] virtio-rtc: Add device specification.
From: Peter Hilber <peter.hilber@opensynergy.com>
To: Parav Pandit <parav@nvidia.com>, "virtio-comment@lists.oasis-open.org" <virtio-comment@lists.oasis-open.org>
Date: Fri, 4 Aug 2023 18:08:15 +0200
On 31.07.23 05:15, Parav Pandit wrote:
> 
>> From: Peter Hilber <peter.hilber@opensynergy.com>
>> Sent: Thursday, July 27, 2023 3:51 PM
> 
> 
>>>> In the spec, clock ids are 64 bit wide to have unique ids when
>>>> supporting clock
>>> It is unclear the namespace of uniqueness.
>>> Do you mean unique id across multiple virtio devices?
>>
>> The idea was not to have a unique id across multiple virtio devices.
>> Rather, the idea was to simplify lifetime management for clocks in a single
>> virtio device (if such lifetimes would be introduced in the
>> future) by not having to worry about race conditions due to clock id reuse.
>>
> If the id not unique, 16-bit is likely more than enough.
> There should not be any race condition with respect to clock id as its per device.
> 

I think 64 bit is better since the cost is negligible, and e.g. the
device then does not have to care how it issues ids while the driver is
suspended.

>>> If so, how is it proposed, we probably need UUID?
>>>
>>
>> So it would be up to the device implementation to assign unique ids for each
>> clock lifetime; incrementing a counter should do.
>>
> Ok. uniqueness is per device hence any scheme by device is fine as you explained.
> 
>>>> hot-plugging in the future.
>>>>
>>>> Any feedback is appreciated very much.
>>>>
>>> This is a good feature.
>>>
>>> A network device has a PHC which is very similar to what is being proposed
>> here which is super set of this proposal.
>>>
>>> A network device will support packet timestamping as phase-1, followed by
>> PTP synchronized UTC/TAI format in phase-2 followed by PTP timestamping
>> offload in phase-3.
>>> Phase-1 is already being proposed using monotonic clock method at [1].
>>>
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202306/msg00287.h
>>> tml
>>>
>>> We would like to have a single virtio network device to expose the optional
>> RTC without creating a full blown new virtio device.
>>>
>>> A quick grep in the industry indicates that at least 15+ nics in industry with
>> dual PTP clock + nic functionality.
>>> And we would like virtio network device also to have it.
>>> At minimum such solution avoids creating an additional full blown virtio
>> device.
>>>
>>> As proposed here, RTC device as stand alone itself has its merit and use.
>>>
>>> So lets build the RTC device interface that can be of dual use.
>>> 1. as part of stand alone RTC device
>>> 2. as part of virtio network device
>>>
>>
>> To me it seems preferable to first try to keep the two devices separated. 
> Devices can be separate, the driver to device interface between them should be common to reuse it.
> 
>> AFAIU
>> PTP clocks are also usually represented as a device separate from network
>> devices.
>>
> Both the modes are in use, hence we want to support #1 and #2 both without duplicating the interfaces.
> 

What I wanted to refer to in my reply are the Linux kernel /dev/ptp*
interfaces, which are separate from the network device interfaces.

>> The extra effort to implement another simple virtio device (rather than putting
>> everything in a single device) seems to be small when compared to the
>> complexity of creating an extensible specification which should work both
>> standalone and as part of an already existing, but still growing, device.
> It may sound as small effort with only a focused view on RTC only device.
> But both types are needed.
> From actual implementation point of view and use case wise, yet another PCI device at scale of virtual machines is not desired, especially when it has tight usage completing on network device.
> 
> 

If the issue is PCI transport related resource consumption, the PCI
transport could be extended to support multi-function devices, in a way
which is agnostic to the device types involved.

>> E.g., the
>> RTC device should also get an alarm, with an associated eventq, in the future.
> Sure, it is ok. it will be optional when it is RTC device.
> 
>> Also, I wonder if the driver implementation should then work both as a
>> standalone Virtio driver and as part of the net driver.
>>
> Driver layer will be fine. Linux stack evolved over last 3 years to support multi-functionality device.
> A RTC/PTP driver will be a stand alone entity that works on underlying layered way to issue read cmd, config etc.
>  

As per my above reply, I think it would be better to tackle the problem
at the PCI transport level, then there would be no impact on individual
device drivers.

>> Maybe the above use case could also be addressed by a standalone RTC device
>> as follows:
>>
>> 1) The net device exports a UUID object to which the RTC device can refer to (so
>> the drivers can link the net device to the RTC clock).
>>
>> 2) If needed, the RTC device aligns the lifetime of the clock to the net device.
>> Then, clock lifetimes are no longer static, and can be aligned with the net
>> device. Through an eventq (to be added), the RTC device announces clock
>> creation/destruction, which can be synchronized to the net device state.
>>
> All of the infrastructure, cross device synchronization, its device side management will come with equal complexity.
> And we will end up with sub-optimal way to using multiple PCI devices.
> 

As per my above reply, I think it would be better to tackle the problem
at the PCI transport level...

> Net already has cvq required by RTC. So cvq commands can be common.
> Readq is additional, we need to define common between the two and way to discover it.
> 

... but if not doing that, I would still try to avoid mixing net and RTC
details. RTC should keep a dedicated cvq. Then the RTC driver could use
generic Virtio functionality to send/receive rather than using an
interface which is specific to the net driver cvq.

[snip]

>>> \footnote{The readq is a dedicated virtqueue to facilitate
>>>> +reducing latency for clock readings, which may be more
>>>> +performance-sensitive than other requests.} The driver uses the
>>>> +controlq for other requests to the device.
>>>> +
>>>> +\subsection{Feature bits}\label{sec:Device Types / RTC Device /
>>>> +Feature bits}
>>>> +
>>>> +\begin{description}
>>>> +\item[VIRTIO_RTC_F_READ_CROSS (0)] Device supports cross-
>> timestamping.
>>>> +\end{description}
>>>> +
>>>> +VIRTIO_RTC_F_READ_CROSS determines whether the device supports clock
>>>> +reading with cross-timestamping. Without this feature, the device
>>>> +does not support cross-timestamping. Even when this feature was
>>>> +negotiated, the device may still not support cross-timestamping for
>>>> +some clocks, or for particular requests.
>>>> +
>>> Can you please explain why even after feature negotiation, device may still
>> not support?
>>> Is it because the cross timestamping is per clock feature?
>>
>> Actually, the cross-timestamping support can be specific to the combination of
>> clock and hardware counter. This is explained in more detail in the context of
>> VIRTIO_RTC_M_READ_CROSS. I will mention this here as well.
>>
> Hardware counter is a misleading term here. It is really platform (system) specific identifier.
> 

Maybe with the currently defined hardware counters there is a one-to-one
correspondence to platforms, but that need not be the case. E.g. it
might be more convenient for a device on an Intel x86 platform to read
the Always Running Timer (ART) rather than the TSC derived from the ART,
and then the ART could become another HW counter in the spec for x86.

> Maybe I am wrong, but I think cross timestamping functionality can be achieved at the driver level, reading counter value from two sources?
> Unlikely? because the variance is lot?
> 

In short, cross-timestamping at the device side can be 

- more precise,
- practically insensitive to temporal behavior, and 
- less resource demanding due to less need to repeat clock readings,

if the device has a clock based on an appropriate hardware counter.

With the cross-timestamping, the device can convey its understanding of
the current time at the instant at which the hardware counter has a
specific value. Both the device and the driver should be able to read
the same hardware counter. When the device repeatedly conveys its
understanding of the current time and corresponding hardware counter
reading, a clock synchronization daemon at the driver side can infer the
offset and speed of the device's clock (relative to the driver's system
time, which can also be related to the same hardware counter).

If the driver side derives system time from the hardware counter (and if
the device does not manipulate the clock ATM), then a driver-side clock
synchronization daemon can synchronize the system time to a precision of
< 10 ns. For this, there are no special requirements on calibration or
clock reading performance, except that the hardware counter view by
device and driver should be synchronized.

A good illustration of cross-timestamping at the (virtual) device side
is the ptp_kvm arm64 implementation [3], which uses ktime_get_snapshot()
for cross-timestamping.

By contrast, cross-timestamping at the driver side (as one could say
PTP_SYS_OFFSET_EXTENDED does in the Linux kernel) results in inferior
precision, is sensitive to clock reading performance, and probably has
to be repeated more often. From my POV cross-timestamping should be done
at the device side, if the device has access to suitable hardware
counters.

> If so, why do we need to annotate arm/x86, can it be done without it as just system clock/cycles?
> 

As discussed above, additional counters may be used in the future.

Also, on x86, no TSC might be present or used (the Linux on my own
notebook until recently refused to keep using TSC due to a firmware
issue).

>>> If so, I guess no need of a feature bit as it is not serving any meaningful
>> purpose.
>>
>> The only purpose is to potentially save the device some initialization work or
>> resources in case the feature is not negotiated.
>>
> Ok. In that case we should move it to per clock enablement instead of feature bit.
> 

As per my above replies, I think it should be kept clock and hardware
counter specific. The feature bit could be removed if it seems not
helpful.

>>> Each clock information can advertise what is supports, right?
>>>
>>
>> The driver can use VIRTIO_RTC_M_CROSS_CAP to inquire about this (or the
>> driver could also just try if VIRTIO_RTC_M_READ_CROSS works).
>>
> VIRTIO_RTC_M_CROSS_CAP is reliable way to discover the device support.
> Direct command if it results in error can throw unwanted kernel errors resulting in false alarms.
> 

Agreed. The second option is not mentioned in the v2 spec so no action
should be necessary.

>>> If this clock device is a PCI device, on an empty system, a PCI device has to do
>> at least couple of DMA.
>>> 1. read the descriptor (250 to 300nsec) 2. Read the request (250 to
>>> 300nsec) 3. Write the response (50 nsec+) 4. raise interrupt.
>>>
>>> How is this time delta accounted in clock reading?
>>> Is driver expected to record system time and subtract it on receiving
>> response?
>>> Or?
>>
>> As a disclaimer, I am no expert for clock reading, so any additional insights are
>> appreciated very much.
>>
>> I think how the driver accounts for the time delta should remain outside of the
>> scope of the specification. In the current spec the interpretation of clock reading
>> is behavior internal to the driver.
> Well, directly reading the clock as device register can be more precise.
> But this can be limited to small number of clocks.
> 

This could be added if required. But to me it would only make sense
where cross-timestamping doesn't work (and where precision is
important).

>> Also, AFAIU interpretation of clock readings is a scientific and technical topic of
>> its own, which has been extensively discussed in the past decades. The choice of
>> a clock reading algorithm implies trade-offs between precision, implementation
>> complexity, and resource usage/clock reading speed (when re-reading the
>> clock).
>>
>> The Linux kernel has ioctl PTP_SYS_OFFSET_EXTENDED, which will read the
>> system time immediately before and after each PTP clock reading, leaving it up
>> to userspace how to interpret the clock reading. This ioctl can also be used with
>> the RFC Linux kernel driver implementing this spec [2], which exposes a virtual
>> PTP clock. So in this case, it would hardly help to stipulate Virtio driver behavior,
>> when the actual clock reading interpretation happens in userspace daemons
>> using PTP_SYS_OFFSET_EXTENDED, such as chrony.
>>
>> I see the above discussed VIRTIO_RTC_M_READ method mainly as a fallback in
>> cases where the VIRTIO_RTC_M_READ_CROSS method would not be available.
>> The VIRTIO_RTC_M_READ_CROSS method is more or less independent of the
>> time delta (which can be in the 100 ms range, as discussed in the driver patch
>> series cover letter [2]).
>>
> Do you mean that the system clock and device clock reading happening in single call, hence the _CROSS gives more precise result?
> 

The clock reading duration is only relevant for the simple
VIRTIO_RTC_M_READ method. For VIRTIO_RTC_M_READ_CROSS, the clock reading
duration is not critical, since the measurements are done at the device
side. The driver side just needs to know what time it would have
associated to the hardware counter reading returned by the device.

I think I will add some contents from my replies to the commit message.

Thanks for the feedback,

Peter

[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3bf725699bf62494b3e179f1795f08c7d749f061
Follow-Ups:
- RE: [virtio-comment] [RFC PATCH v2] virtio-rtc: Add device specification.
  - From: Parav Pandit <parav@nvidia.com>