virtio-comment message

Subject: RE: [virtio-comment] [RFC PATCH v2] virtio-rtc: Add device specification.
From: Parav Pandit <parav@nvidia.com>
To: Peter Hilber <peter.hilber@opensynergy.com>, "virtio-comment@lists.oasis-open.org" <virtio-comment@lists.oasis-open.org>
Date: Mon, 21 Aug 2023 05:42:46 +0000
> From: Peter Hilber <peter.hilber@opensynergy.com>
> Sent: Friday, August 4, 2023 9:38 PM
> 
> On 31.07.23 05:15, Parav Pandit wrote:
> >
> >> From: Peter Hilber <peter.hilber@opensynergy.com>
> >> Sent: Thursday, July 27, 2023 3:51 PM
> >
> >
> >>>> In the spec, clock ids are 64 bit wide to have unique ids when
> >>>> supporting clock
> >>> It is unclear the namespace of uniqueness.
> >>> Do you mean unique id across multiple virtio devices?
> >>
> >> The idea was not to have a unique id across multiple virtio devices.
> >> Rather, the idea was to simplify lifetime management for clocks in a
> >> single virtio device (if such lifetimes would be introduced in the
> >> future) by not having to worry about race conditions due to clock id reuse.
> >>
> > If the id not unique, 16-bit is likely more than enough.
> > There should not be any race condition with respect to clock id as its per
> device.
> >
> 
> I think 64 bit is better since the cost is negligible, and e.g. the device then does
> not have to care how it issues ids while the driver is suspended.

Can you please explain how 64-bit better when the number of clocks per device is not so much?
Even with 256 or 64K clocks device wont care how the ids are issued.

> > Both the modes are in use, hence we want to support #1 and #2 both without
> duplicating the interfaces.
> >
> 
> What I wanted to refer to in my reply are the Linux kernel /dev/ptp* interfaces,
> which are separate from the network device interfaces.
> 
Sure. It can be.
When the netdevice has the RTC features supports it will create /dev/ptp* and /sys/class/net/netdev devices.

All the plumbing that is done is for dual use.
So command etc definitions is for dual use as I explained before.
Those can be used for virtio-net device when it has in-built rtc.

When there is only rtc device with its own device-type, same functionality and commands are there.
Driver will create only /dev/ptp* device.

From SW point of view it is just matter of layering..
Already done by more than 13+ industry devices..

> >> The extra effort to implement another simple virtio device (rather
> >> than putting everything in a single device) seems to be small when
> >> compared to the complexity of creating an extensible specification
> >> which should work both standalone and as part of an already existing, but
> still growing, device.
> > It may sound as small effort with only a focused view on RTC only device.
> > But both types are needed.
> > From actual implementation point of view and use case wise, yet another PCI
> device at scale of virtual machines is not desired, especially when it has tight
> usage completing on network device.
> >
> >
> 
> If the issue is PCI transport related resource consumption, the PCI transport
> could be extended to support multi-function devices, in a way which is agnostic
> to the device types involved.
> 
All things needed by rtc is already present on net device. So there is no point of improving PCI transport etc.
PCI transport resource improvement goes just beyond PCI transport improvement. Such as APIC, IOMMU, device registers and more.
This is equally an overkill.

And defining and duplicating exact same commands in net is equally sub-optimal, specially when we know its need for last several months.

This is the right junction to reuse functionality on two device types without duplication.

> >> E.g., the
> >> RTC device should also get an alarm, with an associated eventq, in the
> future.
> > Sure, it is ok. it will be optional when it is RTC device.
> >
> >> Also, I wonder if the driver implementation should then work both as
> >> a standalone Virtio driver and as part of the net driver.
> >>
> > Driver layer will be fine. Linux stack evolved over last 3 years to support
> multi-functionality device.
> > A RTC/PTP driver will be a stand alone entity that works on underlying layered
> way to issue read cmd, config etc.
> >
> 
> As per my above reply, I think it would be better to tackle the problem at the
> PCI transport level, then there would be no impact on individual device drivers.
> 
There is no impact by better definition usable on two devices.

> >> Maybe the above use case could also be addressed by a standalone RTC
> >> device as follows:
> >>
> >> 1) The net device exports a UUID object to which the RTC device can
> >> refer to (so the drivers can link the net device to the RTC clock).
> >>
> >> 2) If needed, the RTC device aligns the lifetime of the clock to the net device.
> >> Then, clock lifetimes are no longer static, and can be aligned with
> >> the net device. Through an eventq (to be added), the RTC device
> >> announces clock creation/destruction, which can be synchronized to the net
> device state.
> >>
> > All of the infrastructure, cross device synchronization, its device side
> management will come with equal complexity.
> > And we will end up with sub-optimal way to using multiple PCI devices.
> >
> 
> As per my above reply, I think it would be better to tackle the problem at the
> PCI transport level...
> 
> > Net already has cvq required by RTC. So cvq commands can be common.
> > Readq is additional, we need to define common between the two and way to
> discover it.
> >
> 
>  I would still try to avoid mixing net and RTC details. RTC
> should keep a dedicated cvq. 
Can you please explain why a dedicate cvq is needed on net device?

> Then the RTC driver could use generic Virtio
> functionality to send/receive rather than using an interface which is specific to
> the net driver cvq.
> 
Please explain why?
We have dedicated vqs work ongoing for flow filters due to high rate of requests requirement handled by hw.

So would like to understand if we can extend similar dedicated cvq. Given most of the queries done via cvq for rts are not time sensitive.
Hence net cvq seems enough.

> >> Actually, the cross-timestamping support can be specific to the
> >> combination of clock and hardware counter. This is explained in more
> >> detail in the context of VIRTIO_RTC_M_READ_CROSS. I will mention this
> here as well.
> >>
> > Hardware counter is a misleading term here. It is really platform (system)
> specific identifier.
> >
> 
> Maybe with the currently defined hardware counters there is a one-to-one
> correspondence to platforms, but that need not be the case. E.g. it might be
> more convenient for a device on an Intel x86 platform to read the Always
> Running Timer (ART) rather than the TSC derived from the ART, and then the
> ART could become another HW counter in the spec for x86.
>
Maybe I am wrong, but it looks like this is platform feature where driver can device to read ART or ARM timer without any device knowledge?

Can you please explain, if the device does not have the knowledge of platform/system timer, can it still achieve the cross timestamping at driver level?
 
> > Maybe I am wrong, but I think cross timestamping functionality can be
> achieved at the driver level, reading counter value from two sources?
> > Unlikely? because the variance is lot?
> >
> 
> In short, cross-timestamping at the device side can be
> 
> - more precise,
> - practically insensitive to temporal behavior, and
> - less resource demanding due to less need to repeat clock readings,
> 
> if the device has a clock based on an appropriate hardware counter.
> 
> With the cross-timestamping, the device can convey its understanding of the
> current time at the instant at which the hardware counter has a specific value.
> Both the device and the driver should be able to read the same hardware
> counter. When the device repeatedly conveys its understanding of the current
> time and corresponding hardware counter reading, a clock synchronization
> daemon at the driver side can infer the offset and speed of the device's clock
> (relative to the driver's system time, which can also be related to the same
> hardware counter).
> 
> If the driver side derives system time from the hardware counter (and if the
> device does not manipulate the clock ATM), then a driver-side clock
> synchronization daemon can synchronize the system time to a precision of < 10
> ns. For this, there are no special requirements on calibration or clock reading
> performance, except that the hardware counter view by device and driver
> should be synchronized.
> 
> A good illustration of cross-timestamping at the (virtual) device side is the
> ptp_kvm arm64 implementation [3], which uses ktime_get_snapshot() for cross-
> timestamping.
> 
> By contrast, cross-timestamping at the driver side (as one could say
> PTP_SYS_OFFSET_EXTENDED does in the Linux kernel) results in inferior
> precision, is sensitive to clock reading performance, and probably has to be
> repeated more often. From my POV cross-timestamping should be done at the
> device side, if the device has access to suitable hardware counters.
> 
> > If so, why do we need to annotate arm/x86, can it be done without it as just
> system clock/cycles?
> >
> 
Your above detail explanation likely answers why device view of system counter is more precise.

> As discussed above, additional counters may be used in the future.
> 
Do you mean additional system counter?

> Also, on x86, no TSC might be present or used (the Linux on my own notebook
> until recently refused to keep using TSC due to a firmware issue).
> 
> >>> If so, I guess no need of a feature bit as it is not serving any
> >>> meaningful
> >> purpose.
> >>
> >> The only purpose is to potentially save the device some
> >> initialization work or resources in case the feature is not negotiated.
> >>
> > Ok. In that case we should move it to per clock enablement instead of feature
> bit.
> >
> 
> As per my above replies, I think it should be kept clock and hardware counter
> specific. The feature bit could be removed if it seems not helpful.
> 
Ok. lets remove it.

> >>> Each clock information can advertise what is supports, right?
> >>>
> >>
> >> The driver can use VIRTIO_RTC_M_CROSS_CAP to inquire about this (or
> >> the driver could also just try if VIRTIO_RTC_M_READ_CROSS works).
> >>
> > VIRTIO_RTC_M_CROSS_CAP is reliable way to discover the device support.
> > Direct command if it results in error can throw unwanted kernel errors
> resulting in false alarms.
> >
> 
> Agreed. The second option is not mentioned in the v2 spec so no action should
> be necessary.
> 
> >>> If this clock device is a PCI device, on an empty system, a PCI
> >>> device has to do
> >> at least couple of DMA.
> >>> 1. read the descriptor (250 to 300nsec) 2. Read the request (250 to
> >>> 300nsec) 3. Write the response (50 nsec+) 4. raise interrupt.
> >>>
> >>> How is this time delta accounted in clock reading?
> >>> Is driver expected to record system time and subtract it on
> >>> receiving
> >> response?
> >>> Or?
> >>
> >> As a disclaimer, I am no expert for clock reading, so any additional
> >> insights are appreciated very much.
> >>
> >> I think how the driver accounts for the time delta should remain
> >> outside of the scope of the specification. In the current spec the
> >> interpretation of clock reading is behavior internal to the driver.
> > Well, directly reading the clock as device register can be more precise.
> > But this can be limited to small number of clocks.
> >
> 
> This could be added if required. But to me it would only make sense where
> cross-timestamping doesn't work (and where precision is important).
> 
Ok. So this can be follow up post this work.

> >> Also, AFAIU interpretation of clock readings is a scientific and
> >> technical topic of its own, which has been extensively discussed in
> >> the past decades. The choice of a clock reading algorithm implies
> >> trade-offs between precision, implementation complexity, and resource
> >> usage/clock reading speed (when re-reading the clock).
> >>
> >> The Linux kernel has ioctl PTP_SYS_OFFSET_EXTENDED, which will read
> >> the system time immediately before and after each PTP clock reading,
> >> leaving it up to userspace how to interpret the clock reading. This
> >> ioctl can also be used with the RFC Linux kernel driver implementing
> >> this spec [2], which exposes a virtual PTP clock. So in this case, it
> >> would hardly help to stipulate Virtio driver behavior, when the
> >> actual clock reading interpretation happens in userspace daemons using
> PTP_SYS_OFFSET_EXTENDED, such as chrony.
> >>
> >> I see the above discussed VIRTIO_RTC_M_READ method mainly as a
> >> fallback in cases where the VIRTIO_RTC_M_READ_CROSS method would not
> be available.
> >> The VIRTIO_RTC_M_READ_CROSS method is more or less independent of
> the
> >> time delta (which can be in the 100 ms range, as discussed in the
> >> driver patch series cover letter [2]).
> >>
> > Do you mean that the system clock and device clock reading happening in
> single call, hence the _CROSS gives more precise result?
> >
> 
> The clock reading duration is only relevant for the simple VIRTIO_RTC_M_READ
> method. For VIRTIO_RTC_M_READ_CROSS, the clock reading duration is not
> critical, since the measurements are done at the device side. The driver side
> just needs to know what time it would have associated to the hardware counter
> reading returned by the device.
> 
> I think I will add some contents from my replies to the commit message.
> 
> Thanks for the feedback,
> 

I apologize for my slow response. I was not well when you replied.
After that lost track of your email in stacks of email until you sent the newer version.

Lets bring rtc functionality in the two devices. We are almost there, just need to refine the definitions.
References:
- Re: [virtio-comment] [RFC PATCH v2] virtio-rtc: Add device specification.
  - From: Peter Hilber <peter.hilber@opensynergy.com>