OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [PATCH v3 2/2] virtio-fs: add DAX window


* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Mon, Jun 24, 2019 at 02:58:08PM +0100, Stefan Hajnoczi wrote:
> > On Tue, Jun 18, 2019 at 09:41:25PM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Feb 20, 2019 at 12:46:13PM +0000, Stefan Hajnoczi wrote:
> > > > Describe how shared memory region ID 0 is the DAX window and how
> > > > FUSE_SETUPMAPPING maps file ranges into the window.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > > Note that this depends on the shared memory resource specification
> > > > extension that David Gilbert is working on.
> > > > https://lists.oasis-open.org/archives/virtio-comment/201901/msg00000.html
> > > > 
> > > > The FUSE_SETUPMAPPING message is part of the virtio-fs Linux patches:
> > > > https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h
> > > > ---
> > > >  virtio-fs.tex | 25 +++++++++++++++++++++++++
> > > >  1 file changed, 25 insertions(+)
> > > > 
> > > > diff --git a/virtio-fs.tex b/virtio-fs.tex
> > > > index 5df5b9c..abb1e48 100644
> > > > --- a/virtio-fs.tex
> > > > +++ b/virtio-fs.tex
> > > > @@ -157,6 +157,31 @@ The driver MUST submit FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET reques
> > > >  
> > > >  The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
> > > >  
> > > > +\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> > > > +
> > > > +FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
> > > > +driver-provided buffer and the device.  In cases where data transfer is
> > > > +undesirable, the device can map file contents into the DAX window shared memory
> > > > +region.  The driver then accesses file contents directly in device-owned memory
> > > > +without a data transfer.
> > > > +
> > > > +Shared memory region ID 0 is called the DAX window.  The driver maps a file
> > > > +range into the DAX window using the FUSE\_SETUPMAPPING request.  The mapping is
> > > > +removed using the FUSE\_REMOVEMAPPING request.
> > > 
> > > I don't see FUSE\_SETUPMAPPING or FUSE\_REMOVEMAPPING  under
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h
> > > Is it just me?
> > 
> > They are not upstream yet and can be found here:
> > 
> > https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h#L384
> > 
> > There is a chicken-and-egg problem.  Linux should merge this once the
> > spec has been accepted.  The spec makes reference to a new FUSE command
> > that is being added to Linux.  :D
> > 
> > I suggest we break it by merging the VIRTIO spec change first.  There
> > won't be a spec release so soon anyway and we can revert it in case
> > there are issues Linux.  Miklos, the FUSE maintainer, is well aware of
> > virtio-fs and contributes to it, so it's unlikely that Linux will reject
> > these commands.
> > 
> > > > +
> > > > +After FUSE\_SETUPMAPPING has completed successfully the file range is accessible
> > > > +from the DAX window at the offset provided by the driver in the request.
> > > 
> > > Dgilbert's patches describing shared memory say that
> > > the legal ways to set up mappings are all implementation-dependent.
> > > How does driver know which attributes to use for the
> > > mapping?
> > 
> > Two different types of mappings:
> > 1. The DAX window shared memory region described by DaveG's spec.
> > 2. The file mappings established using FUSE_SETUPMAPPING.
> > 
> > The virtio_fs.ko driver maps the DAX window, e.g. from a PCI BAR in an
> > implementation-defined way.  virtio_pci_*.c in Linux will have to help
> > out with the implementation-specific details here.
> > 
> > The only flags currently supported by FUSE_SETUPMAPPING are READ and
> > WRITE.  This depends on the file's access mode.  There is nothing
> > implementation-specific in FUSE_SETUPMAPPING.
> 
> Sorry - I'm being unclear.
> The guest driver maps parts of the PCI BAR.
> What are the attributes of this mapping?
> This is unrelated to FUSE_SETUPMAPPING things -
> mapping is created by creatig PTEs and such
> within guest, not by virtio things.

By attributes you mean... memory ordering, cachability etc?

> 
> > > Also, we recently had a discussion about DAX support on hosts
> > > and safety wrt crashes. Do we need to expose this
> > > information to guests maybe?
> > 
> > No.  Although virtio-fs uses the DAX subsystem, it does not use NVDIMM's
> > persistence model (e.g. CPU cache flush for persistence).  FUSE_FSYNC is
> > sent when persistence is required.  Therefore virtio-fs is still using
> > the traditional file/block persistence model.  No changes necessary for
> > power failure, etc.
> > 
> > > Finally, do we want to have a way to express that the filesystem
> > > only allows RO mappings?
> > 
> > Thanks for this idea.  I'm discussing it with the FUSE community because
> > mount -o ro with FUSE currently doesn't involve the file system daemon.
> > 
> > > > +
> > > > +\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> > > > +
> > > > +The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
> > > 
> > > 
> > > Any alignment requirements?
> > 
> > Good point.  There are alignment requirements and the driver has no way
> > of knowing what they are.  I'll find a way to communicate them into the
> > guest, either via virtio or via FUSE.
> > 
> > > Also, with no limit on mappings, it looks like guest can use up lots of
> > > host VMAs quickly. Shouldn't there be a limit on # of mappings?
> > 
> > The VM can only deteriorate its own performance, right?
> 
> Only if QEMU is put in a container where virtual memory is
> limited.
> It's generally not a good idea where the only way for
> host to make progress is to allocate more memory
> without any limit.
> 
> If we are in a situation where we need to either kill
> the guest or hit swap, none of the choices is good.

There is a bound; it's cache region size / page size - so
that's ~1M mappings worst case (e.g. 4GB cache, 4kB page size)
That limit can be bought down if we impose a larger granularity
somewhere (and the reality is our kernel uses 2MB mapping chunks I
think).

> > We haven't seen catastrophic problems that bring the system to it's
> > knees.
> 
> Because you are not running malicious guests?

Hmm, I didn't realise a process having an excessive number of mappings
could harm any other process.

Dave

> >  But we're aware that increasing the number VMAs slows down the
> > lookup.  There is currently no imposed limit.
> > 
> > Ideas have been discussed to avoid using (so many) VMAs but it seems
> > like that will take some time to develop and get upstream.  This will
> > not affect the virtio specification because the device interface doesn't
> > need to know about this.
> > 
> > Stefan
> 
> 
> One way to address this is to expose the # of mappings
> in the config space.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]