Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
Date: Mon, 01 Aug 2005 20:12:32 UTC
Message-ID: <fa.g19r4as.n067ak@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0508011250210.3341@g5.osdl.org>

On Mon, 1 Aug 2005, Hugh Dickins wrote:
>
> > Aside, that brings up an interesting question - why should readonly
> > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
> > write access while readonly mappings of readonly files not? Or am I
> > horribly confused?
>
> Either you or I.  You'll have to spell that out to me in more detail,
> I don't see it that way.

We have always just done a COW if it's read-only - even if it's shared.

The point being that if a process mapped did a read-only mapping, and a
tracer wants to modify memory, the tracer is always allowed to do so, but
it's _not_ going to write anything back to the filesystem.  Writing
something back to an executable just because the user happened to mmap it
with MAP_SHARED (but read-only) _and_ the user had the right to write to
that fd is _not_ ok.

So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do
so) to the backing store through ptrace unless it was literally a writable
mapping (in which case VM_WRITE will be set, and the page table should be
marked writable in the first case).

So we have two choices:

 - not allow the write at all in ptrace (which I think we did at some
   point)

   This ends up being really inconvenient, and people seem to really
   expect to be able to write to readonly areas in debuggers. And doing
   "MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported
   that pretty much since day #1 when mmap was supported - long before
   writable shared mappings were supported, Linux accepted MAP_SHARED +
   PROT_READ not just because we could, but because Unix apps do use it).

or

 - turn a shared read-only page into a private page on ptrace write

   This is what we've been doing. It's strange, and it _does_ change
   semantics (it's not shared any more, so the debugger writing to it
   means that now you don't see changes to that file by others), so it's
   clearly not "correct" either, but it's certainly a million times better
   than writing out breakpoints to shared files..

At some point (for the longest time), when a debugger was used to modify a
read-only page, we also made it writable to the user, which was much
easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing,
which is part of the reason for this particular problem.

Using the dirty flag for a "page is _really_ writable" is admittedly kind
of hacky, but it does have the advantage of working even when the -real-
write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people
to add some more hacks for their strange VM, so be it..

[ Btw, on a totally unrelated note: anybody who is a git user and looks
  for when this maybe_mkwrite() thing happened, just doing

	git-whatchanged -p -Smaybe_mkwrite mm/memory.c

  in the bkcvs conversion pinpoints it immediately. Very useful git trick
  in case you ever have that kind of question. ]

I added Martin Schwidefsky to the Cc: explicitly, so that he can ping
whoever in the s390 team needs to figure out what the right thing is for
s390 and the dirty bit semantic change. Thanks for pointing it out.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
Date: Mon, 01 Aug 2005 22:00:09 UTC
Message-ID: <fa.g2a73r4.k026qs@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0508011438450.3341@g5.osdl.org>

On Mon, 1 Aug 2005, Hugh Dickins wrote:
> >
> > We have always just done a COW if it's read-only - even if it's shared.
> >
> > The point being that if a process mapped did a read-only mapping, and a
> > tracer wants to modify memory, the tracer is always allowed to do so, but
> > it's _not_ going to write anything back to the filesystem.  Writing
> > something back to an executable just because the user happened to mmap it
> > with MAP_SHARED (but read-only) _and_ the user had the right to write to
> > that fd is _not_ ok.
>
> I'll need to think that through, but not right now.  It's a surprise
> to me, and it's likely to surprise the current kernel too.

Well, even if you did the write-back if VM_MAYWRITE is set, you'd still
have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being
set, and I'd expect that to actually be the common one (since you'd
normally use O_RDONLY to open a fd that you only want to map for reading).

And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life.
Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll
get tons of hits. For good reason too - because MAP_PRIVATE isn't actually
coherent on several old UNIXes.

So you'd still have to convert such a case to a COW mapping, so it's not
like you can avoid it.

Of course, if VM_MAYWRITE is not set, you could just convert it silently
to a MAP_PRIVATE at the VM level (that's literally what we used to do,
back when we didn't support writable shared mappings at all, all those
years ago), so at least now the COW behaviour would match the vma_flags.

> I'd prefer to say that if the executable was mapped shared from a writable fd,
> then the tracer will write back to it; but you're clearly against that.

Absolutely. I can just see somebody mapping an executable MAP_SHARED and
PROT_READ, and something as simple as doing a breakpoint while debugging
causing system-wide trouble.

I really don't think that's acceptable.

And I'm not making it up - add PROT_EXEC to the google search around, and
watch it being done exactly that way. Several of the hits mention shared
libraries too.

I strongly suspect that almost all cases will be opened with O_RDONLY, but
still..

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
Date: Mon, 01 Aug 2005 22:10:20 UTC
Message-ID: <fa.g2q14b1.kgc7ah@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0508011455520.3341@g5.osdl.org>

On Mon, 1 Aug 2005, Linus Torvalds wrote:
>
> Of course, if VM_MAYWRITE is not set, you could just convert it silently
> to a MAP_PRIVATE at the VM level (that's literally what we used to do,
> back when we didn't support writable shared mappings at all, all those
> years ago), so at least now the COW behaviour would match the vma_flags.

Heh. I just checked. We still do exactly that:

                        if (!(file->f_mode & FMODE_WRITE))
                                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

some code never dies ;)

However, we still set the VM_MAYSHARE bit, and thats' the one that
mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE
doesn't actually ever do anything else than make sure that we try to
allocate a mremap() mapping in a cache-coherent space, I think (ie it's a
total no-op on any sane architecture, and as far as rmap is concerned on
all of them).

		Linus

Index Home About Blog