Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.44.0205181958140.30454-100000@home.transmeta.com>
Date: Sun, 19 May 2002 03:02:53 GMT
Message-ID: <fa.l3t3oiv.1n1g29r@ifi.uio.no>

On Sun, 19 May 2002, Rusty Russell wrote:
>
> Huh?  No, you ask for 2000 bytes into a buffer that can only take 1000
> bytes without hitting an unmapped page, returning EFAULT or giving a
> SIGSEGV is perfectly acceptable.

Bzzt, wrong answer.

Partial reads/writes are perfectly possible and non-buggy for any system
that uses variations of mmap/mprotect to implement user-level memory
management, for example persistent data-bases, garbage collection etc.

Which means that if half of a buffer used for "read()" just happens to be
marked non-writable for some GC purpose, the kernel HAS to have the
ability return the right answer (which in this case is to say "I could
only read 1000 bytes"). Because anything else just doesn't give the GC
library (or whatever) any way to recover nicely.

> As a coder, I'd *really* prefer that to hiding the bug!

Rusty, face it, you're wrong on this one. Just drop it.

		Linus



Newsgroups: fa.linux.kernel
From: Larry McVoy <lm@bitmover.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <20020518200540.N8794@work.bitmover.com>
Date: Sun, 19 May 2002 03:06:43 GMT
Message-ID: <fa.ib69oav.1k4mb84@ifi.uio.no>

On Sat, May 18, 2002 at 08:01:48PM -0700, Linus Torvalds wrote:
> On Sun, 19 May 2002, Rusty Russell wrote:
> >
> > Huh?  No, you ask for 2000 bytes into a buffer that can only take 1000
> > bytes without hitting an unmapped page, returning EFAULT or giving a
> > SIGSEGV is perfectly acceptable.
>
> Bzzt, wrong answer.

Linus is absolutely right.  The correct semantics are to return the number
of bytes read, if they are greater than zero, and on the next read return
the error.  This has been a corner case in read for a long time in various
Unix versions, and Linus has it right.  I went through this back at Sun
and we explored all the different ways, and the bottom line is that you
first ACK that you moved some data and then you NAK on the next read.
--
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm


Newsgroups: fa.linux.kernel
From: Larry McVoy <lm@bitmover.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <20020518210218.P8794@work.bitmover.com>
Date: Sun, 19 May 2002 04:03:17 GMT
Message-ID: <fa.i9mro9v.1mk4b8t@ifi.uio.no>

On Sun, May 19, 2002 at 02:01:25PM +1000, Rusty Russell wrote:
> But as we all know, it is harder to remove a feature from Linux, than
> to get the camel book through the eye of a needle (or something).

It's possible that I'm too tired to have grasped this, but if I have,
you're all wet.  In all cases, read needs to return the number of bytes
successfully moved.  If you ask for N and 1/2 of the way through N you
are going to get a fault, and you return SEGFAULT, now how can I ever
find out that N/2 bytes actually made it out to me?  I want to know that.
If you are arguing that return N/2 is wrong, you are incorrect.
--
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.44.0205182030160.31341-100000@home.transmeta.com>
Date: Sun, 19 May 2002 03:34:56 GMT
Message-ID: <fa.l2snnpv.1k1421i@ifi.uio.no>

On Sun, 19 May 2002, Rusty Russell wrote:
>
> Um, what about delivering a SIGSEGV?  So, copy_to/from_user always
> returns 0, but a signal is delivered.

That doesn't help. It's against some stupid SUS rule, I'm afraid.

(And THAT is a stupid rule, I 100% agree with. It means that some things
return -EFAULT, and other things do SIGSEGV, and the only difference is
whether something is a system call or is implemented as a library thing.
UNIX should always just have segfaulted, but there you are..)

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.33.0205211340080.3073-100000@penguin.transmeta.com>
Date: Tue, 21 May 2002 20:50:38 GMT
Message-ID: <fa.o9pvf6v.g4k5rd@ifi.uio.no>

On Thu, 16 May 2002, Pavel Machek wrote:
> I thought POSIX made it explicit that you may SIGSEGV or EFAULT at your
> option. If that SUS rule is stupid, we should just drop it.
>
> Performance advantage from MMX-copy-to-user is probably well worth it.

Stop this STUPID "it speeds things up" argument.

It's not TRUE.

We still have to do exactly the same things we do right now, because even
if we SIGSEGV we still have to return the right number of bytes
read/written.

SIGSEGV doesn't mean that the system call wouldn't complete. It removes
_none_ of the kernel fixup handling, because the SIGSEGV won't be
delivered until we return to user mode anyway. So please stop mixing these
two issues up.

There are two completely orthogonal issues:

 - Use SIGSEGV on system calls or not.

   Using SIGSEGV makes the system call vs library routine issue more
   regular, but it does not change the fact that the system call has to
   return _something_.

 - system call return value for partially successful read/write.

   We already have the exact same issue wrt something like SIGINT, and
   nobody sane would ever suggest that we always return the "whole" thing
   if we're interrupted by an external signal.

   Similarly, it's naive and stupid to suggest we return success if we get
   interrupted by a SIGSEGV/EFAULT.

On the first issue (SIGSEGV) I'm certainly open to trying that out,
although I'm fairly certain there was _some_ reason we didn't do this a
few years ago.

On the second issue, absolutely _nobody_ has shown any reason why we
should break the existing code that does this correctly, and I've shown
reasons why breaking it is STUPID.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.33.0205211500530.1307-100000@penguin.transmeta.com>
Date: Tue, 21 May 2002 22:05:33 GMT
Message-ID: <fa.ob9vcev.jk8539@ifi.uio.no>

On Tue, 21 May 2002, Andrew Morton wrote:
>
> Pavel makes a reasonable point that copy_*_user may elect
> to copy the data in something other than strictly ascending
> user virtual addresses.  In which case it's not possible to return
> a sane "how much was copied" number.

I don't agree that that is true.

Do you have _any_ reasonable implementation that would do that_

> And copy_*_user is buggy at present: it doesn't correctly handle
> the case where the source and destination of the copy are overlapping
> in the same physical page.  Example code below.

So we have memcpy() semantics for read()/write(), big deal.

The same way you aren't supposed to use memcpy() for overlapping areas,
you're not supposed to read/write into such areas, for all the same
reasons.

> One fix is to
> do the copy with descending addresses if src<dest or whatever.

No. That wouldn't work anyway, because the addresses are totally different
kinds.

> But then how to return the number of bytes??

The way we do now, which is the CORRECT way.

Stop this idiocy.

The current interface is quite well-defined, and has good semantics. Every
single argument against it has been totally bogus, with no redeeming
values.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.33.0205211423490.1405-100000@penguin.transmeta.com>
Date: Tue, 21 May 2002 21:28:04 GMT
Message-ID: <fa.oaqbcuv.j4g5j8@ifi.uio.no>

On Tue, 21 May 2002, Pavel Machek wrote:
>
> If you pass bad pointer to memcpy(), you don't expect any reasonable
> return value, right?

Actually, if I pass a bad pointer to memcpy(), and I have a handler for
the SIGSEGV that fixes up the thing, yes, by golly, I _do_ expect memcpy()
to get the right value.

If it doesn't, then the system is BROKEN.

Face it Pavel, you're WRONG. You're so incredibly wrong that this is not
worth even discussing. Linux does it right now, and I won't break that
correct behaviour just because somebody has a incorrect and silly view of
what is readable.

Face it, copy_to/from_user does the RIGHT THING, and stop whining about
it.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.44.0205182210330.878-100000@home.transmeta.com>
Date: Sun, 19 May 2002 05:24:02 GMT
Message-ID: <fa.mljskkv.1d2e2pi@ifi.uio.no>

On Sun, 19 May 2002, Rusty Russell wrote:
> > returns 0, but a signal is delivered.  Then the only places which need
> > to be clever are the mount option copying, and the signal delivery
> > code for SIGSEGV (which uses copy_to_user).
>
> Sorry, this doesn't work here either: this would return the wrong
> value from read.

Oh, read() has to return the right value, but we should _also_ do a
SIGSEGV, in my opinion (it would also catch all those programs that didn't
expect it).

However, that apparently flies in the face of UNIX history and apparently
some standard (whether it was POSIX or SuS or something else, I can't
remember, but that discussion came up earlier..)

> Of course, everyone who does more than one copy_to_user should be
> checking that return value, and not doing:
>
> 	if (copy_to_user(uptr....)
> 	   || copy_to_user(uptr+10,....)
> 		return -EFAULT
>
> So that such gc schemes actually work,

Note that _most_ system calls are simply just re-startable, ie if your
"stat()" system call dies half-way and returns EFAULT, you can re-start it
without having to know how much of the "stat" structure you might have
filled in. So for many things a plain -EFAULT is plenty good enough, and
your "copy_to/from_user() should return 0/-EFAULT" would work for them.

But read (and particularly write) are _not_ re-startable without the
knowledge of how much was written, because they change f_pos and other
things ("write()" in particular changes a _lot_ of "other things", namely
the data in the file itself, of course).

There are other system calls that aren't re-startable, but basically
read/write are the "big ones", and thus Linux should try its best to make
them work in an environment that requires restartability. Most programs
can live without various random ioctl's and special system calls, but very
very few programs/environments can live without read/write.

("restartable" here doesn't mean that the _kernel_ would re-start them,
but that a "gc-aware library" can make wrappers around them and correctly
restart them internally, if you see my point - kind of like how stdio
already handles the issue of EINTR returns for read/write, which is
actually very similar in nature).

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: AUDIT: copy_from_user is a deathtrap.
Original-Message-ID: <Pine.LNX.4.44.0205191125120.3104-100000@home.transmeta.com>
Date: Sun, 19 May 2002 18:30:37 GMT
Message-ID: <fa.m6tscqv.140u9gg@ifi.uio.no>

On Sat, 18 May 2002, Benjamin Herrenschmidt wrote:
>
> Looking at generic_file_write(), it ignore the count returned by
> copy_from_user and always commit a write for the whole requested
> count, regardless of how much could actually be read from userland.
> The result of copy_from_user is only used as an error condition.

And this is exactly what makes it re-startable.

A faulting write will fill some subsequent memory area with zeroes, but a
subsequent write can complete the original one.

It has to _commit_ the whole area, because it uses the pre-fault size
information to optimize away reads etc, ie if you do a

	write(fd, buf, 4096);

at a page-aligned offset, the write code knows that it shouldn't read the
old contents because they get overwritten.

Which is why we need to commit the whole 4096 bytes, even if we only
actually were able to get a single byte from user space.

But by then telling user space that we couldn't actually write more than 1
byte, we give user space the _ability_ to re-start the write with the
missing 4095 bytes.

> generic_file_read() on the other hand seems to be ok.

That one doesn't have any of the same issues.

		Linus



Index Home About Blog