deliverable/linux.git
17 years agopid namespaces: miscellaneous preparations for pid namespaces
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:09 +0000 (23:40 -0700)] 
pid namespaces: miscellaneous preparations for pid namespaces

* remove pid.h from pid_namespaces.h;
* rework is_(cgroup|global)_init;
* optimize (get|put)_pid_ns for init_pid_ns;
* declare task_child_reaper to return actual reaper.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: make proc have multiple superblocks - one for each namespace
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:08 +0000 (23:40 -0700)] 
pid namespaces: make proc have multiple superblocks - one for each namespace

Each pid namespace have to be visible through its own proc mount.  Thus we
need to have per-namespace proc trees with their own superblocks.

We cannot easily show different pid namespace via one global proc tree, since
each pid refers to different tasks in different namespaces.  E.g.  pid 1
refers to the init task in the initial namespace and to some other task when
seeing from another namespace.  Moreover - pid, exisintg in one namespace may
not exist in the other.

This approach has one move advantage is that the tasks from the init namespace
can see what tasks live in another namespace by reading entries from another
proc tree.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: move alloc_pid() lower in copy_process()
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:07 +0000 (23:40 -0700)] 
pid namespaces: move alloc_pid() lower in copy_process()

When we create new namespace we will need to allocate the struct pid, that
will have one extra struct upid in array, comparing to the parent.

Thus we need to know the new namespace (if any) in alloc_pid() to init this
struct upid properly, so move the alloc_pid() call lower in copy_process().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: helpers to find the task by its numerical ids
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:06 +0000 (23:40 -0700)] 
pid namespaces: helpers to find the task by its numerical ids

When searching the task by numerical id on may need to find it using global
pid (as it is done now in kernel) or by its virtual id, e.g.  when sending a
signal to a task from one namespace the sender will specify the task's virtual
id and we should find the task by this value.

[akpm@linux-foundation.org: fix gfs2 linkage]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: helpers to obtain pid numbers
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:06 +0000 (23:40 -0700)] 
pid namespaces: helpers to obtain pid numbers

When showing pid to user or getting the pid numerical id for in-kernel use the
value of this id may differ depending on the namespace.

This set of helpers is used to get the global pid nr, the virtual (i.e.  seen
by task in its namespace) nr and the nr as it is seen from the specified
namespace.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:05 +0000 (23:40 -0700)] 
pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid

Each struct upid element of struct pid has to be initialized properly, i.e.
its nr mst be allocated from appropriate pidmap and ns set to appropriate
namespace.

When allocating a new pid, we need to know the namespace this pid will live
in, so the additional argument is added to alloc_pid().

On the other hand, the rest of the kernel still uses the pid->nr and
pid->pid_chain fields, so these ones are still initialized, but this will be
removed soon.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: add support for pid namespaces hierarchy
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:04 +0000 (23:40 -0700)] 
pid namespaces: add support for pid namespaces hierarchy

Each namespace has a parent and is characterized by its "level".  Level is the
number of the namespace generation.  E.g.  init namespace has level 0, after
cloning new one it will have level 1, the next one - 2 and so on and so forth.
 This level is not explicitly limited.

True hierarchy must have some way to find each namespace's children, but it is
not used in the patches, so this ability is not added (yet).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: introduce struct upid
Sukadev Bhattiprolu [Fri, 19 Oct 2007 06:40:03 +0000 (23:40 -0700)] 
pid namespaces: introduce struct upid

Since task will be visible from different pid namespaces each of them have to
be addressed by multiple pids.  struct upid is to store the information about
which id refers to which namespace.

The constuciton looks like this.  Each struct pid carried the reference
counter and the list of tasks attached to this pid.  At its end it has a
variable length array of struct upid-s.  Each struct upid has a numerical id
(pid itself), pointer to the namespace, this ID is valid in and is hashed into
a pid_hash for searching the pids.

The nr and pid_chain fields are kept in struct pid for a while to make kernel
still work (no patch initialize the upids yet), but it will be removed at the
end of this series when we switch to upids completely.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:03 +0000 (23:40 -0700)] 
pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees

The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.

The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.

When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush.  Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers.  But after __exit_signal() task has detached all
his pids and this information is lost.

This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: introduce MS_KERNMOUNT flag
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:02 +0000 (23:40 -0700)] 
pid namespaces: introduce MS_KERNMOUNT flag

This flag tells the .get_sb callback that this is a kern_mount() call so that
it can trust *data pointer to be valid in-kernel one.  If this flag is passed
from the user process, it is cleared since the *data pointer is not a valid
kernel object.

Running a few steps forward - this will be needed for proc to create the
superblock and store a valid pid namespace on it during the namespace
creation.  The reason, why the namespace cannot live without proc mount is
described in the appropriate patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: move exit_task_namespaces()
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:01 +0000 (23:40 -0700)] 
pid namespaces: move exit_task_namespaces()

Make task release its namespaces after it has reparented all his children to
child_reaper, but before it notifies its parent about its death.

The reason to release namespaces after reparenting is that when task exits it
may send a signal to its parent (SIGCHLD), but if the parent has already
exited its namespaces there will be no way to decide what pid to dever to him
- parent can be from different namespace.

The reason to release namespace before notifying the parent it that when task
sends a SIGCHLD to parent it can call wait() on this taks and release it.  But
releasing the mnt namespace implies dropping of all the mounts in the mnt
namespace and NFS expects the task to have valid sighand pointer.

Thanks to Oleg for pointing out some races that can apear and helping with
patches and fixes.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: rework forget_original_parent()
Oleg Nesterov [Fri, 19 Oct 2007 06:40:00 +0000 (23:40 -0700)] 
pid namespaces: rework forget_original_parent()

A pid namespace is a "view" of a particular set of tasks on the system.  They
work in a similar way to filesystem namespaces.  A file (or a process) can be
accessed in multiple namespaces, but it may have a different name in each.  In
a filesystem, this name might be /etc/passwd in one namespace, but
/chroot/etc/passwd in another.

For processes, a process may have pid 1234 in one namespace, but be pid 1 in
another.  This allows new pid namespaces to have basically arbitrary pids, and
not have to worry about what pids exist in other namespaces.  This is
essential for checkpoint/restart where a restarted process's pid might collide
with an existing process on the system's pid.

In this particular implementation, pid namespaces have a parent-child
relationship, just like processes.  A process in a pid namespace may see all
of the processes in the same namespace, as well as all of the processes in all
of the namespaces which are children of its namespace.  Processes may not,
however, see others which are in their parent's namespace, but not in their
own.  The same goes for sibling namespaces.

The know issue to be solved in the nearest future is signal handling in the
namespace boundary.  That is, currently the namespace's init is treated like
an ordinary task that can be killed from within an namespace.  Ideally, the
signal handling by the namespace's init should have two sides: when signaling
the init from its namespace, the init should look like a real init task, i.e.
receive only those signals, that is explicitly wants to; when signaling the
init from one of the parent namespaces, init should look like an ordinary
task, i.e.  receive any signal, only taking the general permissions into
account.

The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
we eventually came to almost the same implementation, which differed in some
details.  This set is based on Pavel's patches, but it includes comments and
patches that from Sukadev.

Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
valuable advises on how to make this set cleaner.

This patch:

We have to call exit_task_namespaces() only after the exiting task has
reparented all his children and is sure that no other threads will reparent
theirs for it.  Why this is needed is explained in appropriate patch.  This
one only reworks the forget_original_parent() so that after calling this a
task cannot be/become parent of any other task.

We check PF_EXITING instead of ->exit_state while choosing the new parent.
Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
us (when forget_original_parent() drops it) must see PF_EXITING.

The other changes are just cleanups.  They just move some code from
exit_notify to forget_original_parent().  It is a bit silly to declare
ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
forget_original_parent(), unlock-lock-unlock tasklist, and then use
ptrace_dead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agowhitespace fixes: task exit handling
Daniel Walker [Fri, 19 Oct 2007 06:39:59 +0000 (23:39 -0700)] 
whitespace fixes: task exit handling

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomm/oom_kill.c: Use list_for_each_entry instead of list_for_each
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:58 +0000 (23:39 -0700)] 
mm/oom_kill.c: Use list_for_each_entry instead of list_for_each

mm/oom_kill.c: Convert list_for_each to list_for_each_entry in
oom_kill_process()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agokernel/time/clocksource.c: Use list_for_each_entry instead of list_for_each
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:58 +0000 (23:39 -0700)] 
kernel/time/clocksource.c: Use list_for_each_entry instead of list_for_each

kernel/time/clocksource.c: Convert list_for_each to
list_for_each_entry in clocksource_resume(),
sysfs_override_clocksource() and show_available_clocksources()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agokernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe)
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:57 +0000 (23:39 -0700)] 
kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe)

kernel/exit.c: Convert list_for_each(_safe) to
list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
and do_wait()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs/super.c: use list_for_each_entry() instead of list_for_each()
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:57 +0000 (23:39 -0700)] 
fs/super.c: use list_for_each_entry() instead of list_for_each()

fs/super.c: use list_for_each_entry() instead of list_for_each() in
sget()

[akpm@linux-foundation.org: clean up some crap while we're there]
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs/eventpoll.c: use list_for_each_entry() instead of list_for_each()
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:56 +0000 (23:39 -0700)] 
fs/eventpoll.c: use list_for_each_entry() instead of list_for_each()

fs/eventpoll.c: use list_for_each_entry() instead of list_for_each()
in ep_poll_safewake()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs/file_table.c: use list_for_each_entry() instead of list_for_each()
Matthias Kaehlcke [Fri, 19 Oct 2007 06:39:56 +0000 (23:39 -0700)] 
fs/file_table.c: use list_for_each_entry() instead of list_for_each()

fs/file_table.c: use list_for_each_entry() instead of list_for_each()
in fs_may_remount_ro()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoworkqueue: debug flushing deadlocks with lockdep
Johannes Berg [Fri, 19 Oct 2007 06:39:55 +0000 (23:39 -0700)] 
workqueue: debug flushing deadlocks with lockdep

In the following scenario:

code path 1:
  my_function() -> lock(L1); ...; flush_workqueue(); ...

code path 2:
  run_workqueue() -> my_work() -> ...; lock(L1); ...

you can get a deadlock when my_work() is queued or running
but my_function() has acquired L1 already.

This patch adds a pseudo-lock to each workqueue to make lockdep
warn about this scenario.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMake access to task's nsproxy lighter
Pavel Emelyanov [Fri, 19 Oct 2007 06:39:54 +0000 (23:39 -0700)] 
Make access to task's nsproxy lighter

When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists.  This is
slow on read-only paths and may be impossible in some cases.

E.g.  Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

     rcu_read_lock();
     nsproxy = task_nsproxy(tsk);
     if (nsproxy != NULL) {
             / *
               * work with the namespaces here
               * e.g. get the reference on one of them
               * /
     } / *
         * NULL task_nsproxy() means that this task is
         * almost dead (zombie)
         * /
     rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: move alloc_pid() to copy_process()
Sukadev Bhattiprolu [Fri, 19 Oct 2007 06:39:53 +0000 (23:39 -0700)] 
pid namespaces: move alloc_pid() to copy_process()

Move alloc_pid() into copy_process().  This will keep all pid and pid
namespace code together and simplify error handling when we support multiple
pid namespaces.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: define is_global_init() and is_container_init()
Serge E. Hallyn [Fri, 19 Oct 2007 06:39:52 +0000 (23:39 -0700)] 
pid namespaces: define is_global_init() and is_container_init()

is_init() is an ambiguous name for the pid==1 check.  Split it into
is_global_init() and is_container_init().

A cgroup init has it's tsk->pid == 1.

A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns.  But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.

Changelog:

2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
  global init (/sbin/init) process. This would improve performance
  and remove dependence on the task_pid().

2.6.21-mm2-pidns2:

- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
  ppc,avr32}/traps.c for the _exception() call to is_global_init().
  This way, we kill only the cgroup if the cgroup's init has a
  bug rather than force a kernel panic.

[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: use task_pid() to find leader's pid
Sukadev Bhattiprolu [Fri, 19 Oct 2007 06:39:51 +0000 (23:39 -0700)] 
pid namespaces: use task_pid() to find leader's pid

Use task_pid() to get leader's 'struct pid' and avoid the find_pid().

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: rename child_reaper() function
Sukadev Bhattiprolu [Fri, 19 Oct 2007 06:39:50 +0000 (23:39 -0700)] 
pid namespaces: rename child_reaper() function

Rename the child_reaper() function to task_child_reaper() to be similar to
other task_* functions and to distinguish the function from 'struct
pid_namspace.child_reaper'.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: define and use task_active_pid_ns() wrapper
Sukadev Bhattiprolu [Fri, 19 Oct 2007 06:39:49 +0000 (23:39 -0700)] 
pid namespaces: define and use task_active_pid_ns() wrapper

With multiple pid namespaces, a process is known by some pid_t in every
ancestor pid namespace.  Every time the process forks, the child process also
gets a pid_t in every ancestor pid namespace.

While a process is visible in >=1 pid namespaces, it can see pid_t's in only
one pid namespace.  We call this pid namespace it's "active pid namespace",
and it is always the youngest pid namespace in which the process is known.

This patch defines and uses a wrapper to find the active pid namespace of a
process.  The implementation of the wrapper will be changed in when support
for multiple pid namespaces are added.

Changelog:
2.6.22-rc4-mm2-pidns1:
- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
  task_active_pid_ns() in child_reaper() since task->nsproxy
  can be NULL during task exit (so child_reaper() continues to
  use init_pid_ns).

  to implement child_reaper() since init_pid_ns.child_reaper to
  implement child_reaper() since tsk->nsproxy can be NULL during exit.

2.6.21-rc6-mm1:
- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
  process can have multiple pid namespaces.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: dynamic kmem cache allocator for pid namespaces
Pavel Emelianov [Fri, 19 Oct 2007 06:39:48 +0000 (23:39 -0700)] 
pid namespaces: dynamic kmem cache allocator for pid namespaces

Add kmem_cache to pid_namespace to allocate pids from.

Since both implementations expand the struct pid to carry more numerical
values each namespace should have separate cache to store pids of different
sizes.

Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids
on the pid.  Different namespaces with same level of nesting will have same
caches.

This patch has two FIXMEs that are to be fixed after we reach the consensus
about the struct pid itself.

The first one is that the namespace to free the pid from in free_pid() must be
taken from pid.  Now the init_pid_ns is used.

The second FIXME is about the cache allocation.  When we do know how long the
object will be then we'll have to calculate this size in create_pid_cachep.
Right now the sizeof(struct pid) value is used.

[akpm@linux-foundation.org: coding-style repair]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: make get_pid_ns() return the namespace itself
Pavel Emelianov [Fri, 19 Oct 2007 06:39:47 +0000 (23:39 -0700)] 
pid namespaces: make get_pid_ns() return the namespace itself

Make get_pid_ns() return the namespace itself to look like the other getters
and make the code using it look nicer.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agopid namespaces: round up the API
Pavel Emelianov [Fri, 19 Oct 2007 06:39:46 +0000 (23:39 -0700)] 
pid namespaces: round up the API

The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.

The proposals are to
* equip the functions that return the integer with _nr suffix to
  represent that fact,
* and to make all functions work with task (not process) by making
  the common prefix of the same name.

For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agocgroups: implement namespace tracking subsystem
Serge E. Hallyn [Fri, 19 Oct 2007 06:39:45 +0000 (23:39 -0700)] 
cgroups: implement namespace tracking subsystem

When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.

This version names cgroups which are automatically created using
cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
cloned process.  (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create

/cgroups/(...)/node_<pid>/node_<pid>

The only possibilities (AFAICT) for a -EEXIST on unshare are

1. pid wraparound
2. a process fails an unshare, then tries again.

Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.

Changelog:
Name cloned cgroups as "node_<pid>".

[clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd cgroupstats
Balbir Singh [Fri, 19 Oct 2007 06:39:44 +0000 (23:39 -0700)] 
Add cgroupstats

This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)

This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.

The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.

TODO's/NOTE:

This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.

Sample output

# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0

# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0

If the approach looks good, I'll enhance and post the user space utility for
the same

Feedback, comments, test results are always welcome!

[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agotask cgroups: enable cgroups by default in some configs
Paul Jackson [Fri, 19 Oct 2007 06:39:43 +0000 (23:39 -0700)] 
task cgroups: enable cgroups by default in some configs

In pre-cgroup cpusets, a few config files enabled cpusets by default.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: simple task cgroup debug info subsystem
Paul Menage [Fri, 19 Oct 2007 06:39:43 +0000 (23:39 -0700)] 
Task Control Groups: simple task cgroup debug info subsystem

This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the cgroup framework.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: example CPU accounting subsystem
Paul Menage [Fri, 19 Oct 2007 06:39:42 +0000 (23:39 -0700)] 
Task Control Groups: example CPU accounting subsystem

This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.

Portions contributed by Balbir Singh <balbir@in.ibm.com>

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: make cpusets a client of cgroups
Paul Menage [Fri, 19 Oct 2007 06:39:39 +0000 (23:39 -0700)] 
Task Control Groups: make cpusets a client of cgroups

Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: automatic userspace notification of idle cgroups
Paul Menage [Fri, 19 Oct 2007 06:39:38 +0000 (23:39 -0700)] 
Task Control Groups: automatic userspace notification of idle cgroups

Add the following files to the cgroup filesystem:

notify_on_release - configures/reports whether the cgroup subsystem should
attempt to run a release script when this cgroup becomes unused

release_agent - configures/reports the release agent to be used for this
hierarchy (top level in each hierarchy only)

releasable - reports whether this cgroup would have been auto-released if
notify_on_release was true and a release agent was configured (mainly useful
for debugging)

To avoid locking issues, invoking the userspace release agent is done via a
workqueue task; cgroups that need to have their release agents invoked by
the workqueue task are linked on to a list.

[pj@sgi.com: Need to include kmod.h]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: shared cgroup subsystem group arrays
Paul Menage [Fri, 19 Oct 2007 06:39:36 +0000 (23:39 -0700)] 
Task Control Groups: shared cgroup subsystem group arrays

Replace the struct css_set embedded in task_struct with a pointer; all tasks
that have the same set of memberships across all hierarchies will share a
css_set object, and will be linked via their css_sets field to the "tasks"
list_head in the css_set.

Assuming that many tasks share the same cgroup assignments, this reduces
overall space usage and keeps the size of the task_struct down (three pointers
added to task_struct compared to a non-cgroups kernel, no matter how many
subsystems are registered).

[akpm@linux-foundation.org: fix a printk]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: add procfs interface
Paul Menage [Fri, 19 Oct 2007 06:39:35 +0000 (23:39 -0700)] 
Task Control Groups: add procfs interface

Add:

/proc/cgroups - general system info

/proc/*/cgroup - per-task cgroup membership info

[a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: add cgroup_clone() interface
Paul Menage [Fri, 19 Oct 2007 06:39:34 +0000 (23:39 -0700)] 
Task Control Groups: add cgroup_clone() interface

Add support for cgroup_clone(), a way to create new cgroups intended to
be used for systems such as namespace unsharing.  A new subsystem callback,
post_clone(), is added to allow subsystems to automatically configure cloned
cgroups.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: add fork()/exit() hooks
Paul Menage [Fri, 19 Oct 2007 06:39:33 +0000 (23:39 -0700)] 
Task Control Groups: add fork()/exit() hooks

This adds the necessary hooks to the fork() and exit() paths to ensure
that new children inherit their parent's cgroup assignments, and that
exiting processes release reference counts on their cgroups.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd cgroup write_uint() helper method
Paul Menage [Fri, 19 Oct 2007 06:39:33 +0000 (23:39 -0700)] 
Add cgroup write_uint() helper method

Add write_uint() helper method for cgroup subsystems

This helper is analagous to the read_uint() helper method for
reporting u64 values to userspace. It's designed to reduce the amount
of boilerplate requierd for creating new cgroup subsystems.

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: add tasks file interface
Paul Menage [Fri, 19 Oct 2007 06:39:32 +0000 (23:39 -0700)] 
Task Control Groups: add tasks file interface

Add the per-directory "tasks" file for cgroupfs mounts; this allows the
user to determine which tasks are members of a cgroup by reading a
cgroup's "tasks", and to move a task into a cgroup by writing its pid to
its "tasks".

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoTask Control Groups: basic task cgroup framework
Paul Menage [Fri, 19 Oct 2007 06:39:30 +0000 (23:39 -0700)] 
Task Control Groups: basic task cgroup framework

Generic Process Control Groups
--------------------------

There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.

This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.

The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:

- the userspace APIs are (somewhat) normalised

- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.

- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel

This patch:

Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agocpuset: zero malloc - revert the old cpuset fix
Paul Jackson [Fri, 19 Oct 2007 06:39:28 +0000 (23:39 -0700)] 
cpuset: zero malloc - revert the old cpuset fix

The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign.  This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agokernel-api docbook: fix content problems
Randy Dunlap [Fri, 19 Oct 2007 06:39:28 +0000 (23:39 -0700)] 
kernel-api docbook: fix content problems

Fix kernel-api docbook contents problems.

docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line:  of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: ignore on disk s_bmap_nr value
Jeff Mahoney [Fri, 19 Oct 2007 06:39:27 +0000 (23:39 -0700)] 
reiserfs: ignore on disk s_bmap_nr value

Implement support for file systems larger than 8 TiB.

The reiserfs superblock contains a 16 bit value for counting the number of
bitmap blocks.  The rest of the disk format supports file systems up to 2^32
blocks, but the bitmap block limitation artificially limits this to 8 TiB with
a 4KiB block size.

Rather than trust the superblock's 16-bit bitmap block count, we calculate it
dynamically based on the number of blocks in the file system.  When an
incorrect value is observed in the superblock, it is zeroed out, ensuring that
older kernels will not be able to mount the file system.

Userspace support has already been implemented and shipped in reiserfsprogs
3.6.20.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: remove first_zero_hint
Jeff Mahoney [Fri, 19 Oct 2007 06:39:26 +0000 (23:39 -0700)] 
reiserfs: remove first_zero_hint

The first_zero_hint metadata caching was never actually used, and it's of
dubious optimization quality.  This patch removes it.

It doesn't actually shrink the size of the reiserfs_bitmap_info struct, since
that doesn't work with block sizes larger than 8K.  There was a big fixme in
there, and with all the work lately in allowing block size > page size, I
might as well kill the fixme as well.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: fix usage of signed ints for block numbers
Jeff Mahoney [Fri, 19 Oct 2007 06:39:25 +0000 (23:39 -0700)] 
reiserfs: fix usage of signed ints for block numbers

Do a quick signedness check for block numbers.  There are a number of places
where signed integers are used for block numbers, which limits the usable file
system size to 8 TiB.  The disk format, excepting a problem which will be
fixed in the following patch, supports file systems up to 16 TiB in size.
This patch cleans up those sites so that we can enable the full usable size.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: fix memset byte count during resize
Jeff Mahoney [Fri, 19 Oct 2007 06:39:25 +0000 (23:39 -0700)] 
reiserfs: fix memset byte count during resize

Correct the memset in reiserfs_resize to clear the memory allocated for the
new bitmap info structs.  Previously, it would clear the memory used by the
old size.  Depending on the contents of memory, this could cause incorrect
caching behavior for bitmap blocks in the newly allocated area.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: use is_reusable to catch corruption
Jeff Mahoney [Fri, 19 Oct 2007 06:39:24 +0000 (23:39 -0700)] 
reiserfs: use is_reusable to catch corruption

Build in is_reusable() unconditionally and use it to catch corruption before
it reaches the block freeing paths.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: dont use BUG when panicking
Jeff Mahoney [Fri, 19 Oct 2007 06:39:24 +0000 (23:39 -0700)] 
reiserfs: dont use BUG when panicking

Change reiserfs_panic() to use panic() initially instead of BUG().  Using
BUG() ignores the configurable panic behavior, so systems that should be
failing and rebooting are left hanging.  This causes problems in
active/standby HA scenarios.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: fix up lockdep warnings
Jeff Mahoney [Fri, 19 Oct 2007 06:39:23 +0000 (23:39 -0700)] 
reiserfs: fix up lockdep warnings

Add I_MUTEX_XATTR annotations to the inode locking in the reiserfs xattr code.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoJBD: Fix JBD warnings when compiling with CONFIG_JBD_DEBUG
Jose R. Santos [Fri, 19 Oct 2007 06:39:23 +0000 (23:39 -0700)] 
JBD: Fix JBD warnings when compiling with CONFIG_JBD_DEBUG

Note from Mingming's JBD2 fix:

Noticed all warnings are occurs when the debug level is 0.  Then found the
"jbd2: Move jbd2-debug file to debugfs" patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b

changed the jbd2_journal_enable_debug from int type to u8, makes the
jbd_debug comparision is always true when the debugging level is 0.  Thus
the compile warning occurs.

Thought about changing the jbd2_journal_enable_debug data type back to int,
but can't, because the jbd2-debug is moved to debug fs, where calling
debugfs_create_u8() to create the debugfs entry needs the value to be u8
type.

Even if we changed the data type back to int, the code is still buggy,
kernel should not print jbd2 debug message if the jbd2_journal_enable_debug
is set to 0.  But this is not the case.

The fix is change the level of debugging to 1.  The same should fixed in
ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we
probably should fix it all together.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agojbd: fix commit code to properly abort journal
Jan Kara [Fri, 19 Oct 2007 06:39:22 +0000 (23:39 -0700)] 
jbd: fix commit code to properly abort journal

We should really call journal_abort() and not __journal_abort_hard() in
case of errors.  The latter call does not record the error in the journal
superblock and thus filesystem won't be marked as with errors later (and
user could happily mount it without any warning).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agojbd: config_jbd_debug cannot create /proc entry
Jose R. Santos [Fri, 19 Oct 2007 06:39:22 +0000 (23:39 -0700)] 
jbd: config_jbd_debug cannot create /proc entry

The jbd-debug file used to be located in /proc/sys/fs/jbd-debug, but
create_proc_entry() does not do lookups on file names that are more that
one directory deep.  This causes the entry creation to fail and hence, no
proc file is created.

Instead of fixing this on procfs might as well move the jbd2-debug file to
debugfs which would be the preferred location for this kind of tunable.
The new location is now /sys/kernel/debug/jbd/jbd-debug.

[akpm@linux-foundation.org: zillions of cleanups]
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agojbd: remove printk() from J_ASSERT macros
Chris Snook [Fri, 19 Oct 2007 06:39:21 +0000 (23:39 -0700)] 
jbd: remove printk() from J_ASSERT macros

Remove printk from J_ASSERT to preserve registers during BUG.

Signed-off-by: Chris Snook <csnook@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoJBD/ext3 cleanups: convert to kzalloc
Mingming Cao [Fri, 19 Oct 2007 06:39:20 +0000 (23:39 -0700)] 
JBD/ext3 cleanups: convert to kzalloc

Convert kmalloc to kzalloc() and get rid of the memset().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoisdn/sc: remove unused REQUEST_IRQ and unnecessary header file
Fernando Luis Vázquez Cao [Fri, 19 Oct 2007 06:39:20 +0000 (23:39 -0700)] 
isdn/sc: remove unused REQUEST_IRQ and unnecessary header file

REQUEST_IRQ is never used, so delete it. In the process get rid of the
macro FREE_IRQ which makes the code unnecessarily difficult to read.

Signed-off-by: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Acked-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoisdn: fix random hard freeze with AVM T1 cards
Karsten Keil [Fri, 19 Oct 2007 06:39:19 +0000 (23:39 -0700)] 
isdn: fix random hard freeze with AVM T1 cards

This fixes the hard freeze debugged for AVM C4 cards for the AVM T1 cards.

Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoisdn: fix random hard freeze with AVM cards using b1dma
Karsten Keil [Fri, 19 Oct 2007 06:39:19 +0000 (23:39 -0700)] 
isdn: fix random hard freeze with AVM cards using b1dma

This fixes the hard freeze debugded for AVM C4 cards using the b1dma
interface.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoisdn: fix random hard freeze with AVM c4 card part 2
Karsten Keil [Fri, 19 Oct 2007 06:39:18 +0000 (23:39 -0700)] 
isdn: fix random hard freeze with AVM c4 card part 2

One call was missing in the previous patch.

Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofix a trivial typo in scripts/checkstack.pl
Joern Engel [Fri, 19 Oct 2007 06:39:18 +0000 (23:39 -0700)] 
fix a trivial typo in scripts/checkstack.pl

Trivial change in a comment.

Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Andre Haupt <andre@finow14.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoConsole events and accessibility
Samuel Thibault [Fri, 19 Oct 2007 06:39:17 +0000 (23:39 -0700)] 
Console events and accessibility

Some external modules like Speakup need to monitor console output.

This adds a VT notifier that such modules can use to get console output events:
allocation, deallocation, writes, other updates (cursor position, switch, etc.)

[akpm@linux-foundation.org: fix headers_check]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd kernel/notifier.c
Alexey Dobriyan [Fri, 19 Oct 2007 06:39:16 +0000 (23:39 -0700)] 
Add kernel/notifier.c

There is separate notifier header, but no separate notifier .c file.

Extract notifier code out of kernel/sys.c which will remain for
misc syscalls I hope. Merge kernel/die_notifier.c into kernel/notifier.c.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agosetup vma->vm_page_prot by vm_get_page_prot()
Coly Li [Fri, 19 Oct 2007 06:39:15 +0000 (23:39 -0700)] 
setup vma->vm_page_prot by vm_get_page_prot()

This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

Though inside vm_get_page_prot() the protection flags is AND with
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

Signed-off-by: Coly Li <coyli@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoremove unused flush_tlb_pgtables
Benjamin Herrenschmidt [Fri, 19 Oct 2007 06:39:14 +0000 (23:39 -0700)] 
remove unused flush_tlb_pgtables

Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining
traces of it from all archs.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agokmap leak fix for x86_32 kdump
Fernando Luis Vázquez Cao [Fri, 19 Oct 2007 06:39:14 +0000 (23:39 -0700)] 
kmap leak fix for x86_32 kdump

copy_oldmem_page should not return leaving a page frame from the
previous kernel mapped.

Signed-off-by: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agops3av: remove unused fields in ps3av_monitor_quirks
Geert Uytterhoeven [Fri, 19 Oct 2007 06:39:13 +0000 (23:39 -0700)] 
ps3av: remove unused fields in ps3av_monitor_quirks

Remove the `clear_50' and `clear_vesa' fields of struct
ps3av_monitor_quirk, as they're currently unused.  We can always re-add
them when we really need them.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoConsole keyboard events and accessibility
Samuel Thibault [Fri, 19 Oct 2007 06:39:12 +0000 (23:39 -0700)] 
Console keyboard events and accessibility

Some blind people use a kernel engine called Speakup which uses hardware
synthesis to speak what gets displayed on the screen.  They use the
PC keyboard to control this engine (start/stop, accelerate, ...) and
also need to get keyboard feedback (to make sure to know what they are
typing, the caps lock status, etc.)

Up to now, the way it was done was very ugly.  Below is a patch to add a
notifier list for permitting a far better implementation, see ChangeLog
above for details.

You may wonder why this can't be done at the input layer.  The problem
is that what people want to monitor is the console keyboard, i.e. all
input keyboards that got attached to the console, and with the currently
active keymap (i.e. keysyms, not only keycodes).

This adds a keyboard notifier that such modules can use to get the keyboard
events and possibly eat them, at several stages:

- keycodes: even before translation into keysym.
- unbound keycodes: when no keysym is bound.
- unicode: when the keycode would get translated into a unicode character.
- keysym: when the keycode would get translated into a keysym.
- post_keysym: after the keysym got interpreted, so as to see the result
  (caps lock, etc.)

This also provides access to k_handler so as to permit simulation of
keypresses.

[akpm@linux-foundation.org: various fixes]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoput declaration of put_filesystem() in fs.h
Miklos Szeredi [Fri, 19 Oct 2007 06:39:11 +0000 (23:39 -0700)] 
put declaration of put_filesystem() in fs.h

Declarations go into headers.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Ram Pai <linuxram@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoadvansys: depends on VIRT_TO_BUS
Andrew Morton [Fri, 19 Oct 2007 06:39:10 +0000 (23:39 -0700)] 
advansys: depends on VIRT_TO_BUS

Fix powerpc allmodconfig build: advansys requires virt_to_bus() but powerpc
doesn't implement it.

Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMerge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik...
Linus Torvalds [Fri, 19 Oct 2007 02:31:54 +0000 (19:31 -0700)] 
Merge branch 'upstream-linus' of /linux/kernel/git/jgarzik/netdev-2.6

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  pcnet32: remove private net_device_stats structure
  vortex_up should initialize "err"
  pcnet32: remove compile warnings in non-napi mode
  pcnet32: fix non-napi packet reception
  fix EMAC driver for proper napi_synchronize API
  sky2: shutdown cleanup
  napi_synchronize: waiting for NAPI
  forcedeth msi bugfix
  gianfar: fix obviously wrong #ifdef CONFIG_GFAR_NAPI placement
  fs_enet: Update for API changes
  gianfar: remove orphan struct.
  forcedeth: fix rx-work condition in nv_rx_process_optimized() too

17 years agoMerge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
Linus Torvalds [Thu, 18 Oct 2007 23:00:02 +0000 (16:00 -0700)] 
Merge /pub/scm/linux/kernel/git/bart/ide-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6: (37 commits)
  ide: set drive->autotune in ide_pci_setup_ports()
  triflex: always tune PIO
  opti621: always tune PIO
  cy82c693: always tune PIO
  cs5520: always tune PIO
  alim15x3: always tune PIO
  ide: add IDE_HFLAG_LEGACY_IRQS host flag
  ide: add IDE_HFLAG_SERIALIZE host flag
  ide: add IDE_HFLAG_ERROR_STOPS_FIFO host flag
  piix: add DECLARE_ICH_DEV() macro
  pdc202xx_old: add DECLARE_PDC2026X_DEV() macro
  pdc202xx_new: add DECLARE_PDCNEW_DEV() macro
  aec62xx: no need to disable UDMA in ->init_hwif method for ATP850UF
  ide: remove .init_setup from ide_pci_device_t
  serverworks: remove ->init_setup
  scc_pata: remove ->init_setup
  pdc202xx_old: remove ->init_setup
  pdc202xx_new: remove ->init_setup
  hpt366: remove ->init_setup
  cmd64x: remove ->init_setup
  ...

17 years agoide: set drive->autotune in ide_pci_setup_ports()
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:12 +0000 (00:30 +0200)] 
ide: set drive->autotune in ide_pci_setup_ports()

Majority of host drivers using IDE PCI layer set drive->autotune, the only
exceptions are:

generic.c
ns87415.c
rz1000.c
trm290.c
* no ->set_pio_mode method

it821x.c:
* if memory allocation fails drive->autotune won't be set
  (but there also won't be ->set_pio_mode method in such case)

piix.c:
* MPIIX controller (no ->init_hwif method so also no ->set_pio_mode method)

However if there is no ->set_pio_mode method there are no changes in behavior
w.r.t. PIO tuning so always set drive->autotune in ide_pci_setup_ports().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agotriflex: always tune PIO
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
triflex: always tune PIO

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoopti621: always tune PIO
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
opti621: always tune PIO

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agocy82c693: always tune PIO
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
cy82c693: always tune PIO

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agocs5520: always tune PIO
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
cs5520: always tune PIO

Since cs5520 uses VDMA best PIO mode was tuned anyway by ide_dma_check()
but only if DMA was successfully initialized.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoalim15x3: always tune PIO
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
alim15x3: always tune PIO

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add IDE_HFLAG_LEGACY_IRQS host flag
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:11 +0000 (00:30 +0200)] 
ide: add IDE_HFLAG_LEGACY_IRQS host flag

Add IDE_HFLAG_LEGACY_IRQS host flag to tell ide_pci_setup_ports() to set
hwif->irq to legacy IRQ 14/15 (iff hwif->irq is not already set) and convert
atiixp, piix, serverworks, sis5513 and slc90e66 host drivers to use it.

While at it:

* In piix.c add IDE_HFLAGS_PIIX define and don't use ->init_hwif for MPIIX.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add IDE_HFLAG_SERIALIZE host flag
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
ide: add IDE_HFLAG_SERIALIZE host flag

Add IDE_HFLAG_SERIALIZE host flag to tell ide_pci_setup_ports() to set
hwif/mate->serialized and convert aec62xx, cs5530 and sc1200 host drivers
to use it.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add IDE_HFLAG_ERROR_STOPS_FIFO host flag
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
ide: add IDE_HFLAG_ERROR_STOPS_FIFO host flag

Add IDE_HFLAG_ERROR_STOPS_FIFO host flag and use it instead
of hwif->err_stops_fifo.  As a side-effect this change fixes
hwif->err_stops_fifo not being restored by ide_hwif_restore().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopiix: add DECLARE_ICH_DEV() macro
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
piix: add DECLARE_ICH_DEV() macro

Add DECLARE_ICH_DEV() macro.

While at it:

* Add init_hwif_ich() (->init_hwif method) for ICH controllers.

* Rename init_chipset_piix() to init_chipset_ich() and use it only for
  ICH controllers.

* Remove no longer needed piix_is_ichx() helper.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopdc202xx_old: add DECLARE_PDC2026X_DEV() macro
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
pdc202xx_old: add DECLARE_PDC2026X_DEV() macro

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopdc202xx_new: add DECLARE_PDCNEW_DEV() macro
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
pdc202xx_new: add DECLARE_PDCNEW_DEV() macro

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoaec62xx: no need to disable UDMA in ->init_hwif method for ATP850UF
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:10 +0000 (00:30 +0200)] 
aec62xx: no need to disable UDMA in ->init_hwif method for ATP850UF

* No need to disable UDMA in ->init_hwif method for ATP850UF (and since we
  now always tune PIO it will be disabled by ->set_pio_mode calls anyway).

* Bump driver version.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: remove .init_setup from ide_pci_device_t
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
ide: remove .init_setup from ide_pci_device_t

Now that all users were fixed we can safely remove it.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoserverworks: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
serverworks: remove ->init_setup

Merge init_setup_{svwks,csb6}() into svwks_init_one().

While at it:

* Remove redundant dev->device checks.

* Operate on a local copy of serverworks_chipsets[] entry.

* Use pci_resource_start().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoscc_pata: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
scc_pata: remove ->init_setup

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopdc202xx_old: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
pdc202xx_old: remove ->init_setup

* Split off pdc202ata4_fixup_irq() helper from init_setup_pdc202ata4().

* Merge init_setup_{pdc202ata4,pdc20265,pdc202xx}() into pdc202xx_init_one().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopdc202xx_new: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
pdc202xx_new: remove ->init_setup

* Split off pdc20270_get_dev2() helper from init_setup_pdc20270().

* Merge init_setup_{pdcnew,pdc20270,pdc20276}() into pdc202new_init_one().

While at it:

* Change KERN_ level of interrupt fixup message from KERN_WARNING to KERN_INFO.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agohpt366: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
hpt366: remove ->init_setup

* Split off hpt{374,371,366}_init() helper from init_setup_hpt{374,371,366}().

* Merge init_setup_{374,372n,371,372a,302,366}() into hpt366_init_one().

While at it:

* Use "HPT36x" name for HPT366/HPT368 chipsets.

* Add .chip_name to struct hpt_info and use it to set set d->name.

* Convert .max_ultra in struct hpt_info to .udma_mask and use it to set
  d->udma_mask.

* Fix hpt302 to use HPT302_ALLOW_ATA133_6 define.

* Change HPT366/HPT374 interrupt fixup message from KERN_WARNING to KERN_INFO.

* Use the second hpt366_chipsets[] entry for HPT37x chipsets using HPT36x PCI
  device ID and fix .enablebits/.host_flags for HPT36x hpt366_chipsets[] entry.

* Bump driver version.

Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agocmd64x: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:09 +0000 (00:30 +0200)] 
cmd64x: remove ->init_setup

Merge init_setup_{cmd64x,cmd646}() into cmd64x_init_one().

Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoaec62xx: remove ->init_setup
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:08 +0000 (00:30 +0200)] 
aec62xx: remove ->init_setup

Merge init_setup_{aec62xx,aec6x80}() into aec62xx_init_one().

While at it:

* Use id->driver_data instead of dev->device.

* Use ATA_UDMA6 define.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: use I/O ops directly part #2 (take 2)
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:08 +0000 (00:30 +0200)] 
ide: use I/O ops directly part #2 (take 2)

v2:
- bump host driver versions (as suggested by Sergei)
- use I/O ops directly in drivers/ide/setup-pci.c

Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: use pci_dev->revision
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:08 +0000 (00:30 +0200)] 
ide: use pci_dev->revision

Some places were using PCI_CLASS_REVISION instead of PCI_REVISION_ID so
they were not converted by commit 44c10138fd4bbc4b6d6bff0873c24902f2a9da65.

Cc: Auke Kok <auke-jan.h.kok@intel.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agocmd64x: Use dev->revision
Auke Kok [Thu, 18 Oct 2007 22:30:08 +0000 (00:30 +0200)] 
cmd64x: Use dev->revision

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoamd74xx: Omit PCI_REVISION_ID read
Auke Kok [Thu, 18 Oct 2007 22:30:07 +0000 (00:30 +0200)] 
amd74xx: Omit PCI_REVISION_ID read

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add ->mwdma_mask and ->swdma_mask to ide_pci_device_t (take 2)
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:07 +0000 (00:30 +0200)] 
ide: add ->mwdma_mask and ->swdma_mask to ide_pci_device_t (take 2)

* Add ->mwdma_mask and ->swdma_mask to ide_pci_device_t.

* Set ide_hwif_t DMA masks using DMA masks from ide_pci_device_t in
  setup-pci.c::ide_pci_setup_ports() (iff DMA base is valid and ->init_hwif
  method may still override them).

* Convert IDE PCI host drivers to use ide_pci_device_t DMA masks.

While at it:

* Use ATA_{UDMA,MWDMA,SWDMA}* defines.

* hpt34x.c: add separate ide_pci_device_t instances for HPT343 and HPT345.

* serverworks.c: fix DMA masks being set before checking DMA base.

v2:
* Add missing masks to DECLARE_GENERIC_PCI_DEV() macro.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agopdc202xx_old: remove broken SWDMA support
Bartlomiej Zolnierkiewicz [Thu, 18 Oct 2007 22:30:07 +0000 (00:30 +0200)] 
pdc202xx_old: remove broken SWDMA support

Documentation doesn't mention SWDMA and moreover all timings used
for SWDMA modes were over-clocked when compared to ATA spec.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
This page took 0.051642 seconds and 5 git commands to generate.