Theodore Ts'o [Sat, 20 Apr 2013 19:46:17 +0000 (15:46 -0400)]
ext4: mark all metadata I/O with REQ_META
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tao Ma [Fri, 19 Apr 2013 21:55:33 +0000 (17:55 -0400)]
ext4: fix readdir error in case inline_data+^dir_index.
Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.'. And what's
worse, we may meet with duplicate dir entries as the offset
for inline dir and non-inline one is quite different.
This patch just try to resolve this problem if dir_index
is disabled. In this case, f_pos is the real offset with
the dir block, so for inline dir, we just pretend as if
we are a dir block and returns the offset like a norml
dir block does.
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tao Ma [Fri, 19 Apr 2013 21:53:09 +0000 (17:53 -0400)]
ext4: fix readdir error in the case of inline_data+dir_index
Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.' and what's worse,
if there is a conversion happens when the user calls getdents
many times, he/she may get the same entry twice.
In theory, a dir block would also fail if it is converted to a
hashed-index based dir since f_pos will become a hash value, not the
real one, but it doesn't happen. And a deep investigation shows that
we uses a hash based solution even for a normal dir if the dir_index
feature is enabled.
So this patch just adds a new htree_inlinedir_to_tree for inline dir,
and if we find that the hash index is supported, we will do like what
we do for a dir block.
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Zheng Liu [Fri, 19 Apr 2013 21:49:23 +0000 (17:49 -0400)]
jbd2: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
The jbd2_alloc_handle() function is only called by new_handle(). So
this commit uses kmem_cache_zalloc() instead of
kmem_cache_alloc()/memset().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Darrick J. Wong [Fri, 19 Apr 2013 18:04:12 +0000 (14:04 -0400)]
ext4: mext_insert_extents should update extent block checksum
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Jan Kara [Fri, 19 Apr 2013 17:38:14 +0000 (13:38 -0400)]
ext4: move quota initialization out of inode allocation transaction
Inode allocation transaction is pretty heavy (246 credits with quotas
and extents before previous patch, still around 200 after it). This is
mostly due to credits required for allocation of quota structures
(credits there are heavily overestimated but it's difficult to make
better estimates if we don't want to wire non-trivial assumptions about
quota format into filesystem).
So move quota initialization out of allocation transaction. That way
transaction for quota structure allocation will be started only if we
need to look up quota structure on disk (rare) and furthermore it will
be started for each quota type separately, not for all of them at once.
This reduces maximum transaction size to 34 is most cases and to 73 in
the worst case.
[ Modified by tytso to clean up the cleanup paths for error handling.
Also use a separate call to ext4_std_error() for each failure so it
is easier for someone who is debugging a problem in this function to
determine which function call failed. ]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Thu, 18 Apr 2013 18:53:15 +0000 (14:53 -0400)]
ext4: reserve xattr index for Rich ACL support
Jan Kara <jack@suse.cz>
SUSE is carrying out of tree patches for Rich ACL support for ext4 as
they didn't get upstream due to opposition of some VFS maintainers.
Reserve xattr index for Rich ACLs so that it cannot be taken by
anything else which would force users to backup and reset their Rich
ACLs on files.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Jan Kara [Fri, 12 Apr 2013 04:03:42 +0000 (00:03 -0400)]
jbd2: reduce journal_head size
Remove unused t_cow_tid field (ext4 copy-on-write support doesn't seem
to be happening) and change b_modified and b_jlist to bitfields thus
saving 8 bytes in the structure.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Jan Kara [Fri, 12 Apr 2013 04:03:19 +0000 (00:03 -0400)]
ext4: clear buffer_uninit flag when submitting IO
Currently noone cleared buffer_uninit flag. This results in writeback
needlessly marking io_end as needing extent conversion scanning extent
tree for extents to convert. So clear the buffer_uninit flag once the
buffer is submitted for IO and the flag is transformed into
EXT4_IO_END_UNWRITTEN flag.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Jan Kara [Fri, 12 Apr 2013 03:56:53 +0000 (23:56 -0400)]
ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Jan Kara [Fri, 12 Apr 2013 03:48:32 +0000 (23:48 -0400)]
ext4: make ext4_bio_write_page() use BH_Async_Write flags
So far ext4_bio_write_page() attached all the pages to ext4_io_end
structure. This makes that structure pretty heavy (1 KB for pointers
+ 16 bytes per page attached to the bio). Also later we would like to
share ext4_io_end structure among several bios in case IO to a single
extent needs to be split among several bios and pointing to pages from
ext4_io_end makes this complex.
We remove page pointers from ext4_io_end and use pointers from bio
itself instead. This isn't as easy when blocksize < pagesize because
then we can have several bios in flight for a single page and we have
to be careful when to call end_page_writeback(). However this is a
known problem already solved by block_write_full_page() /
end_buffer_async_write() so we mimic its behavior here. We mark
buffers going to disk with BH_Async_Write flag and in
ext4_bio_end_io() we check whether there are any buffers with
BH_Async_Write flag left. If there are not, we can call
end_page_writeback().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Lukas Czerner [Fri, 12 Apr 2013 03:37:19 +0000 (23:37 -0400)]
ext4: Use kstrtoul() instead of parse_strtoul()
In parse_strtoul() we're still using deprecated simple_strtoul(). Remove
parse_strtoul() altogether and replace it with kstrtoul()
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dmitry Monakhov [Fri, 12 Apr 2013 03:24:58 +0000 (23:24 -0400)]
ext4: defragmentation code cleanup
- grab_cache_page_write_begin() may not wait on page's writeback since
(
1d1d1a767206). But it is still reasonable to wait on page's writeback
here in order to be on the safe side.
- Fix miss typo: pass 'length' instead of 'end' to __block_write_begin()
https://bugzilla.kernel.org/show_bug.cgi?id=56241
TESTCASE: git://oss.sgi.com/xfs/cmds/xfstests.git
MKFS_OPTIONS="-b1024" ; ./check ext4/304
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Akira Fujita <a-fujita.rs.jp.nec.com>
Lukas Czerner [Thu, 11 Apr 2013 14:54:46 +0000 (10:54 -0400)]
ext4: do not convert to indirect with bigalloc enabled
With bigalloc feature enabled we do not support indirect addressing at all
so we have to prevent extent addressing to indirect addressing
conversion in this case. The problem has been introduced with the commit
"ext4: support simple conversion of extent-mapped inodes to use i_blocks"
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Thu, 11 Apr 2013 03:32:52 +0000 (23:32 -0400)]
ext4: move ext4_ind_migrate() into migrate.c
Move ext4_ind_migrate() into migrate.c file since it makes much more
sense and ext4_ext_migrate() is there as well.
Also fix tiny style problem - add spaces around "=" in "i=0".
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Wed, 10 Apr 2013 03:59:55 +0000 (23:59 -0400)]
ext4: fix miscellaneous big endian warnings
None of these result in any bug, but they makes sparse complain.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dmitry Monakhov [Wed, 10 Apr 2013 03:56:48 +0000 (23:56 -0400)]
ext4: fix big-endian bug in metadata checksum calculations
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Dmitry Monakhov [Wed, 10 Apr 2013 03:56:44 +0000 (23:56 -0400)]
ext4: fix big-endian bug in extent migration code
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Dmitri Monakho [Wed, 10 Apr 2013 02:48:36 +0000 (22:48 -0400)]
ext4: fix usless declarations
This patch should fix sparse complains about shadow declatations.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Wed, 10 Apr 2013 02:11:22 +0000 (22:11 -0400)]
ext4: introduce reserved space
Currently in ENOSPC condition when writing into unwritten space, or
punching a hole, we might need to split the extent and grow extent tree.
However since we can not allocate any new metadata blocks we'll have to
zero out unwritten part of extent or punched out part of extent, or in
the worst case return ENOSPC even though use actually does not allocate
any space.
Also in delalloc path we do reserve metadata and data blocks for the
time we're going to write out, however metadata block reservation is
very tricky especially since we expect that logical connectivity implies
physical connectivity, however that might not be the case and hence we
might end up allocating more metadata blocks than previously reserved.
So in future, metadata reservation checks should be removed since we can
not assure that we do not under reserve.
And this is where reserved space comes into the picture. When mounting
the file system we slice off a little bit of the file system space (2%
or 4096 clusters, whichever is smaller) which can be then used for the
cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
The number of reserved clusters can be set via sysfs, however it can
never be bigger than number of free clusters in the file system.
Note that this patch fixes the failure of xfstest 274 as expected.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Jan Kara [Tue, 9 Apr 2013 16:39:26 +0000 (12:39 -0400)]
ext4: improve credit estimate for EXT4_SINGLEDATA_TRANS_BLOCKS
Estimate of 27 credits for allocation of a block in extent based inode
is unnecessarily high. We can easily argue 20 is enough.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Andrey Sidorov [Tue, 9 Apr 2013 16:22:29 +0000 (12:22 -0400)]
ext4: speed-up releasing blocks on commit
Improve mb_free_blocks speed by clearing entire range at once instead
of iterating over each bit. Freeing block-by-block also makes buddy
bitmap subtree flip twice making most of the work a no-op. Very few
bits in buddy bitmap require change, e.g. freeing entire group is a 1
bit flip only. As a result, releasing blocks of 60G file now takes
5ms instead of 2.7s. This is especially good for non-preemptive
kernels as there is no rescheduling during release.
Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Eric Whitney [Tue, 9 Apr 2013 13:27:31 +0000 (09:27 -0400)]
ext4: fix free space estimate in ext4_nonda_switch()
Values stored in s_freeclusters_counter and s_dirtyclusters_counter
are both in cluster units. Remove the cluster to block conversion
applied to s_freeclusters_counter causing an inflated estimate of
free space because s_dirtyclusters_counter is not similarly
converted. Rename free_blocks and dirty_blocks to better reflect
the units these variables contain to avoid future confusion. This
fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc
file systems.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Jan Kara [Tue, 9 Apr 2013 13:21:41 +0000 (09:21 -0400)]
ext4: fix deadlock with quota feature
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
accounted even for quota files. Thus we could recurse back to quota code
when adding new blocks to quota file which can easily deadlock. Mark
hidden quota files properly.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dmitry Monakhov [Mon, 8 Apr 2013 17:02:25 +0000 (13:02 -0400)]
ext4: fix incorrect lock ordering for ext4_ind_migrate
existing locking ordering: journal-> i_data_sem, but
ext4_ind_migrate() grab locks in opposite order which may result in
deadlock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dr. Tilmann Bubeck [Mon, 8 Apr 2013 16:54:05 +0000 (12:54 -0400)]
ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT
Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
associated attributes (like i_blocks, i_size, i_flags, ...) from the
specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
typically used to store a boot loader in a secure part of the
filesystem, where it can't be changed by a normal user by accident.
The data blocks of the previous boot loader will be associated with
the given inode.
This usercode program is a simple example of the usage:
int main(int argc, char *argv[])
{
int fd;
int err;
if ( argc != 2 ) {
printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
exit(1);
}
fd = open(argv[1], O_WRONLY);
if ( fd < 0 ) {
perror("open");
exit(1);
}
err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
if ( err < 0 ) {
perror("ioctl");
exit(1);
}
close(fd);
exit(0);
}
[ Modified by Theodore Ts'o to fix a number of bugs in the original code.]
Signed-off-by: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Thu, 4 Apr 2013 03:33:30 +0000 (23:33 -0400)]
ext4: print more info in ext4_print_free_blocks()
Additionally print i_allocated_meta_blocks information as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Lukas Czerner [Thu, 4 Apr 2013 03:33:28 +0000 (23:33 -0400)]
ext4: try to prepend extent to the existing one
Currently when inserting extent in ext4_ext_insert_extent() we would
only try to to see if we can append new extent to the found extent. If
we can not, then we proceed with adding new extent into the extent tree,
but then possibly merging it back again.
We can avoid this situation by trying to append and prepend new extent
to the existing ones. However since the new extent can be on either
sides of the existing extent, we have to pick the right extent to try to
append/prepend to.
This patch adds the conditions to pick the right extent to
append/prepend to and adds the actual prepending condition as well. This
will also eliminate the need to use "reserved" block for possibly
growing extent tree.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Thu, 4 Apr 2013 03:33:27 +0000 (23:33 -0400)]
ext4: Transfer initialized block to right neighbor if possible
Currently when converting extent to initialized we attempt to transfer
initialized block to the left neighbour if possible when certain
criteria are met. However we do not attempt to do the same for the
right neighbor.
This commit adds the possibility to transfer initialized block to the
right neighbour if:
1. We're not converting the whole extent
2. Both extents are stored in the same extent tree node
3. Right neighbor is initialized
4. Right neighbor is logically abutting the current one
5. Right neighbor is physically abutting the current one
6. Right neighbor would not overflow the length limit
This is basically the same logic as with transferring to the left. This
will gain us some performance benefits since it is faster than inserting
extent and then merging it.
It would also prevent some situation in delalloc patch when we might run
out of metadata reservation. This is due to the fact that we would
attempt to split the extent first (possibly allocating new metadata
block) even though we did not counted for that because it can (and will)
be merged again. This commit fix that scenario, because we no longer
need to split the extent in such case.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Lukas Czerner [Thu, 4 Apr 2013 03:32:34 +0000 (23:32 -0400)]
ext4: introduce ext4_get_group_number()
Currently on many places in ext4 we're using
ext4_get_group_no_and_offset() even though we're only interested in
knowing the block group of the particular block, not the offset within
the block group so we can use more efficient way to compute block
group.
This patch introduces ext4_get_group_number() which computes block
group for a given block much more efficiently. Use this function
instead of ext4_get_group_no_and_offset() everywhere where we're only
interested in knowing the block group.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Thu, 4 Apr 2013 02:12:52 +0000 (22:12 -0400)]
ext4: make ext4_block_in_group() much more efficient
Currently in when getting the block group number for a particular
block in ext4_block_in_group() we're using
ext4_get_group_no_and_offset() which uses do_div() to get the block
group and the remainer which is offset within the group.
We don't need all of that in ext4_block_in_group() as we only need to
figure out the group number.
This commit changes ext4_block_in_group() to calculate group number
directly. This shows as a big improvement with regards to cpu
utilization. Measuring fallocate -l 15T on fresh file system with perf
showed that 23% of cpu time was spend in the
ext4_get_group_no_and_offset(). With this change it completely
disappears from the list only bumping the occurrence of
ext4_init_block_bitmap() which is the biggest user of
ext4_block_in_group() by 4%. As the result of this change on my system
the fallocate call was approx. 10% faster.
However since there is '-g' option in mkfs which allow us setting
different groups size (mostly for developers) I've introduced new per
file system flag whether we have a standard block group size or
not. The flag is used to determine whether we can use the bit shift
optimization or not.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dmitry Monakhov [Thu, 4 Apr 2013 02:10:52 +0000 (22:10 -0400)]
ext4: unregister es_shrinker if mount failed
Otherwise destroyed ext_sb_info will be part of global shinker list
and result in the following OOPS:
JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136 /DH55TC
RIP: 0010:[<
ffffffff811bfb2d>] [<
ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:
ffff88011d5cbcd8 EFLAGS:
00010207
RAX:
6b6b6b6b6b6b6b6b RBX:
6b6b6b6b6b6b6b53 RCX:
0000000000000006
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000246
RBP:
ffff88011d5cbce8 R08:
0000000000000002 R09:
0000000000000001
R10:
0000000000000001 R11:
0000000000000000 R12:
ffff88011cd3f848
R13:
ffff88011cd3f830 R14:
ffff88011cd3f000 R15:
0000000000000000
FS:
00007f7b721dd7e0(0000) GS:
ffff880121a00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007fffa6f75038 CR3:
000000011bc1c000 CR4:
00000000000007e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process mount (pid: 2758, threadinfo
ffff88011d5ca000, task
ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<
ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<
ffffffff81249381>] mount_bdev+0x331/0x340
[<
ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<
ffffffff81362035>] ext4_mount+0x15/0x20
[<
ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<
ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<
ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<
ffffffff8127aa56>] do_mount+0x376/0x380
[<
ffffffff8127ab98>] sys_mount+0x138/0x150
[<
ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP [<
ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <
ffff88011d5cbcd8>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Dmitry Monakhov [Thu, 4 Apr 2013 02:08:52 +0000 (22:08 -0400)]
ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
->ext4_mb_free_metadata()
->ext4_journal_callback_del()
This results in the following issue:
WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be
ffff88019a4ec198, but was
6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107
Call Trace:
[<
ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
[<
ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
[<
ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
[<
ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
[<
ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
[<
ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
[<
ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
[<
ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
[<
ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
[<
ffffffff810ad630>] ? wake_up_bit+0x40/0x40
[<
ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
[<
ffffffff810ac6be>] kthread+0x10e/0x120
[<
ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
[<
ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
[<
ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
ext4_journal_callback_try_del()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
Dmitry Monakhov [Thu, 4 Apr 2013 02:06:52 +0000 (22:06 -0400)]
jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback
The following race is possible:
[kjournald2] other_task
jbd2_journal_commit_transaction()
j_state = T_FINISHED;
spin_unlock(&journal->j_list_lock);
->jbd2_journal_remove_checkpoint()
->jbd2_journal_free_transaction();
->kmem_cache_free(transaction)
->j_commit_callback(journal, transaction);
-> USE_AFTER_FREE
WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be
ffff88019a4ec198, but was
6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107
Call Trace:
[<
ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
[<
ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
[<
ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
[<
ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
[<
ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
[<
ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
[<
ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
[<
ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
[<
ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
[<
ffffffff810ad630>] ? wake_up_bit+0x40/0x40
[<
ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
[<
ffffffff810ac6be>] kthread+0x10e/0x120
[<
ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
[<
ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
[<
ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
In order to demonstrace this issue one should mount ext4 with mount -o
discard option on SSD disk. This makes callback longer and race
window becomes wider.
In order to fix this we should mark transaction as finished only after
callbacks have completed
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Theodore Ts'o [Thu, 4 Apr 2013 02:04:52 +0000 (22:04 -0400)]
ext4: support simple conversion of extent-mapped inodes to use i_blocks
In order to make it simpler to test the code which support
i_blocks/indirect-mapped inodes, support the conversion of inodes
which are less than 12 blocks and which are contained in no more than
a single extent.
The primary intended use of this code is to converting freshly created
zero-length files and empty directories.
Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
check that prevents the clearing of the extent flag. A simple patch
which allows "chattr -e <file>" to work will be checked into the
e2fsprogs git repository.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Thu, 4 Apr 2013 02:02:52 +0000 (22:02 -0400)]
ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.
Commit
deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit
e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().
Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time. To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.
As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started. This
should have a small (but probably not measurable) improvement for
ext4's scalability.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
Theodore Ts'o [Thu, 4 Apr 2013 02:00:52 +0000 (22:00 -0400)]
ext4: add might_sleep() annotations
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Theodore Ts'o [Thu, 4 Apr 2013 01:58:52 +0000 (21:58 -0400)]
ext4: add mutex_is_locked() assertion to ext4_truncate()
[ Added fixup from Lukáš Czerner which only checks the assertion when
the inode is not new and is not being freed. ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Wed, 3 Apr 2013 16:47:17 +0000 (12:47 -0400)]
ext4: refactor truncate code
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
ext4_truncate(). This saves over 60 lines of code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Wed, 3 Apr 2013 16:45:17 +0000 (12:45 -0400)]
ext4: refactor punch hole code
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
into ext4_punch_hole(). This saves over 150 lines of code.
This also fixes a potential bug when the punch_hole() code is racing
against indirect-to-extents or extents-to-indirect migation. We are
currently using i_mutex to protect against changes to the inode flag;
specifically, the append-only, immutable, and extents inode flags. So
we need to take i_mutex before deciding whether to use the
extents-specific or indirect-specific punch_hole code.
Also, there was a missing call to ext4_inode_block_unlocked_dio() in
the indirect punch codepath. This was added in commit
02d262dffcf4c
to block DIO readers racing against the punch operation in the
codepath for extent-mapped inodes, but it was missing for
indirect-block mapped inodes. One of the advantages of refactoring
the code is that it makes such oversights much less likely.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Wed, 3 Apr 2013 16:43:17 +0000 (12:43 -0400)]
ext4: fold ext4_alloc_blocks() in ext4_alloc_branch()
The older code was far more complicated than it needed to be because
of how we spliced in the ext4's new multiblock allocator into ext3's
indirect block code. By folding ext4_alloc_blocks() into
ext4_alloc_branch(), we make the code far more understable, shave off
over 130 lines of code and half a kilobyte of compiled object code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Zheng Liu [Wed, 3 Apr 2013 16:41:17 +0000 (12:41 -0400)]
ext4: fold ext4_generic_write_end() into ext4_write_end()
After collapsing the handling of data ordered and data writeback
codepath, ext4_generic_write_end() has only one caller,
ext4_write_end(). So we fold it into ext4_write_end().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Theodore Ts'o [Wed, 3 Apr 2013 16:39:17 +0000 (12:39 -0400)]
ext4: collapse handling of data=ordered and data=writeback codepaths
The only difference between how we handle data=ordered and
data=writeback is a single call to ext4_jbd2_file_inode(). Eliminate
code duplication by factoring out redundant the code paths.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Zheng Liu [Wed, 3 Apr 2013 16:27:18 +0000 (12:27 -0400)]
ext4: fix big-endian bugs which could cause fs corruptions
When an extent was zeroed out, we forgot to do convert from cpu to le16.
It could make us hit a BUG_ON when we try to write dirty pages out. So
fix it.
[ Also fix a bug found by Dmitry Monakhov where we were missing
le32_to_cpu() calls in the new indirect punch hole code.
There are a number of other big endian warnings found by static code
analyzers, but we'll wait for the next merge window to fix them all
up. These fixes are designed to be Obviously Correct by code
inspection, and easy to demonstrate that it won't make any
difference (and hence, won't introduce any bugs) on little endian
architectures such as x86. --tytso ]
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Linus Torvalds [Sun, 31 Mar 2013 22:12:43 +0000 (15:12 -0700)]
Linux 3.9-rc5
Linus Torvalds [Sun, 31 Mar 2013 18:41:47 +0000 (11:41 -0700)]
Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine fixes from Vinod Koul:
"Two fixes for slave-dmaengine.
The first one is for making slave_id value correct for dw_dmac and
the other one fixes the endieness in DT parsing"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dw_dmac: adjust slave_id accordingly to request line base
dmaengine: dw_dma: fix endianess for DT xlate function
Linus Torvalds [Sun, 31 Mar 2013 18:40:33 +0000 (11:40 -0700)]
Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"For a some fixes for Kernel 3.9:
- subsystem build fix when VIDEO_DEV=y, VIDEO_V4L2=m and I2C=m
- compilation fix for arm multiarch preventing IR_RX51 to be selected
- regression fix at bttv crop logic
- s5p-mfc/m5mols/exynos: a few fixes for cameras on exynos hardware"
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] [REGRESSION] bt8xx: Fix too large height in cropcap
[media] fix compilation with both V4L2 and I2C as 'm'
[media] m5mols: Fix bug in stream on handler
[media] s5p-fimc: Do not attempt to disable not enabled media pipeline
[media] s5p-mfc: Fix encoder control 15 issue
[media] s5p-mfc: Fix frame skip bug
[media] s5p-fimc: send valid m2m ctx to fimc_m2m_job_finish
[media] exynos-gsc: send valid m2m ctx to gsc_m2m_job_finish
[media] fimc-lite: Fix the variable type to avoid possible crash
[media] fimc-lite: Initialize 'step' field in fimc_lite_ctrl structure
[media] ir: IR_RX51 only works on OMAP2
Linus Torvalds [Sun, 31 Mar 2013 18:38:59 +0000 (11:38 -0700)]
Merge tag 'for-linus-
20130331' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Alright, this time from 10K up in the air.
Collection of fixes that have been queued up since the merge window
opened, hence postponed until later in the cycle. The pull request
contains:
- A bunch of fixes for the xen blk front/back driver.
- A round of fixes for the new IBM RamSan driver, fixing various
nasty issues.
- Fixes for multiple drives from Wei Yongjun, bad handling of return
values and wrong pointer math.
- A fix for loop properly killing partitions when being detached."
* tag 'for-linus-
20130331' of git://git.kernel.dk/linux-block: (25 commits)
mg_disk: fix error return code in mg_probe()
rsxx: remove unused variable
rsxx: enable error return of rsxx_eeh_save_issued_dmas()
block: removes dynamic allocation on stack
Block: blk-flush: Fixed indent code style
cciss: fix invalid use of sizeof in cciss_find_cfgtables()
loop: cleanup partitions when detaching loop device
loop: fix error return code in loop_add()
mtip32xx: fix error return code in mtip_pci_probe()
xen-blkfront: remove frame list from blk_shadow
xen-blkfront: pre-allocate pages for requests
xen-blkback: don't store dev_bus_addr
xen-blkfront: switch from llist to list
xen-blkback: fix foreach_grant_safe to handle empty lists
xen-blkfront: replace kmalloc and then memcpy with kmemdup
xen-blkback: fix dispatch_rw_block_io() error path
rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas()
Adding in EEH support to the IBM FlashSystem 70/80 device driver
block: IBM RamSan 70/80 error message bug fix.
block: IBM RamSan 70/80 branding changes.
...
Paul Walmsley [Sun, 31 Mar 2013 00:04:40 +0000 (00:04 +0000)]
Revert "lockdep: check that no locks held at freeze time"
This reverts commit
6aa9707099c4b25700940eb3d016f16c4434360d.
Commit
6aa9707099c4 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems. The failures were noticed on
OMAP2 and 3 boards during kernel init:
[ BUG: swapper/0/1 still has locks held! ]
3.9.0-rc3-00344-ga937536 #1 Not tainted
-------------------------------------
1 lock held by swapper/0/1:
#0: (&type->s_umount_key#13/1){+.+.+.}, at: [<
c011e84c>] sget+0x248/0x574
stack backtrace:
rpc_wait_bit_killable
__wait_on_bit
out_of_line_wait_on_bit
__rpc_execute
rpc_run_task
rpc_call_sync
nfs_proc_get_root
nfs_get_root
nfs_fs_mount_common
nfs_try_mount
nfs_fs_mount
mount_fs
vfs_kern_mount
do_mount
sys_mount
do_mount_root
mount_root
prepare_namespace
kernel_init_freeable
kernel_init
Although the rootfs mounts, the system is unstable. Here's a transcript
from a PM test:
http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/
20130317194234/pm/37xxevm/37xxevm_log.txt
Here's what the test log should look like:
http://www.pwsan.com/omap/testlogs/test_v3.8/
20130218214403/pm/37xxevm/37xxevm_log.txt
Mailing list discussion is here:
http://lkml.org/lkml/2013/3/4/221
Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <maciej.rutecki@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 30 Mar 2013 20:13:05 +0000 (13:13 -0700)]
Merge git://git./linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
"This includes the bug-fix for a >= v3.8-rc1 regression specific to
iscsi-target persistent reservation conflict handling (CC'ed to
stable), and a tcm_vhost patch to drop VIRTIO_RING_F_EVENT_IDX usage
so that in-flight qemu vhost-scsi-pci device code can detect the
proper vhost feature bits.
Also, there are two more tcm_vhost patches still being discussed by
MST and Asias for v3.9 that will be required for the in-flight qemu
vhost-scsi-pci device patch to function properly, and that should
(hopefully) be the last target fixes for this round."
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target: Fix RESERVATION_CONFLICT status regression for iscsi-target special case
tcm_vhost: Avoid VIRTIO_RING_F_EVENT_IDX feature bit
Andy Shevchenko [Wed, 20 Feb 2013 11:52:17 +0000 (13:52 +0200)]
dw_dmac: adjust slave_id accordingly to request line base
On some hardware configurations we have got the request line with the offset.
The patch introduces convert_slave_id() helper for that cases. The request line
base is came from the driver data provided by the platform_device_id table.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Arnd Bergmann [Sun, 3 Mar 2013 20:51:28 +0000 (20:51 +0000)]
dmaengine: dw_dma: fix endianess for DT xlate function
As reported by Wu Fengguang's build robot tracking sparse warnings, the
dma_spec arguments in the dw_dma_xlate are already byte swapped on
little-endian platforms and must not get swapped again. This code is
currently not used anywhere, but will be used in Linux 3.10 when the
ARM SPEAr platform starts using the generic DMA DT binding.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Rafael J. Wysocki [Fri, 29 Mar 2013 21:59:53 +0000 (22:59 +0100)]
PNP: List Rafael Wysocki as a maintainer
The Adam Belay's e-mail address in MAINTAINERS under PNP SUPPORT
is not valid any more and I started to maintain that code in the
meantime as a matter of fact, so list myself as a maintainer of it
along with Bjorn and remove the Adam's entry from it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 29 Mar 2013 18:47:43 +0000 (11:47 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull ceph fix from Sage Weil:
"This fixes a regression introduced during the last merge window when
mapping non-existent images."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: don't zero-fill non-image object requests
Alex Elder [Wed, 27 Mar 2013 14:16:30 +0000 (09:16 -0500)]
rbd: don't zero-fill non-image object requests
A result of ENOENT from a read request for an object that's part of
an rbd image indicates that there is a hole in that portion of the
image. Similarly, a short read for such an object indicates that
the remainder of the read should be interpreted a full read with
zeros filling out the end of the request.
This behavior is not correct for objects that are not backing rbd
image data. Currently rbd_img_obj_request_callback() assumes it
should be done for all objects.
Change rbd_img_obj_request_callback() so it only does this zeroing
for image objects. Encapsulate that special handling in its own
function. Add an assertion that the image object request is a bio
request, since we assume that (and we currently don't support any
other types).
This resolves a problem identified here:
http://tracker.ceph.com/issues/4559
The regression was introduced by
bf0d5f503dc11d6314c0503591d258d60ee9c944.
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
Linus Torvalds [Fri, 29 Mar 2013 18:13:25 +0000 (11:13 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.
I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit
Len Brown [Fri, 29 Mar 2013 18:02:30 +0000 (11:02 -0700)]
ia64 idle: delete stale (*idle)() function pointer
Commit
3e7fc708eb41 ("ia64 idle: delete pm_idle") in 3.9-rc1 didn't
finish the job, leaving an un-initialized reference to (*idle)().
[ Haven't seen a crash from this - but seems like we are just being
lucky that "idle" is zero so it does get initialized before we jump to
randomland - Len ]
Reported-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 29 Mar 2013 18:00:43 +0000 (11:00 -0700)]
Merge branch 'for-curr' of git://git./linux/kernel/git/vgupta/arc
Pull arc architecture fixes from Vineet Gupta:
"This includes fix for a serious bug in DMA mapping API, make
allyesconfig wreckage, removal of bogus email-list placeholder in
MAINTAINERS, a typo in ptrace helper code and last remaining changes
for syscall ABI v3 which we are finally starting to transition-to
internally.
The request is late than I intended to - but I was held up with
debugging a timer link list corruption, for which a proposed fix to
generic timer code was sent out to lkml/tglx earlier today."
* 'for-curr' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: Fix the typo in event identifier flags used by ptrace
arc: fix dma_address assignment during dma_map_sg()
ARC: Remove SET_PERSONALITY (tracks cross-arch change)
ARC: ABIv3: fork/vfork wrappers not needed in "no-legacy-syscall" ABI
ARC: ABIv3: Print the correct ABI ver
ARC: make allyesconfig build breakages
ARC: MAINTAINERS update for ARC
Josef Bacik [Fri, 29 Mar 2013 14:09:34 +0000 (08:09 -0600)]
Btrfs: don't drop path when printing out tree errors in scrub
A user reported a panic where we were panicing somewhere in
tree_backref_for_extent from scrub_print_warning. He only captured the trace
but looking at scrub_print_warning we drop the path right before we mess with
the extent buffer to print out a bunch of stuff, which isn't right. So fix this
by dropping the path after we use the eb if we need to. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Nicholas Bellinger [Fri, 29 Mar 2013 06:06:00 +0000 (23:06 -0700)]
target: Fix RESERVATION_CONFLICT status regression for iscsi-target special case
This patch fixes a regression introduced in v3.8-rc1 code where a failed
target_check_reservation() check in target_setup_cmd_from_cdb() was causing
an incorrect SAM_STAT_GOOD status to be returned during a WRITE operation
performed by an unregistered / unreserved iscsi initiator port.
This regression is only effecting iscsi-target due to a special case check
for TCM_RESERVATION_CONFLICT within iscsi_target_erl1.c:iscsit_execute_cmd(),
and was still correctly disallowing WRITE commands from backend submission
for unregistered / unreserved initiator ports, while returning the incorrect
SAM_STAT_GOOD status due to the missing SAM_STAT_RESERVATION_CONFLICT
assignment.
This regression was first introduced with:
commit
de103c93aff0bed0ae984274e5dc8b95899badab
Author: Christoph Hellwig <hch@lst.de>
Date: Tue Nov 6 12:24:09 2012 -0800
target: pass sense_reason as a return value
Go ahead and re-add the missing SAM_STAT_RESERVATION_CONFLICT assignment
during a target_check_reservation() failure, so that iscsi-target code
sends the correct SCSI status.
All other fabrics using target_submit_cmd_*() with a RESERVATION_CONFLICT
call to transport_generic_request_failure() are not effected by this bug.
Reported-by: Jeff Leung <jleung@curriegrad2004.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Nicholas Bellinger [Thu, 28 Mar 2013 00:23:41 +0000 (17:23 -0700)]
tcm_vhost: Avoid VIRTIO_RING_F_EVENT_IDX feature bit
This patch adds a VHOST_SCSI_FEATURES mask minus VIRTIO_RING_F_EVENT_IDX
so that vhost-scsi-pci userspace will strip this feature bit once
GET_FEATURES reports it as being unsupported on the host.
This is to avoid a bug where ->handle_kicks() are missed when EVENT_IDX
is enabled by default in userspace code.
(mst: Rename to VHOST_SCSI_FEATURES + add comment)
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Asias He <asias@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Michel Lespinasse [Thu, 28 Mar 2013 23:26:23 +0000 (16:26 -0700)]
Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace programs"
This reverts commit
186930500985 ("mm: introduce VM_POPULATE flag to
better deal with racy userspace programs").
VM_POPULATE only has any effect when userspace plays racy games with
vmas by trying to unmap and remap memory regions that mmap or mlock are
operating on.
Also, the only effect of VM_POPULATE when userspace plays such games is
that it avoids populating new memory regions that get remapped into the
address range that was being operated on by the original mmap or mlock
calls.
Let's remove VM_POPULATE as there isn't any strong argument to mandate a
new vm_flag.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 28 Mar 2013 22:54:25 +0000 (15:54 -0700)]
Merge tag 'usb-3.9-rc4' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg Kroah-Hartman:
"Here are some USB fixes to resolve issues reported recently, as well
as a new device id for the ftdi_sio driver."
* tag 'usb-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: ftdi_sio: Add support for Mitsubishi FX-USB-AW/-BD
usb: Fix compile error by selecting USB_OTG_UTILS
USB: serial: fix hang when opening port
USB: EHCI: fix bug in iTD/siTD DMA pool allocation
xhci: Don't warn on empty ring for suspended devices.
usb: xhci: Fix TRB transfer length macro used for Event TRB.
usb/acpi: binding xhci root hub usb port with ACPI
usb: add find_raw_port_number callback to struct hc_driver()
usb: xhci: fix build warning
Linus Torvalds [Thu, 28 Mar 2013 22:53:33 +0000 (15:53 -0700)]
Merge tag 'tty-3.9-rc4' of git://git./linux/kernel/git/gregkh/tty
Pull TTY/serial fixes from Greg Kroah-Hartman:
"Here are some tty/serial driver fixes for 3.9.
The big thing here is the fix for the huge mess we caused renaming the
8250 driver accidentally in the 3.7 kernel release, without realizing
that there were users of the module options that suddenly broke. This
is now resolved, and, to top the injury off, we have a backwards-
compatible option for those users who got used to the new name since
3.7. Ugh, sorry about that.
Other than that, some other minor fixes for issues that have been
reported by users."
* tag 'tty-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Xilinx: ARM: UART: clear pending irqs before enabling irqs
TTY: 8250, deprecated 8250_core.* options
TTY: 8250, revert module name change
serial: 8250_pci: Add WCH CH352 quirk to avoid Xscale detection
tty: atmel_serial_probe(): index of atmel_ports[] fix
Linus Torvalds [Thu, 28 Mar 2013 22:52:54 +0000 (15:52 -0700)]
Merge tag 'staging-3.9-rc4' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg Kroah-Hartman:
"Here are two tiny staging driver fixes to resolve issues that have
been reported."
* tag 'staging-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: comedi: s626: fix continuous acquisition
staging: zcache: fix typo "64_BIT"
Linus Torvalds [Thu, 28 Mar 2013 22:52:14 +0000 (15:52 -0700)]
Merge tag 'driver-core-3.9-rc4' of git://git./linux/kernel/git/gregkh/driver-core
Pull sysfs fixes from Greg Kroah-Hartman:
"Here are two fixes for sysfs that resolve issues that have been found
by the Trinity fuzz tool, causing oopses in sysfs. They both have
been in linux-next for a while to ensure that they do not cause any
other problems."
* tag 'driver-core-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
sysfs: handle failure path correctly for readdir()
sysfs: fix race between readdir and lseek
Linus Torvalds [Thu, 28 Mar 2013 22:51:33 +0000 (15:51 -0700)]
Merge tag 'char-misc-3.9-rc4' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg Kroah-Hartman:
"Here are some small char/misc driver fixes that resolve issues
recently reported against the 3.9-rc kernels. All have been in
linux-next for a while."
* tag 'char-misc-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
VMCI: Fix process-to-process DRGAMs.
mei: ME hardware reset needs to be synchronized
mei: add mei_stop function to stop mei device
extcon: max77693: Initialize register of MUIC device to bring up it without platform data
extcon: max77693: Fix bug of wrong pointer when platform data is not used
extcon: max8997: Check the pointer of platform data to protect null pointer error
Linus Torvalds [Thu, 28 Mar 2013 20:47:31 +0000 (13:47 -0700)]
Merge tag 'pm+acpi-3.9-rc5' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael J Wysocki:
- Fix for a recent cpufreq regression related to acpi-cpufreq and
suspend/resume from Viresh Kumar.
- cpufreq stats reference counting fix from Viresh Kumar.
- intel_pstate driver fixes from Dirk Brandewie and Konrad Rzeszutek
Wilk.
- New ACPI suspend blacklist entry for Sony Vaio VGN-FW21M from Fabio
Valentini.
- ACPI Platform Error Interface (APEI) fix from Chen Gong.
- PCI root bridge hotplug locking fix from Yinghai Lu.
* tag 'pm+acpi-3.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI / ACPI: hold acpi_scan_lock during root bus hotplug
ACPI / APEI: fix error status check condition for CPER
ACPI / PM: fix suspend and resume on Sony Vaio VGN-FW21M
cpufreq: acpi-cpufreq: Don't set policy->related_cpus from .init()
cpufreq: stats: do cpufreq_cpu_put() corresponding to cpufreq_cpu_get()
intel-pstate: Use #defines instead of hard-coded values.
cpufreq / intel_pstate: Fix calculation of current frequency
cpufreq / intel_pstate: Add function to check that all MSRs are valid
Linus Torvalds [Thu, 28 Mar 2013 20:46:20 +0000 (13:46 -0700)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This removes IPsec ESN support from the talitos/caam drivers since
they were implemented incorrectly, causing interoperability problems
if ESN is used with them."
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
Revert "crypto: caam - add IPsec ESN support"
Revert "crypto: talitos - add IPsec ESN support"
Linus Torvalds [Thu, 28 Mar 2013 20:45:49 +0000 (13:45 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/cmarinas/linux-aarch64
Pull arm64 fix from Catalin Marinas:
"Fix IS_ENABLED() usage typo (missing CONFIG_ prefix)."
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64:
ARM64: early_printk: Fix check for CONFIG_ARM64_64K_PAGES
Linus Torvalds [Thu, 28 Mar 2013 20:44:40 +0000 (13:44 -0700)]
Merge tag 'fbdev-fixes-3.9-rc4' of git://gitorious.org/linux-omap-dss2/linux
Pull fbdev fixes from Tomi Valkeinen:
"Since Florian is still away/inactive, I volunteered to collect fbdev
fixes for 3.9 and changes for 3.10. I didn't receive any other fbdev
fixes than OMAP yet, but I didn't want to delay this further as
there's a compilation fix for OMAP1. So there could be still some
fbdev fixes on the way a bit later.
This contains:
- Fix OMAP1 compilation
- OMAP display fixes"
* tag 'fbdev-fixes-3.9-rc4' of git://gitorious.org/linux-omap-dss2/linux:
omapdss: features: fix supported outputs for OMAP4
OMAPDSS: tpo-td043 panel: fix data passing between SPI/DSS parts
omapfb: fix broken build on OMAP1
Linus Torvalds [Thu, 28 Mar 2013 20:43:46 +0000 (13:43 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ebiederm/user-namespace
Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.
The rest of the changes are just plain bug fixes.
Many thanks to Andy Lutomirski for pointing out many of these issues."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
Konstantin Holoborodko [Thu, 28 Mar 2013 15:06:13 +0000 (00:06 +0900)]
usb: ftdi_sio: Add support for Mitsubishi FX-USB-AW/-BD
It enhances the driver for FTDI-based USB serial adapters
to recognize Mitsubishi Electric Corp. USB/RS422 Converters
as FT232BM chips and support them.
https://search.meau.com/?q=FX-USB-AW
Signed-off-by: Konstantin Holoborodko <klh.kernel@gmail.com>
Tested-by: Konstantin Holoborodko <klh.kernel@gmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wei Yongjun [Thu, 28 Mar 2013 15:43:43 +0000 (09:43 -0600)]
mg_disk: fix error return code in mg_probe()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Miao Xie [Thu, 28 Mar 2013 08:12:15 +0000 (08:12 +0000)]
Btrfs: fix wrong return value of btrfs_lookup_csum()
If we don't find the expected csum item, but find a csum item which is
adjacent to the specified extent, we should return -EFBIG, or we should
return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
is not adjacent to the specified extent. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao Xie [Thu, 28 Mar 2013 08:08:20 +0000 (08:08 +0000)]
Btrfs: fix wrong reservation of csums
We reserve the space for csums only when we write data into a file, in
the other cases, such as tree log, log replay, we don't do reservation,
so we can use the reservation of the transaction handle just for the former.
And for the latter, we should use the tree's own reservation. But the
function - btrfs_csum_file_blocks() didn't differentiate between these
two types of the cases, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Wang Shilong [Mon, 25 Mar 2013 11:08:23 +0000 (11:08 +0000)]
Btrfs: fix double free in the btrfs_qgroup_account_ref()
The function btrfs_find_all_roots is responsible to allocate
memory for 'roots' and free it if errors happen,so the caller should not
free it again since the work has been done.
Besides,'tmp' is allocated after the function btrfs_find_all_roots,
so we can return directly if btrfs_find_all_roots() fails.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 26 Mar 2013 19:31:45 +0000 (15:31 -0400)]
Btrfs: limit the global reserve to 512mb
A user reported a problem where he was getting early ENOSPC with hundreds of
gigs of free data space and 6 gigs of free metadata space. This is because the
global block reserve was taking up the entire free metadata space. This is
ridiculous, we have infrastructure in place to throttle if we start using too
much of the global reserve, so instead of letting it get this huge just limit it
to 512mb so that users can still get work done. This allowed the user to
complete his rsync without issues. Thanks
Cc: stable@vger.kernel.org
Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 26 Mar 2013 19:29:11 +0000 (15:29 -0400)]
Btrfs: hold the ordered operations mutex when waiting on ordered extents
We need to hold the ordered_operations mutex while waiting on ordered extents
since we splice and run the ordered extents list. We need to make sure anybody
else who wants to wait on ordered extents does actually wait for them to be
completed. This will keep us from bailing out of flushing in case somebody is
already waiting on ordered extents to complete. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Tue, 26 Mar 2013 19:26:55 +0000 (15:26 -0400)]
Btrfs: fix space accounting for unlink and rename
We are way over-reserving for unlink and rename. Rename is just some random
huge number and unlink accounts for tree log operations that don't actually
happen during unlink, not to mention the tree log doesn't take from the trans
block rsv anyway so it's completely useless. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Mon, 25 Mar 2013 20:03:35 +0000 (16:03 -0400)]
Btrfs: fix space leak when we fail to reserve metadata space
Dave reported a warning when running xfstest 275. We have been leaking delalloc
metadata space when our reservations fail. This is because we were improperly
calculating how much space to free for our checksum reservations. The problem
is we would sometimes free up space that had already been freed in another
thread and we would end up with negative usage for the delalloc space. This
patch fixes the problem by calculating how much space the other threads would
have already freed, and then calculate how much space we need to free had we not
done the reservation at all, and then freeing any excess space. This makes
xfstests 275 no longer have leaked space. Thanks
Cc: stable@vger.kernel.org
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Jan Schmidt [Thu, 21 Mar 2013 14:30:23 +0000 (14:30 +0000)]
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
When you take a snapshot, punch a hole where there has been data, then take
another snapshot and try to send an incremental stream, btrfs send would
give you EIO. That is because is_extent_unchanged had no support for holes
being punched. With this patch, instead of returning EIO we just return
0 (== the extent is not unchanged) and we're good.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Cc: Alexander Block <ablock84@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Rafael J. Wysocki [Thu, 28 Mar 2013 12:52:59 +0000 (13:52 +0100)]
Merge branch 'acpi-fixes' into fixes
* acpi-fixes:
PCI / ACPI: hold acpi_scan_lock during root bus hotplug
ACPI / APEI: fix error status check condition for CPER
ACPI / PM: fix suspend and resume on Sony Vaio VGN-FW21M
Rafael J. Wysocki [Thu, 28 Mar 2013 12:52:47 +0000 (13:52 +0100)]
Merge branch 'pm-fixes' into fixes
* pm-fixes:
cpufreq: acpi-cpufreq: Don't set policy->related_cpus from .init()
cpufreq: stats: do cpufreq_cpu_put() corresponding to cpufreq_cpu_get()
intel-pstate: Use #defines instead of hard-coded values.
cpufreq / intel_pstate: Fix calculation of current frequency
cpufreq / intel_pstate: Add function to check that all MSRs are valid
Linus Torvalds [Wed, 27 Mar 2013 22:50:24 +0000 (15:50 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/sfr/next-fixes
Pull powerpc build fixes from Stephen Rothwell:
"Just a couple of build fixes for powerpc all{mod,yes}config.
Submitted by me since BenH is on vacation."
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sfr/next-fixes:
powerpc: define the conditions where the ePAPR idle hcall can be supported
powerpc: make additional room in exception vector area
Linus Torvalds [Wed, 27 Mar 2013 19:56:25 +0000 (12:56 -0700)]
Merge tag 'stable/for-linus-3.9-rc4-tag' of git://git./linux/kernel/git/konrad/xen
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
"This is mostly just the last stragglers of the regression bugs that
this merge window had. There are also two bug-fixes: one that adds an
extra layer of security, and a regression fix for a change that was
added in v3.7 (the v1 was faulty, the v2 works).
- Regression fixes for C-and-P states not being parsed properly.
- Fix possible security issue with guests triggering DoS via
non-assigned MSI-Xs.
- Fix regression (introduced in v3.7) with raising an event (v2).
- Fix hastily introduced band-aid during c0 for the CR3 blowup."
* tag 'stable/for-linus-3.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/events: avoid race with raising an event in unmask_evtchn()
xen/mmu: Move the setting of pvops.write_cr3 to later phase in bootup.
xen/acpi-stub: Disable it b/c the acpi_processor_add is no longer called.
xen-pciback: notify hypervisor about devices intended to be assigned to guests
xen/acpi-processor: Don't dereference struct acpi_processor on all CPUs.
Linus Torvalds [Wed, 27 Mar 2013 18:18:43 +0000 (11:18 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- fix for potential 3.9 regression in handling of buttons for touchpads
following HID mt specification; potential because reportedly there is
no retail product on the market that would be using this feature, but
nevertheless we'd better follow the spec. Fix by Benjamin Tissoires.
- support for two quirky devices added by Josh Boyer.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: multitouch: fix touchpad buttons
HID: usbhid: fix build problem
HID: usbhid: quirk for MSI GX680R led panel
HID: usbhid: quirk for Realtek Multi-card reader
Linus Torvalds [Wed, 27 Mar 2013 16:25:11 +0000 (09:25 -0700)]
Merge tag 'iommu-fixes-v3.9-rc4' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
"Here are some fixes which have collected since Linux v3.9-rc1.
The most important one fixes a long-standing regressen which make
re-hotplugged devices unusable when AMD IOMMU is used.
The other patches fix build issues (build regression on OMAP and a
section mismatch). One patch just removes a duplicate header include."
* tag 'iommu-fixes-v3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Make sure dma_ops are set for hotplug devices
x86, io_apic: remove duplicated include from irq_remapping.c
iommu: OMAP: build only on OMAP2+
amd_iommu_init: remove __init from amd_iommu_erratum_746_workaround
Al Viro [Wed, 27 Mar 2013 15:20:30 +0000 (15:20 +0000)]
vfs/splice: Fix missed checks in new __kernel_write() helper
Commit
06ae43f34bcc ("Don't bother with redoing rw_verify_area() from
default_file_splice_from()") lost the checks to test existence of the
write/aio_write methods. My apologies ;-/
Eventually, we want that in fs/splice.c side of things (no point
repeating it for every buffer, after all), but for now this is the
obvious minimal fix.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Vrabel [Mon, 25 Mar 2013 14:11:19 +0000 (14:11 +0000)]
xen/events: avoid race with raising an event in unmask_evtchn()
In unmask_evtchn(), when the mask bit is cleared after testing for
pending and the event becomes pending between the test and clear, then
the upcall will not become pending and the event may be lost or
delayed.
Avoid this by always clearing the mask bit before checking for
pending. If a hypercall is needed, remask the event as
EVTCHNOP_unmask will only retrigger pending events if they were
masked.
This fixes a regression introduced in 3.7 by
b5e579232d635b79a3da052964cb357ccda8d9ea (xen/events: fix
unmask_evtchn for PV on HVM guests) which reordered the clear mask and
check pending operations.
Changes in v2:
- set mask before hypercall.
Cc: stable@vger.kernel.org
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 22 Mar 2013 14:34:28 +0000 (10:34 -0400)]
xen/mmu: Move the setting of pvops.write_cr3 to later phase in bootup.
We move the setting of write_cr3 from the early bootup variant
(see git commit
0cc9129d75ef8993702d97ab0e49542c15ac6ab9
"x86-64, xen, mmu: Provide an early version of write_cr3.")
to a more appropiate location.
This new location sets all of the other non-early variants
of pvops calls - and most importantly is before the
alternative_asm mechanism kicks in.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 22 Mar 2013 14:15:47 +0000 (10:15 -0400)]
xen/acpi-stub: Disable it b/c the acpi_processor_add is no longer called.
With the Xen ACPI stub code (CONFIG_XEN_STUB=y) enabled, the power
C and P states are no longer uploaded to the hypervisor.
The reason is that the Xen CPU hotplug code: xen-acpi-cpuhotplug.c
and the xen-acpi-stub.c register themselves as the "processor" type object.
That means the generic processor (processor_driver.c) stops
working and it does not call (acpi_processor_add) which populates the
per_cpu(processors, pr->id) = pr;
structure. The 'pr' is gathered from the acpi_processor_get_info function
which does the job of finding the C-states and figuring out PBLK address.
The 'processors->pr' is then later used by xen-acpi-processor.c (the one that
uploads C and P states to the hypervisor). Since it is NULL, we end
skip the gathering of _PSD, _PSS, _PCT, etc and never upload the power
management data.
The end result is that enabling the CONFIG_XEN_STUB in the build means that
xen-acpi-processor is not working anymore.
This temporary patch fixes it by marking the XEN_STUB driver as
BROKEN until this can be properly fixed.
CC: jinsong.liu@intel.com
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Roland Stigge [Tue, 26 Mar 2013 17:36:01 +0000 (18:36 +0100)]
usb: Fix compile error by selecting USB_OTG_UTILS
The current lpc32xx_defconfig breaks like this, caused by recent phy
restructuring:
LD init/built-in.o
drivers/built-in.o: In function `usb_hcd_nxp_probe':
drivers/usb/host/ohci-nxp.c:224: undefined reference to `isp1301_get_client'
drivers/built-in.o: In function `lpc32xx_udc_probe':
drivers/usb/gadget/lpc32xx_udc.c:3104: undefined reference to
`isp1301_get_client' distcc[27867] ERROR: compile (null) on localhost failed
make: *** [vmlinux] Error 1
Caused by
1c2088812f095df77f4b3224b65db79d7111a300 (usb: Makefile: fix
drivers/usb/phy/ Makefile entry)
This patch fixes this by selecting USB_OTG_UTILS in Kconfig which
causes the phy driver to be built again.
Signed-off-by: Roland Stigge <stigge@antcom.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric W. Biederman [Sun, 24 Mar 2013 21:28:27 +0000 (14:28 -0700)]
userns: Restrict when proc and sysfs can be mounted
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.
proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.
Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.
In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Fri, 22 Mar 2013 01:13:15 +0000 (18:13 -0700)]
ipc: Restrict mounting the mqueue filesystem
Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
rights over the ipc namespace. The principle here is if you create
or have capabilities over it you can mount it, otherwise you get to live
with what other people have mounted.
This information is not particularly sensitive and mqueue essentially
only reports which posix messages queues exist. Still when creating a
restricted environment for an application to live any extra
information may be of use to someone with sufficient creativity. The
historical if imperfect way this information has been restricted has
been not to allow mounts and restricting this to ipc namespace
creators maintains the spirit of the historical restriction.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Fri, 22 Mar 2013 11:08:05 +0000 (04:08 -0700)]
vfs: Carefully propogate mounts across user namespaces
As a matter of policy MNT_READONLY should not be changable if the
original mounter had more privileges than creator of the mount
namespace.
Add the flag CL_UNPRIVILEGED to note when we are copying a mount from
a mount namespace that requires more privileges to a mount namespace
that requires fewer privileges.
When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT
if any of the mnt flags that should never be changed are set.
This protects both mount propagation and the initial creation of a less
privileged mount namespace.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Fri, 22 Mar 2013 10:10:15 +0000 (03:10 -0700)]
vfs: Add a mount flag to lock read only bind mounts
When a read-only bind mount is copied from mount namespace in a higher
privileged user namespace to a mount namespace in a lesser privileged
user namespace, it should not be possible to remove the the read-only
restriction.
Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
remain read-only.
CC: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Fri, 15 Mar 2013 08:45:51 +0000 (01:45 -0700)]
userns: Don't allow creation if the user is chrooted
Guarantee that the policy of which files may be access that is
established by setting the root directory will not be violated
by user namespaces by verifying that the root directory points
to the root of the mount namespace at the time of user namespace
creation.
Changing the root is a privileged operation, and as a matter of policy
it serves to limit unprivileged processes to files below the current
root directory.
For reasons of simplicity and comprehensibility the privilege to
change the root directory is gated solely on the CAP_SYS_CHROOT
capability in the user namespace. Therefore when creating a user
namespace we must ensure that the policy of which files may be access
can not be violated by changing the root directory.
Anyone who runs a processes in a chroot and would like to use user
namespace can setup the same view of filesystems with a mount
namespace instead. With this result that this is not a practical
limitation for using user namespaces.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Benjamin Tissoires [Fri, 22 Mar 2013 17:53:57 +0000 (18:53 +0100)]
HID: multitouch: fix touchpad buttons
Commit "HID: multitouch: use the callback "report" instead..." breaks the
buttons of touchpads following the HID multitouch specification.
The buttons were emmitted through hid-input, but as now the events
are generated only in hid-multitouch, the buttons are not emmitted anymore.
The input_event() call is far much simpler than the hid-input one as
many of the different tests do not apply to multitouch touchpads.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Joerg Roedel [Tue, 26 Mar 2013 21:48:23 +0000 (22:48 +0100)]
iommu/amd: Make sure dma_ops are set for hotplug devices
There is a bug introduced with commit
27c2127 that causes
devices which are hot unplugged and then hot-replugged to
not have per-device dma_ops set. This causes these devices
to not function correctly. Fixed with this patch.
Cc: stable@vger.kernel.org
Reported-by: Andreas Degert <andreas.degert@googlemail.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
This page took 0.055311 seconds and 5 git commands to generate.