Michal Hocko [Sat, 10 Sep 2016 10:34:09 +0000 (20:34 +1000)]
oom: keep mm of the killed task available
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see
74070542099c
("oom, suspend: fix oom_reaper vs. oom_killer_disable race")).
exit_oom_victim should be only called from the victim's context ideally.
One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
exit_mm
tsk->mm = NULL;
mmput
__mmput
exit_oom_victim
doesn't guarantee that exit_oom_victim will get called in a bounded amount
of time. At least exit_aio depends on IO which might get blocked due to
lack of memory and who knows what else is lurking there.
This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.
Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.
Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Sat, 10 Sep 2016 10:34:09 +0000 (20:34 +1000)]
mm,oom_reaper: do not attempt to reap a task twice
"mm, oom_reaper: do not attempt to reap a task twice" tried to give the
OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But
the usefulness of the flag is rather limited and actually never shown in
practice. If the flag is set, it means that the holder of mm->mmap_sem
cannot call up_write() due to presumably being blocked at unkillable wait
waiting for other thread's memory allocation. But since one of threads
sharing that mm will queue that mm immediately via task_will_free_mem()
shortcut (otherwise, oom_badness() will select the same mm again due to
oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is
unlikely helpful.
Let's always set MMF_OOM_REAPED.
Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
mm,oom_reaper: reduce find_lock_task_mm() usage
Patch series "fortify oom killer even more", v2.
This patch (of 9):
__oom_reap_task() can be simplified a bit if it receives a valid mm from
oom_reap_task() which also uses that mm when __oom_reap_task() failed. We
can drop one find_lock_task_mm() call and also make the __oom_reap_task()
code flow easier to follow. Moreover, this will make later patch in the
series easier to review. Pinning mm's mm_count for longer time is not
really harmful because this will not pin much memory.
This patch doesn't introduce any functional change.
Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
mm-swap-add-swap_cluster_list-checkpatch-fixes
WARNING: Missing a blank line after declarations
#91: FILE: mm/swapfile.c:285:
+ unsigned int tail = cluster_next(&list->tail);
+ cluster_set_next(&ci[tail], idx);
WARNING: line over 80 characters
#261: FILE: mm/swapfile.c:2347:
+ cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
total: 0 errors, 2 warnings, 222 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-swap-add-swap_cluster_list.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Huang Ying [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
mm, swap: add swap_cluster_list
This is a code clean up patch without functionality changes. The
swap_cluster_list data structure and its operations are introduced to
provide some better encapsulation for the free cluster and discard cluster
list operations. This avoid some code duplication, improved the code
readability, and reduced the total line number.
Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Dobriyan [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
mm: unrig VMA cache hit ratio
Current code doesn't count first FIND operation after VMA cache flush
(which happen surprisingly often) artificially increasing cache hit ratio.
On my regular setup the difference is:
Before After
==========================================================
* boot, login into KDE
vmacache_find_calls 446216 vmacache_find_calls 492741
vmacache_find_hits 277596 vmacache_find_hits 276096
~62.2% ~56.0%
* rebuild kernel (no changes to code, usual config)
vmacache_find_calls
1943007 vmacache_find_calls
2083718
vmacache_find_hits
1246123 vmacache_find_hits
1244146
~64.1% ~59.7%
* rebuild kernel (full rebuild, usual config)
vmacache_find_calls
32163155 vmacache_find_calls
33677183
vmacache_find_hits
27889956 vmacache_find_hits
27877591
~88.2% ~84.3%
Total: ~4% cache hit ratio.
If someone is counting _relative_ cache _miss_ ratio, misreporting is much
higher.
Link: http://lkml.kernel.org/r/20160822225009.GA3934@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
James Morse [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
mm: pagewalk: fix the comment for test_walk
Modify the comment describing struct mm_walk->test_walk()s behaviour to
match the comment on walk_page_test() and the behaviour of
walk_page_vma().
Fixes: fafaa4264eba4 ("pagewalk: improve vma handling")
Link: http://lkml.kernel.org/r/1471622518-21980-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bart Van Assche [Sat, 10 Sep 2016 10:34:08 +0000 (20:34 +1000)]
do_generic_file_read(): fail immediately if killed
If a fatal signal has been received, fail immediately instead of trying to
read more data.
If wait_on_page_locked_killable() was interrupted then this page is most
likely is not PageUptodate() and in this case do_generic_file_read() will
fail after lock_page_killable().
See also commit
ebded02788b5 ("mm: filemap: avoid unnecessary calls to
lock_page when waiting for IO to complete during a read")
[oleg@redhat.com: changelog addition]
Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/page_owner: don't define fields on struct page_ext by hard-coding
There is a memory waste problem if we define field on struct page_ext by
hard-coding. Entry size of struct page_ext includes the size of those
fields even if it is disabled at runtime. Now, extra memory request at
runtime is possible so page_owner don't need to define it's own fields by
hard-coding.
This patch removes hard-coded define and uses extra memory for storing
page_owner information in page_owner. Most of code are just mechanical
changes.
Link: http://lkml.kernel.org/r/1471315879-32294-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/page_ext: support extra space allocation by page_ext user
Until now, if some page_ext users want to use it's own field on page_ext,
it should be defined in struct page_ext by hard-coding. It has a problem
that wastes memory in following situation.
struct page_ext {
#ifdef CONFIG_A
int a;
#endif
#ifdef CONFIG_B
int b;
#endif
};
Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int. It's undesirable
result so this patch tries to fix it.
To solve above problem, this patch implements to support extra space
allocation at runtime. When need() callback returns true, it's extra
memory requirement is summed to entry size of page_ext. Also, offset for
each user's extra memory space is returned. With this offset, user can
use this extra space and there is no need to define needed field on
page_ext by hard-coding.
This patch only implements an infrastructure. Following patch will use it
for page_owner which is only user having it's own fields on page_ext.
Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/page_ext: rename offset to index
Here, 'offset' means entry index in page_ext array. Following patch will
use 'offset' for field offset in each entry so rename current 'offset' to
prevent confusion.
Link: http://lkml.kernel.org/r/1471315879-32294-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/page_owner: move page_owner specific function to page_owner.c
There is no reason that page_owner specific function resides on vmstat.c.
Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/debug_pagealloc.c: don't allocate page_ext if we don't use guard page
What debug_pagealloc does is just mapping/unmapping page table.
Basically, it doesn't need additional memory space to memorize something.
But, with guard page feature, it requires additional memory to distinguish
if the page is for guard or not. Guard page is only used when
debug_guardpage_minorder is non-zero so this patch removes additional
memory allocation (page_ext) if debug_guardpage_minorder is zero.
It saves memory if we just use debug_pagealloc and not guard page.
Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 10 Sep 2016 10:34:07 +0000 (20:34 +1000)]
mm/debug_pagealloc.c: clean-up guard page handling code
Patch series "Reduce memory waste by page extension user".
This patchset tries to reduce memory waste by page extension user.
First case is architecture supported debug_pagealloc. It doesn't requires
additional memory if guard page isn't used. 8 bytes per page will be
saved in this case.
Second case is related to page owner feature. Until now, if page_ext
users want to use it's own fields on page_ext, fields should be defined in
struct page_ext by hard-coding. It has a following problem.
struct page_ext {
#ifdef CONFIG_A
int a;
#endif
#ifdef CONFIG_B
int b;
#endif
};
Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int. It's undesirable waste
so this patch tries to reduce it. By this patchset, we can save 20 bytes
per page dedicated for page owner feature in some configurations.
This patch (of 6):
We can make code clean by moving decision condition for set_page_guard()
into set_page_guard() itself. It will help code readability. There is no
functional change.
Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mm, vmscan: get rid of throttle_vm_writeout
throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
excessive pageout activity during the reclaim. Too many pages could be
put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.
There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages when
initiating the buffered IO and later during the memory pressure, the
direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi is
congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Finally should_reclaim_retry does congestion_wait if the reclaim cannot
make any progress and there are too many dirty/writeback pages.
Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a lot
of IO which would be very scattered and thus unefficient which would just
make the problem worse.
This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies on the
mempool allocator. mempool_alloc(), however, prefers to use the
underlying (usually page) allocator before it grabs objects from the pool.
Such an allocation can dive into the memory reclaim and consequently to
throttle_vm_writeout. If there are too many dirty or pages under
writeback it will get throttled even though it is in fact a flusher to
clear pending pages.
[ 345.352536] kworker/u4:0 D
ffff88003df7f438 10488 6 2 0x00000000
[ 345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[ 345.352536]
ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[ 345.352536]
ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[ 345.352536]
ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[ 345.352536] Call Trace:
[ 345.352536] [<
ffffffff818d466c>] schedule+0x3c/0x90
[ 345.352536] [<
ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[ 345.352536] [<
ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[ 345.352536] [<
ffffffff811407c3>] ? ktime_get+0xb3/0x150
[ 345.352536] [<
ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[ 345.352536] [<
ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[ 345.352536] [<
ffffffff8121d886>] congestion_wait+0x86/0x1f0
[ 345.352536] [<
ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[ 345.352536] [<
ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[ 345.352536] [<
ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[ 345.352536] [<
ffffffff81211720>] shrink_zone+0xe0/0x300
[ 345.352536] [<
ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[ 345.352536] [<
ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[ 345.352536] [<
ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[ 345.352536] [<
ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[ 345.352536] [<
ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[ 345.352536] [<
ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[ 345.352536] [<
ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[ 345.352536] [<
ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[ 345.352536] [<
ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[ 345.352536] [<
ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[ 345.352536] [<
ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[ 345.352536] [<
ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[ 345.352536] [<
ffffffff81267ae1>] __slab_alloc+0x51/0x90
[ 345.352536] [<
ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[ 345.352536] [<
ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[ 345.352536] [<
ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[ 345.352536] [<
ffffffff811f6f11>] mempool_alloc+0x91/0x230
[ 345.352536] [<
ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[ 345.352536] [<
ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.
I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background to
make the memory pressure even stronger as well as to disrupt the steady
state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout the
test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.
[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mm: fix set pageblock migratetype in deferred struct page init
On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually 2M,
so we only set one pageblock's migratetype in deferred_free_range() if pfn
is aligned to MAX_ORDER_NR_PAGES. That means it causes uninitialized
migratetype blocks, you can see from "cat /proc/pagetypeinfo", almost half
blocks are Unmovable.
Also we missed freeing the last block in deferred_init_memmap(), it causes
memory leak.
Fixes: ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mem-hotplug: fix node spanned pages when we have a movable node
342332e6a925e9e ("mm/page_alloc.c: introduce kernelcore=mirror option")
rewrote the calculation of node spanned pages. But when we have a movable
node, the size of node spanned pages is double added. That's because we
have an empty normal zone, the present pages is zero, but its spanned
pages is not zero.
e.g.
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x0000007c7fffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Node 1: 0x0000001080000000
[ 0.000000] Node 2: 0x0000002080000000
[ 0.000000] Node 3: 0x0000003080000000
[ 0.000000] Node 4: 0x0000003c80000000
[ 0.000000] Node 5: 0x0000004c80000000
[ 0.000000] Node 6: 0x0000005c80000000
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009ffff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x000000007552afff]
[ 0.000000] node 0: [mem 0x000000007bd46000-0x000000007bd46fff]
[ 0.000000] node 0: [mem 0x000000007bdcd000-0x000000007bffffff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x000000107fffffff]
[ 0.000000] node 1: [mem 0x0000001080000000-0x000000207fffffff]
[ 0.000000] node 2: [mem 0x0000002080000000-0x000000307fffffff]
[ 0.000000] node 3: [mem 0x0000003080000000-0x0000003c7fffffff]
[ 0.000000] node 4: [mem 0x0000003c80000000-0x0000004c7fffffff]
[ 0.000000] node 5: [mem 0x0000004c80000000-0x0000005c7fffffff]
[ 0.000000] node 6: [mem 0x0000005c80000000-0x0000006c7fffffff]
[ 0.000000] node 7: [mem 0x0000006c80000000-0x0000007c7fffffff]
node1:
[ 760.227767] Normal, start=0x1080000, present=0x0, spanned=0x1000000
[ 760.234024] Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
[ 760.240883] pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000
After apply this patch, the problem is fixed.
node1:
[ 289.770922] Normal, start=0x0, present=0x0, spanned=0x0
[ 289.776153] Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
[ 289.783019] pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000
Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mm, vmscan: make compaction_ready() more accurate and readable
The compaction_ready() is used during direct reclaim for costly order
allocations to skip reclaim for zones where compaction should be attempted
instead. It's combining the standard compaction_suitable() check with its
own watermark check based on high watermark with extra gap, and the result
is confusing at best.
This patch attempts to better structure and document the checks involved.
First, compaction_suitable() can determine that the allocation should
either succeed already, or that compaction doesn't have enough free pages
to proceed. The third possibility is that compaction has enough free
pages, but we still decide to reclaim first - unless we are already above
the high watermark with gap. This does not mean that the reclaim will
actually reach this watermark during single attempt, this is rather an
over-reclaim protection. So document the code as such. The check for
compaction_deferred() is removed completely, as it in fact had no proper
role here.
The result after this patch is mainly a less confusing code. We also skip
some over-reclaim in cases where the allocation should already succed.
Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mm-compaction-require-only-min-watermarks-for-non-costly-orders-fix orders-fix
Clarify why __isolate_free_page() does a order-0 watermark check with
apparent (1UL << order) gap, per Joonsoo.
Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:06 +0000 (20:34 +1000)]
mm, compaction: require only min watermarks for non-costly orders
The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. Then __isolate_free_page uses low watermark check to decide
if particular free page can be isolated. In the latter case, using low
watermark is needlessly pessimistic, as the free page isolations are only
temporary. For __compaction_suitable() the higher watermark makes sense
for high-order allocations where more freepages increase the chance of
success, and we can typically fail with some order-0 fallback when the
system is struggling to reach that watermark. But for low-order
allocation, forming the page should not be that hard. So using low
watermark here might just prevent compaction from even trying, and
eventually lead to OOM killer even if we are above min watermarks.
So after this patch, we use min watermark for non-costly orders in
__compaction_suitable(), and for all orders in __isolate_free_page().
Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm, compaction: use proper alloc_flags in __compaction_suitable()
The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. This check uses direct compactor's alloc_flags, but that's
wrong, since these flags are not applicable for freepage isolation.
For example, alloc_flags may indicate access to memory reserves, making
compaction proceed, and then fail watermark check during the isolation.
A similar problem exists for ALLOC_CMA, which may be part of alloc_flags,
but not during freepage isolation. In this case however it makes sense to
use ALLOC_CMA both in __compaction_suitable() and __isolate_free_page(),
since there's actually nothing preventing the freepage scanner to isolate
from CMA pageblocks, with the assumption that a page that could be
migrated once by compaction can be migrated also later by CMA allocation.
Thus we should count pages in CMA pageblocks when considering compaction
suitability and when isolating freepages.
To sum up, this patch should remove some false positives from
__compaction_suitable(), and allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks.
Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm-compaction-create-compact_gap-wrapper-fix
Clarify the comment of compact_gap() wrt COMPACT_CLUSTER_MAX, per Joonsoo.
Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm, compaction: create compact_gap wrapper
Compaction uses a watermark gap of (2UL << order) pages at various places
and it's not immediately obvious why. Abstract it through a compact_gap()
wrapper to create a single place with a thorough explanation.
Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm, compaction: use correct watermark when checking compaction success
The __compact_finished() function uses low watermark in a check that has
to pass if the direct compaction is to finish and allocation should
succeed. This is too pessimistic, as the allocation will typically use
min watermark. It may happen that during compaction, we drop below the
low watermark (due to parallel activity), but still form the target
high-order page. By checking against low watermark, we might needlessly
continue compaction.
Similarly, __compaction_suitable() uses low watermark in a check whether
allocation can succeed without compaction. Again, this is unnecessarily
pessimistic.
After this patch, these check will use direct compactor's alloc_flags to
determine the watermark, which is effectively the min watermark.
Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm-compaction-add-the-ultimate-direct-compaction-priority-fix
Use the MIN_COMPACT_PRIORITY alias instead of COMPACT_PRIO_SYNC_FULL to
disable heuristics "because this would be easier to follow and it would be
easier for future changes", per Michal.
Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:05 +0000 (20:34 +1000)]
mm, compaction: add the ultimate direct compaction priority
During reclaim/compaction loop, it's desirable to get a final answer from
unsuccessful compaction so we can either fail the allocation or invoke the
OOM killer. However, heuristics such as deferred compaction or pageblock
skip bits can cause compaction to skip parts or whole zones and lead to
premature OOM's, failures or excessive reclaim/compaction retries.
To remedy this, we introduce a new direct compaction priority called
COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
- ignore deferred compaction status for a zone
- ignore pageblock skip hints
- ignore cached scanner positions and scan the whole zone
The new priority should get eventually picked up by should_compact_retry()
and this should improve success rates for costly allocations using
__GFP_REPEAT, such as hugetlbfs allocations, and reduce some corner-case
OOM's for non-costly allocations.
Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:04 +0000 (20:34 +1000)]
mm, compaction: don't recheck watermarks after COMPACT_SUCCESS
Joonsoo has reminded me that in a later patch changing watermark checks
throughout compaction I forgot to update checks in
try_to_compact_pages() and compactd_do_work(). Closer inspection
however shows that they are redundant now in the success case, because
compact_zone() now reliably reports this with COMPACT_SUCCESS. So
effectively the checks just repeat (a subset) of checks that have just
passed. So instead of checking watermarks again, just test the return
value.
Note it's also possible that compaction would declare failure e.g.
because its find_suitable_fallback() is more strict than simple
watermark check, and then the watermark check we are removing would
then still succeed. After this patch this is not possible and it's
arguably better, because for long-term fragmentation avoidance we
should rather try a different zone than allocate with the unsuitable
fallback. If compaction of all zones fail and the allocation is
important enough, it will retry and succeed anyway.
Also remove the stray "bool success" variable from kcompactd_do_work().
Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:04 +0000 (20:34 +1000)]
mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS
COMPACT_PARTIAL has historically meant that compaction returned after
doing some work without fully compacting a zone. It however didn't
distinguish if compaction terminated because it succeeded in creating the
requested high-order page. This has changed recently and now we only
return COMPACT_PARTIAL when compaction thinks it succeeded, or the
high-order watermark check in compaction_suitable() passes and no
compaction needs to be done.
So at this point we can make the return value clearer by renaming it to
COMPACT_SUCCESS. The next patch will remove some redundant tests for
success where compaction just returned COMPACT_SUCCESS.
Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:04 +0000 (20:34 +1000)]
mm, compaction: cleanup unused functions
Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
anymore, so we remove it. The only caller of __compact_pgdat() is
compact_node(), so we merge them and remove code that was only reachable
from kswapd.
Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:04 +0000 (20:34 +1000)]
mm-compaction-make-whole_zone-flag-ignore-cached-scanner-positions-checkpatch-fixes
fix typo in comment
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Sat, 10 Sep 2016 10:34:04 +0000 (20:34 +1000)]
mm, compaction: make whole_zone flag ignore cached scanner positions
Patch series "make direct compaction more deterministic")
This is mostly a followup to Michal's oom detection rework, which
highlighted the need for direct compaction to provide better feedback in
reclaim/compaction loop, so that it can reliably recognize when compaction
cannot make further progress, and allocation should invoke OOM killer or
fail. We've discussed this at LSF/MM [1] where I proposed expanding the
async/sync migration mode used in compaction to more general "priorities".
This patchset adds one new priority that just overrides all the
heuristics and makes compaction fully scan all zones. I don't currently
think that we need more fine-grained priorities, but we'll see. Other
than that there's some smaller fixes and cleanups, mainly related to the
THP-specific hacks.
I've tested this with stress-highalloc in GFP_KERNEL order-4 and THP-like
order-9 scenarios. There's some improvement for compaction stats for the
order-4, which is likely due to the better watermarks handling. In the
previous version I reported mostly noise wrt compaction stats, and
decreased direct reclaim - now the reclaim is without difference. I
believe this is due to the less aggressive compaction priority increase in
patch 6.
"before" is a mmotm tree prior to 4.7 release plus the first part of the
series that was sent and merged separately
before after
order-4:
Compaction stalls 27216 30759
Compaction success 19598 25475
Compaction failures 7617 5283
Page migrate success 370510 464919
Page migrate failure 25712 27987
Compaction pages isolated 849601
1041581
Compaction migrate scanned
143146541 101084990
Compaction free scanned
208355124 144863510
Compaction cost 1403 1210
order-9:
Compaction stalls 7311 7401
Compaction success 1634 1683
Compaction failures 5677 5718
Page migrate success 194657 183988
Page migrate failure 4753 4170
Compaction pages isolated 498790 456130
Compaction migrate scanned 565371 524174
Compaction free scanned
4230296 4250744
Compaction cost 215 203
[1] https://lwn.net/Articles/684611/
This patch (of 11):
A recent patch has added whole_zone flag that compaction sets when
scanning starts from the zone boundary, in order to report that zone has
been fully scanned in one attempt. For allocations that want to try
really hard or cannot fail, we will want to introduce a mode where
scanning whole zone is guaranteed regardless of the cached positions.
This patch reuses the whole_zone flag in a way that if it's already passed
true to compaction, the cached scanner positions are ignored. Employing
this flag during reclaim/compaction loop will be done in the next patch.
This patch however converts compaction invoked from userspace via procfs
to use this flag. Before this patch, the cached positions were first
reset to zone boundaries and then read back from struct zone, so there was
a window where a parallel compaction could replace the reset values,
making the manual compaction less effective. Using the flag instead of
performing reset is more robust.
Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
mm/oom_kill.c: fix task_will_free_mem() comment
Attempt to demystify the task_will_free_mem() loop.
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
mm: memcontrol: add sanity checks for memcg->id.ref on get/put
Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zijun_hu [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
bitops.h: move out get_count_order[_long]() from __KERNEL__ scope
move out get_count_order[_long]() definitions from scope limited by macro
__KERNEL__
it not only make both functions available in wider region regardless of
whether __KERNEL__ is defined but also keep original region for
get_count_order() before the recent commit
c513b4cd2fe9
("mm-vmalloc-fix-align-value-calculation-error-v2-fix-fix")
Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
mm-vmalloc-fix-align-value-calculation-error-v2-fix-fix
move get_count_order[_long] definitions to pick up fls_long()
Cc: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
mm-vmalloc-fix-align-value-calculation-error-v2-fix
locate get_count_order_long() next to get_count_order()
Cc: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zijun_hu [Sat, 10 Sep 2016 10:34:03 +0000 (20:34 +1000)]
mm-vmalloc-fix-align-value-calculation-error-v2
i provide another patch called v2 based on your suggestion as shown below
it have following correction against original patch v1
1) use name get_count_order_long() instead of get_order_long()
2) return -1 if @l == 0 to consist with get_order_long()
3) cast type to int before returning from get_count_order_long()
4) move up function parameter checking for __get_vm_area_node()
5) more commit message is offered to make issue and approach clear
any comments about new patch is welcome
Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
mm-vmalloc-fix-align-value-calculation-error-fix
s/get_order_long()/get_count_order_long()/ to match get_count_order()
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zijun_hu [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
mm/vmalloc.c: fix align value calculation error
It causes double align requirement for __get_vm_area_node() if parameter
size is power of 2 and VM_IOREMAP is set in parameter flags, for example
size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000
get_count_order_long() is implemented and used instead of fls_long() for
fixing the bug, for example
size=0x10000 -> get_count_order_long(0x10000)=16 -> align=0x10000
Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ganesh Mahendran [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
mm/zsmalloc: add per-class compact trace event
add per-class compact trace event to get number of migrated objects
and number of freed pages.
trace log is like below:
bash-3863 [001] .... 141.791366: zs_compact_start: pool zram0
bash-3863 [001] .... 141.791372: zs_compact: class 254: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791375: zs_compact: class 202: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791385: zs_compact: class 190: 1 objects migrated, 3 pages freed
bash-3863 [001] .... 141.791393: zs_compact: class 168: 2 objects migrated, 2 pages freed
bash-3863 [001] .... 141.791396: zs_compact: class 151: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791407: zs_compact: class 144: 5 objects migrated, 4 pages freed
bash-3863 [001] .... 141.791427: zs_compact: class 126: 8 objects migrated, 8 pages freed
bash-3863 [001] .... 141.791433: zs_compact: class 111: 1 objects migrated, 4 pages freed
bash-3863 [001] .... 141.791459: zs_compact: class 107: 18 objects migrated, 12 pages freed
bash-3863 [001] .... 141.791487: zs_compact: class 100: 18 objects migrated, 16 pages freed
bash-3863 [001] .... 141.791509: zs_compact: class 94: 18 objects migrated, 9 pages freed
bash-3863 [001] .... 141.791560: zs_compact: class 91: 44 objects migrated, 24 pages freed
bash-3863 [001] .... 141.791605: zs_compact: class 83: 35 objects migrated, 20 pages freed
bash-3863 [001] .... 141.791616: zs_compact: class 76: 8 objects migrated, 4 pages freed
bash-3863 [001] .... 141.791644: zs_compact: class 74: 21 objects migrated, 9 pages freed
bash-3863 [001] .... 141.791665: zs_compact: class 71: 18 objects migrated, 10 pages freed
bash-3863 [001] .... 141.791736: zs_compact: class 67: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791763: zs_compact: class 66: 22 objects migrated, 8 pages freed
bash-3863 [001] .... 141.791820: zs_compact: class 62: 18 objects migrated, 6 pages freed
bash-3863 [001] .... 141.791826: zs_compact: class 58: 1 objects migrated, 4 pages freed
bash-3863 [001] .... 141.791829: zs_compact: class 57: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791834: zs_compact: class 54: 2 objects migrated, 2 pages freed
...
bash-3863 [001] .... 141.791952: zs_compact: class 4: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791964: zs_compact: class 3: 14 objects migrated, 1 pages freed
bash-3863 [001] .... 141.791966: zs_compact: class 2: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791968: zs_compact: class 1: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791971: zs_compact: class 0: 0 objects migrated, 0 pages freed
bash-3863 [001] .... 141.791973: zs_compact_end: pool zram0: 155 pages compacted
Also this patch changes trace_zsmalloc_compact_start[end] to
trace_zs_compact_start[end] to keep function naming consistent with others
in zsmalloc.
Link: http://lkml.kernel.org/r/1467882338-4300-8-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ganesh Mahendran [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
mm/zsmalloc: add trace events for zs_compact
Currently zsmalloc is widely used in android device. Sometimes, we want
to see how frequently zs_compact is triggered or how may pages freed by
zs_compact(), or which zsmalloc pool is compacted.
We have backported the zs_compact() to our product(kernel 3.18). It is
usefull for a longtime running device. But there is not a convenient
way to get the detailed information of zs_comapct() which is usefull
for performance optimization. Information about how much time
zs_compact used, which pool is compacted, how many page freed, etc.
With these information, we will know what is going on in zs_comapct.
And draw the relation between free mem and zs_comapct.
Most of the time, user can get the brief information from
trace_mm_shrink_slab_[start | end], but in some senario, they do not use
zsmalloc shrinker, but trigger compaction manually. So add some trace
events in zs_compact is convenient. Also we can add some zsmalloc
specific information(pool name, total compact pages, etc) in zsmalloc
trace.
This patch add two trace events for zs_compact(), below the trace log:
-----------------------------
root@land:/ # cat /d/tracing/trace
kswapd0-125 [007] ...1 174.176979: zsmalloc_compact_start: pool zram0
kswapd0-125 [007] ...1 174.181967: zsmalloc_compact_end: pool zram0: 608 pages compacted(total 1794)
kswapd0-125 [000] ...1 184.134475: zsmalloc_compact_start: pool zram0
kswapd0-125 [000] ...1 184.135010: zsmalloc_compact_end: pool zram0: 62 pages compacted(total 1856)
kswapd0-125 [003] ...1 226.927221: zsmalloc_compact_start: pool zram0
kswapd0-125 [003] ...1 226.928575: zsmalloc_compact_end: pool zram0: 250 pages compacted(total 2106)
-----------------------------
Link: http://lkml.kernel.org/r/1465289804-4913-1-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
mm: oom: deduplicate victim selection code for memcg and global oom
When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom. The only difference is the scope of tasks to
select the victim from. So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it. That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.
Let's rework this as follows: keep all oom heuristic related code private
to oom_kill.c and make oom_kill.c use exported memcg functions when it's
really necessary (like in case of iterating over memcg tasks).
Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Konstantin Khlebnikov [Sat, 10 Sep 2016 10:34:02 +0000 (20:34 +1000)]
kernel/watchdog: use nmi registers snapshot in hardlockup handler
NMI handler doesn't call set_irq_regs(), it's set only by normal IRQ.
Thus get_irq_regs() returns NULL or stale registers snapshot with IP/SP
pointing to the code interrupted by IRQ which was interrupted by NMI.
NULL isn't a problem: in this case watchdog calls dump_stack() and
prints full stack trace including NMI. But if we're stuck in IRQ
handler then NMI watchlog will print stack trace without IRQ part at
all.
This patch uses registers snapshot passed into NMI handler as
arguments: these registers points exactly to the instruction
interrupted by NMI.
Fixes: 55537871ef66 ("kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup")
Link: http://lkml.kernel.org/r/146771764784.86724.6006627197118544150.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Sat, 10 Sep 2016 10:34:01 +0000 (20:34 +1000)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit
d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bhaktipriya Shridhar [Sat, 10 Sep 2016 10:34:01 +0000 (20:34 +1000)]
fs/ocfs2/dlm: remove deprecated create_singlethread_workqueue()
The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work
and thus it doesn't require execution ordering. Hence, alloc_workqueue
has been used to replace the deprecated create_singlethread_workqueue
instance.
The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure.
Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.
Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bhaktipriya Shridhar [Sat, 10 Sep 2016 10:34:01 +0000 (20:34 +1000)]
fs/ocfs2/super: remove deprecated create_singlethread_workqueue()
The workqueue "ocfs2_wq" queues multiple work items viz
&osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
&osb->osb_truncate_log_wq which require strict execution ordering. Hence,
an ordered dedicated workqueue has been used.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure because the workqueue is being used on a memory reclaim path.
Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bhaktipriya Shridhar [Sat, 10 Sep 2016 10:34:01 +0000 (20:34 +1000)]
fs/ocfs2/cluster: remove deprecated create_singlethread_workqueue()
The workqueue "o2net_wq" queues multiple work items viz
&old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which
require strict execution ordering. Hence, an ordered dedicated workqueue
has been used.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bhaktipriya Shridhar [Sat, 10 Sep 2016 10:34:01 +0000 (20:34 +1000)]
fs/ocfs2/dlmfs: remove deprecated create_singlethread_workqueue()
The workqueue "user_dlm_worker" queues a single work item &lockres->l_work
per user_lock_res instance and hence it doesn't require execution
ordering. Hence, alloc_workqueue has been used to replace the deprecated
create_singlethread_workqueue instance.
The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure.
Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.
Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Dobriyan [Sat, 10 Sep 2016 10:34:00 +0000 (20:34 +1000)]
kbuild: simpler generation of assembly constants
gcc doesn't really look inside "asm" statements and more or less
directly emits it into assembly. So pretend "#define" is CPU
instruction.
C++ comment can't be used because sparc assembler doesn't understand it.
Link: http://lkml.kernel.org/r/20160713173646.GA1910@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Marek <mmarek@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 10 Sep 2016 10:34:00 +0000 (20:34 +1000)]
arm: arch/arm/include/asm/page.h needs personality.h
VM_DATA_DEFAULT_FLAGS uses READ_IMPLIES_EXEC, so page.h should include
personality.h to provide this.
This fixes no known bugs and can be safely ignored ;)
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Florian Fainelli [Sat, 10 Sep 2016 10:34:00 +0000 (20:34 +1000)]
MAINTAINERS: update email for VLYNQ bus entry
Link: http://lkml.kernel.org/r/1473218738-21836-1-git-send-email-f.fainelli@gmail.com
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Sat, 10 Sep 2016 10:34:00 +0000 (20:34 +1000)]
mm: avoid endless recursion in dump_page()
dump_page() uses page_mapcount() to get mapcount of the page.
page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
make sense for slab pages and the field in struct page used for other
information.
It leads to recursion if dump_page() called for slub page and DEBUG_VM
is enabled:
dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...
Let's avoid calling page_mapcount() for slab pages in dump_page().
Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ebru Akagunduz [Sat, 10 Sep 2016 10:34:00 +0000 (20:34 +1000)]
mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin()
Currently, khugepaged does not permit swapin if there are enough young
pages in a THP. The problem is when a THP does not have enough young
pages, khugepaged leaks mapped ptes.
This patch prohibits leaking mapped ptes.
Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vegard Nossum [Sat, 10 Sep 2016 10:33:59 +0000 (20:33 +1000)]
stackdepot: fix mempolicy use-after-free
This patch fixes the following:
BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr
ffff88010b48102c
Read of size 2 by task trinity-c2/15425
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-proje
ct.org 04/01/2014
ffff88010b481040 ffff88010b557650 ffffffff81f08d11 ffff88011a40d380
ffff88010b481028 ffff88010b557678 ffffffff815dac7c ffff88010b557708
ffff88010b481028 ffff88011a40d380 ffff88010b5576f8 ffffffff815daf15
Call Trace:
[<
ffffffff81f08d11>] dump_stack+0x65/0x84
[<
ffffffff815dac7c>] kasan_object_err+0x1c/0x70
[<
ffffffff815daf15>] kasan_report_error+0x1f5/0x4c0
[<
ffffffff815db2fe>] __asan_report_load2_noabort+0x3e/0x40
[<
ffffffff815cb903>] alloc_pages_current+0x363/0x370 <---- use-after-free
[<
ffffffff81fa9954>] depot_save_stack+0x3f4/0x490
[<
ffffffff815d9bb5>] save_stack+0xb5/0xd0
[<
ffffffff815da211>] kasan_slab_free+0x71/0xb0
[<
ffffffff815d6643>] kmem_cache_free+0xa3/0x290
[<
ffffffff815c8149>] __mpol_put+0x19/0x20 <---- free
[<
ffffffff81260635>] do_exit+0x1515/0x2b70
[<
ffffffff81261dc4>] do_group_exit+0xf4/0x2f0
[<
ffffffff81281c5d>] get_signal+0x53d/0x1120
[<
ffffffff8119e993>] do_signal+0x83/0x1e20
[<
ffffffff810027af>] exit_to_usermode_loop+0xaf/0x140
[<
ffffffff810051e4>] syscall_return_slowpath+0x144/0x170
[<
ffffffff83ae406f>] ret_from_fork+0x2f/0x40
Read of size 2 by task trinity-c2/15425
The problem is that we may be calling alloc_pages() in a code path where
current->mempolicy has already been freed.
By passing __GFP_THISNODE we will always use default_mempolicy (which
cannot be freed).
Link: https://lkml.org/lkml/2016/7/29/277
Link: https://github.com/google/kasan/issues/35
Link: http://lkml.kernel.org/r/1471603265-31804-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Srikar Dronamraju [Sat, 10 Sep 2016 10:33:59 +0000 (20:33 +1000)]
mm/page_alloc.c: replace set_dma_reserve to set_memory_reserve
Expand the scope of the existing dma_reserve to accommodate other memory
reserves too. Accordingly rename variable dma_reserve to
nr_memory_reserve.
set_memory_reserve() also takes a new parameter that helps to identify if
the current value needs to be incremented.
Link: http://lkml.kernel.org/r/1470330729-6273-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aruna Ramakrishna [Sat, 10 Sep 2016 10:33:59 +0000 (20:33 +1000)]
mm/slab: improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).
This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo
statistics now only requires walking the slabs_free and slabs_partial
lists, and those lists are usually much smaller than slabs_full. We
tested this after growing the dentry cache to 70GB, and the performance
improved from 2s to 5ms.
Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Sat, 10 Sep 2016 10:33:59 +0000 (20:33 +1000)]
khugepaged: fix use-after-free in collapse_huge_page()
hugepage_vma_revalidate() tries to re-check if we still should try to
collapse small pages into huge one after the re-acquiring mmap_sem.
The problem Dmitry Vyukov reported[1] is that the vma found by
hugepage_vma_revalidate() can be suitable for huge pages, but not the same
vma we had before dropping mmap_sem. And dereferencing original vma can
lead to fun results..
Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
same as what we had before the lock was dropped. [1]
http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com
Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sudip Mukherjee [Sat, 10 Sep 2016 10:33:59 +0000 (20:33 +1000)]
MAINTAINERS: Maik has moved
Maik is no longer using the plusserver.de email, update with his
current email.
Link: http://lkml.kernel.org/r/1473007794-27960-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Cc: Maik Broemme <mbroemme@libmpq.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Sat, 10 Sep 2016 10:33:58 +0000 (20:33 +1000)]
ocfs2/dlm: fix race between convert and migration
Commit
ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after old
master does umount ( means master will change), a new convert request
comes.
In this case, it will reset lockres state to DLM_RECOVERING and then retry
convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between ocfs2
and dlm, and then finally BUG.
Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify the
race case between convert and recovery. So fix it.
Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Zhong [Sat, 10 Sep 2016 10:33:58 +0000 (20:33 +1000)]
mem-hotplug: Don't clear the only node in new_node_page()
394e31d2c ("mem-hotplug: alloc new page from a nearest neighbor node when
mem-offline") introduced new_node_page() for memory hotplug.
In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask(). But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.
The patch checks whether it is the last node on the system, and if it is, then
don't clear the nid in the nodemask.
Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reported-by: John Allen <jallen@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Thu, 8 Sep 2016 19:23:13 +0000 (12:23 -0700)]
Merge tag 'ceph-for-4.8-rc6' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a 4.7 performance regression, caused by a typo in an if
condition"
* tag 'ceph-for-4.8-rc6' of git://github.com/ceph/ceph-client:
ceph: do not modify fi->frag in need_reset_readdir()
Linus Torvalds [Thu, 8 Sep 2016 19:19:24 +0000 (12:19 -0700)]
Merge branch 'dmi-for-linus' of git://git./linux/kernel/git/jdelvare/staging
Pull dmi fix from Jean Delvare.
* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
dmi-id: don't free dev structure after calling device_register
Linus Torvalds [Thu, 8 Sep 2016 19:05:15 +0000 (12:05 -0700)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"This is a slightly larger batch of fixes that we've been sitting on a
few -rcs. Most of them are simple oneliners, but there are two sets
that are slightly larger and worth pointing out:
- A set of patches to OMAP to deal with hwmod for RTC on am33xx
(beaglebone SoC, among others). It's the only clock that ever has
a valid offset of 0, so a new flag needed introduction once this
problem was discovered.
- A collection of CCI fixes for performance counters discovered once
people started using it on X-Gene CPUs"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (37 commits)
arm-cci: pmu: Fix typo in event name
Revert "ARM: tegra: fix erroneous address in dts"
ARM: dts: imx6qdl: Fix SPDIF regression
ARM: imx6: add missing BM_CLPCR_BYPASS_PMIC_READY setting for imx6sx
ARM: dts: imx7d-sdb: fix ti,x-plate-ohms property name
ARM: dts: kirkwood: Fix PCIe label on OpenRD
ARM: kirkwood: ib62x0: fix size of u-boot environment partition
bus: arm-ccn: make event groups reliable
bus: arm-ccn: fix hrtimer registration
bus: arm-ccn: fix PMU interrupt flags
ARM: tegra: Correct polarity for Tegra114 PMIC interrupt
MAINTAINERS: add tree entry for ARM/UniPhier architecture
ARM: sun5i: Fix typo in trip point temperature
MAINTAINERS: Switch to kernel.org account for Krzysztof Kozlowski
ARM: imx6ul: populates platform device at .init_machine
bus: arm-ccn: Add missing event attribute exclusions for host/guest
bus: arm-ccn: Correct required arguments for XP PMU events
bus: arm-ccn: Fix XP watchpoint settings bitmask
bus: arm-ccn: Do not attempt to configure XPs for cycle counter
bus: arm-ccn: Fix PMU handling of MN
...
Allen Hung [Fri, 15 Jul 2016 09:42:22 +0000 (17:42 +0800)]
dmi-id: don't free dev structure after calling device_register
dmi_dev is freed in error exit code but, according to the document
of device_register, it should never directly free device structure
after calling this function, even if it returned an error! Use
put_device() instead.
Signed-off-by: Allen Hung <allen_hung@dell.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Linus Torvalds [Thu, 8 Sep 2016 04:28:26 +0000 (21:28 -0700)]
Merge branch 'for-rc' of git://git./linux/kernel/git/rzhang/linux
Pull thermal fix from Zhang Rui:
"Only one patch this time, which fixes a crash in rcar_thermal driver.
From Dirk Behme"
* 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
thermal: rcar_thermal: Fix priv->zone error handling
Olof Johansson [Thu, 8 Sep 2016 04:25:08 +0000 (21:25 -0700)]
Merge tag 'sunxi-fixes-for-4.8' of https://git./linux/kernel/git/mripard/linux into fixes
Allwinner fixes for 4.8
A single patch fixing a typo in the temperature trip points in the A13
DTSI.
* tag 'sunxi-fixes-for-4.8' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux:
ARM: sun5i: Fix typo in trip point temperature
Signed-off-by: Olof Johansson <olof@lixom.net>
Suzuki K Poulose [Mon, 5 Sep 2016 15:27:53 +0000 (16:27 +0100)]
arm-cci: pmu: Fix typo in event name
For one of the CCI events exposed under sysfs, "snoop" was typo'd as
"snopp". Correct this such that users see the expected event name when
enumerating events via sysfs.
Cc: arm@kernel.org
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Thu, 8 Sep 2016 04:24:22 +0000 (21:24 -0700)]
Merge tag 'imx-fixes-4.8-2' of git://git./linux/kernel/git/shawnguo/linux into fixes
i.MX fixes for 4.8, 2nd round:
- Fix misspelled "ti,x-plate-ohms" property name of touchscreen
controller for imx7d-sdb DTS.
- Add missing BM_CLPCR_BYPASS_PMIC_READY setting for i.MX6SX to get
suspend/resume work properly.
- Fix SPDIF regression on imx6qdl which caused by a clock update on
spdif device node.
* tag 'imx-fixes-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: dts: imx6qdl: Fix SPDIF regression
ARM: imx6: add missing BM_CLPCR_BYPASS_PMIC_READY setting for imx6sx
ARM: dts: imx7d-sdb: fix ti,x-plate-ohms property name
Signed-off-by: Olof Johansson <olof@lixom.net>
Olof Johansson [Thu, 8 Sep 2016 04:16:40 +0000 (21:16 -0700)]
Revert "ARM: tegra: fix erroneous address in dts"
This reverts commit
b5c86b7496d74f6e454bcab5166efa023e1f0459.
This is no longer needed due to other changes going into 4.8 to rename
the unit addresses on a large number of device nodes. So it was picked up
for v4.8-rc1 in error.
Reported-by: Ralf Ramsauer <ralf@ramses-pyramidenbau.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
Linus Torvalds [Wed, 7 Sep 2016 21:03:49 +0000 (14:03 -0700)]
Merge tag 'usercopy-v4.8-rc6-part2' of git://git./linux/kernel/git/kees/linux
Pull more hardened usercopyfixes from Kees Cook:
- force check_object_size() to be inline too
- move page-spanning check behind a CONFIG since it's triggering false
positives
[ Changed the page-spanning config option to depend on EXPERT in the
merge. That way it still gets build testing, and you can enable it if
you want to, but is never enabled for "normal" configurations ]
* tag 'usercopy-v4.8-rc6-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: remove page-spanning test for now
usercopy: force check_object_size() inline
Kees Cook [Wed, 7 Sep 2016 16:54:34 +0000 (09:54 -0700)]
usercopy: remove page-spanning test for now
A custom allocator without __GFP_COMP that copies to userspace has been
found in vmw_execbuf_process[1], so this disables the page-span checker
by placing it behind a CONFIG for future work where such things can be
tracked down later.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=
1373326
Reported-by: Vinson Lee <vlee@freedesktop.org>
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Wed, 7 Sep 2016 16:39:32 +0000 (09:39 -0700)]
usercopy: force check_object_size() inline
Just for good measure, make sure that check_object_size() is always
inlined too, as already done for copy_*_user() and __copy_*_user().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Linus Torvalds [Wed, 7 Sep 2016 17:46:06 +0000 (10:46 -0700)]
Merge tag 'seccomp-v4.8-rc6' of git://git./linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"Fix UM seccomp vs ptrace, after reordering landed"
* tag 'seccomp-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Remove 2-phase API documentation
um/ptrace: Fix the syscall number update after a ptrace
um/ptrace: Fix the syscall_trace_leave call
Linus Torvalds [Wed, 7 Sep 2016 16:29:36 +0000 (09:29 -0700)]
Merge tag 'usercopy-v4.8-rc6' of git://git./linux/kernel/git/kees/linux
Pull hardened usercopy fixes from Kees Cook:
- inline copy_*_user() for correct use of __builtin_const_p() for
hardened usercopy and the recent compile-time checks.
- switch hardened usercopy to only check non-const size arguments to
avoid meaningless checks on likely-sane const values.
- update lkdtm usercopy tests to compenstate for the const checking.
* tag 'usercopy-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lkdtm: adjust usercopy tests to bypass const checks
usercopy: fold builtin_const check into inline function
x86/uaccess: force copy_*_user() to be inlined
Mickaël Salaün [Mon, 1 Aug 2016 21:01:57 +0000 (23:01 +0200)]
seccomp: Remove 2-phase API documentation
Fixes: 8112c4f140fa ("seccomp: remove 2-phase API")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Mickaël Salaün [Mon, 1 Aug 2016 21:01:56 +0000 (23:01 +0200)]
um/ptrace: Fix the syscall number update after a ptrace
Update the syscall number after each PTRACE_SETREGS on ORIG_*AX.
This is needed to get the potentially altered syscall number in the
seccomp filters after RET_TRACE.
This fix four seccomp_bpf tests:
> [ RUN ] TRACE_syscall.skip_after_RET_TRACE
> seccomp_bpf.c:1560:TRACE_syscall.skip_after_RET_TRACE:Expected -1 (
18446744073709551615) == syscall(39) (26)
> seccomp_bpf.c:1561:TRACE_syscall.skip_after_RET_TRACE:Expected 1 (1) == (*__errno_location ()) (22)
> [ FAIL ] TRACE_syscall.skip_after_RET_TRACE
> [ RUN ] TRACE_syscall.kill_after_RET_TRACE
> TRACE_syscall.kill_after_RET_TRACE: Test exited normally instead of by signal (code: 1)
> [ FAIL ] TRACE_syscall.kill_after_RET_TRACE
> [ RUN ] TRACE_syscall.skip_after_ptrace
> seccomp_bpf.c:1622:TRACE_syscall.skip_after_ptrace:Expected -1 (
18446744073709551615) == syscall(39) (26)
> seccomp_bpf.c:1623:TRACE_syscall.skip_after_ptrace:Expected 1 (1) == (*__errno_location ()) (22)
> [ FAIL ] TRACE_syscall.skip_after_ptrace
> [ RUN ] TRACE_syscall.kill_after_ptrace
> TRACE_syscall.kill_after_ptrace: Test exited normally instead of by signal (code: 1)
> [ FAIL ] TRACE_syscall.kill_after_ptrace
Fixes: 26703c636c1f ("um/ptrace: run seccomp after ptrace")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: James Morris <jmorris@namei.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Mickaël Salaün [Mon, 1 Aug 2016 21:01:55 +0000 (23:01 +0200)]
um/ptrace: Fix the syscall_trace_leave call
Keep the same semantic as before the commit
26703c636c1f: deallocate
audit context and fake a proper syscall exit.
This fix a kernel panic triggered by the seccomp_bpf test:
> [ RUN ] global.ERRNO_valid
> BUG: failure at kernel/auditsc.c:1504/__audit_syscall_entry()!
> Kernel panic - not syncing: BUG!
Fixes: 26703c636c1f ("um/ptrace: run seccomp after ptrace")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: James Morris <jmorris@namei.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Linus Torvalds [Tue, 6 Sep 2016 19:33:12 +0000 (12:33 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"This is the second pull request for the rdma subsystem. Most of the
patches are small and obvious. I took two patches in that are larger
than I wanted this late in the cycle.
The first is the hfi1 patch that implements a work queue to test the
QSFP read state. I originally rejected the first patch for this
(which would have place up to 20 seconds worth of udelays in their
probe routine). They then rewrote it the way I wanted (use delayed
work tasks to wait asynchronously up to 20 seconds for the QSFP to
come alive), so I can't really complain about the size of getting what
I asked for :-/.
The second is large because it switches the rcu locking in the debugfs
code. Since a locking change like this is done all at once, the size
it what it is. It resolves a litany of debug messages from the
kernel, so I pulled it in for -rc.
The rest are all typical -rc worthy patches I think.
There will still be a third -rc pull request from the rdma subsystem
this release. I hope to have that one ready to go by the end of this
week or early next.
Summary:
- a smattering of small fixes across the core, ipoib, i40iw, isert,
cxgb4, and mlx4
- a slightly larger group of fixes to each of mlx5 and hfi1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/hfi1: Rework debugfs to use SRCU
IB/hfi1: Make n_krcvqs be an unsigned long integer
IB/hfi1: Add QSFP sanity pre-check
IB/hfi1: Fix AHG KDETH Intr shift
IB/hfi1: Fix SGE length for misaligned PIO copy
IB/mlx5: Don't return errors from poll_cq
IB/mlx5: Use TIR number based on selector
IB/mlx5: Simplify code by removing return variable
IB/mlx5: Return EINVAL when caller specifies too many SGEs
IB/mlx4: Don't return errors from poll_cq
Revert "IB/mlx4: Return EAGAIN for any error in mlx4_ib_poll_one"
IB/ipoib: Fix memory corruption in ipoib cm mode connect flow
IB/core: Fix use after free in send_leave function
IB/cxgb4: Make _free_qp static to silence build warning
IB/isert: Properly release resources on DEVICE_REMOVAL
IB/hfi1: Fix the size parameter to find_first_bit
IB/mlx5: Fix the size parameter to find_first_bit
IB/hfi1: Clean up type used and casting
i40iw: Receive notification events correctly
i40iw: Update hw_iwarp_state
Kees Cook [Tue, 6 Sep 2016 18:26:12 +0000 (11:26 -0700)]
lkdtm: adjust usercopy tests to bypass const checks
The hardened usercopy is now consistently avoiding checks against const
sizes, since we really only want to perform runtime bounds checking
on lengths that weren't known at build time. To test the hardened usercopy
code, we must force the length arguments to be seen as non-const.
Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Wed, 31 Aug 2016 23:04:21 +0000 (16:04 -0700)]
usercopy: fold builtin_const check into inline function
Instead of having each caller of check_object_size() need to remember to
check for a const size parameter, move the check into check_object_size()
itself. This actually matches the original implementation in PaX, though
this commit cleans up the now-redundant builtin_const() calls in the
various architectures.
Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Tue, 6 Sep 2016 18:56:01 +0000 (11:56 -0700)]
x86/uaccess: force copy_*_user() to be inlined
As already done with __copy_*_user(), mark copy_*_user() as __always_inline.
Without this, the checks for things like __builtin_const_p() won't work
consistently in either hardened usercopy nor the recent adjustments for
detecting usercopy overflows at compile time.
The change in kernel text size is detectable, but very small:
text data bss dec hex filename
12118735 5768608 14229504 32116847 1ea106f vmlinux.before
12120207 5768608 14229504 32118319 1ea162f vmlinux.after
Signed-off-by: Kees Cook <keescook@chromium.org>
Linus Torvalds [Tue, 6 Sep 2016 18:15:07 +0000 (11:15 -0700)]
Merge branch 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration
Pull mailbox fixes from Jassi Brar:
"Misc fixes for BCM mailbox driver
- Fix build warnings by making static functions used within the file.
- Check for potential NULL before dereferencing
- Fix link error by defining HAS_DMA dependency"
* 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
fix:mailbox:bcm-pdc-mailbox:mark symbols static where possible
mailbox: bcm-pdc: potential NULL dereference in pdc_shutdown()
mailbox: Add HAS_DMA Kconfig dependency to BCM_PDC_MBOX
Linus Torvalds [Tue, 6 Sep 2016 18:06:52 +0000 (11:06 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is really three fixes, but the SES one comes in a bundle of three
(making the replacement API available properly, using it and removing
the non-working one). The SES problem causes an oops on hpsa devices
because they attach virtual disks to the host which aren't SAS
attached (the replacement API ignores them).
The other two fixes are fairly minor: the sense key one means we
actually resolve a newly added sense key and the RDAC device
blacklisting is needed to prevent us annoying the universal XPORT lun
of various RDAC arrays"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sas: remove is_sas_attached()
scsi: ses: use scsi_is_sas_rphy instead of is_sas_attached
scsi: sas: provide stub implementation for scsi_is_sas_rphy
scsi: blacklist all RDAC devices for BLIST_NO_ULD_ATTACH
scsi: fix upper bounds check of sense key in scsi_sense_key_string()
Linus Torvalds [Tue, 6 Sep 2016 18:02:36 +0000 (11:02 -0700)]
Merge tag 'regmap-fix-v4.8-rc5' of git://git./linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Several fixes here, the main one being the change from Lars-Peter
which I'd been letting soak in -next since the merge window in case it
uncovered further issues as it's a minimal fix rather than a change
addressing the root cause of the problems (which would've been too
invasive for -rc):
- The biggest change is a fix from Lars-Peter to ensure that we don't
create overlapping rbtree nodes which in turn avoids returning
corrupt cache values to users, fixing some issues that were exposed
by some recent optimisations with certain access patterns but had
been present for a long time.
- A fix from Elaine Zhang to stop us updating the cache if we get an
I/O error when writing to the hardware.
- A fix fromm Maarten ter Huurne to avoid uninitialized defaults in
cases where we have non-readable registers but are initializing the
cache by reading from the device"
* tag 'regmap-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: drop cache if the bus transfer error
regmap: rbtree: Avoid overlapping nodes
regmap: cache: Fix num_reg_defaults computation from reg_defaults_raw
Linus Torvalds [Tue, 6 Sep 2016 17:59:44 +0000 (10:59 -0700)]
Merge tag 'spi-fix-v4.8-rc5' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"As well as the usual driver fixes there's a couple of non-trivial core
fixes in here:
- Fixes for issues reported by Julia Lawall in the changes that were
sent last time to fix interaction between the bus lock and the
locking done for the SPI thread. I'd let this one cook for a while
to make sure nothing else came up in testing.
- A fix from Sien Wu for arithmetic overflows when calculating the
timeout for larger transfers (espcially common with slow buses with
flashes on them)"
* tag 'spi-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: Prevent unexpected SPI time out due to arithmetic overflow
spi: pxa2xx-pci: fix ACPI-based enumeration of SPI devices
MAINTAINERS: add myself as Samsung SPI maintainer
spi: Drop io_mutex in error paths
spi: sh-msiof: Avoid invalid clock generator parameters
spi: img-spfi: Remove spi_master_put in img_spfi_remove()
spi: mediatek: remove spi_master_put in mtk_spi_remove()
spi: qup: Remove spi_master_put in spi_qup_remove()
Linus Torvalds [Tue, 6 Sep 2016 17:43:54 +0000 (10:43 -0700)]
Merge tag 'regulator-fix-v4.8-rc5' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"Two things here, one an e-mail update for Krzysztof Kozlowski and the
other a couple of fixes for issues with incorrectly described voltages
in a couple of the Qualcomm regulator drivers that were breaking MMC
on some platforms"
* tag 'regulator-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Change Krzysztof Kozlowski's email to kernel.org
regulator: qcom_smd: Fix voltage ranges for pma8084 ftsmps and pldo
regulator: qcom_smd: Fix voltage ranges for pm8x41
Linus Torvalds [Tue, 6 Sep 2016 17:36:12 +0000 (10:36 -0700)]
Merge tag 'pinctrl-v4.8-3' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Nothing special at all, just three SoC-specific driver fixes:
- Fix routing problems in pistachio (Imagination) and sunxi
(AllWinner)
- Fix an interrupt problem in the Cherryview (Intel)"
* tag 'pinctrl-v4.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: sunxi: fix uart1 CTS/RTS pins at PG on A23/A33
pinctrl: cherryview: Do not mask all interrupts in probe
pinctrl: pistachio: fix mfio pll_lock pinmux
Dirk Behme [Thu, 21 Apr 2016 10:24:55 +0000 (12:24 +0200)]
thermal: rcar_thermal: Fix priv->zone error handling
In case thermal_zone_xxx_register() returns an error, priv->zone
isn't NULL any more, but contains the error code.
This is passed to thermal_zone_device_unregister(), then. This checks
for priv->zone being NULL, but the error code is != NULL. So it works
with the error code as a pointer. Crashing immediately.
To fix this, reset priv->zone to NULL before entering
rcar_gen3_thermal_remove().
Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Mark Brown [Tue, 6 Sep 2016 11:32:09 +0000 (12:32 +0100)]
Merge remote-tracking branches 'spi/fix/lock', 'spi/fix/maintainers', 'spi/fix/put', 'spi/fix/pxa2xx', 'spi/fix/sh-msiof' and 'spi/fix/timeout' into spi-linus
Mark Brown [Tue, 6 Sep 2016 11:31:34 +0000 (12:31 +0100)]
Merge remote-tracking branches 'regulator/fix/email' and 'regulator/fix/qcom-smd' into regulator-linus
Linus Torvalds [Mon, 5 Sep 2016 18:10:00 +0000 (11:10 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes a regression in the cryptd code that breaks certain
accelerated AED algorithms as well as an older regression in the
caam driver that breaks IPsec"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: caam - fix IV loading for authenc (giv)decryption
crypto: cryptd - Use correct tfm object for AEAD tracking
Linus Torvalds [Mon, 5 Sep 2016 17:55:55 +0000 (10:55 -0700)]
Merge branch 'rc-fixes' of git://git./linux/kernel/git/mmarek/kbuild
Pull kbuild fix from Michal Marek:
"Fix for 'make deb-pkg'. The bug got introduced in v4.8-rc1"
* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
builddeb: Skip gcc-plugins when not configured
Nicolas Iooss [Sun, 28 Aug 2016 16:47:12 +0000 (18:47 +0200)]
ceph: do not modify fi->frag in need_reset_readdir()
Commit
f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
comparison operator with an assignment one.
This looks like a typo which is reported by clang when building the
kernel with some warning flags:
fs/ceph/dir.c:600:22: error: using the result of an assignment as a
condition without parentheses [-Werror,-Wparentheses]
} else if (fi->frag |= fpos_frag(new_pos)) {
~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
fs/ceph/dir.c:600:22: note: place parentheses around the assignment
to silence this warning
} else if (fi->frag |= fpos_frag(new_pos)) {
^
( )
fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
assignment into an inequality comparison
} else if (fi->frag |= fpos_frag(new_pos)) {
^~
!=
Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fabio Estevam [Wed, 31 Aug 2016 13:56:48 +0000 (10:56 -0300)]
ARM: dts: imx6qdl: Fix SPDIF regression
Commit
833f2cbf7091 ("ARM: dts: imx6: change the core clock of spdif")
changed many more clocks than only the SPDIF core clock as stated in
the commit message.
The MLB clock has been added and this causes SPDIF regression as
reported by Xavi Drudis Ferran and also in this forum post:
https://forum.digikey.com/thread/34240
The MX6Q Reference Manual does not mention that MLB is a clock related
to SPDIF, so change it back to a dummy clock to restore SPDIF
functionality.
Thanks to Ambika for providing the fix at:
https://community.nxp.com/thread/387131
Fixes: 833f2cbf7091 ("ARM: dts: imx6: change the core clock of spdif")
Cc: <stable@vger.kernel.org> # 4.4.x
Reported-by: Xavi Drudis Ferran <xdrudis@tinet.cat>
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Xavi Drudis Ferran <xdrudis@tinet.cat>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Linus Torvalds [Sun, 4 Sep 2016 21:31:46 +0000 (14:31 -0700)]
Linux 4.8-rc5
Linus Torvalds [Sun, 4 Sep 2016 15:45:41 +0000 (08:45 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for an AMD erratum so machines without a BIOS fix work"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/AMD: Apply erratum 665 on machines without a BIOS fix
Linus Torvalds [Sun, 4 Sep 2016 15:43:45 +0000 (08:43 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixlet from the timers departement:
- A fix for scheduler stalls in the tick idle code affecting
NOHZ_FULL kernels
- A trivial compile fix"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/nohz: Fix softlockup on scheduler stalls in kvm guest
clocksource/drivers/atmel-pit: Fix compilation error
Linus Torvalds [Sun, 4 Sep 2016 00:29:58 +0000 (17:29 -0700)]
Merge tag 'dm-4.8-fixes-4' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a stable fix in both DM crypt and DM log-writes for too large bios
(as generated by bcache)
- two other stable fixes for DM log-writes
- a stable fix for a DM crypt bug that could result in freeing pointers
from uninitialized memory in the tfm allocation error path
- a DM bufio cleanup to discontinue using create_singlethread_workqueue()
* tag 'dm-4.8-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: remove use of deprecated create_singlethread_workqueue()
dm crypt: fix free of bad values after tfm allocation failure
dm crypt: fix error with too large bios
dm log writes: fix check of kthread_run() return value
dm log writes: fix bug with too large bios
dm log writes: move IO accounting earlier to fix error path
Linus Torvalds [Sat, 3 Sep 2016 19:40:45 +0000 (12:40 -0700)]
Merge branch 'for-linus-4.8' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"I'm still prepping a set of fixes for btrfs fsync, just nailing down a
hard to trigger memory corruption. For now, these are tested and ready."
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
Btrfs: fix endless loop in balancing block groups
Btrfs: kill invalid ASSERT() in process_all_refs()
Linus Torvalds [Sat, 3 Sep 2016 19:31:37 +0000 (12:31 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
"arm64 and arm/perf fixes:
- arm64 fix: debug exception unmasking on the CPU resume path
- ARM PMU fixes: memory leak on error path and NULL pointer
dereference"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1
drivers/perf: arm_pmu: Fix NULL pointer dereference during probe
drivers/perf: arm_pmu: Fix leak in error path
This page took 0.074184 seconds and 5 git commands to generate.