[deliverable/linux.git] / Documentation / filesystems / ext4.txt


Ext4 Filesystem
===============

This is a development version of the ext4 filesystem, an advanced level
of the ext3 filesystem which incorporates scalability and reliability
enhancements for supporting large filesystems (64 bit) in keeping with
increasing disk capacities and state-of-the-art feature requirements.

Mailing list: linux-ext4@vger.kernel.org


1. Quick usage instructions:
===========================

  - Compile and install the latest version of e2fsprogs (as of this
    writing version 1.41) from:

    http://sourceforge.net/project/showfiles.php?group_id=2406
	
	or

    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/

	or grab the latest git repository from:

    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

  - Create a new filesystem using the ext4dev filesystem type:

    	# mke2fs -t ext4dev /dev/hda1

    Or configure an existing ext3 filesystem to support extents and set
    the test_fs flag to indicate that it's ok for an in-development
    filesystem to touch this filesystem:

	# tune2fs -O extents -E test_fs /dev/hda1

    If the filesystem was created with 128 byte inodes, it can be
    converted to use 256 byte for greater efficiency via:

        # tune2fs -I 256 /dev/hda1

    (Note: we currently do not have tools to convert an ext4dev
    filesystem back to ext3; so please do not do try this on production
    filesystems.)

  - Mounting:

	# mount -t ext4dev /dev/hda1 /wherever

  - When comparing performance with other filesystems, remember that
    ext3/4 by default offers higher data integrity guarantees than most.
    So when comparing with a metadata-only journalling filesystem, such
    as ext3, use `mount -o data=writeback'.  And you might as well use
    `mount -o nobh' too along with it.  Making the journal larger than
    the mke2fs default often helps performance with metadata-intensive
    workloads.

2. Features
===========

2.1 Currently available

* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree
* improved file allocation (multi-block alloc, delayed alloc)
* fix 32000 subdirectory limit
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
* reduced e2fsck time via uninit_bg feature
* journal checksumming for robustness, performance
* persistent file preallocation (e.g for streaming media, databases)
* ability to pack bitmaps and inode tables into larger virtual groups via the
  flex_bg feature
* large file support
* Inode allocation using large virtual block groups via flex_bg

2.2 Candidate features for future inclusion

* Online defrag (patches available but not well tested)
* reduced mke2fs time via lazy itable initialization in conjuction with
  the uninit_bg feature (capability to do this is available in e2fsprogs
  but a kernel thread to do lazy zeroing of unused inode table blocks
  after filesystem is first mounted is required for safety)

There are several others under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them. Features like
metadata checksumming have been discussed and planned for a bit but no patches
exist yet so I'm not sure they're in the near-term roadmap.

The big performance win will come with mballoc, delalloc and flex_bg
grouping of bitmaps and inode tables.  Some test results available here:

 - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
 - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html

3. Options
==========

When mounting an ext4 filesystem, the following option are accepted:
(*) == default

extents		(*)	ext4 will use extents to address file data.  The
			file system will no longer be mountable by ext3.

noextents		ext4 will not use extents for newly created files

journal_checksum	Enable checksumming of the journal transactions.
			This will allow the recovery code in e2fsck and the
			kernel to detect corruption in the kernel.  It is a
			compatible change and will be ignored by older kernels.

journal_async_commit	Commit block can be written to disk without waiting
			for descriptor blocks. If enabled older kernels cannot
			mount the device. This will enable 'journal_checksum'
			internally.

journal=update		Update the ext4 file system's journal to the current
			format.

journal=inum		When a journal already exists, this option is ignored.
			Otherwise, it specifies the number of the inode which
			will represent the ext4 file system's journal file.

journal_dev=devnum	When the external journal device's major/minor numbers
			have changed, this option allows the user to specify
			the new journal location.  The journal device is
			identified through its new major/minor numbers encoded
			in devnum.

noload			Don't load the journal on mounting.

data=journal		All data are committed into the journal prior to being
			written into the main file system.

data=ordered	(*)	All data are forced directly out to the main file
			system prior to its metadata being committed to the
			journal.

data=writeback		Data ordering is not preserved, data may be written
			into the main file system after its metadata has been
			committed to the journal.

commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve
			performance.

barrier=<0|1(*)>	This enables/disables the use of write barriers in
			the jbd code.  barrier=0 disables, barrier=1 enables.
			This also requires an IO stack which can support
			barriers, and if jbd gets an error on a barrier
			write, it will disable again with a warning.
			Write barriers enforce proper on-disk ordering
			of journal commits, making volatile disk write caches
			safe to use, at some performance penalty.  If
			your disks are battery-backed in one way or another,
			disabling barriers may safely improve performance.

orlov		(*)	This enables the new Orlov block allocator. It is
			enabled by default.

oldalloc		This disables the Orlov block allocator and enables
			the old block allocator.  Orlov should have better
			performance - we'd like to get some feedback if it's
			the contrary for you.

user_xattr		Enables Extended User Attributes.  Additionally, you
			need to have extended attribute support enabled in the
			kernel configuration (CONFIG_EXT4_FS_XATTR).  See the
			attr(5) manual page and http://acl.bestbits.at/ to
			learn more about extended attributes.

nouser_xattr		Disables Extended User Attributes.

acl			Enables POSIX Access Control Lists support.
			Additionally, you need to have ACL support enabled in
			the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
			See the acl(5) manual page and http://acl.bestbits.at/
			for more information.

noacl			This option disables POSIX Access Control List
			support.

reservation

noreservation

bsddf		(*)	Make 'df' act like BSD.
minixdf			Make 'df' act like Minix.

check=none		Don't do extra checking of bitmaps on mount.
nocheck

debug			Extra debugging information is sent to syslog.

errors=remount-ro(*)	Remount the filesystem read-only on an error.
errors=continue		Keep going on a filesystem error.
errors=panic		Panic and halt the machine if an error occurs.

grpid			Give objects the same group ID as their creator.
bsdgroups

nogrpid		(*)	New objects have the group ID of their creator.
sysvgroups

resgid=n		The group ID which may use the reserved blocks.

resuid=n		The user ID which may use the reserved blocks.

sb=n			Use alternate superblock at this location.

quota
noquota
grpquota
usrquota

bh		(*)	ext4 associates buffer heads to data pages to
nobh			(a) cache disk block mapping information
			(b) link pages into transaction to provide
			    ordering guarantees.
			"bh" option forces use of buffer heads.
			"nobh" option tries to avoid associating buffer
			heads (supported only for "writeback" mode).

mballoc		(*)	Use the multiple block allocator for block allocation
nomballoc		disabled multiple block allocator for block allocation.
stripe=n		Number of filesystem blocks that mballoc will try
			to use for allocation size and alignment. For RAID5/6
			systems this should be the number of data
			disks *  RAID chunk size in file system blocks.

Data Mode
=========
There are 3 different data modes:

* writeback mode
In data=writeback mode, ext4 does not journal data at all.  This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling.  A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash.  This mode will
typically provide the best ext4 performance.

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction.  When
it's time to write the new metadata out to disk, the associated data blocks
are written first.  In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.

* journal mode
data=journal mode provides full data and metadata journaling.  All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state.  This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes.

References
==========

kernel source:	<file:fs/ext4/>
		<file:fs/jbd2/>

programs:	http://e2fsprogs.sourceforge.net/

useful links:	http://fedoraproject.org/wiki/ext3-devel
		http://www.bullopensource.org/ext4/
		http://ext4.wiki.kernel.org/index.php/Main_Page
		http://fedoraproject.org/wiki/Features/Ext4
Commit	Line	Data
fc513a33 DK	1
	2	Ext4 Filesystem
	3	===============
	4
	5	This is a development version of the ext4 filesystem, an advanced level
	6	of the ext3 filesystem which incorporates scalability and reliability
	7	enhancements for supporting large filesystems (64 bit) in keeping with
	8	increasing disk capacities and state-of-the-art feature requirements.
	9
	10	Mailing list: linux-ext4@vger.kernel.org
	11
	12
	13	1. Quick usage instructions:
	14	===========================
	15
93e3270c JS	16	- Compile and install the latest version of e2fsprogs (as of this
	17	writing version 1.41) from:
	18
	19	http://sourceforge.net/project/showfiles.php?group_id=2406
	20
	21	or
	22
fc513a33 DK	23	ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
fc513a33 DK	24
93e3270c JS	25	or grab the latest git repository from:
	26
	27	git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
	28
	29	- Create a new filesystem using the ext4dev filesystem type:
	30
	31	# mke2fs -t ext4dev /dev/hda1
	32
	33	Or configure an existing ext3 filesystem to support extents and set
	34	the test_fs flag to indicate that it's ok for an in-development
	35	filesystem to touch this filesystem:
fc513a33	36
93e3270c	37	# tune2fs -O extents -E test_fs /dev/hda1
fc513a33	38
93e3270c JS	39	If the filesystem was created with 128 byte inodes, it can be
93e3270c JS	40	converted to use 256 byte for greater efficiency via:
fc513a33	41
93e3270c	42	# tune2fs -I 256 /dev/hda1
fc513a33	43
93e3270c JS	44	(Note: we currently do not have tools to convert an ext4dev
	45	filesystem back to ext3; so please do not do try this on production
	46	filesystems.)
fc513a33	47
93e3270c JS	48	- Mounting:
	49
	50	# mount -t ext4dev /dev/hda1 /wherever
fc513a33 DK	51
fc513a33 DK	52	- When comparing performance with other filesystems, remember that
93e3270c JS	53	ext3/4 by default offers higher data integrity guarantees than most.
	54	So when comparing with a metadata-only journalling filesystem, such
	55	as ext3, use `mount -o data=writeback'. And you might as well use
	56	`mount -o nobh' too along with it. Making the journal larger than
	57	the mke2fs default often helps performance with metadata-intensive
	58	workloads.
fc513a33 DK	59
	60	2. Features
	61	===========
	62
	63	2.1 Currently available
	64
93e3270c	65	* ability to use filesystems > 16TB (e2fsprogs support not available yet)
fc513a33 DK	66	* extent format reduces metadata overhead (RAM, IO for access, transactions)
	67	* extent format more robust in face of on-disk corruption due to magics,
	68	* internal redunancy in tree
93e3270c JS	69	* improved file allocation (multi-block alloc, delayed alloc)
	70	* fix 32000 subdirectory limit
	71	* nsec timestamps for mtime, atime, ctime, create time
	72	* inode version field on disk (NFSv4, Lustre)
	73	* reduced e2fsck time via uninit_bg feature
	74	* journal checksumming for robustness, performance
	75	* persistent file preallocation (e.g for streaming media, databases)
	76	* ability to pack bitmaps and inode tables into larger virtual groups via the
	77	flex_bg feature
	78	* large file support
	79	* Inode allocation using large virtual block groups via flex_bg
fc513a33 DK	80
	81	2.2 Candidate features for future inclusion
	82
93e3270c JS	83	* Online defrag (patches available but not well tested)
	84	* reduced mke2fs time via lazy itable initialization in conjuction with
	85	the uninit_bg feature (capability to do this is available in e2fsprogs
	86	but a kernel thread to do lazy zeroing of unused inode table blocks
	87	after filesystem is first mounted is required for safety)
fc513a33	88
93e3270c JS	89	There are several others under discussion, whether they all make it in is
	90	partly a function of how much time everyone has to work on them. Features like
	91	metadata checksumming have been discussed and planned for a bit but no patches
	92	exist yet so I'm not sure they're in the near-term roadmap.
fc513a33	93
93e3270c JS	94	The big performance win will come with mballoc, delalloc and flex_bg
93e3270c JS	95	grouping of bitmaps and inode tables. Some test results available here:
fc513a33	96
93e3270c JS	97	- http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
93e3270c JS	98	- http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
fc513a33 DK	99
	100	3. Options
	101	==========
	102
	103	When mounting an ext4 filesystem, the following option are accepted:
	104	(*) == default
	105
c9de560d	106	extents (*) ext4 will use extents to address file data. The
fc513a33 DK	107	file system will no longer be mountable by ext3.
fc513a33 DK	108
c9de560d AT	109	noextents ext4 will not use extents for newly created files
c9de560d AT	110
818d276c GS	111	journal_checksum Enable checksumming of the journal transactions.
	112	This will allow the recovery code in e2fsck and the
	113	kernel to detect corruption in the kernel. It is a
	114	compatible change and will be ignored by older kernels.
	115
	116	journal_async_commit Commit block can be written to disk without waiting
	117	for descriptor blocks. If enabled older kernels cannot
	118	mount the device. This will enable 'journal_checksum'
	119	internally.
	120
fc513a33 DK	121	journal=update Update the ext4 file system's journal to the current
	122	format.
	123
	124	journal=inum When a journal already exists, this option is ignored.
	125	Otherwise, it specifies the number of the inode which
	126	will represent the ext4 file system's journal file.
	127
	128	journal_dev=devnum When the external journal device's major/minor numbers
	129	have changed, this option allows the user to specify
	130	the new journal location. The journal device is
	131	identified through its new major/minor numbers encoded
	132	in devnum.
	133
	134	noload Don't load the journal on mounting.
	135
	136	data=journal All data are committed into the journal prior to being
	137	written into the main file system.
	138
	139	data=ordered (*) All data are forced directly out to the main file
	140	system prior to its metadata being committed to the
	141	journal.
	142
	143	data=writeback Data ordering is not preserved, data may be written
	144	into the main file system after its metadata has been
	145	committed to the journal.
	146
	147	commit=nrsec (*) Ext4 can be told to sync all its data and metadata
	148	every 'nrsec' seconds. The default value is 5 seconds.
	149	This means that if you lose your power, you will lose
	150	as much as the latest 5 seconds of work (your
	151	filesystem will not be damaged though, thanks to the
	152	journaling). This default value (or any low value)
	153	will hurt performance, but it's good for data-safety.
	154	Setting it to 0 will have the same effect as leaving
	155	it at the default (5 seconds).
	156	Setting it to very large values will improve
	157	performance.
	158
571640ca ES	159	barrier=<0\|1(*)> This enables/disables the use of write barriers in
	160	the jbd code. barrier=0 disables, barrier=1 enables.
	161	This also requires an IO stack which can support
	162	barriers, and if jbd gets an error on a barrier
	163	write, it will disable again with a warning.
	164	Write barriers enforce proper on-disk ordering
	165	of journal commits, making volatile disk write caches
	166	safe to use, at some performance penalty. If
	167	your disks are battery-backed in one way or another,
	168	disabling barriers may safely improve performance.
fc513a33 DK	169
	170	orlov (*) This enables the new Orlov block allocator. It is
	171	enabled by default.
	172
	173	oldalloc This disables the Orlov block allocator and enables
	174	the old block allocator. Orlov should have better
	175	performance - we'd like to get some feedback if it's
	176	the contrary for you.
	177
	178	user_xattr Enables Extended User Attributes. Additionally, you
	179	need to have extended attribute support enabled in the
	180	kernel configuration (CONFIG_EXT4_FS_XATTR). See the
	181	attr(5) manual page and http://acl.bestbits.at/ to
	182	learn more about extended attributes.
	183
	184	nouser_xattr Disables Extended User Attributes.
	185
	186	acl Enables POSIX Access Control Lists support.
	187	Additionally, you need to have ACL support enabled in
	188	the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
	189	See the acl(5) manual page and http://acl.bestbits.at/
	190	for more information.
	191
	192	noacl This option disables POSIX Access Control List
	193	support.
	194
	195	reservation
	196
	197	noreservation
	198
	199	bsddf (*) Make 'df' act like BSD.
	200	minixdf Make 'df' act like Minix.
	201
	202	check=none Don't do extra checking of bitmaps on mount.
	203	nocheck
	204
	205	debug Extra debugging information is sent to syslog.
	206
	207	errors=remount-ro(*) Remount the filesystem read-only on an error.
	208	errors=continue Keep going on a filesystem error.
	209	errors=panic Panic and halt the machine if an error occurs.
	210
	211	grpid Give objects the same group ID as their creator.
	212	bsdgroups
	213
	214	nogrpid (*) New objects have the group ID of their creator.
	215	sysvgroups
	216
	217	resgid=n The group ID which may use the reserved blocks.
	218
	219	resuid=n The user ID which may use the reserved blocks.
	220
	221	sb=n Use alternate superblock at this location.
	222
	223	quota
	224	noquota
	225	grpquota
	226	usrquota
	227
	228	bh (*) ext4 associates buffer heads to data pages to
	229	nobh (a) cache disk block mapping information
	230	(b) link pages into transaction to provide
	231	ordering guarantees.
	232	"bh" option forces use of buffer heads.
233	"nobh" option tries to avoid associating buffer
234	heads (supported only for "writeback" mode).
235
c9de560d AT	236	mballoc (*) Use the multiple block allocator for block allocation
	237	nomballoc disabled multiple block allocator for block allocation.
	238	stripe=n Number of filesystem blocks that mballoc will try
	239	to use for allocation size and alignment. For RAID5/6
	240	systems this should be the number of data
	241	disks * RAID chunk size in file system blocks.
fc513a33 DK	242
fc513a33 DK	243	Data Mode
93e3270c	244	=========
fc513a33 DK	245	There are 3 different data modes:
	246
	247	* writeback mode
	248	In data=writeback mode, ext4 does not journal data at all. This mode provides
	249	a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
	250	mode - metadata journaling. A crash+recovery can cause incorrect data to
	251	appear in files which were written shortly before the crash. This mode will
	252	typically provide the best ext4 performance.
	253
	254	* ordered mode
	255	In data=ordered mode, ext4 only officially journals metadata, but it logically
	256	groups metadata and data blocks into a single unit called a transaction. When
	257	it's time to write the new metadata out to disk, the associated data blocks
	258	are written first. In general, this mode performs slightly slower than
	259	writeback but significantly faster than journal mode.
	260
	261	* journal mode
	262	data=journal mode provides full data and metadata journaling. All new data is
	263	written to the journal first, and then to its final location.
	264	In the event of a crash, the journal can be replayed, bringing both data and
	265	metadata into a consistent state. This mode is the slowest except when data
	266	needs to be read from and written to disk at the same time where it
	267	outperforms all others modes.
	268
	269	References
	270	==========
	271
	272	kernel source: <file:fs/ext4/>
	273	<file:fs/jbd2/>
	274
	275	programs: http://e2fsprogs.sourceforge.net/
fc513a33 DK	276
	277	useful links: http://fedoraproject.org/wiki/ext3-devel
	278	http://www.bullopensource.org/ext4/
93e3270c JS	279	http://ext4.wiki.kernel.org/index.php/Main_Page
93e3270c JS	280	http://fedoraproject.org/wiki/Features/Ext4