diff --git a/en_US.ISO8859-1/books/handbook/Makefile b/en_US.ISO8859-1/books/handbook/Makefile index 84ff7e115f..748be3b437 100644 --- a/en_US.ISO8859-1/books/handbook/Makefile +++ b/en_US.ISO8859-1/books/handbook/Makefile @@ -245,6 +245,7 @@ SRCS+= desktop/chapter.xml SRCS+= disks/chapter.xml SRCS+= eresources/chapter.xml SRCS+= firewalls/chapter.xml +SRCS+= zfs/chapter.xml SRCS+= filesystems/chapter.xml SRCS+= geom/chapter.xml SRCS+= install/chapter.xml diff --git a/en_US.ISO8859-1/books/handbook/book.xml b/en_US.ISO8859-1/books/handbook/book.xml index 122db97a89..1b26ce9a86 100644 --- a/en_US.ISO8859-1/books/handbook/book.xml +++ b/en_US.ISO8859-1/books/handbook/book.xml @@ -237,6 +237,7 @@ &chap.audit; &chap.disks; &chap.geom; + &chap.zfs; &chap.filesystems; &chap.virtualization; &chap.l10n; diff --git a/en_US.ISO8859-1/books/handbook/bsdinstall/chapter.xml b/en_US.ISO8859-1/books/handbook/bsdinstall/chapter.xml index 36f6e87702..a1289fab0e 100644 --- a/en_US.ISO8859-1/books/handbook/bsdinstall/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/bsdinstall/chapter.xml @@ -1445,7 +1445,7 @@ Ethernet address 0:3:ba:b:92:d4, Host ID: 830b92d4. Another partition type worth noting is freebsd-zfs, used for partitions that will contain a &os; ZFS file system (). Refer to &man.gpart.8; for + linkend="zfs"/>). Refer to &man.gpart.8; for descriptions of the available GPT partition types. diff --git a/en_US.ISO8859-1/books/handbook/chapters.ent b/en_US.ISO8859-1/books/handbook/chapters.ent index 2c6fb96371..17a7abc412 100644 --- a/en_US.ISO8859-1/books/handbook/chapters.ent +++ b/en_US.ISO8859-1/books/handbook/chapters.ent @@ -37,6 +37,7 @@ + diff --git a/en_US.ISO8859-1/books/handbook/disks/chapter.xml b/en_US.ISO8859-1/books/handbook/disks/chapter.xml index b5e775a54e..3d3441f73f 100644 --- a/en_US.ISO8859-1/books/handbook/disks/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/disks/chapter.xml @@ -2160,7 +2160,7 @@ Filesystem 1K-blocks Used Avail Capacity Mounted on This section describes how to configure disk quotas for the UFS file system. To configure quotas on the ZFS file system, refer to + linkend="zfs-zfs-quota"/> Enabling Disk Quotas diff --git a/en_US.ISO8859-1/books/handbook/filesystems/chapter.xml b/en_US.ISO8859-1/books/handbook/filesystems/chapter.xml index a8d5e11682..cfba538bc8 100644 --- a/en_US.ISO8859-1/books/handbook/filesystems/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/filesystems/chapter.xml @@ -5,7 +5,7 @@ --> - File Systems Support + Other File Systems TomRhodesWritten @@ -29,8 +29,8 @@ native &os; file system has been the Unix File System UFS which has been modernized as UFS2. Since &os; 7.0, the Z File - System ZFS is also available as a native file - system. + System (ZFS) is also available as a native file + system. See for more information. In addition to its native file systems, &os; supports a multitude of other file systems so that data from other @@ -91,642 +91,6 @@ - - The Z File System (ZFS) - - The Z file system, originally developed by &sun;, - is designed to use a pooled storage method in that space is only - used as it is needed for data storage. It is also designed for - maximum data integrity, supporting data snapshots, multiple - copies, and data checksums. It uses a software data replication - model, known as RAID-Z. - RAID-Z provides redundancy similar to - hardware RAID, but is designed to prevent - data write corruption and to overcome some of the limitations - of hardware RAID. - - - ZFS Tuning - - Some of the features provided by ZFS - are RAM-intensive, so some tuning may be required to provide - maximum efficiency on systems with limited RAM. - - - Memory - - At a bare minimum, the total system memory should be at - least one gigabyte. The amount of recommended RAM depends - upon the size of the pool and the ZFS features which are - used. A general rule of thumb is 1GB of RAM for every 1TB - of storage. If the deduplication feature is used, a general - rule of thumb is 5GB of RAM per TB of storage to be - deduplicated. While some users successfully use ZFS with - less RAM, it is possible that when the system is under heavy - load, it may panic due to memory exhaustion. Further tuning - may be required for systems with less than the recommended - RAM requirements. - - - - Kernel Configuration - - Due to the RAM limitations of the &i386; platform, users - using ZFS on the &i386; architecture should add the - following option to a custom kernel configuration file, - rebuild the kernel, and reboot: - - options KVA_PAGES=512 - - This option expands the kernel address space, allowing - the vm.kvm_size tunable to be pushed - beyond the currently imposed limit of 1 GB, or the - limit of 2 GB for PAE. To find the - most suitable value for this option, divide the desired - address space in megabytes by four (4). In this example, it - is 512 for 2 GB. - - - - Loader Tunables - - The kmem address space can - be increased on all &os; architectures. On a test system - with one gigabyte of physical memory, success was achieved - with the following options added to - /boot/loader.conf, and the system - restarted: - - vm.kmem_size="330M" -vm.kmem_size_max="330M" -vfs.zfs.arc_max="40M" -vfs.zfs.vdev.cache.size="5M" - - For a more detailed list of recommendations for - ZFS-related tuning, see http://wiki.freebsd.org/ZFSTuningGuide. - - - - - Using <acronym>ZFS</acronym> - - There is a start up mechanism that allows &os; to mount - ZFS pools during system initialization. To - set it, issue the following commands: - - &prompt.root; echo 'zfs_enable="YES"' >> /etc/rc.conf -&prompt.root; service zfs start - - The examples in this section assume three - SCSI disks with the device names - da0, - da1, - and da2. - Users of IDE hardware should instead use - ad - device names. - - - Single Disk Pool - - To create a simple, non-redundant ZFS - pool using a single disk device, use - zpool: - - &prompt.root; zpool create example /dev/da0 - - To view the new pool, review the output of - df: - - &prompt.root; df -Filesystem 1K-blocks Used Avail Capacity Mounted on -/dev/ad0s1a 2026030 235230 1628718 13% / -devfs 1 1 0 100% /dev -/dev/ad0s1d 54098308 1032846 48737598 2% /usr -example 17547136 0 17547136 0% /example - - This output shows that the example - pool has been created and mounted. It - is now accessible as a file system. Files may be created - on it and users can browse it, as seen in the following - example: - - &prompt.root; cd /example -&prompt.root; ls -&prompt.root; touch testfile -&prompt.root; ls -al -total 4 -drwxr-xr-x 2 root wheel 3 Aug 29 23:15 . -drwxr-xr-x 21 root wheel 512 Aug 29 23:12 .. --rw-r--r-- 1 root wheel 0 Aug 29 23:15 testfile - - However, this pool is not taking advantage of any - ZFS features. To create a dataset on - this pool with compression enabled: - - &prompt.root; zfs create example/compressed -&prompt.root; zfs set compression=gzip example/compressed - - The example/compressed dataset is now - a ZFS compressed file system. Try - copying some large files to - /example/compressed. - - Compression can be disabled with: - - &prompt.root; zfs set compression=off example/compressed - - To unmount a file system, issue the following command - and then verify by using df: - - &prompt.root; zfs umount example/compressed -&prompt.root; df -Filesystem 1K-blocks Used Avail Capacity Mounted on -/dev/ad0s1a 2026030 235232 1628716 13% / -devfs 1 1 0 100% /dev -/dev/ad0s1d 54098308 1032864 48737580 2% /usr -example 17547008 0 17547008 0% /example - - To re-mount the file system to make it accessible - again, and verify with df: - - &prompt.root; zfs mount example/compressed -&prompt.root; df -Filesystem 1K-blocks Used Avail Capacity Mounted on -/dev/ad0s1a 2026030 235234 1628714 13% / -devfs 1 1 0 100% /dev -/dev/ad0s1d 54098308 1032864 48737580 2% /usr -example 17547008 0 17547008 0% /example -example/compressed 17547008 0 17547008 0% /example/compressed - - The pool and file system may also be observed by viewing - the output from mount: - - &prompt.root; mount -/dev/ad0s1a on / (ufs, local) -devfs on /dev (devfs, local) -/dev/ad0s1d on /usr (ufs, local, soft-updates) -example on /example (zfs, local) -example/data on /example/data (zfs, local) -example/compressed on /example/compressed (zfs, local) - - ZFS datasets, after creation, may be - used like any file systems. However, many other features - are available which can be set on a per-dataset basis. In - the following example, a new file system, - data is created. Important files will be - stored here, the file system is set to keep two copies of - each data block: - - &prompt.root; zfs create example/data -&prompt.root; zfs set copies=2 example/data - - It is now possible to see the data and space utilization - by issuing df: - - &prompt.root; df -Filesystem 1K-blocks Used Avail Capacity Mounted on -/dev/ad0s1a 2026030 235234 1628714 13% / -devfs 1 1 0 100% /dev -/dev/ad0s1d 54098308 1032864 48737580 2% /usr -example 17547008 0 17547008 0% /example -example/compressed 17547008 0 17547008 0% /example/compressed -example/data 17547008 0 17547008 0% /example/data - - Notice that each file system on the pool has the same - amount of available space. This is the reason for using - df in these examples, to show that the - file systems use only the amount of space they need and all - draw from the same pool. The ZFS file - system does away with concepts such as volumes and - partitions, and allows for several file systems to occupy - the same pool. - - To destroy the file systems and then destroy the pool as - they are no longer needed: - - &prompt.root; zfs destroy example/compressed -&prompt.root; zfs destroy example/data -&prompt.root; zpool destroy example - - - - - <acronym>ZFS</acronym> RAID-Z - - There is no way to prevent a disk from failing. One - method of avoiding data loss due to a failed hard disk is to - implement RAID. ZFS - supports this feature in its pool design. - - To create a RAID-Z pool, issue the - following command and specify the disks to add to the - pool: - - &prompt.root; zpool create storage raidz da0 da1 da2 - - - &sun; recommends that the amount of devices used in - a RAID-Z configuration is between - three and nine. For environments requiring a single pool - consisting of 10 disks or more, consider breaking it up - into smaller RAID-Z groups. If only - two disks are available and redundancy is a requirement, - consider using a ZFS mirror. Refer to - &man.zpool.8; for more details. - - - This command creates the storage - zpool. This may be verified using &man.mount.8; and - &man.df.1;. This command makes a new file system in the - pool called home: - - &prompt.root; zfs create storage/home - - It is now possible to enable compression and keep extra - copies of directories and files using the following - commands: - - &prompt.root; zfs set copies=2 storage/home -&prompt.root; zfs set compression=gzip storage/home - - To make this the new home directory for users, copy the - user data to this directory, and create the appropriate - symbolic links: - - &prompt.root; cp -rp /home/* /storage/home -&prompt.root; rm -rf /home /usr/home -&prompt.root; ln -s /storage/home /home -&prompt.root; ln -s /storage/home /usr/home - - Users should now have their data stored on the freshly - created /storage/home. Test by - adding a new user and logging in as that user. - - Try creating a snapshot which may be rolled back - later: - - &prompt.root; zfs snapshot storage/home@08-30-08 - - Note that the snapshot option will only capture a real - file system, not a home directory or a file. The - @ character is a delimiter used between - the file system name or the volume name. When a user's - home directory gets trashed, restore it with: - - &prompt.root; zfs rollback storage/home@08-30-08 - - To get a list of all available snapshots, run - ls in the file system's - .zfs/snapshot directory. For example, - to see the previously taken snapshot: - - &prompt.root; ls /storage/home/.zfs/snapshot - - It is possible to write a script to perform regular - snapshots on user data. However, over time, snapshots - may consume a great deal of disk space. The previous - snapshot may be removed using the following command: - - &prompt.root; zfs destroy storage/home@08-30-08 - - After testing, /storage/home can be - made the real /home using this - command: - - &prompt.root; zfs set mountpoint=/home storage/home - - Run df and - mount to confirm that the system now - treats the file system as the real - /home: - - &prompt.root; mount -/dev/ad0s1a on / (ufs, local) -devfs on /dev (devfs, local) -/dev/ad0s1d on /usr (ufs, local, soft-updates) -storage on /storage (zfs, local) -storage/home on /home (zfs, local) -&prompt.root; df -Filesystem 1K-blocks Used Avail Capacity Mounted on -/dev/ad0s1a 2026030 235240 1628708 13% / -devfs 1 1 0 100% /dev -/dev/ad0s1d 54098308 1032826 48737618 2% /usr -storage 26320512 0 26320512 0% /storage -storage/home 26320512 0 26320512 0% /home - - This completes the RAID-Z - configuration. To get status updates about the file systems - created during the nightly &man.periodic.8; runs, issue the - following command: - - &prompt.root; echo 'daily_status_zfs_enable="YES"' >> /etc/periodic.conf - - - - Recovering <acronym>RAID</acronym>-Z - - Every software RAID has a method of - monitoring its state. The status of - RAID-Z devices may be viewed with the - following command: - - &prompt.root; zpool status -x - - If all pools are healthy and everything is normal, the - following message will be returned: - - all pools are healthy - - If there is an issue, perhaps a disk has gone offline, - the pool state will look similar to: - - pool: storage - state: DEGRADED -status: One or more devices has been taken offline by the administrator. - Sufficient replicas exist for the pool to continue functioning in a - degraded state. -action: Online the device using 'zpool online' or replace the device with - 'zpool replace'. - scrub: none requested -config: - - NAME STATE READ WRITE CKSUM - storage DEGRADED 0 0 0 - raidz1 DEGRADED 0 0 0 - da0 ONLINE 0 0 0 - da1 OFFLINE 0 0 0 - da2 ONLINE 0 0 0 - -errors: No known data errors - - This indicates that the device was previously taken - offline by the administrator using the following - command: - - &prompt.root; zpool offline storage da1 - - It is now possible to replace - da1 after the system has been - powered down. When the system is back online, the following - command may issued to replace the disk: - - &prompt.root; zpool replace storage da1 - - From here, the status may be checked again, this time - without the flag to get state - information: - - &prompt.root; zpool status storage - pool: storage - state: ONLINE - scrub: resilver completed with 0 errors on Sat Aug 30 19:44:11 2008 -config: - - NAME STATE READ WRITE CKSUM - storage ONLINE 0 0 0 - raidz1 ONLINE 0 0 0 - da0 ONLINE 0 0 0 - da1 ONLINE 0 0 0 - da2 ONLINE 0 0 0 - -errors: No known data errors - - As shown from this example, everything appears to be - normal. - - - - Data Verification - - ZFS uses checksums to verify the - integrity of stored data. These are enabled automatically - upon creation of file systems and may be disabled using the - following command: - - &prompt.root; zfs set checksum=off storage/home - - Doing so is not recommended as - checksums take very little storage space and are used to - check data integrity using checksum verification in a - process is known as scrubbing. To verify the - data integrity of the storage pool, issue - this command: - - &prompt.root; zpool scrub storage - - This process may take considerable time depending on - the amount of data stored. It is also very - I/O intensive, so much so that only one - scrub may be run at any given time. After the scrub has - completed, the status is updated and may be viewed by - issuing a status request: - - &prompt.root; zpool status storage - pool: storage - state: ONLINE - scrub: scrub completed with 0 errors on Sat Jan 26 19:57:37 2013 -config: - - NAME STATE READ WRITE CKSUM - storage ONLINE 0 0 0 - raidz1 ONLINE 0 0 0 - da0 ONLINE 0 0 0 - da1 ONLINE 0 0 0 - da2 ONLINE 0 0 0 - -errors: No known data errors - - The completion time is displayed and helps to ensure - data integrity over a long period of time. - - Refer to &man.zfs.8; and &man.zpool.8; for other - ZFS options. - - - - ZFS Quotas - - ZFS supports different types of quotas: the refquota, - the general quota, the user quota, and the group quota. - This section explains the basics of each type and includes - some usage instructions. - - Quotas limit the amount of space that a dataset and its - descendants can consume, and enforce a limit on the amount - of space used by file systems and snapshots for the - descendants. Quotas are useful to limit the amount of space - a particular user can use. - - - Quotas cannot be set on volumes, as the - volsize property acts as an implicit - quota. - - - The - refquota=size - limits the amount of space a dataset can consume by - enforcing a hard limit on the space used. However, this - hard limit does not include space used by descendants, such - as file systems or snapshots. - - To enforce a general quota of 10 GB for - storage/home/bob, use the - following: - - &prompt.root; zfs set quota=10G storage/home/bob - - User quotas limit the amount of space that can be used - by the specified user. The general format is - userquota@user=size, - and the user's name must be in one of the following - formats: - - - - POSIX compatible name such as - joe. - - - - POSIX numeric ID such as - 789. - - - - SID name - such as - joe.bloggs@example.com. - - - - SID - numeric ID such as - S-1-123-456-789. - - - - For example, to enforce a quota of 50 GB for a user - named joe, use the - following: - - &prompt.root; zfs set userquota@joe=50G - - To remove the quota or make sure that one is not set, - instead use: - - &prompt.root; zfs set userquota@joe=none - - User quota properties are not displayed by - zfs get all. - Non-root users can - only see their own quotas unless they have been granted the - userquota privilege. Users with this - privilege are able to view and set everyone's quota. - - The group quota limits the amount of space that a - specified group can consume. The general format is - groupquota@group=size. - - To set the quota for the group - firstgroup to 50 GB, - use: - - &prompt.root; zfs set groupquota@firstgroup=50G - - To remove the quota for the group - firstgroup, or to make sure that - one is not set, instead use: - - &prompt.root; zfs set groupquota@firstgroup=none - - As with the user quota property, - non-root users can - only see the quotas associated with the groups that they - belong to. However, root or a user with the - groupquota privilege can view and set all - quotas for all groups. - - To display the amount of space consumed by each user on - the specified file system or snapshot, along with any - specified quotas, use zfs userspace. - For group information, use zfs - groupspace. For more information about - supported options or how to display only specific options, - refer to &man.zfs.1;. - - Users with sufficient privileges and root can list the quota for - storage/home/bob using: - - &prompt.root; zfs get quota storage/home/bob - - - - ZFS Reservations - - ZFS supports two types of space reservations. This - section explains the basics of each and includes some usage - instructions. - - The reservation property makes it - possible to reserve a minimum amount of space guaranteed - for a dataset and its descendants. This means that if a - 10 GB reservation is set on - storage/home/bob, if disk - space gets low, at least 10 GB of space is reserved - for this dataset. The refreservation - property sets or indicates the minimum amount of space - guaranteed to a dataset excluding descendants, such as - snapshots. As an example, if a snapshot was taken of - storage/home/bob, enough disk space - would have to exist outside of the - refreservation amount for the operation - to succeed because descendants of the main data set are - not counted by the refreservation - amount and so do not encroach on the space set. - - Reservations of any sort are useful in many situations, - such as planning and testing the suitability of disk space - allocation in a new system, or ensuring that enough space is - available on file systems for system recovery procedures and - files. - - The general format of the reservation - property is - reservation=size, - so to set a reservation of 10 GB on - storage/home/bob, use: - - &prompt.root; zfs set reservation=10G storage/home/bob - - To make sure that no reservation is set, or to remove a - reservation, use: - - &prompt.root; zfs set reservation=none storage/home/bob - - The same principle can be applied to the - refreservation property for setting a - refreservation, with the general format - refreservation=size. - - To check if any reservations or refreservations exist on - storage/home/bob, execute one of the - following commands: - - &prompt.root; zfs get reservation storage/home/bob -&prompt.root; zfs get refreservation storage/home/bob - - - - &linux; File Systems diff --git a/en_US.ISO8859-1/books/handbook/zfs/chapter.xml b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml new file mode 100644 index 0000000000..0c3013c206 --- /dev/null +++ b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml @@ -0,0 +1,4332 @@ + + + + + + + The Z File System (<acronym>ZFS</acronym>) + + + + + Tom + Rhodes + + Written by + + + + Allan + Jude + + Written by + + + + Benedict + Reuschling + + Written by + + + + Warren + Block + + Written by + + + + + The Z File System, or + ZFS, is an advanced file system designed to + overcome many of the major problems found in previous + designs. + + Originally developed at &sun;, ongoing open source + ZFS development has moved to the OpenZFS Project. + + ZFS has three major design goals: + + + + Data integrity: All data includes a + checksum of the data. + When data is written, the checksum is calculated and written + along with it. When that data is later read back, the + checksum is calculated again. If the checksums do not match, + a data error has been detected. ZFS will + attempt to automatically correct errors when data redundancy + is available. + + + + Pooled storage: physical storage devices are added to a + pool, and storage space is allocated from that shared pool. + Space is available to all file systems, and can be increased + by adding new storage devices to the pool. + + + + Performance: multiple caching mechanisms provide increased + performance. ARC is an + advanced memory-based read cache. A second level of + disk-based read cache can be added with + L2ARC, and disk-based + synchronous write cache is available with + ZIL. + + + + A complete list of features and terminology is shown in + . + + + What Makes <acronym>ZFS</acronym> Different + + ZFS is significantly different from any + previous file system because it is more than just a file system. + Combining the traditionally separate roles of volume manager and + file system provides ZFS with unique + advantages. The file system is now aware of the underlying + structure of the disks. Traditional file systems could only be + created on a single disk at a time. If there were two disks + then two separate file systems would have to be created. In a + traditional hardware RAID configuration, this + problem was avoided by presenting the operating system with a + single logical disk made up of the space provided by a number of + physical disks, on top of which the operating system placed a + file system. Even in the case of software + RAID solutions like those provided by + GEOM, the UFS file system + living on top of the RAID transform believed + that it was dealing with a single device. + ZFS's combination of the volume manager and + the file system solves this and allows the creation of many file + systems all sharing a pool of available storage. One of the + biggest advantages to ZFS's awareness of the + physical layout of the disks is that existing file systems can + be grown automatically when additional disks are added to the + pool. This new space is then made available to all of the file + systems. ZFS also has a number of different + properties that can be applied to each file system, giving many + advantages to creating a number of different file systems and + datasets rather than a single monolithic file system. + + + + Quick Start Guide + + There is a startup mechanism that allows &os; to mount + ZFS pools during system initialization. To + enable it, add this line to + /etc/rc.conf: + + zfs_enable="YES" + + Then start the service: + + &prompt.root; service zfs start + + The examples in this section assume three + SCSI disks with the device names + da0, + da1, and + da2. Users + of SATA hardware should instead use + ada device + names. + + + Single Disk Pool + + To create a simple, non-redundant pool using a single + disk device: + + &prompt.root; zpool create example /dev/da0 + + To view the new pool, review the output of + df: + + &prompt.root; df +Filesystem 1K-blocks Used Avail Capacity Mounted on +/dev/ad0s1a 2026030 235230 1628718 13% / +devfs 1 1 0 100% /dev +/dev/ad0s1d 54098308 1032846 48737598 2% /usr +example 17547136 0 17547136 0% /example + + This output shows that the example pool + has been created and mounted. It is now accessible as a file + system. Files can be created on it and users can browse + it: + + &prompt.root; cd /example +&prompt.root; ls +&prompt.root; touch testfile +&prompt.root; ls -al +total 4 +drwxr-xr-x 2 root wheel 3 Aug 29 23:15 . +drwxr-xr-x 21 root wheel 512 Aug 29 23:12 .. +-rw-r--r-- 1 root wheel 0 Aug 29 23:15 testfile + + However, this pool is not taking advantage of any + ZFS features. To create a dataset on this + pool with compression enabled: + + &prompt.root; zfs create example/compressed +&prompt.root; zfs set compression=gzip example/compressed + + The example/compressed dataset is now a + ZFS compressed file system. Try copying + some large files to + /example/compressed. + + Compression can be disabled with: + + &prompt.root; zfs set compression=off example/compressed + + To unmount a file system, use + zfs umount and then verify with + df: + + &prompt.root; zfs umount example/compressed +&prompt.root; df +Filesystem 1K-blocks Used Avail Capacity Mounted on +/dev/ad0s1a 2026030 235232 1628716 13% / +devfs 1 1 0 100% /dev +/dev/ad0s1d 54098308 1032864 48737580 2% /usr +example 17547008 0 17547008 0% /example + + To re-mount the file system to make it accessible again, + use zfs mount and verify with + df: + + &prompt.root; zfs mount example/compressed +&prompt.root; df +Filesystem 1K-blocks Used Avail Capacity Mounted on +/dev/ad0s1a 2026030 235234 1628714 13% / +devfs 1 1 0 100% /dev +/dev/ad0s1d 54098308 1032864 48737580 2% /usr +example 17547008 0 17547008 0% /example +example/compressed 17547008 0 17547008 0% /example/compressed + + The pool and file system may also be observed by viewing + the output from mount: + + &prompt.root; mount +/dev/ad0s1a on / (ufs, local) +devfs on /dev (devfs, local) +/dev/ad0s1d on /usr (ufs, local, soft-updates) +example on /example (zfs, local) +example/data on /example/data (zfs, local) +example/compressed on /example/compressed (zfs, local) + + After creation, ZFS datasets can be + used like any file systems. However, many other features are + available which can be set on a per-dataset basis. In the + example below, a new file system called + data is created. Important files will be + stored here, so it is configured to keep two copies of each + data block: + + &prompt.root; zfs create example/data +&prompt.root; zfs set copies=2 example/data + + It is now possible to see the data and space utilization + by issuing df: + + &prompt.root; df +Filesystem 1K-blocks Used Avail Capacity Mounted on +/dev/ad0s1a 2026030 235234 1628714 13% / +devfs 1 1 0 100% /dev +/dev/ad0s1d 54098308 1032864 48737580 2% /usr +example 17547008 0 17547008 0% /example +example/compressed 17547008 0 17547008 0% /example/compressed +example/data 17547008 0 17547008 0% /example/data + + Notice that each file system on the pool has the same + amount of available space. This is the reason for using + df in these examples, to show that the file + systems use only the amount of space they need and all draw + from the same pool. ZFS eliminates + concepts such as volumes and partitions, and allows multiple + file systems to occupy the same pool. + + To destroy the file systems and then destroy the pool as + it is no longer needed: + + &prompt.root; zfs destroy example/compressed +&prompt.root; zfs destroy example/data +&prompt.root; zpool destroy example + + + + RAID-Z + + Disks fail. One method of avoiding data loss from disk + failure is to implement RAID. + ZFS supports this feature in its pool + design. RAID-Z pools require three or more + disks but provide more usable space than mirrored + pools. + + This example creates a RAID-Z pool, + specifying the disks to add to the pool: + + &prompt.root; zpool create storage raidz da0 da1 da2 + + + &sun; recommends that the number of devices used in a + RAID-Z configuration be between three and + nine. For environments requiring a single pool consisting + of 10 disks or more, consider breaking it up into smaller + RAID-Z groups. If only two disks are + available and redundancy is a requirement, consider using a + ZFS mirror. Refer to &man.zpool.8; for + more details. + + + The previous example created the + storage zpool. This example makes a new + file system called home in that + pool: + + &prompt.root; zfs create storage/home + + Compression and keeping extra copies of directories + and files can be enabled: + + &prompt.root; zfs set copies=2 storage/home +&prompt.root; zfs set compression=gzip storage/home + + To make this the new home directory for users, copy the + user data to this directory and create the appropriate + symbolic links: + + &prompt.root; cp -rp /home/* /storage/home +&prompt.root; rm -rf /home /usr/home +&prompt.root; ln -s /storage/home /home +&prompt.root; ln -s /storage/home /usr/home + + Users data is now stored on the freshly-created + /storage/home. Test by adding a new user + and logging in as that user. + + Try creating a file system snapshot which can be rolled + back later: + + &prompt.root; zfs snapshot storage/home@08-30-08 + + Snapshots can only be made of a full file system, not a + single directory or file. + + The @ character is a delimiter between + the file system name or the volume name. If an important + directory has been accidentally deleted, the file system can + be backed up, then rolled back to an earlier snapshot when the + directory still existed: + + &prompt.root; zfs rollback storage/home@08-30-08 + + To list all available snapshots, run + ls in the file system's + .zfs/snapshot directory. For example, to + see the previously taken snapshot: + + &prompt.root; ls /storage/home/.zfs/snapshot + + It is possible to write a script to perform regular + snapshots on user data. However, over time, snapshots can + consume a great deal of disk space. The previous snapshot can + be removed using the command: + + &prompt.root; zfs destroy storage/home@08-30-08 + + After testing, /storage/home can be + made the real /home using this + command: + + &prompt.root; zfs set mountpoint=/home storage/home + + Run df and mount to + confirm that the system now treats the file system as the real + /home: + + &prompt.root; mount +/dev/ad0s1a on / (ufs, local) +devfs on /dev (devfs, local) +/dev/ad0s1d on /usr (ufs, local, soft-updates) +storage on /storage (zfs, local) +storage/home on /home (zfs, local) +&prompt.root; df +Filesystem 1K-blocks Used Avail Capacity Mounted on +/dev/ad0s1a 2026030 235240 1628708 13% / +devfs 1 1 0 100% /dev +/dev/ad0s1d 54098308 1032826 48737618 2% /usr +storage 26320512 0 26320512 0% /storage +storage/home 26320512 0 26320512 0% /home + + This completes the RAID-Z + configuration. Daily status updates about the file systems + created can be generated as part of the nightly + &man.periodic.8; runs. Add this line to + /etc/periodic.conf: + + daily_status_zfs_enable="YES" + + + + Recovering <acronym>RAID-Z</acronym> + + Every software RAID has a method of + monitoring its state. The status of + RAID-Z devices may be viewed with this + command: + + &prompt.root; zpool status -x + + If all pools are + Online and everything + is normal, the message shows: + + all pools are healthy + + If there is an issue, perhaps a disk is in the + Offline state, the + pool state will look similar to: + + pool: storage + state: DEGRADED +status: One or more devices has been taken offline by the administrator. + Sufficient replicas exist for the pool to continue functioning in a + degraded state. +action: Online the device using 'zpool online' or replace the device with + 'zpool replace'. + scrub: none requested +config: + + NAME STATE READ WRITE CKSUM + storage DEGRADED 0 0 0 + raidz1 DEGRADED 0 0 0 + da0 ONLINE 0 0 0 + da1 OFFLINE 0 0 0 + da2 ONLINE 0 0 0 + +errors: No known data errors + + This indicates that the device was previously taken + offline by the administrator with this command: + + &prompt.root; zpool offline storage da1 + + Now the system can be powered down to replace + da1. When the system is back online, + the failed disk can replaced in the pool: + + &prompt.root; zpool replace storage da1 + + From here, the status may be checked again, this time + without so that all pools are + shown: + + &prompt.root; zpool status storage + pool: storage + state: ONLINE + scrub: resilver completed with 0 errors on Sat Aug 30 19:44:11 2008 +config: + + NAME STATE READ WRITE CKSUM + storage ONLINE 0 0 0 + raidz1 ONLINE 0 0 0 + da0 ONLINE 0 0 0 + da1 ONLINE 0 0 0 + da2 ONLINE 0 0 0 + +errors: No known data errors + + In this example, everything is normal. + + + + Data Verification + + ZFS uses checksums to verify the + integrity of stored data. These are enabled automatically + upon creation of file systems. + + + Checksums can be disabled, but it is + not recommended! Checksums take very + little storage space and provide data integrity. Many + ZFS features will not work properly with + checksums disabled. There is no noticeable performance gain + from disabling these checksums. + + + Checksum verification is known as + scrubbing. Verify the data integrity of + the storage pool with this command: + + &prompt.root; zpool scrub storage + + The duration of a scrub depends on the amount of data + stored. Larger amounts of data will take proportionally + longer to verify. Scrubs are very I/O + intensive, and only one scrub is allowed to run at a time. + After the scrub completes, the status can be viewed with + status: + + &prompt.root; zpool status storage + pool: storage + state: ONLINE + scrub: scrub completed with 0 errors on Sat Jan 26 19:57:37 2013 +config: + + NAME STATE READ WRITE CKSUM + storage ONLINE 0 0 0 + raidz1 ONLINE 0 0 0 + da0 ONLINE 0 0 0 + da1 ONLINE 0 0 0 + da2 ONLINE 0 0 0 + +errors: No known data errors + + The completion date of the last scrub operation is + displayed to help track when another scrub is required. + Routine scrubs help protect data from silent corruption and + ensure the integrity of the pool. + + Refer to &man.zfs.8; and &man.zpool.8; for other + ZFS options. + + + + + <command>zpool</command> Administration + + ZFS administration is divided between two + main utilities. The zpool utility controls + the operation of the pool and deals with adding, removing, + replacing, and managing disks. The + zfs utility + deals with creating, destroying, and managing datasets, + both file systems and + volumes. + + + Creating and Destroying Storage Pools + + Creating a ZFS storage pool + (zpool) involves making a number of + decisions that are relatively permanent because the structure + of the pool cannot be changed after the pool has been created. + The most important decision is what types of vdevs into which + to group the physical disks. See the list of + vdev types for details + about the possible options. After the pool has been created, + most vdev types do not allow additional disks to be added to + the vdev. The exceptions are mirrors, which allow additional + disks to be added to the vdev, and stripes, which can be + upgraded to mirrors by attaching an additional disk to the + vdev. Although additional vdevs can be added to expand a + pool, the layout of the pool cannot be changed after pool + creation. Instead, the data must be backed up and the + pool destroyed and recreated. + + Create a simple mirror pool: + + &prompt.root; zpool create mypool mirror /dev/ada1 /dev/ada2 +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + ada2 ONLINE 0 0 0 + +errors: No known data errors + + Multiple vdevs can be created at once. Specify multiple + groups of disks separated by the vdev type keyword, + mirror in this example: + + &prompt.root; zpool create mypool mirror /dev/ada1 /dev/ada2 mirror /dev/ada3 /dev/ada4 + pool: mypool + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + ada2 ONLINE 0 0 0 + mirror-1 ONLINE 0 0 0 + ada3 ONLINE 0 0 0 + ada4 ONLINE 0 0 0 + +errors: No known data errors + + Pools can also be constructed using partitions rather than + whole disks. Putting ZFS in a separate + partition allows the same disk to have other partitions for + other purposes. In particular, partitions with bootcode and + file systems needed for booting can be added. This allows + booting from disks that are also members of a pool. There is + no performance penalty on &os; when using a partition rather + than a whole disk. Using partitions also allows the + administrator to under-provision the + disks, using less than the full capacity. If a future + replacement disk of the same nominal size as the original + actually has a slightly smaller capacity, the smaller + partition will still fit, and the replacement disk can still + be used. + + Create a + RAID-Z2 pool using + partitions: + + &prompt.root; zpool create mypool raidz2 /dev/ada0p3 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3 /dev/ada4p3 /dev/ada5p3 +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + raidz2-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + ada3p3 ONLINE 0 0 0 + ada4p3 ONLINE 0 0 0 + ada5p3 ONLINE 0 0 0 + +errors: No known data errors + + A pool that is no longer needed can be destroyed so that + the disks can be reused. Destroying a pool involves first + unmounting all of the datasets in that pool. If the datasets + are in use, the unmount operation will fail and the pool will + not be destroyed. The destruction of the pool can be forced + with , but this can cause undefined + behavior in applications which had open files on those + datasets. + + + + Adding and Removing Devices + + There are two cases for adding disks to a zpool: attaching + a disk to an existing vdev with + zpool attach, or adding vdevs to the pool + with zpool add. Only some + vdev types allow disks to + be added to the vdev after creation. + + A pool created with a single disk lacks redundancy. + Corruption can be detected but + not repaired, because there is no other copy of the data. + + The copies property may + be able to recover from a small failure such as a bad sector, + but does not provide the same level of protection as mirroring + or RAID-Z. Starting with a pool consisting + of a single disk vdev, zpool attach can be + used to add an additional disk to the vdev, creating a mirror. + zpool attach can also be used to add + additional disks to a mirror group, increasing redundancy and + read performance. If the disks being used for the pool are + partitioned, replicate the layout of the first disk on to the + second, gpart backup and + gpart restore can be used to make this + process easier. + + Upgrade the single disk (stripe) vdev + ada0p3 to a mirror by attaching + ada1p3: + + &prompt.root; zpool status + pool: mypool + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool attach mypool ada0p3 ada1p3 +Make sure to wait until resilver is done before rebooting. + +If you boot from pool 'mypool', you may need to update +boot code on newly attached disk 'ada1p3'. + +Assuming you use GPT partitioning and 'da0' is your new boot disk +you may use the following command: + + gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0 +&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1 +bootcode written to ada1 +&prompt.root; zpool status + pool: mypool + state: ONLINE +status: One or more devices is currently being resilvered. The pool will + continue to function, possibly in a degraded state. +action: Wait for the resilver to complete. + scan: resilver in progress since Fri May 30 08:19:19 2014 + 527M scanned out of 781M at 47.9M/s, 0h0m to go + 527M resilvered, 67.53% done +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 (resilvering) + +errors: No known data errors +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:15:58 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + +errors: No known data errors + + When adding disks to the existing vdev is not an option, + as for RAID-Z, an alternative method is to + add another vdev to the pool. Additional vdevs provide higher + performance, distributing writes across the vdevs. Each vdev + is reponsible for providing its own redundancy. It is + possible, but discouraged, to mix vdev types, like + mirror and RAID-Z. + Adding a non-redundant vdev to a pool containing mirror or + RAID-Z vdevs risks the data on the entire + pool. Writes are distributed, so the failure of the + non-redundant disk will result in the loss of a fraction of + every block that has been written to the pool. + + Data is striped across each of the vdevs. For example, + with two mirror vdevs, this is effectively a + RAID 10 that stripes writes across two sets + of mirrors. Space is allocated so that each vdev reaches 100% + full at the same time. There is a performance penalty if the + vdevs have different amounts of free space, as a + disproportionate amount of the data is written to the less + full vdev. + + When attaching additional devices to a boot pool, remember + to update the bootcode. + + Attach a second mirror group (ada2p3 + and ada3p3) to the existing + mirror: + + &prompt.root; zpool status + pool: mypool + state: ONLINE + scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:19:35 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool add mypool mirror ada2p3 ada3p3 +&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2 +bootcode written to ada2 +&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3 +bootcode written to ada3 +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + mirror-1 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + ada3p3 ONLINE 0 0 0 + +errors: No known data errors + + Currently, vdevs cannot be removed from a pool, and disks + can only be removed from a mirror if there is enough remaining + redundancy. If only one disk in a mirror group remains, it + ceases to be a mirror and reverts to being a stripe, risking + the entire pool if that remaining disk fails. + + Remove a disk from a three-way mirror group: + + &prompt.root; zpool status + pool: mypool + state: ONLINE + scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool detach mypool ada2p3 +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + +errors: No known data errors + + + + Checking the Status of a Pool + + Pool status is important. If a drive goes offline or a + read, write, or checksum error is detected, the corresponding + error count increases. The status output + shows the configuration and status of each device in the pool + and the status of the entire pool. Actions that need to be + taken and details about the last scrub + are also shown. + + &prompt.root; zpool status + pool: mypool + state: ONLINE + scan: scrub repaired 0 in 2h25m with 0 errors on Sat Sep 14 04:25:50 2013 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + raidz2-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + ada3p3 ONLINE 0 0 0 + ada4p3 ONLINE 0 0 0 + ada5p3 ONLINE 0 0 0 + +errors: No known data errors + + + + Clearing Errors + + When an error is detected, the read, write, or checksum + counts are incremented. The error message can be cleared and + the counts reset with zpool clear + mypool. Clearing the + error state can be important for automated scripts that alert + the administrator when the pool encounters an error. Further + errors may not be reported if the old errors are not + cleared. + + + + Replacing a Functioning Device + + There are a number of situations where it m be + desirable to replace one disk with a different disk. When + replacing a working disk, the process keeps the old disk + online during the replacement. The pool never enters a + degraded state, + reducing the risk of data loss. + zpool replace copies all of the data from + the old disk to the new one. After the operation completes, + the old disk is disconnected from the vdev. If the new disk + is larger than the old disk, it may be possible to grow the + zpool, using the new space. See Growing a Pool. + + Replace a functioning device in the pool: + + &prompt.root; zpool status + pool: mypool + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool replace mypool ada1p3 ada2p3 +Make sure to wait until resilver is done before rebooting. + +If you boot from pool 'zroot', you may need to update +boot code on newly attached disk 'ada2p3'. + +Assuming you use GPT partitioning and 'da0' is your new boot disk +you may use the following command: + + gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0 +&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2 +&prompt.root; zpool status + pool: mypool + state: ONLINE +status: One or more devices is currently being resilvered. The pool will + continue to function, possibly in a degraded state. +action: Wait for the resilver to complete. + scan: resilver in progress since Mon Jun 2 14:21:35 2014 + 604M scanned out of 781M at 46.5M/s, 0h0m to go + 604M resilvered, 77.39% done +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + replacing-1 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 (resilvering) + +errors: No known data errors +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:21:52 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + +errors: No known data errors + + + + Dealing with Failed Devices + + When a disk in a pool fails, the vdev to which the disk + belongs enters the + degraded state. All + of the data is still available, but performance may be reduced + because missing data must be calculated from the available + redundancy. To restore the vdev to a fully functional state, + the failed physical device must be replaced. + ZFS is then instructed to begin the + resilver operation. + Data that was on the failed device is recalculated from + available redundancy and written to the replacement device. + After completion, the vdev returns to + online status. + + If the vdev does not have any redundancy, or if multiple + devices have failed and there is not enough redundancy to + compensate, the pool enters the + faulted state. If a + sufficient number of devices cannot be reconnected to the + pool, the pool becomes inoperative and data must be restored + from backups. + + When replacing a failed disk, the name of the failed disk + is replaced with the GUID of the device. + A new device name parameter for + zpool replace is not required if the + replacement device has the same device name. + + Replace a failed disk using + zpool replace: + + &prompt.root; zpool status + pool: mypool + state: DEGRADED +status: One or more devices could not be opened. Sufficient replicas exist for + the pool to continue functioning in a degraded state. +action: Attach the missing device and online it using 'zpool online'. + see: http://illumos.org/msg/ZFS-8000-2Q + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool DEGRADED 0 0 0 + mirror-0 DEGRADED 0 0 0 + ada0p3 ONLINE 0 0 0 + 316502962686821739 UNAVAIL 0 0 0 was /dev/ada1p3 + +errors: No known data errors +&prompt.root; zpool replace mypool 316502962686821739 ada2p3 +&prompt.root; zpool status + pool: mypool + state: DEGRADED +status: One or more devices is currently being resilvered. The pool will + continue to function, possibly in a degraded state. +action: Wait for the resilver to complete. + scan: resilver in progress since Mon Jun 2 14:52:21 2014 + 641M scanned out of 781M at 49.3M/s, 0h0m to go + 640M resilvered, 82.04% done +config: + + NAME STATE READ WRITE CKSUM + mypool DEGRADED 0 0 0 + mirror-0 DEGRADED 0 0 0 + ada0p3 ONLINE 0 0 0 + replacing-1 UNAVAIL 0 0 0 + 15732067398082357289 UNAVAIL 0 0 0 was /dev/ada1p3/old + ada2p3 ONLINE 0 0 0 (resilvering) + +errors: No known data errors +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:52:38 2014 +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + +errors: No known data errors + + + + Scrubbing a Pool + + It is recommended that pools be + scrubbed regularly, + ideally at least once every month. The + scrub operation is very disk-intensive and + will reduce performance while running. Avoid high-demand + periods when scheduling scrub or use vfs.zfs.scrub_delay + to adjust the relative priority of the + scrub to prevent it interfering with other + workloads. + + &prompt.root; zpool scrub mypool +&prompt.root; zpool status + pool: mypool + state: ONLINE + scan: scrub in progress since Wed Feb 19 20:52:54 2014 + 116G scanned out of 8.60T at 649M/s, 3h48m to go + 0 repaired, 1.32% done +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + raidz2-0 ONLINE 0 0 0 + ada0p3 ONLINE 0 0 0 + ada1p3 ONLINE 0 0 0 + ada2p3 ONLINE 0 0 0 + ada3p3 ONLINE 0 0 0 + ada4p3 ONLINE 0 0 0 + ada5p3 ONLINE 0 0 0 + +errors: No known data errors + + In the event that a scrub operation needs to be cancelled, + issue zpool scrub -s + mypool. + + + + Self-Healing + + The checksums stored with data blocks enable the file + system to self-heal. This feature will + automatically repair data whose checksum does not match the + one recorded on another device that is part of the storage + pool. For example, a mirror with two disks where one drive is + starting to malfunction and cannot properly store the data any + more. This is even worse when the data has not been accessed + for a long time, as with long term archive storage. + Traditional file systems need to run algorithms that check and + repair the data like &man.fsck.8;. These commands take time, + and in severe cases, an administrator has to manually decide + which repair operation must be performed. When + ZFS detects a data block with a checksum + that does not match, it tries to read the data from the mirror + disk. If that disk can provide the correct data, it will not + only give that data to the application requesting it, but also + correct the wrong data on the disk that had the bad checksum. + This happens without any interaction from a system + administrator during normal pool operation. + + The next example demonstrates this self-healing behavior. + A mirrored pool of disks /dev/ada0 and + /dev/ada1 is created. + + &prompt.root; zpool create healer mirror /dev/ada0 /dev/ada1 +&prompt.root; zpool status healer + pool: healer + state: ONLINE + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + healer ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +healer 960M 92.5K 960M 0% 1.00x ONLINE - + + Some important data that to be protected from data errors + using the self-healing feature is copied to the pool. A + checksum of the pool is created for later comparison. + + &prompt.root; cp /some/important/data /healer +&prompt.root; zfs list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +healer 960M 67.7M 892M 7% 1.00x ONLINE - +&prompt.root; sha1 /healer > checksum.txt +&prompt.root; cat checksum.txt +SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f + + Data corruption is simulated by writing random data to the + beginning of one of the disks in the mirror. To prevent + ZFS from healing the data as soon as it is + detected, the pool is exported before the corruption and + imported again afterwards. + + + This is a dangerous operation that can destroy vital + data. It is shown here for demonstrational purposes only + and should not be attempted during normal operation of a + storage pool. Nor should this intentional corruption + example be run on any disk with a different file system on + it. Do not use any other disk device names other than the + ones that are part of the pool. Make certain that proper + backups of the pool are created before running the + command! + + + &prompt.root; zpool export healer +&prompt.root; dd if=/dev/random of=/dev/ada1 bs=1m count=200 +200+0 records in +200+0 records out +209715200 bytes transferred in 62.992162 secs (3329227 bytes/sec) +&prompt.root; zpool import healer + + The pool status shows that one device has experienced an + error. Note that applications reading data from the pool did + not receive any incorrect data. ZFS + provided data from the ada0 device with + the correct checksums. The device with the wrong checksum can + be found easily as the CKSUM column + contains a nonzero value. + + &prompt.root; zpool status healer + pool: healer + state: ONLINE + status: One or more devices has experienced an unrecoverable error. An + attempt was made to correct the error. Applications are unaffected. + action: Determine if the device needs to be replaced, and clear the errors + using 'zpool clear' or replace the device with 'zpool replace'. + see: http://www.sun.com/msg/ZFS-8000-9P + scan: none requested + config: + + NAME STATE READ WRITE CKSUM + healer ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 1 + +errors: No known data errors + + The error was detected and handled by using the redundancy + present in the unaffected ada0 mirror + disk. A checksum comparison with the original one will reveal + whether the pool is consistent again. + + &prompt.root; sha1 /healer >> checksum.txt +&prompt.root; cat checksum.txt +SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f +SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f + + The two checksums that were generated before and after the + intentional tampering with the pool data still match. This + shows how ZFS is capable of detecting and + correcting any errors automatically when the checksums differ. + Note that this is only possible when there is enough + redundancy present in the pool. A pool consisting of a single + device has no self-healing capabilities. That is also the + reason why checksums are so important in + ZFS and should not be disabled for any + reason. No &man.fsck.8; or similar file system consistency + check program is required to detect and correct this and the + pool was still available during the time there was a problem. + A scrub operation is now required to overwrite the corrupted + data on ada1. + + &prompt.root; zpool scrub healer +&prompt.root; zpool status healer + pool: healer + state: ONLINE +status: One or more devices has experienced an unrecoverable error. An + attempt was made to correct the error. Applications are unaffected. +action: Determine if the device needs to be replaced, and clear the errors + using 'zpool clear' or replace the device with 'zpool replace'. + see: http://www.sun.com/msg/ZFS-8000-9P + scan: scrub in progress since Mon Dec 10 12:23:30 2012 + 10.4M scanned out of 67.0M at 267K/s, 0h3m to go + 9.63M repaired, 15.56% done +config: + + NAME STATE READ WRITE CKSUM + healer ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 627 (repairing) + +errors: No known data errors + + The scrub operation reads data from + ada0 and rewrites any data with an + incorrect checksum on ada1. This is + indicated by the (repairing) output from + zpool status. After the operation is + complete, the pool status changes to: + + &prompt.root; zpool status healer + pool: healer + state: ONLINE +status: One or more devices has experienced an unrecoverable error. An + attempt was made to correct the error. Applications are unaffected. +action: Determine if the device needs to be replaced, and clear the errors + using 'zpool clear' or replace the device with 'zpool replace'. + see: http://www.sun.com/msg/ZFS-8000-9P + scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012 +config: + + NAME STATE READ WRITE CKSUM + healer ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 2.72K + +errors: No known data errors + + After the scrub operation completes and all the data + has been synchronized from ada0 to + ada1, the error messages can be + cleared from the pool + status by running zpool clear. + + &prompt.root; zpool clear healer +&prompt.root; zpool status healer + pool: healer + state: ONLINE + scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012 +config: + + NAME STATE READ WRITE CKSUM + healer ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + +errors: No known data errors + + The pool is now back to a fully working state and all the + errors have been cleared. + + + + Growing a Pool + + The usable size of a redundant pool is limited by the + capacity of the smallest device in each vdev. The smallest + device can be replaced with a larger device. After completing + a replace or + resilver operation, + the pool can grow to use the capacity of the new device. For + example, consider a mirror of a 1 TB drive and a + 2 drive. The usable space is 1 TB. Then the + 1 TB is replaced with another 2 TB drive, and the + resilvering process duplicates existing data. Because + both of the devices now have 2 TB capacity, the mirror's + available space can be grown to 2 TB. + + Expansion is triggered by using + zpool online -e on each device. After + expansion of all devices, the additional space becomes + available to the pool. + + + + Importing and Exporting Pools + + Pools are exported before moving them + to another system. All datasets are unmounted, and each + device is marked as exported but still locked so it cannot be + used by other disk subsystems. This allows pools to be + imported on other machines, other + operating systems that support ZFS, and + even different hardware architectures (with some caveats, see + &man.zpool.8;). When a dataset has open files, + zpool export -f can be used to force the + export of a pool. Use this with caution. The datasets are + forcibly unmounted, potentially resulting in unexpected + behavior by the applications which had open files on those + datasets. + + Export a pool that is not in use: + + &prompt.root; zpool export mypool + + Importing a pool automatically mounts the datasets. This + may not be the desired behavior, and can be prevented with + zpool import -N. + zpool import -o sets temporary properties + for this import only. + zpool import altroot= allows importing a + pool with a base mount point instead of the root of the file + system. If the pool was last used on a different system and + was not properly exported, an import might have to be forced + with zpool import -f. + zpool import -a imports all pools that do + not appear to be in use by another system. + + List all available pools for import: + + &prompt.root; zpool import + pool: mypool + id: 9930174748043525076 + state: ONLINE + action: The pool can be imported using its name or numeric identifier. + config: + + mypool ONLINE + ada2p3 ONLINE + + Import the pool with an alternative root directory: + + &prompt.root; zpool import -o altroot=/mnt mypool +&prompt.root; zfs list +zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 110K 47.0G 31K /mnt/mypool + + + + Upgrading a Storage Pool + + After upgrading &os;, or if a pool has been imported from + a system using an older version of ZFS, the + pool can be manually upgraded to the latest version of + ZFS to support newer features. Consider + whether the pool may ever need to be imported on an older + system before upgrading. Upgrading is a one-way process. + Older pools can be upgraded, but pools with newer features + cannot be downgraded. + + Upgrade a v28 pool to support + Feature Flags: + + &prompt.root; zpool status + pool: mypool + state: ONLINE +status: The pool is formatted using a legacy on-disk format. The pool can + still be used, but some features are unavailable. +action: Upgrade the pool using 'zpool upgrade'. Once this is done, the + pool will no longer be accessible on software that does not support feat + flags. + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool upgrade +This system supports ZFS pool feature flags. + +The following pools are formatted with legacy version numbers and can +be upgraded to use feature flags. After being upgraded, these pools +will no longer be accessible by software that does not support feature +flags. + +VER POOL +--- ------------ +28 mypool + +Use 'zpool upgrade -v' for a list of available legacy versions. +Every feature flags pool has all supported features enabled. +&prompt.root; zpool upgrade mypool +This system supports ZFS pool feature flags. + +Successfully upgraded 'mypool' from version 28 to feature flags. +Enabled the following features on 'mypool': + async_destroy + empty_bpobj + lz4_compress + multi_vdev_crash_dump + + The newer features of ZFS will not be + available until zpool upgrade has + completed. zpool upgrade -v can be used to + see what new features will be provided by upgrading, as well + as which features are already supported. + + Upgrade a pool to support additional feature flags: + + &prompt.root; zpool status + pool: mypool + state: ONLINE +status: Some supported features are not enabled on the pool. The pool can + still be used, but some features are unavailable. +action: Enable all features using 'zpool upgrade'. Once this is done, + the pool may no longer be accessible by software that does not support + the features. See zpool-features(7) for details. + scan: none requested +config: + + NAME STATE READ WRITE CKSUM + mypool ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + ada0 ONLINE 0 0 0 + ada1 ONLINE 0 0 0 + +errors: No known data errors +&prompt.root; zpool upgrade +This system supports ZFS pool feature flags. + +All pools are formatted using feature flags. + + +Some supported features are not enabled on the following pools. Once a +feature is enabled the pool may become incompatible with software +that does not support the feature. See zpool-features(7) for details. + +POOL FEATURE +--------------- +zstore + multi_vdev_crash_dump + spacemap_histogram + enabled_txg + hole_birth + extensible_dataset + bookmarks + filesystem_limits +&prompt.root; zpool upgrade mypool +This system supports ZFS pool feature flags. + +Enabled the following features on 'mypool': + spacemap_histogram + enabled_txg + hole_birth + extensible_dataset + bookmarks + filesystem_limits + + + The boot code on systems that boot from a pool must be + updated to support the new pool version. Use + gpart bootcode on the partition that + contains the boot code. See &man.gpart.8; for more + information. + + + + + Displaying Recorded Pool History + + Commands that modify the pool are recorded. Recorded + actions include the creation of datasets, changing properties, + or replacement of a disk. This history is useful for + reviewing how a pool was created and which user performed a + specific action and when. History is not kept in a log file, + but is part of the pool itself. The command to review this + history is aptly named + zpool history: + + &prompt.root; zpool history +History for 'tank': +2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 +2013-02-27.18:50:58 zfs set atime=off tank +2013-02-27.18:51:09 zfs set checksum=fletcher4 tank +2013-02-27.18:51:18 zfs create tank/backup + + The output shows zpool and + zfs commands that were executed on the pool + along with a timestamp. Only commands that alter the pool in + some way are recorded. Commands like + zfs list are not included. When no pool + name is specified, the history of all pools is + displayed. + + zpool history can show even more + information when the options or + are provided. + displays user-initiated events as well as internally logged + ZFS events. + + &prompt.root; zpool history -i +History for 'tank': +2013-02-26.23:02:35 [internal pool create txg:5] pool spa 28; zfs spa 28; zpl 5;uts 9.1-RELEASE 901000 amd64 +2013-02-27.18:50:53 [internal property set txg:50] atime=0 dataset = 21 +2013-02-27.18:50:58 zfs set atime=off tank +2013-02-27.18:51:04 [internal property set txg:53] checksum=7 dataset = 21 +2013-02-27.18:51:09 zfs set checksum=fletcher4 tank +2013-02-27.18:51:13 [internal create txg:55] dataset = 39 +2013-02-27.18:51:18 zfs create tank/backup + + More details can be shown by adding . + History records are shown in a long format, including + information like the name of the user who issued the command + and the hostname on which the change was made. + + &prompt.root; zpool history -l +History for 'tank': +2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 [user 0 (root) on :global] +2013-02-27.18:50:58 zfs set atime=off tank [user 0 (root) on myzfsbox:global] +2013-02-27.18:51:09 zfs set checksum=fletcher4 tank [user 0 (root) on myzfsbox:global] +2013-02-27.18:51:18 zfs create tank/backup [user 0 (root) on myzfsbox:global] + + The output shows that the + root user created + the mirrored pool with disks + /dev/ada0 and + /dev/ada1. The hostname + myzfsbox is also + shown in the commands after the pool's creation. The hostname + display becomes important when the pool is exported from one + system and imported on another. The commands that are issued + on the other system can clearly be distinguished by the + hostname that is recorded for each command. + + Both options to zpool history can be + combined to give the most detailed information possible for + any given pool. Pool history provides valuable information + when tracking down the actions that were performed or when + more detailed output is needed for debugging. + + + + Performance Monitoring + + A built-in monitoring system can display pool + I/O statistics in real time. It shows the + amount of free and used space on the pool, how many read and + write operations are being performed per second, and how much + I/O bandwidth is currently being utilized. + By default, all pools in the system are monitored and + displayed. A pool name can be provided to limit monitoring to + just that pool. A basic example: + + &prompt.root; zpool iostat + capacity operations bandwidth +pool alloc free read write read write +---------- ----- ----- ----- ----- ----- ----- +data 288G 1.53T 2 11 11.3K 57.1K + + To continuously monitor I/O activity, a + number can be specified as the last parameter, indicating a + interval in seconds to wait between updates. The next + statistic line is printed after each interval. Press + + Ctrl + C + to stop this continuous monitoring. + Alternatively, give a second number on the command line after + the interval to specify the total number of statistics to + display. + + Even more detailed I/O statistics can + be displayed with . Each device in the + pool is shown with a statistics line. This is useful in + seeing how many read and write operations are being performed + on each device, and can help determine if any individual + device is slowing down the pool. This example shows a + mirrored pool with two devices: + + &prompt.root; zpool iostat -v + capacity operations bandwidth +pool alloc free read write read write +----------------------- ----- ----- ----- ----- ----- ----- +data 288G 1.53T 2 12 9.23K 61.5K + mirror 288G 1.53T 2 12 9.23K 61.5K + ada1 - - 0 4 5.61K 61.7K + ada2 - - 1 4 5.04K 61.7K +----------------------- ----- ----- ----- ----- ----- ----- + + + + Splitting a Storage Pool + + A pool consisting of one or more mirror vdevs can be split + into two pools. Unless otherwise specified, the last member + of each mirror is detached and used to create a new pool + containing the same data. The operation should first be + attempted with . The details of the + proposed operation are displayed without it actually being + performed. This helps confirm that the operation will do what + the user intends. + + + + + <command>zfs</command> Administration + + The zfs utility is responsible for + creating, destroying, and managing all ZFS + datasets that exist within a pool. The pool is managed using + zpool. + + + Creating and Destroying Datasets + + Unlike traditional disks and volume managers, space in + ZFS is not + preallocated. With traditional file systems, after all of the + space is partitioned and assigned, there is no way to add an + additional file system without adding a new disk. With + ZFS, new file systems can be created at any + time. Each dataset + has properties including features like compression, + deduplication, caching, and quotas, as well as other useful + properties like readonly, case sensitivity, network file + sharing, and a mount point. Datasets can be nested inside + each other, and child datasets will inherit properties from + their parents. Each dataset can be administered, + delegated, + replicated, + snapshotted, + jailed, and destroyed as a + unit. There are many advantages to creating a separate + dataset for each different type or set of files. The only + drawbacks to having an extremely large number of datasets is + that some commands like zfs list will be + slower, and the mounting of hundreds or even thousands of + datasets can slow the &os; boot process. + + Create a new dataset and enable LZ4 + compression on it: + + &prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 781M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 616K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.20M 93.2G 608K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/tmp 152K 93.2G 152K /var/tmp +&prompt.root; zfs create -o compress=lz4 mypool/usr/mydataset +&prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 781M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 704K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.20M 93.2G 610K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/tmp 152K 93.2G 152K /var/tmp + + Destroying a dataset is much quicker than deleting all + of the files that reside on the dataset, as it does not + involve scanning all of the files and updating all of the + corresponding metadata. + + Destroy the previously-created dataset: + + &prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 880M 93.1G 144K none +mypool/ROOT 777M 93.1G 144K none +mypool/ROOT/default 777M 93.1G 777M / +mypool/tmp 176K 93.1G 176K /tmp +mypool/usr 101M 93.1G 144K /usr +mypool/usr/home 184K 93.1G 184K /usr/home +mypool/usr/mydataset 100M 93.1G 100M /usr/mydataset +mypool/usr/ports 144K 93.1G 144K /usr/ports +mypool/usr/src 144K 93.1G 144K /usr/src +mypool/var 1.20M 93.1G 610K /var +mypool/var/crash 148K 93.1G 148K /var/crash +mypool/var/log 178K 93.1G 178K /var/log +mypool/var/mail 144K 93.1G 144K /var/mail +mypool/var/tmp 152K 93.1G 152K /var/tmp +&prompt.root; zfs destroy mypool/usr/mydataset +&prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 781M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 616K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.21M 93.2G 612K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/tmp 152K 93.2G 152K /var/tmp + + In modern versions of ZFS, + zfs destroy is asynchronous, and the free + space might take several minutes to appear in the pool. Use + zpool get freeing + poolname to see the + freeing property, indicating how many + datasets are having their blocks freed in the background. + If there are child datasets, like + snapshots or other + datasets, then the parent cannot be destroyed. To destroy a + dataset and all of its children, use to + recursively destroy the dataset and all of its children. + Use to list datasets + and snapshots that would be destroyed by this operation, but + do not actually destroy anything. Space that would be + reclaimed by destruction of snapshots is also shown. + + + + Creating and Destroying Volumes + + A volume is a special type of dataset. Rather than being + mounted as a file system, it is exposed as a block device + under + /dev/zvol/poolname/dataset. + This allows the volume to be used for other file systems, to + back the disks of a virtual machine, or to be exported using + protocols like iSCSI or + HAST. + + A volume can be formatted with any file system, or used + without a file system to store raw data. To the user, a + volume appears to be a regular disk. Putting ordinary file + systems on these zvols provides features + that ordinary disks or file systems do not normally have. For + example, using the compression property on a 250 MB + volume allows creation of a compressed FAT + file system. + + &prompt.root; zfs create -V 250m -o compression=on tank/fat32 +&prompt.root; zfs list tank +NAME USED AVAIL REFER MOUNTPOINT +tank 258M 670M 31K /tank +&prompt.root; newfs_msdos -F32 /dev/zvol/tank/fat32 +&prompt.root; mount -t msdosfs /dev/zvol/tank/fat32 /mnt +&prompt.root; df -h /mnt | grep fat32 +Filesystem Size Used Avail Capacity Mounted on +/dev/zvol/tank/fat32 249M 24k 249M 0% /mnt +&prompt.root; mount | grep fat32 +/dev/zvol/tank/fat32 on /mnt (msdosfs, local) + + Destroying a volume is much the same as destroying a + regular file system dataset. The operation is nearly + instantaneous, but it may take several minutes for the free + space to be reclaimed in the background. + + + + Renaming a Dataset + + The name of a dataset can be changed with + zfs rename. The parent of a dataset can + also be changed with this command. Renaming a dataset to be + under a different parent dataset will change the value of + those properties that are inherited from the parent dataset. + When a dataset is renamed, it is unmounted and then remounted + in the new location (which is inherited from the new parent + dataset). This behavior can be prevented with + . + + Rename a dataset and move it to be under a different + parent dataset: + + &prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 780M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 704K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.21M 93.2G 614K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/tmp 152K 93.2G 152K /var/tmp +&prompt.root; zfs rename mypool/usr/mydataset mypool/var/newname +&prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +mypool 780M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 616K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.29M 93.2G 614K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/newname 87.5K 93.2G 87.5K /var/newname +mypool/var/tmp 152K 93.2G 152K /var/tmp + + Snapshots can also be renamed like this. Due to the + nature of snapshots, they cannot be renamed into a different + parent dataset. To rename a recursive snapshot, specify + , and all snapshots with the same name in + child datasets with also be renamed. + + &prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/newname@first_snapshot 0 - 87.5K - +&prompt.root; zfs rename mypool/var/newname@first_snapshot new_snapshot_name +&prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/newname@new_snapshot_name 0 - 87.5K - + + + + Setting Dataset Properties + + Each ZFS dataset has a number of + properties that control its behavior. Most properties are + automatically inherited from the parent dataset, but can be + overridden locally. Set a property on a dataset with + zfs set + property=value + dataset. Most + properties have a limited set of valid values, + zfs get will display each possible property + and valid values. Most properties can be reverted to their + inherited values using zfs inherit. + + User-defined properties can also be set. They become part + of the dataset configuration and can be used to provide + additional information about the dataset or its contents. To + distinguish these custom properties from the ones supplied as + part of ZFS, a colon (:) + is used to create a custom namespace for the property. + + &prompt.root; zfs set custom:costcenter=1234 tank +&prompt.root; zfs get custom:costcenter tank +NAME PROPERTY VALUE SOURCE +tank custom:costcenter 1234 local + + To remove a custom property, use + zfs inherit with . If + the custom property is not defined in any of the parent + datasets, it will be removed completely (although the changes + are still recorded in the pool's history). + + &prompt.root; zfs inherit -r custom:costcenter tank +&prompt.root; zfs get custom:costcenter tank +NAME PROPERTY VALUE SOURCE +tank custom:costcenter - - +&prompt.root; zfs get all tank | grep custom:costcenter +&prompt.root; + + + + Managing Snapshots + + Snapshots are one + of the most powerful features of ZFS. A + snapshot provides a read-only, point-in-time copy of the + dataset. With Copy-On-Write (COW), + snapshots can be created quickly by preserving the older + version of the data on disk. If no snapshots exist, space is + reclaimed for future use when data is rewritten or deleted. + Snapshots preserve disk space by recording only the + differences between the current dataset and a previous + version. Snapshots are allowed only on whole datasets, not on + individual files or directories. When a snapshot is created + from a dataset, everything contained in it is duplicated. + This includes the file system properties, files, directories, + permissions, and so on. Snapshots use no additional space + when they are first created, only consuming space as the + blocks they reference are changed. Recursive snapshots taken + with create a snapshot with the same name + on the dataset and all of its children, providing a consistent + moment-in-time snapshot of all of the file systems. This can + be important when an application has files on multiple + datasets that are related or dependent upon each other. + Without snapshots, a backup would have copies of the files + from different points in time. + + Snapshots in ZFSprovide a variety of + features that even other file systems with snapshot + functionality lack. A typical example of snapshot use is to + have a quick way of backing up the current state of the file + system when a risky action like a software installation or a + system upgrade is performed. If the action fails, the + snapshot can be rolled back and the system has the same state + as when the snapshot was created. If the upgrade was + successful, the snapshot can be deleted to free up space. + Without snapshots, a failed upgrade often requires a restore + from backup, which is tedious, time consuming, and may require + downtime during which the system cannot be used. Snapshots + can be rolled back quickly, even while the system is running + in normal operation, with little or no downtime. The time + savings are enormous with multi-terabyte storage systems and + the time required to copy the data from backup. Snapshots are + not a replacement for a complete backup of a pool, but can be + used as a quick and easy way to store a copy of the dataset at + a specific point in time. + + + Creating Snapshots + + Snapshots are created with zfs snapshot + dataset@snapshotname. + Adding creates a snapshot recursively, + with the same name on all child datasets. + + Create a recursive snapshot of the entire pool: + + &prompt.root; zfs list -t all +NAME USED AVAIL REFER MOUNTPOINT +mypool 780M 93.2G 144K none +mypool/ROOT 777M 93.2G 144K none +mypool/ROOT/default 777M 93.2G 777M / +mypool/tmp 176K 93.2G 176K /tmp +mypool/usr 616K 93.2G 144K /usr +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/ports 144K 93.2G 144K /usr/ports +mypool/usr/src 144K 93.2G 144K /usr/src +mypool/var 1.29M 93.2G 616K /var +mypool/var/crash 148K 93.2G 148K /var/crash +mypool/var/log 178K 93.2G 178K /var/log +mypool/var/mail 144K 93.2G 144K /var/mail +mypool/var/newname 87.5K 93.2G 87.5K /var/newname +mypool/var/newname@new_snapshot_name 0 - 87.5K - +mypool/var/tmp 152K 93.2G 152K /var/tmp +&prompt.root; zfs snapshot -r mypool@my_recursive_snapshot +&prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +mypool@my_recursive_snapshot 0 - 144K - +mypool/ROOT@my_recursive_snapshot 0 - 144K - +mypool/ROOT/default@my_recursive_snapshot 0 - 777M - +mypool/tmp@my_recursive_snapshot 0 - 176K - +mypool/usr@my_recursive_snapshot 0 - 144K - +mypool/usr/home@my_recursive_snapshot 0 - 184K - +mypool/usr/ports@my_recursive_snapshot 0 - 144K - +mypool/usr/src@my_recursive_snapshot 0 - 144K - +mypool/var@my_recursive_snapshot 0 - 616K - +mypool/var/crash@my_recursive_snapshot 0 - 148K - +mypool/var/log@my_recursive_snapshot 0 - 178K - +mypool/var/mail@my_recursive_snapshot 0 - 144K - +mypool/var/newname@new_snapshot_name 0 - 87.5K - +mypool/var/newname@my_recursive_snapshot 0 - 87.5K - +mypool/var/tmp@my_recursive_snapshot 0 - 152K - + + Snapshots are not shown by a normal + zfs list operation. To list snapshots, + is appended to + zfs list. + displays both file systems and snapshots. + + Snapshots are not mounted directly, so path is shown in + the MOUNTPOINT column. There is no + mention of available disk space in the + AVAIL column, as snapshots cannot be + written to after they are created. Compare the snapshot + to the original dataset from which it was created: + + &prompt.root; zfs list -rt all mypool/usr/home +NAME USED AVAIL REFER MOUNTPOINT +mypool/usr/home 184K 93.2G 184K /usr/home +mypool/usr/home@my_recursive_snapshot 0 - 184K - + + Displaying both the dataset and the snapshot together + reveals how snapshots work in + COW fashion. They save + only the changes (delta) that were made + and not the complete file system contents all over again. + This means that snapshots take little space when few changes + are made. Space usage can be made even more apparent by + copying a file to the dataset, then making a second + snapshot: + + &prompt.root; cp /etc/passwd /var/tmp +&prompt.root; zfs snapshot mypool/var/tmp@after_cp +&prompt.root; zfs list -rt all mypool/var/tmp +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/tmp 206K 93.2G 118K /var/tmp +mypool/var/tmp@my_recursive_snapshot 88K - 152K - +mypool/var/tmp@after_cp 0 - 118K - + + The second snapshot contains only the changes to the + dataset after the copy operation. This yields enormous + space savings. Notice that the size of the snapshot + mypool/var/tmp@my_recursive_snapshot + also changed in the USED + column to indicate the changes between itself and the + snapshot taken afterwards. + + + + Comparing Snapshots + + ZFS provides a built-in command to compare the + differences in content between two snapshots. This is + helpful when many snapshots were taken over time and the + user wants to see how the file system has changed over time. + For example, zfs diff lets a user find + the latest snapshot that still contains a file that was + accidentally deleted. Doing this for the two snapshots that + were created in the previous section yields this + output: + + &prompt.root; zfs list -rt all mypool/var/tmp +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/tmp 206K 93.2G 118K /var/tmp +mypool/var/tmp@my_recursive_snapshot 88K - 152K - +mypool/var/tmp@after_cp 0 - 118K - +&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot +M /var/tmp/ ++ /var/tmp/passwd + + The command lists the changes between the specified + snapshot (in this case + mypool/var/tmp@my_recursive_snapshot) + and the live file system. The first column shows the + type of change: + + + + + + + + The path or file was added. + + + + - + The path or file was deleted. + + + + M + The path or file was modified. + + + + R + The path or file was renamed. + + + + + + Comparing the output with the table, it becomes clear + that passwd + was added after the snapshot + mypool/var/tmp@my_recursive_snapshot + was created. This also resulted in a modification to the + parent directory mounted at + /var/tmp. + + Comparing two snapshots is helpful when using the + ZFS replication feature to transfer a + dataset to a different host for backup purposes. + + Compare two snapshots by providing the full dataset name + and snapshot name of both datasets: + + &prompt.root; cp /var/tmp/passwd /var/tmp/passwd.copy +&prompt.root; zfs snapshot mypool/var/tmp@diff_snapshot +&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot mypool/var/tmp@diff_snapshot +M /var/tmp/ ++ /var/tmp/passwd ++ /var/tmp/passwd.copy +&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot mypool/var/tmp@after_cp +M /var/tmp/ ++ /var/tmp/passwd + + A backup administrator can compare two snapshots + received from the sending host and determine the actual + changes in the dataset. See the + Replication section for + more information. + + + + Snapshot Rollback + + When at least one snapshot is available, it can be + rolled back to at any time. Most of the time this is the + case when the current state of the dataset is no longer + required and an older version is preferred. Scenarios such + as local development tests have gone wrong, botched system + updates hampering the system's overall functionality, or the + requirement to restore accidentally deleted files or + directories are all too common occurrences. Luckily, + rolling back a snapshot is just as easy as typing + zfs rollback + snapshotname. + Depending on how many changes are involved, the operation + will finish in a certain amount of time. During that time, + the dataset always remains in a consistent state, much like + a database that conforms to ACID principles is performing a + rollback. This is happening while the dataset is live and + accessible without requiring a downtime. Once the snapshot + has been rolled back, the dataset has the same state as it + had when the snapshot was originally taken. All other data + in that dataset that was not part of the snapshot is + discarded. Taking a snapshot of the current state of the + dataset before rolling back to a previous one is a good idea + when some data is required later. This way, the user can + roll back and forth between snapshots without losing data + that is still valuable. + + In the first example, a snapshot is rolled back because + of a careless rm operation that removes + too much data than was intended. + + &prompt.root; zfs list -rt all mypool/var/tmp +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/tmp 262K 93.2G 120K /var/tmp +mypool/var/tmp@my_recursive_snapshot 88K - 152K - +mypool/var/tmp@after_cp 53.5K - 118K - +mypool/var/tmp@diff_snapshot 0 - 120K - +&prompt.user; ls /var/tmp +passwd passwd.copy +&prompt.user; rm /var/tmp/passwd* +&prompt.user; ls /var/tmp +vi.recover +&prompt.user; + + At this point, the user realized that too many files + were deleted and wants them back. ZFS + provides an easy way to get them back using rollbacks, but + only when snapshots of important data are performed on a + regular basis. To get the files back and start over from + the last snapshot, issue the command: + + &prompt.root; zfs rollback mypool/var/tmp@diff_snapshot +&prompt.user; ls /var/tmp +passwd passwd.copy vi.recover + + The rollback operation restored the dataset to the state + of the last snapshot. It is also possible to roll back to a + snapshot that was taken much earlier and has other snapshots + that were created after it. When trying to do this, + ZFS will issue this warning: + + &prompt.root; zfs list -rt snapshot mypool/var/tmp +AME USED AVAIL REFER MOUNTPOINT +mypool/var/tmp@my_recursive_snapshot 88K - 152K - +mypool/var/tmp@after_cp 53.5K - 118K - +mypool/var/tmp@diff_snapshot 0 - 120K - +&prompt.root; zfs rollback mypool/var/tmp@my_recursive_snapshot +cannot rollback to 'mypool/var/tmp@my_recursive_snapshot': more recent snapshots exist +use '-r' to force deletion of the following snapshots: +mypool/var/tmp@after_cp +mypool/var/tmp@diff_snapshot + + This warning means that snapshots exist between the + current state of the dataset and the snapshot to which the + user wants to roll back. To complete the rollback, these + snapshots must be deleted. ZFS cannot + track all the changes between different states of the + dataset, because snapshots are read-only. + ZFS will not delete the affected + snapshots unless the user specifies to + indicate that this is the desired action. If that is the + intention, and the consequences of losing all intermediate + snapshots is understood, the command can be issued: + + &prompt.root; zfs rollback -r mypool/var/tmp@my_recursive_snapshot +&prompt.root; zfs list -rt snapshot mypool/var/tmp +NAME USED AVAIL REFER MOUNTPOINT +mypool/var/tmp@my_recursive_snapshot 8K - 152K - +&prompt.user; ls /var/tmp +vi.recover + + The output from zfs list -t snapshot + confirms that the intermediate snapshots + were removed as a result of + zfs rollback -r. + + + + Restoring Individual Files from Snapshots + + Snapshots are mounted in a hidden directory under the + parent dataset: + .zfs/snapshots/snapshotname. + By default, these directories will not be displayed even + when a standard ls -a is issued. + Although the directory is not displayed, it is there + nevertheless and can be accessed like any normal directory. + The property named snapdir controls + whether these hidden directories show up in a directory + listing. Setting the property to visible + allows them to appear in the output of ls + and other commands that deal with directory contents. + + &prompt.root; zfs get snapdir mypool/var/tmp +NAME PROPERTY VALUE SOURCE +mypool/var/tmp snapdir hidden default +&prompt.user; ls -a /var/tmp +. .. passwd vi.recover +&prompt.root; zfs set snapdir=visible mypool/var/tmp +&prompt.user; ls -a /var/tmp +. .. .zfs passwd vi.recover + + Individual files can easily be restored to a previous + state by copying them from the snapshot back to the parent + dataset. The directory structure below + .zfs/snapshot has a directory named + exactly like the snapshots taken earlier to make it easier + to identify them. In the next example, it is assumed that a + file is to be restored from the hidden + .zfs directory by copying it from the + snapshot that contained the latest version of the + file: + + &prompt.root; rm /var/tmp/passwd +&prompt.user; ls -a /var/tmp +. .. .zfs vi.recover +&prompt.root; ls /var/tmp/.zfs/snapshot +after_cp my_recursive_snapshot +&prompt.root; ls /var/tmp/.zfs/snapshot/after_cp +passwd vi.recover +&prompt.root; cp /var/tmp/.zfs/snapshot/after_cp/passwd /var/tmp + + When ls .zfs/snapshot was issued, the + snapdir property might have been set to + hidden, but it would still be possible to list the contents + of that directory. It is up to the administrator to decide + whether these directories will be displayed. It is possible + to display these for certain datasets and prevent it for + others. Copying files or directories from this hidden + .zfs/snapshot is simple enough. Trying + it the other way around results in this error: + + &prompt.root; cp /etc/rc.conf /var/tmp/.zfs/snapshot/after_cp/ +cp: /var/tmp/.zfs/snapshot/after_cp/rc.conf: Read-only file system + + The error reminds the user that snapshots are read-only + and can not be changed after creation. No files can be + copied into or removed from snapshot directories because + that would change the state of the dataset they + represent. + + Snapshots consume space based on how much the parent + file system has changed since the time of the snapshot. The + written property of a snapshot tracks how + much space is being used by the snapshot. + + Snapshots are destroyed and the space reclaimed with + zfs destroy + dataset@snapshot. + Adding recursively removes all snapshots + with the same name under the parent dataset. Adding + to the command displays a list of the + snapshots that would be deleted and an estimate of how much + space would be reclaimed without performing the actual + destroy operation. + + + + + Managing Clones + + A clone is a copy of a snapshot that is treated more like + a regular dataset. Unlike a snapshot, a clone is not read + only, is mounted, and can have its own properties. Once a + clone has been created using zfs clone, the + snapshot it was created from cannot be destroyed. The + child/parent relationship between the clone and the snapshot + can be reversed using zfs promote. After a + clone has been promoted, the snapshot becomes a child of the + clone, rather than of the original parent dataset. This will + change how the space is accounted, but not actually change the + amount of space consumed. The clone can be mounted at any + point within the ZFS file system hierarchy, + not just below the original location of the snapshot. + + To demonstrate the clone feature, this example dataset is + used: + + &prompt.root; zfs list -rt all camino/home/joe +NAME USED AVAIL REFER MOUNTPOINT +camino/home/joe 108K 1.3G 87K /usr/home/joe +camino/home/joe@plans 21K - 85.5K - +camino/home/joe@backup 0K - 87K - + + A typical use for clones is to experiment with a specific + dataset while keeping the snapshot around to fall back to in + case something goes wrong. Since snapshots can not be + changed, a read/write clone of a snapshot is created. After + the desired result is achieved in the clone, the clone can be + promoted to a dataset and the old file system removed. This + is not strictly necessary, as the clone and dataset can + coexist without problems. + + &prompt.root; zfs clone camino/home/joe@backup camino/home/joenew +&prompt.root; ls /usr/home/joe* +/usr/home/joe: +backup.txz plans.txt + +/usr/home/joenew: +backup.txz plans.txt +&prompt.root; df -h /usr/home +Filesystem Size Used Avail Capacity Mounted on +usr/home/joe 1.3G 31k 1.3G 0% /usr/home/joe +usr/home/joenew 1.3G 31k 1.3G 0% /usr/home/joenew + + After a clone is created it is an exact copy of the state + the dataset was in when the snapshot was taken. The clone can + now be changed independently from its originating dataset. + The only connection between the two is the snapshot. + ZFS records this connection in the property + origin. Once the dependency between the + snapshot and the clone has been removed by promoting the clone + using zfs promote, the + origin of the clone is removed as it is now + an independent dataset. This example demonstrates it: + + &prompt.root; zfs get origin camino/home/joenew +NAME PROPERTY VALUE SOURCE +camino/home/joenew origin camino/home/joe@backup - +&prompt.root; zfs promote camino/home/joenew +&prompt.root; zfs get origin camino/home/joenew +NAME PROPERTY VALUE SOURCE +camino/home/joenew origin - - + + After making some changes like copying + loader.conf to the promoted clone, for + example, the old directory becomes obsolete in this case. + Instead, the promoted clone can replace it. This can be + achieved by two consecutive commands: zfs + destroy on the old dataset and zfs + rename on the clone to name it like the old + dataset (it could also get an entirely different name). + + &prompt.root; cp /boot/defaults/loader.conf /usr/home/joenew +&prompt.root; zfs destroy -f camino/home/joe +&prompt.root; zfs rename camino/home/joenew camino/home/joe +&prompt.root; ls /usr/home/joe +backup.txz loader.conf plans.txt +&prompt.root; df -h /usr/home +Filesystem Size Used Avail Capacity Mounted on +usr/home/joe 1.3G 128k 1.3G 0% /usr/home/joe + + The cloned snapshot is now handled like an ordinary + dataset. It contains all the data from the original snapshot + plus the files that were added to it like + loader.conf. Clones can be used in + different scenarios to provide useful features to ZFS users. + For example, jails could be provided as snapshots containing + different sets of installed applications. Users can clone + these snapshots and add their own applications as they see + fit. Once they are satisfied with the changes, the clones can + be promoted to full datasets and provided to end users to work + with like they would with a real dataset. This saves time and + administrative overhead when providing these jails. + + + + Replication + + Keeping data on a single pool in one location exposes + it to risks like theft and natural or human disasters. Making + regular backups of the entire pool is vital. + ZFS provides a built-in serialization + feature that can send a stream representation of the data to + standard output. Using this technique, it is possible to not + only store the data on another pool connected to the local + system, but also to send it over a network to another system. + Snapshots are the basis for this replication (see the section + on ZFS + snapshots). The commands used for replicating data + are zfs send and + zfs receive. + + These examples demonstrate ZFS + replication with these two pools: + + &prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +backup 960M 77K 896M 0% 1.00x ONLINE - +mypool 984M 43.7M 940M 4% 1.00x ONLINE - + + The pool named mypool is the + primary pool where data is written to and read from on a + regular basis. A second pool, + backup is used as a standby in case + the primary pool becomes unavailable. Note that this + fail-over is not done automatically by ZFS, + but must be manually done by a system administrator when + needed. A snapshot is used to provide a consistent version of + the file system to be replicated. Once a snapshot of + mypool has been created, it can be + copied to the backup pool. Only + snapshots can be replicated. Changes made since the most + recent snapshot will not be included. + + &prompt.root; zfs snapshot mypool@backup1 +&prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +mypool@backup1 0 - 43.6M - + + Now that a snapshot exists, zfs send + can be used to create a stream representing the contents of + the snapshot. This stream can be stored as a file or received + by another pool. The stream is written to standard output, + but must be redirected to a file or pipe or an error is + produced: + + &prompt.root; zfs send mypool@backup1 +Error: Stream can not be written to a terminal. +You must redirect standard output. + + To back up a dataset with zfs send, + redirect to a file located on the mounted backup pool. Ensure + that the pool has enough free space to accommodate the size of + the snapshot being sent, which means all of the data contained + in the snapshot, not just the changes from the previous + snapshot. + + &prompt.root; zfs send mypool@backup1 > /backup/backup1 +&prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +backup 960M 63.7M 896M 6% 1.00x ONLINE - +mypool 984M 43.7M 940M 4% 1.00x ONLINE - + + The zfs send transferred all the data + in the snapshot called backup1 to + the pool named backup. Creating + and sending these snapshots can be done automatically with a + &man.cron.8; job. + + Instead of storing the backups as archive files, + ZFS can receive them as a live file system, + allowing the backed up data to be accessed directly. To get + to the actual data contained in those streams, + zfs receive is used to transform the + streams back into files and directories. The example below + combines zfs send and + zfs receive using a pipe to copy the data + from one pool to another. The data can be used directly on + the receiving pool after the transfer is complete. A dataset + can only be replicated to an empty dataset. + + &prompt.root; zfs snapshot mypool@replica1 +&prompt.root; zfs send -v mypool@replica1 | zfs receive backup/mypool +send from @ to mypool@replica1 estimated size is 50.1M +total estimated size is 50.1M +TIME SENT SNAPSHOT + +&prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +backup 960M 63.7M 896M 6% 1.00x ONLINE - +mypool 984M 43.7M 940M 4% 1.00x ONLINE - + + + Incremental Backups + + zfs send can also determine the + difference between two snapshots and send only the + differences between the two. This saves disk space and + transfer time. For example: + + &prompt.root; zfs snapshot mypool@replica2 +&prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +mypool@replica1 5.72M - 43.6M - +mypool@replica2 0 - 44.1M - +&prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +backup 960M 61.7M 898M 6% 1.00x ONLINE - +mypool 960M 50.2M 910M 5% 1.00x ONLINE - + + A second snapshot called + replica2 was created. This + second snapshot contains only the changes that were made to + the file system between now and the previous snapshot, + replica1. Using + zfs send -i and indicating the pair of + snapshots generates an incremental replica stream containing + only the data that has changed. This can only succeed if + the initial snapshot already exists on the receiving + side. + + &prompt.root; zfs send -v -i mypool@replica1 mypool@replica2 | zfs receive /backup/mypool +send from @replica1 to mypool@replica2 estimated size is 5.02M +total estimated size is 5.02M +TIME SENT SNAPSHOT + +&prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +backup 960M 80.8M 879M 8% 1.00x ONLINE - +mypool 960M 50.2M 910M 5% 1.00x ONLINE - + +&prompt.root; zfs list +NAME USED AVAIL REFER MOUNTPOINT +backup 55.4M 240G 152K /backup +backup/mypool 55.3M 240G 55.2M /backup/mypool +mypool 55.6M 11.6G 55.0M /mypool + +&prompt.root; zfs list -t snapshot +NAME USED AVAIL REFER MOUNTPOINT +backup/mypool@replica1 104K - 50.2M - +backup/mypool@replica2 0 - 55.2M - +mypool@replica1 29.9K - 50.0M - +mypool@replica2 0 - 55.0M - + + The incremental stream was successfully transferred. + Only the data that had changed was replicated, rather than + the entirety of replica1. Only + the differences were sent, which took much less time to + transfer and saved disk space by not copying the complete + pool each time. This is useful when having to rely on slow + networks or when costs per transferred byte must be + considered. + + A new file system, + backup/mypool, is available with + all of the files and data from the pool + mypool. If + is specified, the properties of the dataset will be copied, + including compression settings, quotas, and mount points. + When is specified, all child datasets of + the indicated dataset will be copied, along with all of + their properties. Sending and receiving can be automated so + that regular backups are created on the second pool. + + + + Sending Encrypted Backups over + <application>SSH</application> + + Sending streams over the network is a good way to keep a + remote backup, but it does come with a drawback. Data sent + over the network link is not encrypted, allowing anyone to + intercept and transform the streams back into data without + the knowledge of the sending user. This is undesirable, + especially when sending the streams over the internet to a + remote host. SSH can be used to + securely encrypt data send over a network connection. Since + ZFS only requires the stream to be + redirected from standard output, it is relatively easy to + pipe it through SSH. To keep the + contents of the file system encrypted in transit and on the + remote system, consider using PEFS. + + A few settings and security precautions must be + completed first. Only the necessary steps required for the + zfs send operation are shown here. For + more information on SSH, see + . + + This configuration is required: + + + + Passwordless SSH access + between sending and receiving host using + SSH keys + + + + Normally, the privileges of the + root user are + needed to send and receive streams. This requires + logging in to the receiving system as + root. + However, logging in as + root is + disabled by default for security reasons. The + ZFS Delegation + system can be used to allow a + non-root user + on each system to perform the respective send and + receive operations. + + + + On the sending system: + + &prompt.root; zfs allow -u someuser send,snapshot mypool + + + + To mount the pool, the unprivileged user must own + the directory, and regular users must be allowed to + mount file systems. On the receiving system: + + &prompt.root; sysctl vfs.usermount=1 +vfs.usermount: 0 -> 1 +&prompt.root; echo vfs.usermount=1 >> /etc/sysctl.conf +&prompt.root; zfs create recvpool/backup +&prompt.root; zfs allow -u someuser create,mount,receive recvpool/backup +&prompt.root; chown someuser /recvpool/backup + + + + The unprivileged user now has the ability to receive and + mount datasets, and the home + dataset can be replicated to the remote system: + + &prompt.user; zfs snapshot -r mypool/home@monday +&prompt.user; zfs send -R mypool/home@monday | ssh someuser@backuphost zfs recv -dvu recvpool/backup + + A recursive snapshot called + monday is made of the file system + dataset home that resides on the + pool mypool. Then it is sent + with zfs send -R to include the dataset, + all child datasets, snaphots, clones, and settings in the + stream. The output is piped to the waiting + zfs receive on the remote host + backuphost through + SSH. Using a fully qualified + domain name or IP address is recommended. The receiving + machine writes the data to the + backup dataset on the + recvpool pool. Adding + to zfs recv + overwrites the name of the pool on the receiving side with + the name of the snapshot. causes the + file systems to not be mounted on the receiving side. When + is included, more detail about the + transfer is shown, including elapsed time and the amount of + data transferred. + + + + + Dataset, User, and Group Quotas + + Dataset quotas are + used to restrict the amount of space that can be consumed + by a particular dataset. + Reference Quotas work + in very much the same way, but only count the space + used by the dataset itself, excluding snapshots and child + datasets. Similarly, + user and + group quotas can be + used to prevent users or groups from using all of the + space in the pool or dataset. + + To enforce a dataset quota of 10 GB for + storage/home/bob: + + &prompt.root; zfs set quota=10G storage/home/bob + + To enforce a reference quota of 10 GB for + storage/home/bob: + + &prompt.root; zfs set refquota=10G storage/home/bob + + To remove a quota of 10 GB for + storage/home/bob: + + &prompt.root; zfs set quota=none storage/home/bob + + The general format is + userquota@user=size, + and the user's name must be in one of these formats: + + + + POSIX compatible name such as + joe. + + + + POSIX numeric ID such as + 789. + + + + SID name + such as + joe.bloggs@example.com. + + + + SID + numeric ID such as + S-1-123-456-789. + + + + For example, to enforce a user quota of 50 GB for the + user named joe: + + &prompt.root; zfs set userquota@joe=50G + + To remove any quota: + + &prompt.root; zfs set userquota@joe=none + + + User quota properties are not displayed by + zfs get all. + Non-root users can + only see their own quotas unless they have been granted the + userquota privilege. Users with this + privilege are able to view and set everyone's quota. + + + The general format for setting a group quota is: + groupquota@group=size. + + To set the quota for the group + firstgroup to 50 GB, + use: + + &prompt.root; zfs set groupquota@firstgroup=50G + + To remove the quota for the group + firstgroup, or to make sure that + one is not set, instead use: + + &prompt.root; zfs set groupquota@firstgroup=none + + As with the user quota property, + non-root users can + only see the quotas associated with the groups to which they + belong. However, + root or a user with + the groupquota privilege can view and set + all quotas for all groups. + + To display the amount of space used by each user on + a file system or snapshot along with any quotas, use + zfs userspace. For group information, use + zfs groupspace. For more information about + supported options or how to display only specific options, + refer to &man.zfs.1;. + + Users with sufficient privileges, and + root, can list the + quota for storage/home/bob using: + + &prompt.root; zfs get quota storage/home/bob + + + + Reservations + + Reservations + guarantee a minimum amount of space will always be available + on a dataset. The reserved space will not be available to any + other dataset. This feature can be especially useful to + ensure that free space is available for an important dataset + or log files. + + The general format of the reservation + property is + reservation=size, + so to set a reservation of 10 GB on + storage/home/bob, use: + + &prompt.root; zfs set reservation=10G storage/home/bob + + To clear any reservation: + + &prompt.root; zfs set reservation=none storage/home/bob + + The same principle can be applied to the + refreservation property for setting a + Reference + Reservation, with the general format + refreservation=size. + + This command shows any reservations or refreservations + that exist on storage/home/bob: + + &prompt.root; zfs get reservation storage/home/bob +&prompt.root; zfs get refreservation storage/home/bob + + + + Compression + + ZFS provides transparent compression. + Compressing data at the block level as it is written not only + saves space, but can also increase disk throughput. If data + is compressed by 25%, but the compressed data is written to + the disk at the same rate as the uncompressed version, + resulting in an effective write speed of 125%. Compression + can also be a great alternative to + Deduplication + because it does not require additional memory. + + ZFS offers several different + compression algorithms, each with different trade-offs. With + the introduction of LZ4 compression in + ZFS v5000, it is possible to enable + compression for the entire pool without the large performance + trade-off of other algorithms. The biggest advantage to + LZ4 is the early abort + feature. If LZ4 does not achieve at least + 12.5% compression in the first part of the data, the block is + written uncompressed to avoid wasting CPU cycles trying to + compress data that is either already compressed or + uncompressible. For details about the different compression + algorithms available in ZFS, see the + Compression entry + in the terminology section. + + The administrator can monitor the effectiveness of + compression using a number of dataset properties. + + &prompt.root; zfs get used,compressratio,compression,logicalused mypool/compressed_dataset +NAME PROPERTY VALUE SOURCE +mypool/compressed_dataset used 449G - +mypool/compressed_dataset compressratio 1.11x - +mypool/compressed_dataset compression lz4 local +mypool/compressed_dataset logicalused 496G - + + The dataset is currently using 449 GB of space (the + used property). Without compression, it would have taken + 496 GB of space (the logicallyused + property). This results in the 1.11:1 compression + ratio. + + Compression can have an unexpected side effect when + combined with + User Quotas. + User quotas restrict how much space a user can consume on a + dataset, but the measurements are based on how much space is + used after compression. So if a user has + a quota of 10 GB, and writes 10 GB of compressible + data, they will still be able to store additional data. If + they later update a file, say a database, with more or less + compressible data, the amount of space available to them will + change. This can result in the odd situation where a user did + not increase the actual amount of data (the + logicalused property), but the change in + compression caused them to reach their quota limit. + + Compression can have a similar unexpected interaction with + backups. Quotas are often used to limit how much data can be + stored to ensure there is sufficient backup space available. + However since quotas do not consider compression, more data + may be written than would fit with uncompressed + backups. + + + + Deduplication + + When enabled, + deduplication + uses the checksum of each block to detect duplicate blocks. + When a new block is a duplicate of an existing block, + ZFS writes an additional reference to the + existing data instead of the whole duplicate block. + Tremendous space savings are possible if the data contains + many duplicated files or repeated information. Be warned: + deduplication requires an extremely large amount of memory, + and most of the space savings can be had without the extra + cost by enabling compression instead. + + To activate deduplication, set the + dedup property on the target pool: + + &prompt.root; zfs set dedup=on pool + + Only new data being written to the pool will be + deduplicated. Data that has already been written to the pool + will not be deduplicated merely by activating this option. A + pool with a freshly activated deduplication property will look + like this example: + + &prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +pool 2.84G 2.19M 2.83G 0% 1.00x ONLINE - + + The DEDUP column shows the actual rate + of deduplication for the pool. A value of + 1.00x shows that data has not been + deduplicated yet. In the next example, the ports tree is + copied three times into different directories on the + deduplicated pool created above. + + &prompt.root; zpool list +for d in dir1 dir2 dir3; do +for> mkdir $d && cp -R /usr/ports $d & +for> done + + Redundant data is detected and deduplicated: + + &prompt.root; zpool list +NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT +pool 2.84G 20.9M 2.82G 0% 3.00x ONLINE - + + The DEDUP column shows a factor of + 3.00x. Multiple copies of the ports tree + data was detected and deduplicated, using only a third of the + space. The potential for space savings can be enormous, but + comes at the cost of having enough memory to keep track of the + deduplicated blocks. + + Deduplication is not always beneficial, especially when + the data on a pool is not redundant. + ZFS can show potential space savings by + simulating deduplication on an existing pool: + + &prompt.root; zdb -S pool +Simulated DDT histogram: + +bucket allocated referenced +______ ______________________________ ______________________________ +refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE +------ ------ ----- ----- ----- ------ ----- ----- ----- + 1 2.58M 289G 264G 264G 2.58M 289G 264G 264G + 2 206K 12.6G 10.4G 10.4G 430K 26.4G 21.6G 21.6G + 4 37.6K 692M 276M 276M 170K 3.04G 1.26G 1.26G + 8 2.18K 45.2M 19.4M 19.4M 20.0K 425M 176M 176M + 16 174 2.83M 1.20M 1.20M 3.33K 48.4M 20.4M 20.4M + 32 40 2.17M 222K 222K 1.70K 97.2M 9.91M 9.91M + 64 9 56K 10.5K 10.5K 865 4.96M 948K 948K + 128 2 9.50K 2K 2K 419 2.11M 438K 438K + 256 5 61.5K 12K 12K 1.90K 23.0M 4.47M 4.47M + 1K 2 1K 1K 1K 2.98K 1.49M 1.49M 1.49M + Total 2.82M 303G 275G 275G 3.20M 319G 287G 287G + +dedup = 1.05, compress = 1.11, copies = 1.00, dedup * compress / copies = 1.16 + + After zdb -S finishes analyzing the + pool, it shows the space reduction ratio that would be + achieved by activating deduplication. In this case, + 1.16 is a very poor space saving ratio that + is mostly provided by compression. Activating deduplication + on this pool would not save any significant amount of space, + and is not worth the amount of memory required to enable + deduplication. Using the formula + ratio = dedup * compress / copies, + system administrators can plan the storage allocation, + deciding whether the workload will contain enough duplicate + blocks to justify the memory requirements. If the data is + reasonably compressible, the space savings may be very good. + Enabling compression first is recommended, and compression can + also provide greatly increased performance. Only enable + deduplication in cases where the additional savings will be + considerable and there is sufficient memory for the DDT. + + + + <acronym>ZFS</acronym> and Jails + + zfs jail and the corresponding + jailed property are used to delegate a + ZFS dataset to a + Jail. + zfs jail jailid + attaches a dataset to the specified jail, and + zfs unjail detaches it. For the dataset to + be controlled from within a jail, the + jailed property must be set. Once a + dataset is jailed, it can no longer be mounted on the + host because it may have mount points that would compromise + the security of the host. + + + + + Delegated Administration + + A comprehensive permission delegation system allows + unprivileged users to perform ZFS + administration functions. For example, if each user's home + directory is a dataset, users can be given permission to create + and destroy snapshots of their home directories. A backup user + can be given permission to use replication features. A usage + statistics script can be allowed to run with access only to the + space utilization data for all users. It is even possible to + delegate the ability to delegate permissions. Permission + delegation is possible for each subcommand and most + properties. + + + Delegating Dataset Creation + + zfs allow + someuser create + mydataset gives the + specified user permission to create child datasets under the + selected parent dataset. There is a caveat: creating a new + dataset involves mounting it. That requires setting the + &os; vfs.usermount &man.sysctl.8; to + 1 to allow non-root users to mount a + file system. There is another restriction aimed at preventing + abuse: non-root + users must own the mountpoint where the file system is to be + mounted. + + + + Delegating Permission Delegation + + zfs allow + someuser allow + mydataset gives the + specified user the ability to assign any permission they have + on the target dataset, or its children, to other users. If a + user has the snapshot permission and the + allow permission, that user can then grant + the snapshot permission to other + users. + + + + + Advanced Topics + + + Tuning + + There are a number of tunables that can be adjusted to + make ZFS perform best for different + workloads. + + + + vfs.zfs.arc_max + - Maximum size of the ARC. + The default is all RAM less 1 GB, + or one half of RAM, whichever is more. + However, a lower value should be used if the system will + be running any other daemons or processes that may require + memory. This value can only be adjusted at boot time, and + is set in /boot/loader.conf. + + + + vfs.zfs.arc_meta_limit + - Limit the portion of the + ARC + that can be used to store metadata. The default is one + fourth of vfs.zfs.arc_max. Increasing + this value will improve performance if the workload + involves operations on a large number of files and + directories, or frequent metadata operations, at the cost + of less file data fitting in the ARC. + This value can only be adjusted at boot time, and is set + in /boot/loader.conf. + + + + vfs.zfs.arc_min + - Minimum size of the ARC. + The default is one half of + vfs.zfs.arc_meta_limit. Adjust this + value to prevent other applications from pressuring out + the entire ARC. + This value can only be adjusted at boot time, and is set + in /boot/loader.conf. + + + + vfs.zfs.vdev.cache.size + - A preallocated amount of memory reserved as a cache for + each device in the pool. The total amount of memory used + will be this value multiplied by the number of devices. + This value can only be adjusted at boot time, and is set + in /boot/loader.conf. + + + + vfs.zfs.min_auto_ashift + - Minimum ashift (sector size) that + will be used automatically at pool creation time. The + value is a power of two. The default value of + 9 represents + 2^9 = 512, a sector size of 512 bytes. + To avoid write amplification and get + the best performance, set this value to the largest sector + size used by a device in the pool. + + Many drives have 4 KB sectors. Using the default + ashift of 9 with + these drives results in write amplification on these + devices. Data that could be contained in a single + 4 KB write must instead be written in eight 512-byte + writes. ZFS tries to read the native + sector size from all devices when creating a pool, but + many drives with 4 KB sectors report that their + sectors are 512 bytes for compatibility. Setting + vfs.zfs.min_auto_ashift to + 12 (2^12 = 4096) + before creating a pool forces ZFS to + use 4 KB blocks for best performance on these + drives. + + Forcing 4 KB blocks is also useful on pools where + disk upgrades are planned. Future disks are likely to use + 4 KB sectors, and ashift values + cannot be changed after a pool is created. + + In some specific cases, the smaller 512-byte block + size might be preferable. When used with 512-byte disks + for databases, or as storage for virtual machines, less + data is transferred during small random reads. This can + provide better performance, especially when using a + smaller ZFS record size. + + + + vfs.zfs.prefetch_disable + - Disable prefetch. A value of 0 is + enabled and 1 is disabled. The default + is 0, unless the system has less than + 4 GB of RAM. Prefetch works by + reading larged blocks than were requested into the + ARC + in hopes that the data will be needed soon. If the + workload has a large number of random reads, disabling + prefetch may actually improve performance by reducing + unnecessary reads. This value can be adjusted at any time + with &man.sysctl.8;. + + + + vfs.zfs.vdev.trim_on_init + - Control whether new devices added to the pool have the + TRIM command run on them. This ensures + the best performance and longevity for + SSDs, but takes extra time. If the + device has already been secure erased, disabling this + setting will make the addition of the new device faster. + This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.write_to_degraded + - Control whether new data is written to a vdev that is + in the DEGRADED + state. Defaults to 0, preventing + writes to any top level vdev that is in a degraded state. + The administrator may with to allow writing to degraded + vdevs to prevent the amount of free space across the vdevs + from becoming unbalanced, which will reduce read and write + performance. This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.vdev.max_pending + - Limit the number of pending I/O requests per device. + A higher value will keep the device command queue full + and may give higher throughput. A lower value will reduce + latency. This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.top_maxinflight + - Maxmimum number of outstanding I/Os per top-level + vdev. Limits the + depth of the command queue to prevent high latency. The + limit is per top-level vdev, meaning the limit applies to + each mirror, + RAID-Z, or + other vdev independently. This value can be adjusted at + any time with &man.sysctl.8;. + + + + vfs.zfs.l2arc_write_max + - Limit the amount of data written to the L2ARC + per second. This tunable is designed to extend the + longevity of SSDs by limiting the + amount of data written to the device. This value can be + adjusted at any time with &man.sysctl.8;. + + + + vfs.zfs.l2arc_write_boost + - The value of this tunable is added to vfs.zfs.l2arc_write_max + and increases the write speed to the + SSD until the first block is evicted + from the L2ARC. + This Turbo Warmup Phase is designed to + reduce the performance loss from an empty L2ARC + after a reboot. This value can be adjusted at any time + with &man.sysctl.8;. + + + + vfs.zfs.scrub_delay + - Number of ticks to delay between each I/O during a + scrub. + To ensure that a scrub does not + interfere with the normal operation of the pool, if any + other I/O is happening the + scrub will delay between each command. + This value controls the limit on the total + IOPS (I/Os Per Second) generated by the + scrub. The granularity of the setting + is deterined by the value of kern.hz + which defaults to 1000 ticks per second. This setting may + be changed, resulting in a different effective + IOPS limit. The default value is + 4, resulting in a limit of: + 1000 ticks/sec / 4 = + 250 IOPS. Using a value of + 20 would give a limit of: + 1000 ticks/sec / 20 = + 50 IOPS. The speed of + scrub is only limited when there has + been recent activity on the pool, as determined by vfs.zfs.scan_idle. + This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.resilver_delay + - Number of milliseconds of delay inserted between + each I/O during a + resilver. To + ensure that a resilver does not interfere with the normal + operation of the pool, if any other I/O is happening the + resilver will delay between each command. This value + controls the limit of total IOPS (I/Os + Per Second) generated by the resilver. The granularity of + the setting is determined by the value of + kern.hz which defaults to 1000 ticks + per second. This setting may be changed, resulting in a + different effective IOPS limit. The + default value is 2, resulting in a limit of: + 1000 ticks/sec / 2 = + 500 IOPS. Returning the pool to + an Online state may + be more important if another device failing could + Fault the pool, + causing data loss. A value of 0 will give the resilver + operation the same priority as other operations, speeding + the healing process. The speed of resilver is only + limited when there has been other recent activity on the + pool, as determined by vfs.zfs.scan_idle. + This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.scan_idle + - Number of milliseconds since the last operation before + the pool is considered idle. When the pool is idle the + rate limiting for scrub + and + resilver are + disabled. This value can be adjusted at any time with + &man.sysctl.8;. + + + + vfs.zfs.txg.timeout + - Maximum number of seconds between + transaction groups. + The current transaction group will be written to the pool + and a fresh transaction group started if this amount of + time has elapsed since the previous transaction group. A + transaction group my be triggered earlier if enough data + is written. The default value is 5 seconds. A larger + value may improve read performance by delaying + asynchronous writes, but this may cause uneven performance + when the transaction group is written. This value can be + adjusted at any time with &man.sysctl.8;. + + + + + + + + <acronym>ZFS</acronym> on i386 + + Some of the features provided by ZFS + are memory intensive, and may require tuning for maximum + efficiency on systems with limited + RAM. + + + Memory + + As a bare minimum, the total system memory should be at + least one gigabyte. The amount of recommended + RAM depends upon the size of the pool and + which ZFS features are used. A general + rule of thumb is 1 GB of RAM for every 1 TB of + storage. If the deduplication feature is used, a general + rule of thumb is 5 GB of RAM per TB of storage to be + deduplicated. While some users successfully use + ZFS with less RAM, + systems under heavy load may panic due to memory exhaustion. + Further tuning may be required for systems with less than + the recommended RAM requirements. + + + + Kernel Configuration + + Due to the address space limitations of the + &i386; platform, ZFS users on the + &i386; architecture must add this option to a + custom kernel configuration file, rebuild the kernel, and + reboot: + + options KVA_PAGES=512 + + This expands the kernel address space, allowing + the vm.kvm_size tunable to be pushed + beyond the currently imposed limit of 1 GB, or the + limit of 2 GB for PAE. To find the + most suitable value for this option, divide the desired + address space in megabytes by four. In this example, it + is 512 for 2 GB. + + + + Loader Tunables + + The kmem address space can be + increased on all &os; architectures. On a test system with + 1 GB of physical memory, success was achieved with + these options added to + /boot/loader.conf, and the system + restarted: + + vm.kmem_size="330M" +vm.kmem_size_max="330M" +vfs.zfs.arc_max="40M" +vfs.zfs.vdev.cache.size="5M" + + For a more detailed list of recommendations for + ZFS-related tuning, see . + + + + + + Additional Resources + + + + FreeBSD + Wiki - ZFS + + + + FreeBSD + Wiki - ZFS Tuning + + + + Illumos + Wiki - ZFS + + + + Oracle + Solaris ZFS Administration + Guide + + + + ZFS + Evil Tuning Guide + + + + ZFS + Best Practices Guide + + + + Calomel + Blog - ZFS Raidz Performance, Capacity + and Integrity + + + + + + <acronym>ZFS</acronym> Features and Terminology + + ZFS is a fundamentally different file + system because it is more than just a file system. + ZFS combines the roles of file system and + volume manager, enabling additional storage devices to be added + to a live system and having the new space available on all of + the existing file systems in that pool immediately. By + combining the traditionally separate roles, + ZFS is able to overcome previous limitations + that prevented RAID groups being able to + grow. Each top level device in a zpool is called a + vdev, which can be a simple disk or a + RAID transformation such as a mirror or + RAID-Z array. ZFS file + systems (called datasets) each have access + to the combined free space of the entire pool. As blocks are + allocated from the pool, the space available to each file system + decreases. This approach avoids the common pitfall with + extensive partitioning where free space becomes fragmented + across the partitions. + + + + + + zpool + + A storage pool is the most + basic building block of ZFS. A pool + is made up of one or more vdevs, the underlying devices + that store the data. A pool is then used to create one + or more file systems (datasets) or block devices + (volumes). These datasets and volumes share the pool of + remaining free space. Each pool is uniquely identified + by a name and a GUID. The features + available are determined by the ZFS + version number on the pool. + + + &os; 9.0 and 9.1 include support for + ZFS version 28. Later versions + use ZFS version 5000 with feature + flags. The new feature flags system allows greater + cross-compatibility with other implementations of + ZFS. + + + + + + vdev Types + + A pool is made up of one or more vdevs, which + themselves can be a single disk or a group of disks, in + the case of a RAID transform. When + multiple vdevs are used, ZFS spreads + data across the vdevs to increase performance and + maximize usable space. + + + + Disk + - The most basic type of vdev is a standard block + device. This can be an entire disk (such as + /dev/ada0 + or + /dev/da0) + or a partition + (/dev/ada0p3). + On &os;, there is no performance penalty for using + a partition rather than the entire disk. This + differs from recommendations made by the Solaris + documentation. + + + + File + - In addition to disks, ZFS + pools can be backed by regular files, this is + especially useful for testing and experimentation. + Use the full path to the file as the device path + in the zpool create command. All vdevs must be + at least 128 MB in size. + + + + Mirror + - When creating a mirror, specify the + mirror keyword followed by the + list of member devices for the mirror. A mirror + consists of two or more devices, all data will be + written to all member devices. A mirror vdev will + only hold as much data as its smallest member. A + mirror vdev can withstand the failure of all but + one of its members without losing any data. + + + A regular single disk vdev can be upgraded + to a mirror vdev at any time with + zpool + attach. + + + + + RAID-Z + - ZFS implements + RAID-Z, a variation on standard + RAID-5 that offers better + distribution of parity and eliminates the + RAID-5 write + hole in which the data and parity + information become inconsistent after an + unexpected restart. ZFS + supports three levels of RAID-Z + which provide varying levels of redundancy in + exchange for decreasing levels of usable storage. + The types are named RAID-Z1 + through RAID-Z3 based on the + number of parity devices in the array and the + number of disks which can fail while the pool + remains operational. + + In a RAID-Z1 configuration + with four disks, each 1 TB, usable storage is + 3 TB and the pool will still be able to + operate in degraded mode with one faulted disk. + If an additional disk goes offline before the + faulted disk is replaced and resilvered, all data + in the pool can be lost. + + In a RAID-Z3 configuration + with eight disks of 1 TB, the volume will + provide 5 TB of usable space and still be + able to operate with three faulted disks. &sun; + recommends no more than nine disks in a single + vdev. If the configuration has more disks, it is + recommended to divide them into separate vdevs and + the pool data will be striped across them. + + A configuration of two + RAID-Z2 vdevs consisting of 8 + disks each would create something similar to a + RAID-60 array. A + RAID-Z group's storage capacity + is approximately the size of the smallest disk + multiplied by the number of non-parity disks. + Four 1 TB disks in RAID-Z1 + has an effective size of approximately 3 TB, + and an array of eight 1 TB disks in + RAID-Z3 will yield 5 TB of + usable space. + + + + Spare + - ZFS has a special pseudo-vdev + type for keeping track of available hot spares. + Note that installed hot spares are not deployed + automatically; they must manually be configured to + replace the failed device using + zfs replace. + + + + Log + - ZFS Log Devices, also known + as ZFS Intent Log (ZIL) + move the intent log from the regular pool devices + to a dedicated device, typically an + SSD. Having a dedicated log + device can significantly improve the performance + of applications with a high volume of synchronous + writes, especially databases. Log devices can be + mirrored, but RAID-Z is not + supported. If multiple log devices are used, + writes will be load balanced across them. + + + + Cache + - Adding a cache vdev to a zpool will add the + storage of the cache to the L2ARC. + Cache devices cannot be mirrored. Since a cache + device only stores additional copies of existing + data, there is no risk of data loss. + + + + + + Transaction Group + (TXG) + + Transaction Groups are the way changed blocks are + grouped together and eventually written to the pool. + Transaction groups are the atomic unit that + ZFS uses to assert consistency. Each + transaction group is assigned a unique 64-bit + consecutive identifier. There can be up to three active + transaction groups at a time, one in each of these three + states: + + + + Open - When a new + transaction group is created, it is in the open + state, and accepts new writes. There is always + a transaction group in the open state, however the + transaction group may refuse new writes if it has + reached a limit. Once the open transaction group + has reached a limit, or the vfs.zfs.txg.timeout + has been reached, the transaction group advances + to the next state. + + + + Quiescing - A short state + that allows any pending operations to finish while + not blocking the creation of a new open + transaction group. Once all of the transactions + in the group have completed, the transaction group + advances to the final state. + + + + Syncing - All of the data + in the transaction group is written to stable + storage. This process will in turn modify other + data, such as metadata and space maps, that will + also need to be written to stable storage. The + process of syncing involves multiple passes. The + first, all of the changed data blocks, is the + biggest, followed by the metadata, which may take + multiple passes to complete. Since allocating + space for the data blocks generates new metadata, + the syncing state cannot finish until a pass + completes that does not allocate any additional + space. The syncing state is also where + synctasks are completed. + Synctasks are administrative operations, such as + creating or destroying snapshots and datasets, + that modify the uberblock are completed. Once the + sync state is complete, the transaction group in + the quiescing state is advanced to the syncing + state. + + + + All administrative functions, such as snapshot + are written as part of the transaction group. When a + synctask is created, it is added to the currently open + transaction group, and that group is advanced as quickly + as possible to the syncing state to reduce the + latency of administrative commands. + + + + Adaptive Replacement + Cache (ARC) + + ZFS uses an Adaptive Replacement + Cache (ARC), rather than a more + traditional Least Recently Used (LRU) + cache. An LRU cache is a simple list + of items in the cache, sorted by when each object was + most recently used. New items are added to the top of + the list. When the cache is full, items from the + bottom of the list are evicted to make room for more + active objects. An ARC consists of + four lists; the Most Recently Used + (MRU) and Most Frequently Used + (MFU) objects, plus a ghost list for + each. These ghost lists track recently evicted objects + to prevent them from being added back to the cache. + This increases the cache hit ratio by avoiding objects + that have a history of only being used occasionally. + Another advantage of using both an + MRU and MFU is + that scanning an entire file system would normally evict + all data from an MRU or + LRU cache in favor of this freshly + accessed content. With ZFS, there is + also an MFU that only tracks the most + frequently used objects, and the cache of the most + commonly accessed blocks remains. + + + + L2ARC + + L2ARC is the second level + of the ZFS caching system. The + primary ARC is stored in + RAM. Since the amount of + available RAM is often limited, + ZFS can also use + cache vdevs. + Solid State Disks (SSDs) are often + used as these cache devices due to their higher speed + and lower latency compared to traditional spinning + disks. L2ARC is entirely optional, + but having one will significantly increase read speeds + for files that are cached on the SSD + instead of having to be read from the regular disks. + L2ARC can also speed up deduplication + because a DDT that does not fit in + RAM but does fit in the + L2ARC will be much faster than a + DDT that must be read from disk. The + rate at which data is added to the cache devices is + limited to prevent prematurely wearing out + SSDs with too many writes. Until the + cache is full (the first block has been evicted to make + room), writing to the L2ARC is + limited to the sum of the write limit and the boost + limit, and afterwards limited to the write limit. A + pair of &man.sysctl.8; values control these rate limits. + vfs.zfs.l2arc_write_max + controls how many bytes are written to the cache per + second, while vfs.zfs.l2arc_write_boost + adds to this limit during the + Turbo Warmup Phase (Write Boost). + + + + ZIL + + ZIL accelerates synchronous + transactions by using storage devices like + SSDs that are faster than those used + in the main storage pool. When an application requests + a synchronous write (a guarantee that the data has been + safely stored to disk rather than merely cached to be + written later), the data is written to the faster + ZIL storage, then later flushed out + to the regular disks. This greatly reduces latency and + improves performance. Only synchronous workloads like + databases will benefit from a ZIL. + Regular asynchronous writes such as copying files will + not use the ZIL at all. + + + + Copy-On-Write + + Unlike a traditional file system, when data is + overwritten on ZFS, the new data is + written to a different block rather than overwriting the + old data in place. Only when this write is complete is + the metadata then updated to point to the new location. + In the event of a shorn write (a system crash or power + loss in the middle of writing a file), the entire + original contents of the file are still available and + the incomplete write is discarded. This also means that + ZFS does not require a &man.fsck.8; + after an unexpected shutdown. + + + + Dataset + + Dataset is the generic term + for a ZFS file system, volume, + snapshot or clone. Each dataset has a unique name in + the format + poolname/path@snapshot. + The root of the pool is technically a dataset as well. + Child datasets are named hierarchically like + directories. For example, + mypool/home, the home + dataset, is a child of mypool + and inherits properties from it. This can be expanded + further by creating + mypool/home/user. This + grandchild dataset will inherit properties from the + parent and grandparent. Properties on a child can be + set to override the defaults inherited from the parents + and grandparents. Administration of datasets and their + children can be + delegated. + + + + File system + + A ZFS dataset is most often used + as a file system. Like most other file systems, a + ZFS file system is mounted somewhere + in the systems directory hierarchy and contains files + and directories of its own with permissions, flags, and + other metadata. + + + + Volume + + In additional to regular file system datasets, + ZFS can also create volumes, which + are block devices. Volumes have many of the same + features, including copy-on-write, snapshots, clones, + and checksumming. Volumes can be useful for running + other file system formats on top of + ZFS, such as UFS + virtualization, or exporting iSCSI + extents. + + + + Snapshot + + The + copy-on-write + (COW) design of + ZFS allows for nearly instantaneous, + consistent snapshots with arbitrary names. After taking + a snapshot of a dataset, or a recursive snapshot of a + parent dataset that will include all child datasets, new + data is written to new blocks, but the old blocks are + not reclaimed as free space. The snapshot contains + the original version of the file system, and the live + file system contains any changes made since the snapshot + was taken. No additional space is used. As new data is + written to the live file system, new blocks are + allocated to store this data. The apparent size of the + snapshot will grow as the blocks are no longer used in + the live file system, but only in the snapshot. These + snapshots can be mounted read only to allow for the + recovery of previous versions of files. It is also + possible to + rollback a live + file system to a specific snapshot, undoing any changes + that took place after the snapshot was taken. Each + block in the pool has a reference counter which keeps + track of how many snapshots, clones, datasets, or + volumes make use of that block. As files and snapshots + are deleted, the reference count is decremented. When a + block is no longer referenced, it is reclaimed as free + space. Snapshots can also be marked with a + hold. When a + snapshot is held, any attempt to destroy it will return + an EBUSY error. Each snapshot can + have multiple holds, each with a unique name. The + release command + removes the hold so the snapshot can deleted. Snapshots + can be taken on volumes, but they can only be cloned or + rolled back, not mounted independently. + + + + Clone + + Snapshots can also be cloned. A clone is a + writable version of a snapshot, allowing the file system + to be forked as a new dataset. As with a snapshot, a + clone initially consumes no additional space. As + new data is written to a clone and new blocks are + allocated, the apparent size of the clone grows. When + blocks are overwritten in the cloned file system or + volume, the reference count on the previous block is + decremented. The snapshot upon which a clone is based + cannot be deleted because the clone depends on it. The + snapshot is the parent, and the clone is the child. + Clones can be promoted, reversing + this dependency and making the clone the parent and the + previous parent the child. This operation requires no + additional space. Because the amount of space used by + the parent and child is reversed, existing quotas and + reservations might be affected. + + + + Checksum + + Every block that is allocated is also checksummed. + The checksum algorithm used is a per-dataset property, + see set. + The checksum of each block is transparently validated as + it is read, allowing ZFS to detect + silent corruption. If the data that is read does not + match the expected checksum, ZFS will + attempt to recover the data from any available + redundancy, like mirrors or RAID-Z). + Validation of all checksums can be triggered with scrub. + Checksum algorithms include: + + + + fletcher2 + + + + fletcher4 + + + + sha256 + + + + The fletcher algorithms are faster, + but sha256 is a strong cryptographic + hash and has a much lower chance of collisions at the + cost of some performance. Checksums can be disabled, + but it is not recommended. + + + + Compression + + Each dataset has a compression property, which + defaults to off. This property can be set to one of a + number of compression algorithms. This will cause all + new data that is written to the dataset to be + compressed. Beyond a reduction in space used, read and + write throughput often increases because fewer blocks + are read or written. + + + + LZ4 - + Added in ZFS pool version + 5000 (feature flags), LZ4 is + now the recommended compression algorithm. + LZ4 compresses approximately + 50% faster than LZJB when + operating on compressible data, and is over three + times faster when operating on uncompressible + data. LZ4 also decompresses + approximately 80% faster than + LZJB. On modern + CPUs, LZ4 + can often compress at over 500 MB/s, and + decompress at over 1.5 GB/s (per single CPU + core). + + + LZ4 compression is + only available after &os; 9.2. + + + + + LZJB - + The default compression algorithm. Created by + Jeff Bonwick (one of the original creators of + ZFS). LZJB + offers good compression with less + CPU overhead compared to + GZIP. In the future, the + default compression algorithm will likely change + to LZ4. + + + + GZIP - + A popular stream compression algorithm available + in ZFS. One of the main + advantages of using GZIP is its + configurable level of compression. When setting + the compress property, the + administrator can choose the level of compression, + ranging from gzip1, the lowest + level of compression, to gzip9, + the highest level of compression. This gives the + administrator control over how much + CPU time to trade for saved + disk space. + + + + ZLE - + Zero Length Encoding is a special compression + algorithm that only compresses continuous runs of + zeros. This compression algorithm is only useful + when the dataset contains large blocks of + zeros. + + + + + + Copies + + When set to a value greater than 1, the + copies property instructs + ZFS to maintain multiple copies of + each block in the + File System + or + Volume. Setting + this property on important datasets provides additional + redundancy from which to recover a block that does not + match its checksum. In pools without redundancy, the + copies feature is the only form of redundancy. The + copies feature can recover from a single bad sector or + other forms of minor corruption, but it does not protect + the pool from the loss of an entire disk. + + + + Deduplication + + Checksums make it possible to detect duplicate + blocks of data as they are written. With deduplication, + the reference count of an existing, identical block is + increased, saving storage space. To detect duplicate + blocks, a deduplication table (DDT) + is kept in memory. The table contains a list of unique + checksums, the location of those blocks, and a reference + count. When new data is written, the checksum is + calculated and compared to the list. If a match is + found, the existing block is used. The + SHA256 checksum algorithm is used + with deduplication to provide a secure cryptographic + hash. Deduplication is tunable. If + dedup is on, then + a matching checksum is assumed to mean that the data is + identical. If dedup is set to + verify, then the data in the two + blocks will be checked byte-for-byte to ensure it is + actually identical. If the data is not identical, the + hash collision will be noted and the two blocks will be + stored separately. Because DDT must + store the hash of each unique block, it consumes a very + large amount of memory. A general rule of thumb is + 5-6 GB of ram per 1 TB of deduplicated data). + In situations where it is not practical to have enough + RAM to keep the entire + DDT in memory, performance will + suffer greatly as the DDT must be + read from disk before each new block is written. + Deduplication can use L2ARC to store + the DDT, providing a middle ground + between fast system memory and slower disks. Consider + using compression instead, which often provides nearly + as much space savings without the additional memory + requirement. + + + + Scrub + + Instead of a consistency check like &man.fsck.8;, + ZFS has scrub. + scrub reads all data blocks stored on + the pool and verifies their checksums against the known + good checksums stored in the metadata. A periodic check + of all the data stored on the pool ensures the recovery + of any corrupted blocks before they are needed. A scrub + is not required after an unclean shutdown, but is + recommended at least once every three months. The + checksum of each block is verified as blocks are read + during normal use, but a scrub makes certain that even + infrequently used blocks are checked for silent + corruption. Data security is improved, especially in + archival storage situations. The relative priority of + scrub can be adjusted with vfs.zfs.scrub_delay + to prevent the scrub from degrading the performance of + other workloads on the pool. + + + + Dataset Quota + + ZFS provides very fast and + accurate dataset, user, and group space accounting in + addition to quotas and space reservations. This gives + the administrator fine grained control over how space is + allocated and allows space to be reserved for critical + file systems. + + ZFS supports different types of + quotas: the dataset quota, the reference + quota (refquota), the + user + quota, and the + group + quota. + + Quotas limit the amount of space that a dataset + and all of its descendants, including snapshots of the + dataset, child datasets, and the snapshots of those + datasets, can consume. + + + Quotas cannot be set on volumes, as the + volsize property acts as an + implicit quota. + + + + + Reference + Quota + + A reference quota limits the amount of space a + dataset can consume by enforcing a hard limit. However, + this hard limit includes only space that the dataset + references and does not include space used by + descendants, such as file systems or snapshots. + + + + User + Quota + + User quotas are useful to limit the amount of space + that can be used by the specified user. + + + + Group + Quota + + The group quota limits the amount of space that a + specified group can consume. + + + + Dataset + Reservation + + The reservation property makes + it possible to guarantee a minimum amount of space for a + specific dataset and its descendants. If a 10 GB + reservation is set on + storage/home/bob, and another + dataset tries to use all of the free space, at least + 10 GB of space is reserved for this dataset. If a + snapshot is taken of + storage/home/bob, the space used by + that snapshot is counted against the reservation. The + refreservation + property works in a similar way, but it + excludes descendants like + snapshots. + + Reservations of any sort are useful in many + situations, such as planning and testing the + suitability of disk space allocation in a new system, + or ensuring that enough space is available on file + systems for audio logs or system recovery procedures + and files. + + + + + Reference + Reservation + + The refreservation property + makes it possible to guarantee a minimum amount of + space for the use of a specific dataset + excluding its descendants. This + means that if a 10 GB reservation is set on + storage/home/bob, and another + dataset tries to use all of the free space, at least + 10 GB of space is reserved for this dataset. In + contrast to a regular + reservation, + space used by snapshots and decendant datasets is not + counted against the reservation. For example, if a + snapshot is taken of + storage/home/bob, enough disk space + must exist outside of the + refreservation amount for the + operation to succeed. Descendants of the main data set + are not counted in the refreservation + amount and so do not encroach on the space set. + + + + Resilver + + When a disk fails and is replaced, the new disk + must be filled with the data that was lost. The process + of using the parity information distributed across the + remaining drives to calculate and write the missing data + to the new drive is called + resilvering. + + + + Online + + A pool or vdev in the Online + state has all of its member devices connected and fully + operational. Individual devices in the + Online state are functioning + normally. + + + + Offline + + Individual devices can be put in an + Offline state by the administrator if + there is sufficient redundancy to avoid putting the pool + or vdev into a + Faulted state. + An administrator may choose to offline a disk in + preparation for replacing it, or to make it easier to + identify. + + + + Degraded + + A pool or vdev in the Degraded + state has one or more disks that have been disconnected + or have failed. The pool is still usable, but if + additional devices fail, the pool could become + unrecoverable. Reconnecting the missing devices or + replacing the failed disks will return the pool to an + Online state + after the reconnected or new device has completed the + Resilver + process. + + + + Faulted + + A pool or vdev in the Faulted + state is no longer operational. The data on it can no + longer be accessed. A pool or vdev enters the + Faulted state when the number of + missing or failed devices exceeds the level of + redundancy in the vdev. If missing devices can be + reconnected, the pool will return to a + Online state. If + there is insufficient redundancy to compensate for the + number of failed disks, then the contents of the pool + are lost and must be restored from backups. + + + + + +