Submitted by: Emily Boyd (emilyboyd at emilyboyd dot com) Sponsored by: Google Summer of Code 2005
		
			
				
	
	
		
			285 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			285 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" [
 | |
| <!ENTITY base CDATA "../..">
 | |
| <!ENTITY date "$FreeBSD: www/en/projects/bigdisk/index.sgml,v 1.10 2005/02/16 23:56:09 brueffer Exp $">
 | |
| <!ENTITY title "Large data storage in FreeBSD">
 | |
| <!ENTITY % navincludes SYSTEM "../../includes.navdevelopers.sgml"> %navincludes;
 | |
| <!ENTITY % includes SYSTEM "../../includes.sgml"> %includes;
 | |
| 
 | |
| <!-- Status levels -->
 | |
| <!ENTITY status.na "<font color=green>N/A</font>">
 | |
| <!ENTITY status.done "<font color=green>Done</font>">
 | |
| <!ENTITY status.wip "<font color=blue>In progress</font>">
 | |
| <!ENTITY status.untested "<font color=orange>Needs testing</font>">
 | |
| <!ENTITY status.new "<font color=red>Not done</font>">
 | |
| <!ENTITY status.unknown "<font color=red>Unknown</font>">
 | |
| 
 | |
| <!-- The list of contributors was moved to a separate file so that it can
 | |
|   be used by other documents in the FreeBSD web site. -->
 | |
| 
 | |
| <!ENTITY % developers SYSTEM "../../developers.sgml"> %developers;
 | |
| 
 | |
| ]>
 | |
| 
 | |
| <html>
 | |
|   &header;
 | |
| 
 | |
|     <h2>Contents</h2>
 | |
|     <ul>
 | |
|       <li><a href="#background">Purpose and background</a></li>
 | |
|       <li><a href="#testing">Testing large capacities</a></li>
 | |
|       <li><a href="#userland">Userland Tools Status</a></li>
 | |
|       <li><a href="#ifnet-status">Kernel Driver Status</a></li>
 | |
|     </ul>
 | |
| 
 | |
|     <a name="background"></a>
 | |
|     <h2>Purpose and background</h2>
 | |
|     <h3>The UFS filesystem</h3>
 | |
|     <p>When the UFS filesystem was introduced to BSD in 1982, its use of 32
 | |
|       bit offsets and counters to address the storage was considered to be
 | |
|       ahead of its time.  Since most fixed-disk storage devices use 512 byte
 | |
|       sectors, 32 bits allowed for 2 Terabytes of storage.  That was an almost
 | |
|       un-imaginable quantity for the time.  But now that 250 and 400 Gigabyte
 | |
|       disks are available at consumer prices, it's trivial to build a hardware
 | |
|       or software based storage array that can exceed 2TB for a few thousand
 | |
|       dollars.</p>
 | |
| 
 | |
|     <p>The UFS2 filesystem was introduced in 2003 as a replacement to the
 | |
|       original UFS and provides 64 bit counters and offsets.  This allows for
 | |
|       files and filesystems to grow to 2^73 bytes (2^64 * 512) in size and
 | |
|       hopefully be sufficient for quite a long time.  UFS2 largely solved
 | |
|       the storage size limits imposed by the filesystem.  Unfortunately, many
 | |
|       tools and storage mechanisms still use or assume 32 bit values, often
 | |
|       keeping FreeBSD limited to 2TB.</p>
 | |
| 
 | |
|     <p>We need to ensure that FreeBSD supports large storage sizes and that
 | |
|       the benefits of UFS2 can actually be realized so that FreeBSD can remain
 | |
|       relevant in the enterprise world.  This page describes known issues and
 | |
|       limits and provides a focus for further auditing, validation, and
 | |
|       fixing.</p>
 | |
| 
 | |
|     <h3>Limits on disk partitioning</h3>
 | |
|     <p>The first limit that is encountered is in disk partitioning.  For x86
 | |
|       and amd64 PC's, the FDISK MBR table is used by the BIOS to partition the
 | |
|       disk into logical extents and identify which partition ('slice' in FreeBSD
 | |
|       terms) to boot from.  The MBR is defined to use 32 bit disk offsets,
 | |
|       and since it's an industry standard and interoperability is required,
 | |
|       there is nothing that can be done to change this.  As long as booting a
 | |
|       PC requires the MBR, the boot slice in FreeBSD is going to be limited to
 | |
|       2TB.</p>
 | |
| 
 | |
|     <p>The GPT partitioning scheme was introduced with the ia64 architecture
 | |
|       as an MBR replacement.  It provides 64 bit offsets and allows for an
 | |
|       arbitrary number of partitions.  It also provides a compatibility mode
 | |
|       with MBR where it can generate an MBR-compatible structure on the disk
 | |
|       for use with systems that don't understand GPT.  However, to get the
 | |
|       full benefits for boot storage, the BIOS and the FreeBSD loader must
 | |
|       understand it.  For secondary storage, GPT can be used by any
 | |
|       architecture regardless of BIOS or boot support.</p>
 | |
| 
 | |
|     <p>Many systems don't require an MBR or GPT, and even PCs don't require it
 | |
|       if booting and inter-operating with other OS's is not required.  The next
 | |
|       limit that comes in, though, is with the BSD disklabel.  This label
 | |
|       defines up to 8 partitions on a disk, MBR slice, or other storage extent
 | |
|       for filesystems and swap space.  Unfortunately, the on-disk format of the
 | |
|       disk label again uses 32 bit quantities, so it is also limited to 2TB.
 | |
|       Fixing this would require creating a new format that is incompatible
 | |
|       with the old and would require an update to the FreeBSD boot loader.
 | |
|       This would complicate interoperability and the upgrade path.  Also, if a
 | |
|       new format is going to be created, it should also address the 8 partition
 | |
|       limit that exists now.  Given these requirements, it's tempting to just
 | |
|       adopt the GPT format instead for secondary storage partitioning.</p>
 | |
| 
 | |
|     <a name="testing"></a>
 | |
|     <h2>Testing large capacities</h2>
 | |
|     <p>Even though large drives are cheap, it still isn't always feasible or
 | |
|       economical to test on real hardware.  Swap-backed memory disks, via the
 | |
|       md(4) driver, can provide a good substitute for some of the testing.
 | |
|       Backing with swap means that only the pages that are dirtied by data
 | |
|       are actually allocated, so a multi-terabyte storage can be simulated
 | |
|       with a minimal amount of physical RAM+swap.  Note that this is less true with
 | |
|       UFS1 since it will initialize all of the inode blocks during newfs,
 | |
|       which will dirty quite a bit of data.  But for UFS2, swap-backed md
 | |
|       has the potential for working well.  Unfortunately, the kernel md driver
 | |
|       has a number of 32-bit size limits of its own that need to be fixed.
 | |
|       Details are provided below.</p>
 | |
| 
 | |
|     <p>It is still possible to avoid disklabels and MBRs for testing by
 | |
|       using newfs directly on the raw disk or md disk.  Sysinstall can be
 | |
|       tested from a running system by just selecting Expert mode and just
 | |
|       performing the MBR and disklabel steps.  Beware that sysinstall might
 | |
|       have other bugs that will wipe out your existing system, so care must
 | |
|       be taken here!</p>
 | |
| 
 | |
|     <a name="userland"></a>
 | |
|     <h2>Userland Tool Status</h2>
 | |
| 
 | |
|     <p>The following userland tools need auditing and testing for 64-bit
 | |
|       cleanliness:</p>
 | |
| 
 | |
|     <table class="tblbasic">
 | |
|       <tr>
 | |
| 	<th> Task </th>
 | |
| 	<th> Responsible </th>
 | |
| 	<th> Last updated </th>
 | |
| 	<th> Status </th>
 | |
| 	<th> Details </th>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>newfs</td>
 | |
|         <td>&a.pjd;</td>
 | |
|         <td>19 Sept 2004</td>
 | |
|         <td>&status.done;</td>
 | |
| 	<td>Handling of '-s' option was fixed. Newfs should be now fully usable
 | |
|           for large file systems.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>df</td>
 | |
| 	<td> </td>
 | |
| 	<td> </td>
 | |
| 	<td>&status.new;</td>
 | |
| 	<td>An audit is needed to make sure that all reported fields are
 | |
|           64-bit clean.  There are reports with certain fields being incorrect
 | |
|           or negative with NFS volumes, which could either be an NFS or df
 | |
|           problem.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>du</td>
 | |
|         <td>&a.pjd;</td>
 | |
|         <td>7 Jan 2005</td>
 | |
|         <td>&status.done;</td>
 | |
| 	<td>Big files/directories handling was broken. It was fixed and du
 | |
| 	  should be now fully usable on large file systems with large
 | |
|           files/directories.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>growfs</td>
 | |
| 	<td>&a.scottl;</td>
 | |
| 	<td>12 Sept 2004</td>
 | |
| 	<td>&status.wip;</td>
 | |
| 	<td>Growfs has problems with expanding to new cylinder groups.  It also
 | |
|           initializes UFS2 inode blocks instead of leaving them for lazy
 | |
|           initialization.  It also needs a 64-bit audit.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>sysinstall</td>
 | |
| 	<td> </td>
 | |
| 	<td> </td>
 | |
| 	<td>&status.new;</td>
 | |
| 	<td>A full audit is needed.  Reports exist of problems with >1TB
 | |
|           partitions.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>fsck_ffs</td>
 | |
| 	<td>&a.pb;</td>
 | |
| 	<td>15 Jan 2005</td>
 | |
| 	<td>&status.wip;</td>
 | |
| 	<td>A full audit is needed. At least some printf format changes are
 | |
|           necessary.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>dump/restore</td>
 | |
| 	<td> </td>
 | |
| 	<td> </td>
 | |
| 	<td>&status.new;</td>
 | |
| 	<td>A full audit is needed. At least some printf format changes are
 | |
|           necessary in dump(8).</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>fsdb</td>
 | |
| 	<td> </td>
 | |
| 	<td> </td>
 | |
| 	<td>&status.new;</td>
 | |
| 	<td>A full audit is needed. At least some printf format changes are
 | |
|           necessary.</td>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
| 	<td>quota tools</td>
 | |
| 	<td> </td>
 | |
| 	<td> </td>
 | |
| 	<td>&status.new;</td>
 | |
| 	<td>Extensive changes are need. Disk quotas are currently
 | |
| 	  handled as 32-bit quantities, which limits the maximum
 | |
| 	  possible quota at 2TB. Two tasks are needed: 1) have the
 | |
| 	  current tools (kernel+userland, edquota for example) fail
 | |
| 	  gracefully when presented with 64-bit quantities and 2)
 | |
| 	  extend the quota file format and tools to 64-bit while
 | |
| 	  providing a compatibility mode and/or migration tools.</td>
 | |
|       </tr>
 | |
| 
 | |
|     </table>
 | |
| 
 | |
|     <a name="Kernel"></a>
 | |
|     <h2>Kernel Driver Status</h2>
 | |
| 
 | |
|     <p>Many storage peripherals simply are not designed to handle >2TB
 | |
|       capacities.  For those that are, an audit should be done to verify
 | |
|       that their drivers handle the sizes correctly and pass those sizes
 | |
|       correctly to the rest of the kernel.</p>
 | |
| 
 | |
|     <table class="tblbasic">
 | |
|       <tr>
 | |
| 	<th> Task </th>
 | |
| 	<th> Responsible </th>
 | |
| 	<th> Last updated </th>
 | |
| 	<th> Status </th>
 | |
| 	<th> Details </th>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
|         <td>md</td>
 | |
|         <td>&a.pjd;</td>
 | |
|         <td>17 Sept 2004</td>
 | |
|         <td>&status.done;</td>
 | |
|         <td>Swap backed disks can now be created up to 16TB in size on i386.
 | |
|           This corresponds to 2^32*4096.</td>
 | |
|       </tr>
 | |
|     </table>
 | |
| 
 | |
|     <h2>Subsystem Status</h2>
 | |
|     <p>Some filesystem-related subsystems require testing with >2TB volumes, or
 | |
|       need to be adapted. The following areas have been identified:</p>
 | |
| 
 | |
|     <table class="tblbasic">
 | |
|       <tr>
 | |
| 	<th> Task </th>
 | |
| 	<th> Responsible </th>
 | |
| 	<th> Last updated </th>
 | |
| 	<th> Status </th>
 | |
| 	<th> Details </th>
 | |
|       </tr>
 | |
| 
 | |
|       <tr>
 | |
|         <td>snapshots</td>
 | |
|         <td>&a.pb;</td>
 | |
|         <td>15 Jan 2004</td>
 | |
|         <td>&status.wip;</td>
 | |
|         <td>Taking snapshots fails on filesystems >2TB, returning EFBIG
 | |
| 	   (on a 5TB filesystem) and subsequently crashing the system in
 | |
| 	   softupdates.</td>
 | |
|       </tr>
 | |
|       <tr>
 | |
|         <td>quotas</td>
 | |
|         <td> </td>
 | |
|         <td> </td>
 | |
|         <td>&status.new;</td>
 | |
| 	<td>The quota subsystem handles 32-bit quantities, which
 | |
| 	  limits quotas to 2TB. Blockings of the syncer have been
 | |
| 	  observed while attempting to set quotas over that limit
 | |
| 	  (try 4000000000 KBytes as a hard limit in edquota(8) for
 | |
| 	  some uid, then create somes files owned by that uid). See
 | |
| 	  also the userland entry for quota tools.</td>
 | |
|       </tr>
 | |
|     </table>
 | |
| 
 | |
|   &footer;
 | |
|   </body>
 | |
| </html>
 |