Part1 of this blog presented an overview of the Windows storage stack.
Application programmers may use a number of interfaces to control the way their application data is cached or if they prefer, not cached and committed to disk media.
Applications open a handle to a resource such as a file or a volume using the CreateFile API. One of the parameters to the CreateFile API is the dwFlagsAndAttributes parameter, which can be any valid combination of file attributes and file flags.
To avoid caching at the file system layer, an application should specify a valid combination that includes FILE_FLAG_NO_BUFFERING in this dwFlagsAndAttributes parameter. Some notable consequences of specifying this flag include:
- Applications are expected to perform I/O that is in integer units of the volume sector size.
- FILE_FLAG_NO_BUFFERING only applies to application data – the file system may still cache file metadata. Data and metadata may be flushed to disk by using the FlushFileBuffers API
Some well known applications such as Microsoft SQL and the Microsoft JET database (ships with Windows Server SKUs) specify FILE_FLAG_NO_BUFFERING with the CreateFile API.
Alternatively, applications may call the CreateFile API making sure that the FILE_FLAG_NO_BUFFERING flag is cleared in the parameter dwFlagsAndAttributes. In this case, it is likely that the file system and cache manager will cooperate to cache the application data, but there is no guarantee the caching will actually occur. A number of factors including the file/volume open mode specified in CreateFile, the data access pattern, and the load on the system will affect whether a particular application I/O is cached or not.
Note: The FILE_FLAG_NO_BUFFERING only affects file/volume data and does not apply to file/volume metadata, which may be cached even though FILE_FLAG_NO_BUFFERING is set.
Another relevant flag in the CreateFile API is FILE_FLAG_WRITE_THROUGH. Specified by itself, this flag ensures that both file/volume data and file/volume metadata are immediately flushed to storage media. Note that this does not mean the data does not traverse the cache. Referring to Figure 1, FILE_FLAG_NO_BUFFERING may mean that the data is written to Cache Manager and then immediately flushed from there. So the I/O may still be a buffered I/O.
- Application developers favoring data integrity at the cost of reduced throughput should specify both FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH are set when invoking the CreateFile API.
- Application developers favoring speed should make sure these flags are cleared when invoking CreateFile.
- Application developers favoring a balance of throughput and data integrity need to read the section on FlushFileBuffers API.
The main issues with hardware are the FILE_FLAG_WRITE_THROUGH parameter and how an operating system can handle it.
SCSI protocols define a Force Unit Access (FUA) flag in the SCSI Request Block (SRB). Early versions of the SCSI protocol (circa 1997) defined FUA as optional, while later specifications have made it mandatory. In situations where it is imperative that data gets committed to media, ensure that the deployed storage hardware does support the Force Unit Access semantics and that this feature is not disabled.
In the enterprise world, NTFS has been deployed on a lot of non SCSI hardware which increasingly so includes SATA aka Serial ATA storage. The implementation of FUA in these devices is, at best, inconsistent. Even when implemented, the default is to turn it off, because of severe performance penalties.
Lower end PCs continue to use what is loosely termed IDE/ATAPI (ATA Parallel Interface which is ATA retroactively renamed) storage drives. Strictly speaking, this is more a family of protocols, rather than a single protocol. The ATAPI-4 specification is implemented in Windows 2000 and the ATAPI-5 specification in Windows Server 2003. Neither of these have any equivalent to the Force Unit Access semantic of SCSI. This is a long understood problem and Microsoft proposed in 2002 that the relevant standard be revised.
Note: The conclusion is that there is a possibility of data corruption due to drives caching data, especially so in ATA, IDE, ATAPI, and SATA devices, to name a few. This problem exists for other popular operating systems such as the Apple OS X and Linux as well. See the “FlushFileBuffers” section within this document. Also be sure to read the “NTFS” section further ahead in this blog.
The FlushFileBuffers API can be used to flush all the outstanding data and metadata on a single file or a whole volume. However, frequent use of this API can cause reduced throughput. Internally, Windows uses the SCSI Synchronize Cache or the IDE/ATAPI Flush cache commands.
- Application developers desiring a combination of speed and data integrity can
- Specify ~FILE_FLAG_NO_BUFFERING and ~FILE_FLAG_WRITE_THROUGH when invoking CreateFile
- Write data as needed using an appropriate API such as WriteFile
- Periodically call FlushFileBuffers to commit the data and meatadata to storage media – the exact period at which this occurs is application specific
- Application developers favoring data integrity at the cost of reduced throughput should make judicious use of this API, especially so when they specify only FILE_FLAG_NO_BUFFERING while invoking the CreateFile API. In other words, these applications are using FILE_FLAG_NO_BUFFERING to ensure data is committed to storage media and using FlushFileBuffers to ensure metadata is committed to storage media.
- Application developers favoring pure throughput at the risk of potential data corruption should never use FlushFileBuffers API.
- As reference 14 describes, FlushFileBuffers can be used to mitigate the hardware not supporting write-Through
Liberal use of the FlushFileBuffers API can severely affect system throughput. This is because at the file system layer, it is quite clear what data blocks belong to what file. So when FlushFileBuffers is invoked, it is also apparent what data buffers need to be written out to media. However, at the block storage layer – shown as “Sector I/O” in Figure 1, it is difficult to track what blocks are associated with what files. Consequently the only way to honor any FlushFileBuffers call is to make sure all data is flushed to media. Therefore, not only is more data written out than originally intended, but the larger amount of data can affect other I/O optimizations such as queuing of the writes.
There is also a bright side to this picture. While it is true that FlushFileBuffers, if handled properly by all involved layers, will flush all data and cause performance degradation, it can also help in preserving data integrity. An application that “forgets” to invoke FlushFileBuffers will still have its data committed to media due to other applications invoking this API.
FlushFileBuffers should be judiciously used as needed.
The SCSI-3 protocol defines a command SYNCHRONIZE_CACHE that commits data from the storage cache to media. Hence SCSI devices are good candidates for applications that are highly sensitive to data being committed to media. However, it is always good practice to verify that a particular SCSI hardware does implement the SYNCHRONIZE_CACHE command.
The relevant ATA/IDE specifications define an equivalent command FLUSH_CACHE. ATAPI-4 (relevant circa 2000) defines the FLUSH_CACHE command as optional; ATAPI-5 makes this command mandatory. As always, verify the exact functionality of a particular hardware and do not assume it implements the relevant standard precisely.
Client/Server application considerations
Both the Common Internet File Systems (CIFS) and the new SMB 2.0 protocols define a flush SMB command. Therefore the CIFS/SMB redirector built into Windows can easily propagate a FlushFileBuffers command over to a file server.
While data may be cached at the file system layer, it can (and does) occur at other layers such as the disk block level. For the purposes of this discussion, caching done within a disk or a disk controller is referred to as disk block caching.
Storage devices cache data to enhance throughput, but sometimes at the cost of data integrity. Some storage devices include their own battery backup to enhance data integrity.
Further increasing ambiguity is the fact that there is no standard manner to determine whether such caching is indeed occurring or not. Some storage systems have their own battery backed cache and hence under many circumstances, it may be OK to leave the data in this battery backed cached and treat it as being committed to media. Some of these storage systems ignore the command to commit data to media.
The DeviceIOControl API can be used to inspect block storage caching configurations and also potentially set these configurations. Both the caching policy inspection and setting have limitations as well.
Windows offers a number of ways for applications to programmatically affect the caching at the sector I/O level. Many of these interfaces consist of calling the DeviceIOControl API with some different function code. Windows NT 4.0 SP4 and higher versions of Windows require administrator privileges to submit SCSI pass-through requests. SCSI pass-through requests are submitted via the WIN32 DeviceIOControl API, which requires a handle to a file or a volume as a parameter. This handle is obtained via the CreateFile API and starting with Windows NT 4.0 SP4, the CreateFile function requires GENERIC_READ and GENERIC_WRITE to be specified in the dwDesiredAccess parameter of the CreateFile API.
Windows defines a number of IOCTL codes that may be used to inspect and control the disk block caching functionality. These include:
Returns information about whether the disk cache is enabled or not. The function works only where the disk returns correct information via a SCSI mode page.
-Sets disk caching functionality to be enabled or disabled. The function works only where the disk implements SCSI mode pages.
Windows Vista and higher versions of Windows support another interface to retrieve disk cache properties.
- A single interface that applications can code to – in the absence of this single interface, applications would need to understand the nuances of various devices e.g. for 1394 devices, obtain page 6 of the mode data, but for SCSI compliant devices, obtain page 8 of the mode data, etc.
- Windows implements the necessary details to retrieve the information in a transport dependent manner (SCSI/IDE/etc) to surface the information
- A richer interface that can provide information not just whether the disk implements a cache, but also what kind of cache
- Time will tell whether this interface continues to be evolve and be populated with more data
- Yet another interface that applications need to code to
- Does not work for RAID devices
- Does not work for Flash drives
- Not yet widely implemented by storage vendors
- Only available in the newer versions of Windows
- Does not (yet at least) allow setting of caching properties
For an application to retrieve information about a device’s write cache property, use the STORAGE_WRITE_CACHE_PROPERTY structure with the IOCTL_STORAGE_QUERY_PROPERTY request.
NTFS and flushing behavior
NTFS, when introduced with Windows NT 3.X depended upon the SCSI Forced Unit Access behavior to ensure its meta data is flushed to media. As described above, NTFS has been deployed on a fair number of non SCSI devices, all of which have an inconsistent implementation of FUA. Perhaps the saving grace may have been that the consumer devices do not have a cache, and hence even without FUA, the data hits the media. In any case, what is worth noting is that all NTFS versions upto and including Windows 7 and Windows 2008 R2 depend upon FUA to ensure that NTFS. NTFS in Windows 8 has switched to using the FlushFileBuffers API instead of depending upon the Forced Unit Access behavior.
This concludes an overview of the knobs an application developer can twist and turn to influence application data write caching. Part 3 of this blog will present an overview of the knobs a system administrator can twist and turn to influence write data caching.