Category Archives: Windows 8

SMB 3 NAS is preferable to DAS in a Windows environment

Microsoft is investing heavily in the Network Attached Storage (NAS) protocol SMB 3 and is clearly laying out a road map that suggests NAS is the future as opposed to Direct Attached Storage (DAS). Consider:

  • SQL Server 2012 system d/b and user d/bs, as well as Hyper-V 2012 workloads can be placed on NAS provided the NAS is SMB 3!
  • Microsoft made significant speed improvements in the SMB 3 client and server to have NAS achieve 97% of the speed of DAS, and this is without hardware acceleration.
  • Microsoft invested in SMB 3 Multi Channel by aggregating the bandwidth using parallel TCP channels using multiple NICs at the SMB 3 protocol layer. Multi Channel is all about speed AND reliability where failed I/Os are seamlessly moved to a different TCP channel when one channel fails.
  • Continuing on the speed theme, Microsoft invested in RDMA support via SMB Direct, which requires not just SMB 3, but also SMB 3 Multi Channel. The maximum IOPS on a Windows system is achieved when using SMB 3 NAS with SMB Direct support, NOT with DAS!
  • Going back to the reliability theme, SMB 3 includes support for Persistent Handles, which combined with the Witness Protocol, ensure applications such as SQL, Exchange, and Hyper-V never see an I/O failure, and the I/O is seamlessly moved to a different node as needed. This only works with SMB 3 NAS, and does NOT work with DAS!
  • I have been asked numerous times “But Microsoft has invested in Storage Spaces and Tiering where data is moved between SSD and spinning media to optimize performance. Does that not indicate Microsoft advocates DAS?” And my answer has always been “Storage Spaces is even more valuable when used as the storage backing a Windows Server 2012/R2 NAS!” Using Storage Spaces does not mean one has to abandon NAS.
  • Microsoft supports deduplication of VDI VMs, but the only supported configuration is with the VDI VM files residing on an SMB 3 based Windows Server 2012 R2 based NAS! (and not with DAS!)
  • To provide examples of other Microsoft efforts leveraging SMB3 , consider the simple “copy” or “xcopy” command to say copy a GBs large file. Microsoft changed the CopyFileEx API to leverage all SMB 3 features including SMB 3 credits, SMB 3 Multi Channel, and SMB Direct (RDMA) to ensure the file copy is as fast as possible.
  • The Microsoft Hyper-V team re-wrote live migration in Hyper-V 2012 R2 to leverage SMB 3. While migrating a VM, Hyper-V 2012 setup its own TCP channel to copy the VM RAM. Hyper-V 2012 R2 uses SMB 3, and thereby gets the speed/reliability improvements of SMB 3 while doing the same copy.

Backup performance and SMB 3 Multi Channel

In this day and age of exploding data amounts, backup and restore is both increasingly important, and becoming more common and taken for granted. But not all backup “target systems” i.e. the systems to which data is backed up are created equal. Especially so, when the system being backed up is Windows based.

  1.  If your backup target system is based upon CIFS (also sometimes referred to as SMB 1), backup (and restore) is limited to 64kb serial I/O. In other words, the backup/restore software does a 64kb I/O, waits for the I/O to complete, and only then issues the next I/O. In fact it is worse than this. The total payload is limited to 64kb and hence well behaved apps that want to perform I/O in 4MB block size will only use a 60kb payload (data).
  2. If your backup target system is running SMB 2.0, the I/O is 1MB serial, which is certainly an improvement.
  3. If your backup target is SMB 2.1, the I/O is again 1MB, but SMB 2.1 has a server issuing multiple credits which means the client can issue multiple I/Os without having to wait for any one of the I/Os to complete. A typical Windows to Windows flow will show 10 1MB I/Os on the wire at the same time. Note that this is all on a single TCP channel. So the backup/restore speed is significantly higher
  4. Now recall that in most cases, BOTH the system being backed up AND the backup target are servers. For example, you could be backing up a file server or SQL server or Hyper-V server, and of course, the backup target also operates typically as a NAS (file server).  Thus it is very likely that at least one of the two has multiple NICs. If any one (or both) ends of an SMB 3 connection have multiple NICs, and provided these NICs are 10GB RSS capable (which are fairly cheap now), SMB 3 Multi Channel will kick in. SMB 3 Multi Channel establishes multiple TCP channels and engages multiple credits on each TCP channel. So with just 2 TCP channels, you could now have 20MB I/O in flight at any given moment.

In short, if Windows and especially so Windows 2012 is part of your IT environment (or planned environment), make sure your backup target has an upgrade path to SMB 3! And don’t be fooled by just the SMB 3 label! Ask your vendor if it is SMB 3 Multi Channel. The SMB 3 protocol allows a storage device to negotiate SMB 3, but not support SMB 3 Multi Channel!

Wishing you higher backup/restore speeds with SMB 3 Multi Channel!

Windows Write Caching – Part 3 – An Overview For System Administrators

The Windows Cache Manager (also referred to as System Cache) acts as a single system-wide cache that contains driver code, application code, data for both, user mode applications as well as driver data. While an application can make API calls in a manner that guarantees the application data bypasses this cache, there is no way for an application to guarantee that its data WILL be cached. Because the behavior of the cache depends upon a number of factors and is very often non repeatable, the application and system administrator can only increase the likelihood that the application data will be cached. In other words, executing the same program multiple times is very likely to result in slightly different cache behavior each time. This is part of the reason why applications such as Microsoft SQL and Microsoft Exchange bypass the System Cache.

To illustrate the complexities involved, consider the seemingly simple act of copying a file from one volume to another. Some, but not all, of these have been originally described in References 8 and 9.

  • Either the source or the destination volume may be a local volume or a network volume
  • The access speed for the source and destination volumes may either be the same, or one may be significantly slower than the other. Further, the access speed can change depending upon a variety of factors such as network load, system load in terms of other application execution, resource usage e.g. I/O may switch from being cached to non cached and vice versa.
  • The optimum I/O size for the source and destination volumes may be either the same or significantly different
  • If both the source and destination volumes are on a Windows system, then the System Cache is involved in both reads and writes
  • The Windows team has spent a considerable amount of resources fine tuning the CopyFile and CopyFileEx APIs. Details are described in Reference 8, but the lesson to take away is the complexity of the issue and that further changes are probably forthcoming

Applications may

  • Use the CopyFile or CopyFileEx APIs and utilize the system cache
  • Use the CreateFile, ReadFile, WriteFile APIs and utilize the system cache for the source only, destination only, or both, or none.

Once you combine all of the various permutations and combinations offered by the above mentioned elements, the following situations can and do occur, when large files are being copied:

  • The system cache on the computer hosting the source file gets filled to a large extent with data from the source file. At the very least, this will affect other programs executing on that system.
  • The system cache on the computer system hosting the destination file gets filled with data for the destination file. This occurs fairly often since in the beginning, all of the destination file data is cached and thus writes appear to complete quickly. Once the destination file system cache hits a limit, disk writes (for flushing that cache) may occur slowly because the disk subsystem may be relatively slow
  • To complicate matters further, even when the data is flushed from system cache, it may be cached inside the block storage device (storage array)

When suspecting problems that may involve the System Cache, an administrator can

  • Inspect the application being used and switch to using a different application that explicitly does not use the System Cache. The Microsoft Server Performance Team Blog (Reference 7) explicitly suggests using Microsoft Exchange EseUtil as a file copy tool. The legal implications of using software shipped with Microsoft Exchange on a regular file server are beyond the scope of this document and best decided by your legal department
  • Use some other means to affect the System Cache e.g. use some other application that will consume up the System Cache, but not otherwise unduly load the system.
  • Attempt to administer the System Cache behavior utilizing in built utilities and/or registry keys

System caching can be controlled using administrative utilities and or a registry key.

To change the setting by editing the registry – as always beware of making registry changes and do so at your own risk – edit the registry key

HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\LargeSystemCache

By default this DWORD is set to one (enabled) on Server SKUs and to zero (disabled)  on desktop SKUs.

On Windows XP, Microsoft provides a GUI to make the same changes which is preferable to making these changes via registry edits. Figure 2 shows the GUI that results from launching the Control Panel System Applet and then clicking the Advanced Tab

wincachefig2

Figure 2 Windows XP Control Panel System Applet Advanced Tab

Figure 3 shows the resulting System Cache size adjustment GUI when the advanced tab is clicked in Figure 2 on a Windows XP system

WinCacheFig3

Figure 3 Windows XP Control Panel System Applet Advanced Tab to adjust System Cache Size

Note that this GUI to change the System Cache size has been removed in Windows Vista.

Windows Server 2003, Windows Vista, and Windows Server 2008 Block Storage Cache Administration

Recall the earlier explanation of the bug in previous versions of Windows that ignored application requests to ensure data/metadata got committed to storage media and the subsequent fix made in Windows Server 2000 SP3 and also Windows XP SP2. To allow system administrators an informed choice, Microsoft made available a cache administration utility called DskCache.exe. This utility was only available by calling Microsoft PSS and could be obtained without incurring any monetary charge. To make it very clear that the DskCache utility should only be used in rare circumstances, Microsoft labeled it the “Power Protected Write Cache” and shipped it natively with Windows Server 2003 and higher versions of Windows. The new utility name emphasizes that it should be used only when the administrator is sure that the disk storage cache has a battery backup to ensure data integrity.

For Windows Server 2003 and higher versions of Windows, Microsoft has provided the equivalent of the DskCache.exe tool built into Windows. To use this feature:

  • Start Device Manager
  • Select the drive for which you wish to administer the caching policy
  • Select Properties
  • Click on Policies tab
  • Look for the option  “Enable write caching on the disk”  and make sure it is selected
  • And just below that, look for an option “Enable advanced performance”. This  option favors throughput/speed at the potential risk of data corruption.

The resulting GUI from following these steps is shown in Figure 4.

wincachefig4

 

Figure 4 – Windows Server 2003, Windows Vista & Windows Server 2008 disk caching policy administration

For Windows Server 2012, here is what the disk caching policy GUI looks like

WinCacheFig5

Figure 5 – Windows Server 2012 disk caching policy administration

Conclusion

This article described means by which application programmers can

  • Ensure that their file level data does not get cached in the Windows System Cache
  • Ensure that their file data does not get cached in the block storage layer and does get committed to storage media, given the correct hardware
  • Attempt to ensure, with no guarantee of success, that their file data does indeed get cached in the Windows System Cache

This article also describes means by which system administrators can attempt to ensure that data gets committed to storage media and does not get cached at either the System Cache or any block storage cache.

References

  1. Microsoft KB 241374 (http://support.microsoft.com/kb/241374/EN-US/) : Read and Write Access Required for SCSI Pass Through Request
  2. Microsoft KB 8373314: About Cache Manager in Windows Server 2003
  3. Microsoft KB 332023 Slow Disk Performance When Write Caching Is Disabled
  4. Nuances of Windows NT and SCSI disk performance article by Dilip Naik
  5. Force Unit Access Proposal
  6. Microsoft KB  870894 You receive a “Delayed Write Failed” error message in Windows XP Service Pack 2 or Windows XP Tablet PC Edition 2005
  7. Slow Large File Copy Issues – Microsoft Server Technical Support Performance Team Blog http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx
  8. Inside Vista SP1 File Copy Improvements – Mark Russinovich Blog http://blogs.technet.com/markrussinovich/
  9. Server Generates Delayed Errors Copying Very Large Files http://www.eggheadcafe.com/software/aspnet/32252624/server-generates-delayed.aspx
  10. Microsoft KB 920739  http://support.microsoft.com/kb/920739  Decreased Performance when copying files larger than 500 MB
  11. Serial ATA Program Revision 1.2 http://www.sata-io.org/documents/Interop_UnifiedTest_Rev1_2_v10_091707_000.pdf
  12. Disks, Lies, and damn disks http://perspectives.mvdirona.com/2008/04/17/DisksLiesAndDamnDisks.aspx
  13. Serial ATA in the Microsoft operating system environment http://www.microsoft.com/whdc/device/storage/serialATA_FAQ.mspx
  14. Enforcing Database Recoverability on Disks that lack Write-Through ftp://ftp.research.microsoft.com/pub/tr/TR-2008-36.pdf

 

 

 

Windows Write Caching – Part 2 An overview for Application Developers

Part1 of this blog presented an overview of the Windows storage stack.

Application programmers may use a number of interfaces to control the way their application data is cached or if they prefer, not cached and committed to disk media.

CreateFile API

Applications open a handle to a resource such as a file or a volume using the CreateFile API. One of the parameters to the CreateFile API is the dwFlagsAndAttributes parameter, which can be any valid combination of file attributes and file flags.

To avoid caching at the file system layer, an application should specify a valid combination that includes FILE_FLAG_NO_BUFFERING in this dwFlagsAndAttributes parameter. Some notable consequences of specifying this flag include:

  • Applications are expected to perform I/O that is in integer units of the volume sector size.
  • FILE_FLAG_NO_BUFFERING only applies to application data – the file system may still cache file metadata. Data and metadata may be flushed to disk by using the FlushFileBuffers API

Some well known applications such as Microsoft SQL and the Microsoft JET database (ships with Windows Server SKUs) specify FILE_FLAG_NO_BUFFERING with the CreateFile API.

Alternatively, applications may call the CreateFile API making sure that the FILE_FLAG_NO_BUFFERING flag is cleared in the parameter dwFlagsAndAttributes. In this case, it is likely that the file system and cache manager will cooperate to cache the application data, but there is no guarantee the caching will actually occur. A number of factors including the file/volume open mode specified in CreateFile, the data access pattern, and the load on the system will affect whether a particular application I/O is cached or not.

Note: The FILE_FLAG_NO_BUFFERING only affects file/volume data and does not apply to file/volume metadata, which may be cached even though FILE_FLAG_NO_BUFFERING is set.

Another relevant flag in the CreateFile API is FILE_FLAG_WRITE_THROUGH. Specified by itself, this flag ensures that both file/volume data and file/volume metadata are immediately flushed to storage media. Note that this does not mean the data does not traverse the cache. Referring to Figure 1, FILE_FLAG_NO_BUFFERING may mean that the data is written to Cache Manager and then immediately flushed from there. So the I/O may still be a buffered I/O.

  • Application developers favoring data integrity at the cost of reduced throughput should specify both FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH are set when invoking the CreateFile API.
  • Application developers favoring speed should make sure these flags are cleared when invoking CreateFile.
  • Application developers favoring a balance of throughput and data integrity need to read the section on FlushFileBuffers API.

Hardware Considerations

The main issues with hardware are the FILE_FLAG_WRITE_THROUGH parameter and how an operating system can handle it.

SCSI protocols define a Force Unit Access (FUA) flag in the SCSI Request Block (SRB). Early versions of the SCSI protocol (circa 1997) defined FUA as optional, while later specifications have made it mandatory. In situations where it is imperative that data gets committed to media, ensure that the deployed  storage hardware does support the Force Unit Access semantics and that this feature is not disabled.

In the enterprise world, NTFS has been deployed on a lot of non SCSI hardware which increasingly so includes SATA aka Serial ATA storage. The implementation of FUA in these devices is, at best, inconsistent. Even when implemented, the default is to turn it off, because of severe performance penalties.

Lower end PCs continue to use what is loosely termed IDE/ATAPI (ATA Parallel Interface which is ATA retroactively renamed) storage drives. Strictly speaking, this is more a family of protocols, rather than a single protocol. The ATAPI-4 specification is implemented in Windows 2000 and the ATAPI-5 specification in Windows Server 2003. Neither of these have any equivalent to the Force Unit Access semantic of SCSI. This is a long understood problem and Microsoft proposed in 2002 that the relevant standard be revised.

Note: The conclusion is that there is a possibility of data corruption due to drives caching data, especially so in ATA, IDE, ATAPI, and SATA devices, to name a few. This problem exists for other popular operating systems such as the Apple OS X and Linux as well. See the “FlushFileBuffers” section within this document. Also be sure to read the “NTFS” section further ahead in this blog.

FlushFileBuffers API

The FlushFileBuffers API can be used to flush all the outstanding data and metadata on a single file or a whole volume. However, frequent use of this API can cause reduced throughput. Internally, Windows uses the SCSI Synchronize Cache or the IDE/ATAPI Flush cache commands.

  • Application developers desiring a combination of speed and data integrity can
    • Specify ~FILE_FLAG_NO_BUFFERING and ~FILE_FLAG_WRITE_THROUGH when invoking CreateFile
    • Write data as needed using an appropriate API such as WriteFile
    • Periodically call FlushFileBuffers to commit the data and meatadata to storage media – the exact period at which this occurs is application specific
    • Application developers favoring data integrity at the cost of reduced throughput should make judicious use of this API, especially so when they specify only FILE_FLAG_NO_BUFFERING while invoking the CreateFile API. In other words, these applications are using FILE_FLAG_NO_BUFFERING to ensure data is committed to storage media and using FlushFileBuffers to ensure metadata is committed to storage media.
    • Application developers favoring pure throughput at the risk of potential data corruption should never use FlushFileBuffers API.
    • As reference 14 describes, FlushFileBuffers can be used to mitigate the hardware not supporting write-Through

Liberal use of the FlushFileBuffers API can severely affect system throughput. This is because at the file system layer, it is quite clear what data blocks belong to what file. So when FlushFileBuffers is invoked, it is also apparent what data buffers need to be written out to media. However, at the block storage layer – shown as “Sector I/O” in Figure 1, it is difficult to track what blocks are associated with what files. Consequently the only way to honor any FlushFileBuffers call is to make sure all data is flushed to media. Therefore, not only is more data written out than originally intended, but the larger amount of data can affect other I/O optimizations such as queuing of the writes.

There is also a bright side to this picture. While it is true that FlushFileBuffers, if handled properly by all involved layers, will flush all data and cause performance degradation, it can also help in preserving data integrity. An application that “forgets” to invoke FlushFileBuffers will still have its data committed to media due to other applications invoking this API.

FlushFileBuffers should be judiciously used as needed.

Hardware Considerations

The SCSI-3 protocol defines a command SYNCHRONIZE_CACHE that commits data from the storage cache to media. Hence SCSI devices are good candidates for applications that are highly sensitive to data being committed to media. However, it is always good practice to verify that  a particular SCSI hardware does implement the SYNCHRONIZE_CACHE command.

The relevant ATA/IDE specifications define an equivalent command FLUSH_CACHE. ATAPI-4 (relevant circa 2000) defines the FLUSH_CACHE command as optional; ATAPI-5 makes this command mandatory. As always, verify the exact functionality of a particular hardware and do not assume it implements the relevant standard precisely.

Client/Server application considerations

Both the Common Internet File Systems (CIFS) and the new SMB 2.0 protocols define a flush SMB command. Therefore the CIFS/SMB redirector built into Windows can easily propagate a FlushFileBuffers command over to a file server.

DeviceIOControl API

While data may be cached at the file system layer, it can (and does) occur at other layers such as the disk block level. For the purposes of this discussion, caching done within a disk or a disk controller is referred to as disk block caching.

Storage devices cache data to enhance throughput, but sometimes at the cost of data integrity. Some storage devices include their own battery backup to enhance data integrity.

Further increasing ambiguity is the fact that there is no standard manner to determine whether such caching is indeed occurring or not. Some storage systems have their own battery backed cache and hence under many circumstances, it may be OK to leave the data in this battery backed cached and treat it as being committed to media. Some of these storage systems ignore the command to commit data to media.

The DeviceIOControl API can be used to inspect block storage caching configurations and also potentially set these configurations. Both the caching policy inspection and setting have limitations as well.

Windows offers a number of ways for applications to programmatically affect the caching at the sector I/O level. Many of these interfaces consist of calling the DeviceIOControl API with some different function code. Windows NT 4.0 SP4 and higher versions of Windows require administrator privileges to submit SCSI pass-through requests. SCSI pass-through requests are submitted via the WIN32 DeviceIOControl API, which requires a handle to a file or a volume as a parameter. This handle is obtained via the CreateFile API and starting with Windows NT 4.0 SP4, the CreateFile function requires GENERIC_READ and GENERIC_WRITE to be specified in the dwDesiredAccess parameter of the CreateFile API.

Windows defines a number of IOCTL codes that may be used to inspect and control the disk block caching functionality. These include:

IOCTL_DISK_GET_CACHE

Returns information about whether the disk cache is enabled or not. The function works only where the disk returns correct information via a SCSI mode page.

IOCTL_DISK_SET_CACHE

-Sets disk caching functionality to be enabled or disabled. The function works only where the disk implements SCSI mode pages.

IOCTL_STORAGE_QUERY_PROPERTY

Windows Vista and higher versions of Windows support another interface to retrieve disk cache properties.

Pros

  • A single interface that applications can code to – in the absence of this single interface, applications would need to understand the nuances of various devices e.g. for 1394 devices, obtain page 6 of the mode data, but for SCSI compliant devices, obtain page 8 of the mode data, etc.
  • Windows implements the necessary details to retrieve  the information in a transport dependent manner (SCSI/IDE/etc) to surface the information
  • A richer interface that can provide information not just whether the disk implements a cache, but also what kind of cache
  • Time will tell whether this interface continues to be evolve and be populated with more data

Cons

  • Yet another interface that applications need to code to
  • Does not work for RAID devices
  • Does not work for Flash drives
  • Not yet widely implemented by storage vendors
  • Only available in the newer versions of Windows
  • Does not (yet at least) allow setting of caching properties

For an application to retrieve information about a device’s write cache property, use the STORAGE_WRITE_CACHE_PROPERTY structure with the IOCTL_STORAGE_QUERY_PROPERTY request.

NTFS and flushing behavior

NTFS, when introduced with Windows NT 3.X depended upon the SCSI Forced Unit Access behavior to ensure its meta data is flushed to media. As described above, NTFS has been deployed on a fair number of non SCSI devices, all of which have an inconsistent implementation of FUA. Perhaps the saving grace may have been that the consumer devices do not have a cache, and hence even without FUA, the data hits the media. In any case, what is worth noting is that all NTFS versions upto and including Windows 7 and Windows 2008 R2 depend upon FUA to ensure that NTFS. NTFS in Windows 8 has switched to using the FlushFileBuffers API instead of depending upon the Forced Unit Access behavior.

This concludes an overview of the knobs an application developer can twist and turn to influence application data write caching. Part 3 of this blog will present an overview of the knobs a system administrator can twist and turn to influence write data caching.

Windows Write caching – Part 1 Overview


Certain Windows applications such as database applications need to ensure their I/O is committed to media, even at the cost of reduced throughput. However, at times an administrator has faith in the hardware and is willing to accept a small risk of data corruption in favor of a higher throughput by allowing caching to occur .

This is a 3 part blog that concentrates on the write caching behavior in the Windows storage stack.

  1. Part 1 presents an overview of the Windows storage stack with specific reference to write caching
  2. Part 2 presents the “knobs” an application programmer can twist and turn to affect write caching
  3. Part 3 presents the “knobs” a system administrator can twist and turn to affect write caching

Windows Storage Stack

Figure 1 Windows Storage I/O Stack

Figure 1 shows a simplified overview of the Windows Storage I/O stack. Starting from the top of Figure 1,

  • Applications make I/O requests. Figure 1 concentrates on write requests and hence the unidirectional arrows from the application towards the disk media.
  • Depending upon the nature of the I/O (decided partly by the way the application opens a file or volume), some I/O requests completely bypass the Windows Cache Manager and go straight from the file system to the Volume Manager layer. This is labeled Unbuffered I/O in Figure 1. As will be explained later, applications can ensure that their I/O is Unbuffered.
  • Alternatively, Application I/O may traverse the buffered path labeled in Figure 1. While applications may strive to ensure their I/O is buffered, in reality, there is no way to ensure this. I/O is buffered depending upon a number of factors such as nature of file open, the type of I/O, the history of the application I/O, the load on the system, etc.
  • The Volume Manager performs sector I/O. While the application may strive to ensure that there is no caching at the sector I/O level, the reality is that applications have limited success in some cases. This is discussed in more detail within the document.

Different types of Write Caching

Irrespective of whether the data is written at the file system level or at the disk block level, write caching can be broadly classified into two categories:

  • Write-through caching:  where data is written to cache AND also written to non volatile media. The data integrity is high, but write performance is slower whereas read performance is enhanced
  • Writeback caching: where data is written to cache, the operating system write request is completed, and the data is lazily written to media at a later point in time. Writeback Caching emphasizes write performance, but at the possible loss of data integrity.

Part 2 of this blog will describe the APIs an application developer can use to control write caching behavior.

The perils of alignment for memory access and disk I/O

In my earlier blog, I described how Visual Studio (VS) 2012 is now a requirement for writing kernel mode drivers on both the x86/x64 Intel/AMD, and also the ARM version of Windows 8. So I installed VS 2012 RC on two different laptops and was unhappy with the installation time. I must place on my record my appreciation for the Visual Studio team, which has been very diligent in following up and looking into the issue. Of course, I will acknowledge that my belief of “it takes too long” could be incorrect, and I may be encountering unusual circumstances on both my systems. So with that caveat that perhaps “I am encountering a one off situation”, here we go with my analysis.
First, a couple of references are in order.

  1.  To quote from MSDN “In this document we explain why you should care about data alignment, the costs if you do not, how to get your data aligned, and what to do when you cannot. You will never look at your data access the same way again.” The point is; aligned memory access in Windows is very important.
  2. It is equally important to ensure that writes are aligned as well. Most current disks write data in 512 chunks called sectors. So if you write 512 bytes at offset zero, a single write suffices. But if you write 512 bytes at offset 1, the I/O spans 2 disk sectors. So a single write becomes read 2 disk sectors, copy over the new 512 bytes of data, and issue 2 sector writes, each of 512 bytes. So a 512 byte write becomes a 1024 byte read and a 1024 byte write. Here is an MSDN blog explaining among other things, the importance of aligned I/Os for SQL. And here is a another MSDN SQL blog explaining the importance of aligned I/O

Now back to the topic at hand – installing Visual Studio 2012 RC and analyzing possible causes for why it takes as long as it does. So I decided to investigate further, by tracing the I/Os using Sysinternals (now part of MSDN) tool Process Monitor.

Here is a screen shot showing a small part of the I/O of the installation. Note that I randomly located this I/O pattern. I also cursorily checked that other files have similar behavior; in particular, write an odd number of bytes at offset zero, and then proceed to write the rest of the file.

Image

For file DataCollection.dll, please notice

  1. The write at offset zero for 32,447 bytes
  2. The write at offset 32,447 for 32,768 bytes
  3. The write at offset 62,215 for 16761 bytes
  4. The total file size is 81,976 bytes and 32,447 + 32,768 + 16,761 = 81976

Now apply the logic of the references quoted – in particular, the importance of aligned memory access, and aligned disk I/O access.

At the very least, the each of the 3 I/Os will consist of a 1 or 3 byte copy, a copy of some N DWORDs, followed by a 1 or 3 byte copy. This could have been completely avoided by doing 3 I/Os, each consisting of an even number of bytes. There is a penalty to be paid for the 1 byte and 3 byte memory access.

I must admit that this trace is at the file system layer. It is certain that before the I/O hits the disk, which is a block mode I/O, the Windows Cache Manager and I/O subsystem will have intervened to make the I/O aligned and an integral number of sectors. There will still be some disk I/O penalties however, when some writes get split across 2 adjacent sectors. This could be avoided. Consider the case where say part of the file has been written, and is in cache. And the I/O pattern guarantees that there will be an odd number of bytes cached, until the final odd length write arrives. Now imagine that for some reason, the cache gets flushed before the last write arrives. This could be because the file is very large, or there is memory pressure. This means that the cache manager will zero fill a buffer until the end of a sector (an odd number of bytes) and then write out that sector. When the next write arrives, this just flushed sector needs to be read, the zero filled bytes are copied over with the newly arrived data, and then the same sector is written – again!

There is no perceivable advantage in making the I/O nonaligned – and significant potential harm. It is difficult to estimate how much VS 2012 installation will speed up, were the writes to be aligned.
There are other oddities as well in the trace, but I will write about those in future blogs.

I invite reader comments on whether they believe this I/O pattern is within acceptable bounds. For readers willing to trace their VS 2012 install, I would also welcome feedback as to whether they observe this pattern.

Developing kernel mode drivers for Windows 8

I have been developing Windows kernel mode drivers for 10+ years now and notice that the Windows 8/Windows Server 2012 WDK brings some changes. This blog tries to highlight the changes in the hope that other developers will benefit.

I went through the mechanics of installing the Windows 8 WDK on 2 different laptops. So with the caveat “Your mileage may vary – maybe I hit the jackpot and my experience was unique”, here we go:

  1. WDK 8 requires that you first install Visual Studio 2012. See http://msdn.microsoft.com/en-US/windows/hardware/hh852362 and the listed System Requirements section that among other things, state “Before you begin, you must first install Visual Studio Professional 2012 RC or above”
  2. For now, Visual Studio 2012 is “free” since it is not yet a released product. Presumably, two different parts of Microsoft will soon tell us a couple of important data points
    1. The Visual Studio team will tell us pricing for the various different versions/SKUs of Visual Studio 2012
    2. A different part of Microsoft will tell us which SKUs are acceptable for compiling the WDK code samples and code developers write
    3. I am not referring to the new ARM based version of Windows called WindowsRT. I am referring to writing drivers for the x86/X64 platform. Even that one now requires Visual Studio 2012.
    4. Visual Studio is an excellent product for whom it works. A while back – as in 5 or so years ago, I abandoned it, primarily due to the long install time and resources it consumed in terms of disk space. The only use I had for it was the compiler. I use a different editor, and I use WinDbg in stand alone mode.  So when a previous version of the WDK (called the DDK at that time) shipped with compilers, I abandoned Visual Studio. I don’t seem to have the same choice any more.
    5. Depending upon which version of Visual Studio you install, and depending upon what choices you make during the installation, Visual Studio will take some time to install and occupy some GBs (certainly less than 10GB) of disk space.
    6. In case you are still reading, the WDK no longer downloads sample code. My gratitude to the people who posted this fact e.g. http://boardreader.com/thread/Samples_arent_installed_along_with_the_W_u8jjs__3e9c9b67-ea9f-4225-a268-5d5ece555568.html Presumably, this makes it easier for Microsoft to release new or updated code samples without shipping the whole WDK
    7.  The code samples are available at http://code.msdn.microsoft.com/windowshardware . Presumably Microsoft will release some scripts to download the samples in some sort of collection e.g. all the samples, all the USB samples, all the storage driver samples, etc. But meanwhile, one has to download the samples one sample at a time. While this saves on disk space, the savings are miniscule compared to the added GBs occupied by VS 2012.
    8. I must acknowledge that VS 2012 now provides an ARM capable compiler, something the old WDK did not.

We will have to wait and see what the VS 2012 requirement adds in terms of software licensing costs. I guess that is just “the cost of doing business” with Windows 8.

In the meanwhile, I look forward to attending the next plug fest and testing my driver(s) for CSV 2.0 compatibility.

Windows Server 8 NIC Teaming tips

Some highly knowledgeable folks at Microsoft recently shared some very valuable tips during the recently concluded MVP Summit. This blog is a small sample of thse tips.

Prior to Windows 8, NIC Teaming has been a feature never officially supported by Microsoft. It was a third party offering from an OEM/IHV/ISV and all support for the feature had to be provided by the third party. I personally have spent considerable time debugging situations where a system start up service I wrote had issues. It turned out that my service could not connect to the Domain Controller because the NIC team was still in the forming stage and had not yet completed its initialization.

Windows Server 8 natively supports NIC teaming. Here are the highlights and tips:

  • NIC teams can only be formed between homogenous NICs. So two 1GB NICs can be teamed, or two 10GB NICs can be teamed, but you cannot team a 1GB and 10GB NIC.
  • If the individual NIC members each support Receive Side Scaling (RSS), the NIC team also supports RSS. Hence it is a good idea to team NICs that support RSS. The resulting NIC team is also highly capable and does not lose any functionality.
  • If the individual NIC members each support RDMA, the resulting NIC team does NOT support RDMA. Given how Windows 8 SMB 2.2 natively supports RDMA without modifying applications, it is a bad idea to team NICs with RDMA capabilities, and where the interconnect (routers, etc) also supports RDMA

Windows 8 VHDX file instant dedupe wish list

I have been testing the Windows 8 dedupe feature, especially so for large VHDX files. But the testing has revealed a major “wish”. Hopefully somebody from the right department at Microsoft reads this and at least puts it on a feature list for the future.

Here is a scenario I exercised – and it seems to be a very common scenario

  1. Create a Windows Server VM inside a 40GB VHDX file- call it VM1.vhdx
  2. Xcopy – (and yes – xcopy /J –see my previous blog “Tips for copying VHD and VHDX files”)  the VM1.vhdx  file to say VM2.vhdx. That’s 40 GB of reads and 40 GB of writes.  
  3. Repeat the xcopy to a different destination file – Xcopy /J VM1.vhdx to VM3.vhdx and that’s 40 GB more reads and 40GB more writes.
  4. Fire up VM1, enter license info, assign computer name, assign IP address, etc. Turn into a file server
  5. Fire up VM2, enter license info, etc, install Microsoft Exchange into a second VM and turn into an Exchange Server
  6. Fire up VM3, enter license info, etc and install SQL Server into a third VM and turn it into a SQL Server
  7. Now let the system idle, make sure it does not hibernate, wait for dedupe to read all 3 VHDX files ( 3 X 40 GB worth of reads, etc) and dedupe the files.

Instead, here is an alternative sequence that would be really useful

  1. Create a Windows Server VM inside a 40GB VHDX file
  2. Run a PS script that creates an instantly deduped second copy of this  VHDX file – with all the associated dedupe metadata. So now I have 2 VHDX files that are identical and have been deduped. The PS script would have to invoke some custom dedupe code Microsoft could ship. Create a new file entry for say VM2.vhdx and create the dedupe metadata for VM1.vhdx and VM2.vhdx.
  3. Repeat the same PS script with different parameters and now I have 3 identical VHDX files, all deduped
  4. Repeat steps 4 through 6 from the first sequence – step 7 – the dedupe step is not needed

This would save 100s of GBs of reads and writes, and administrator time, increasing productivity. Whether you call this instant dedupe or not is up to you.

In the interest of keeping the focus on the instant dedupe scenario, I have deliberately avoided the details of requiring Sysprep’ed installations. But the audience I am targeting with this blog will certainly understand the nuances of requiring Sysprep.

If you are a Microsoft MVP reading this blog, and you agree, please comment on the blog, and email your MVP lead asking for this feature.

Intel Ultrabooks, SSD based laptops, and file system needs

This blog is partly triggered by the new Resilient File System (ReFS) that Microsoft just announced for Windows 8. At least for now, the new file system appears to be more for servers than any laptops or tablets, and that too, particularly SSD based laptops and tablets. More about the ReFs in some other blogs.

For the record, I believe Intel holds a trademark on the term  Ultrabook.

I am not sure my Windows 7 and Windows 8 Developer Preview NTFS based laptops need a better metadata checksum mechanism, let alone a better user data checksum mechanism. But here is what I do believe my NTFS based laptops (Win 7 and presumably also the Win 8 based laptops) need, especially so when the hard disk is an SSD.

  • Can the OEMs please stop bundling and/or stop offering a disk defragmentation utility with SSD based systems? SSD based volumes do not need to be defragmented, indeed, they reduce the life of the SSD! Further, maybe the Microsoft OEM division, especially so for Windows 8, as well as the Intel Ultrabook division can do something about this?
  • Microsoft, thank you for disabling the built in defrag code in Windows 7 when an SSD based NTFS volume is detected. Hopefully the same is true in Windows 8 as well.
  • SSDs have this need for the unused data blocks to be erased. That is just the nature of the physics involved. Doing so makes the writes faster. After a user buys an SSD based laptop, after a while, all of the blocks have been written and it would be advantageous for the disk firmware to know which blocks are unused as seen by the file system (NTFS) so that it can go ahead and erase them. The idea is that the disk blocks are erased and ready for me to write to when I download the latest movie, whether from iTunes, YouTube, or my DVD drive. Enter the Windows 7 “TRIM” command where Windows 7 passes down the information as to what disk blocks it just “released” and that can be erased. The problem is that it is not clear which drive vendors make use of that TRIM command? Or when a new version of the driver firmware makes use of that TRIM command, do the laptop OEMs bother to use that firmware? I understand there are profits to be made, and that goal may at least temporarily result in a situation where a Windows 7 SSD volume is simply ignoring the TRIM command. It would be interesting to get those statistics – whether it be from Microsoft, or a drive vendor, or a laptop OEM, or for that matter Intel for its Ultrabook branded OEMs.
  • What would be even more useful would be to have the same insight for Windows 8 SSD based laptops and tablets, whenever those are commercially available. While I will not buy a laptop or tablet simply because it makes proper use of the TRIM command, I do know that 64GB and 128 GB SSDs tend to fill very quickly, and hence the TRIM command will help. So it is certainly an important consideration. Perhaps this is one way an OEM can differentiate their offering.

More about ReFS and SSD in a new blog  at a later date.