The importance of using streams

So when I started out programming I really struggled with the idea of Streams. "What’s the point" I used to think "Just use an array of bytes". At least that I can understand, I can see them in the debugger, and I can manipulate them with simple code. Then after a while I groked streams with the single thought "It’s all about the memory".

Move forward several years and I started programming in .net. When Microsoft released .net 1 it was very opinionated about encouraging the use of Streams. This was especially true for file access. However in .net 2 it seemed Microsoft bowed to the masses and added many members that allowed consumers to easily bypass the Stream approach. For example the following methods did not exist in .net 1.

Even had this not been the case it is fairly easy to bypass the Stream approach yourself. Just search for “convert a stream to a byte array .net” and you will see many people doing just that.

An Example

I saw one such example recently in the Nuget projects CryptoHashProvider

The code can be effectively simplified as this.

public class HashHelper
{
    static SHA512 sha512;

    static HashHelper()
    {
        sha512 = SHA512.Create();
    }

    public static byte[] ComputeHashWithBytes(string path)
    {
        byte[] fileBytes;
        using (var stream = File.OpenRead(path))
        {
            var length = (int) stream.Length;
            var buffer = new byte[length];
            stream.Read(buffer, 0, length);
            fileBytes = buffer;
        }
        return sha512.ComputeHash(fileBytes);
    }
}

Note that the file is read into a byte array before computing the hash. Compared to the stream based approach.

public static byte[] ComputeHashWithStream(string path)
{
    using (var stream = File.OpenRead(path))
    {
        return sha512.ComputeHash(stream);
    }
}

The Impact

So what is the performance impact of this? Well it is difficult to determine the real impact of this without running up a Nuget server. So I am going to speculate based on the following.

  • This code is in the Core so can be used from the web server.
  • It is used to hash nuget packages.
  • Nuget packages are usually small. For the sake of testing I will use the NHibernate package which is 1.84MB.
  • There are approximately 3000 packages. I will test with a smaller loop of 1000 iterations to speed up testing.

Sequential Processing

One possible scenario is hashing all files at a given point in time, perhaps at start-up or as some kind of batch job.

For this we will use the below code

var stopwatch = Stopwatch.StartNew();
for (var i = 0; i < 1000; i++)
{
    HashHelper.ComputeHash(@"..\..\..\SmallFile");
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);

The first thing that surprised me about this was the time to process for both approaches was roughly the same. I would speculate that this is because the time taken for garbage collection of a large array of bytes is insignificant to the cost of computing the hash. The memory usage, however, is more interesting.

Bytes Approach Sequential

Snapshot

Total size of objects is 1.873MB.

Bytes sequential memory usage snapshot

There are two interesting points here:

  1. There is a large amount of reserved free space. This is due to the fact that the memory has been used by a byte array and, after garbage collection, is being reserved.
  2. There is a large amount of memory taken up by byte arrays. Not really a surprise.
Timeline

Bytes sequential memory usage timeline

Here you can see the constant usage and garbage collection that happens due to using byte arrays.

Stream Approach Sequential

Snapshot

Total size of objects is 82KB.

Stream sequential memory usage snapshot

Looking at the two points from above:

  1. While there is a proportionally large amount of free space reserved it is only 836KB compared to the 1.992MB from above.
  2. Byte arrays does not even rank on the graph. Zooming in shows 4.5KB taken up by byte arrays.
Timeline

Stream sequential memory usage timeline

When compared to above, the memory usage is relatively stable.

Parallel Processing

Another possible scenario is parallel hashing of files. This could happen in similar scenerios to above or due to a high number of web requests.

For this we will use the code below:

var stopwatch = Stopwatch.StartNew();
Parallel.For(0, 1000, i => HashHelper.ComputeHash(@"..\..\..\SmallFile"));
stopwatch.Stop();
Debug.WriteLine(stopwatch.ElapsedMilliseconds);

Again the time to process for the stream and byte based approach was roughly the same. Moving on to the memory usage.

Bytes Approach Parallel

Snapshot

Total size of objects is 16.26MB.

Bytes parallel memory usage snapshot

Again a large amount of space taken up by byte arrays and reserved free space.

Timeline

Bytes parallel memory usage timeline

Here you can see the memory usage of the code competing with the garbage collector cleaning up.

Stream Approach Parallel

Snapshot

Total size of objects is 126.8KB.

Stream parallel memory usage snapshot

Again significantly lower memory usage of byte arrays and reserved free space.

Timeline

Stream parallel memory usage timeline

Again the memory usage is relatively stable compared to the byte approach.

In Summary

Use streams unless you have a specific reason to use byte arrays. It will result in better memory usage and (usually) less code.

On a side note. I am only using the nuget CryptoHashProvider as an example to compare byte arrays to streams. The nuget guys may have a very good reason for their approach.

Posted by: Simon Cropp
Last revised: 20 Dec, 2011 07:39 PM History

Comments

No comments yet. Be the first!

No new comments are allowed on this post.