Analytics

Wednesday, February 13, 2013

Reading InputStreams

The Java InputStream is one of the core I/O classes in the JDK and is used pretty much whenever you want to do I/O in your Java programs, e.g. reading from the disk or network. It has a variety of read methods that allow you to read a single byte or many bytes at a time, which are typically called in some sort of loop until you get a return value of -1 indicating the end of the stream. Consider the following example:


Here, we write 256MB worth of data into a ByteArrayOutputStream, which just allocates a byte array to store all of those bytes, and then read it back all at once using a ByteArrayInputStream, which just wraps a byte array with the InputStream API. Unsurprisingly, this program results in the message: "Read 268435456 bytes." Simple enough, but let's see what happens when you decide you want to compress the bytes when writing them out and decompress them when reading them back (this is common when you have to write large amount of easily-compressible data to the disk or network).


Now we're wrapping the ByteArrayOutputStream with a DeflaterOutputStream, which compresses the data as its written out, and the ByteArrayInputStream with an InflaterInputStream, which decompresses the data as its read in. These streams do indeed invert each other correctly, but now this programs prints: "Read 512132 bytes." That's strange, because we expected to get the same number of bytes back after compression followed by decompression. Digging into the contract provided by the InputStream API. you can find the following statement: "Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. This method blocks until input data is available, end of file is detected, or an exception is thrown." What that means is that InputStreams do not provided any guarantees on how many bytes it will read, even if, as in this case, all of the data is "available" in memory. The InflaterInputStream is most likely designed to inflate data in chunks and be efficient regardless of the underlying InputStream it is reading from. Taking this fact into account, the final example produces the expected output:


The great thing about all of this is that, if your data is small enough, the second example will actually work properly. Thus even testing may not catch the bug, which can lead to a lot of unfortunate situations. So the lesson to be learned is to always wrap InputStream reads in a loop and only consider it done once you see that -1!

No comments:

Post a Comment