When working with files, we often find ourselves dealing with what we read in as a List, using Enum to iterate over the data. Enum is a protocol implemented by many data structures, generally taking values out of the data structure and performing some operation on them. It goes hand in hand with Collectable, which is used to do the inverse or collect values into a data structure. Enum operates in linear time and is ‘eagerly’ evaluated.
We run into issues with both reading the file into memory and using Enum to operate on the result. The amount of memory usage can become quite large. Fortunately, Elixir has the Stream module, which works similar to Enum but is lazily evaluated so can work on infinitely large lists. Similarly, there is a File.Stream data structure that can be used when reading files to stream them in line by line or a fixed number of bytes at a time. This allows us to limit memory usage when operating on very large files and lists.
Stream is evaluated in a single process, which for very large lists has performance limitations. The Flow library uses Genstage to improve performance by breaking each operation into a parallel process and running in batches.
In the following video, we’ll cover using File.stream to read in a very large file. We’ll compare the performance of Enum, Stream, and Flow-based approaches to working with large files and lists using Benchee.
Get notified of any new episodes as we release them.