Learn With Me: Elixir - ElixirLargeSort IntSort Project Part 4 (#82)
In the previous section, I created a merging mechanism in the form of a merge stream that can produce a stream of merged integers from multiple integer files. That merge stream is really useful, but we haven't done anything with it yet. Today I'm going to go over how I created the functionality that uses the merge stream to perform the file merging.
This is the point where I had to leave the stream of data that I've been building up so far (I love streams!) and use file names directly. I originally attempted to merge streams together into other streams until I was left with a single stream, which would then be written to the final output file. This did not work out because the stream pipeline has to be set up in advance when the code is created. I cannot predict how many merges will be needed, so I was unable to create a dynamic stream merging pipeline that's determined at runtime. If anyone figures out how to do this in Elixir, I'd love to hear it.
So instead the merging function receives a collection of intermediate file names, performs the merge, and returns a collection of output file names.
Creating and Merging Multiple File Groups
The main merging function is IntSort.merge_intermediate_files/4
, which can be found in lib/int_sort.ex. It takes in a collection of file names, the number of files to be merged at a time, a function that creates a merge file name (for the output file), and a callback function that we can use to get feedback on the progress of the merge.
Here's what the function looks like.
@doc """
Does a single round of merges on a collection of intermediate files.
This function only does a single round, merging groups of N intermediate
files together, where N is defined by the `merge_count` parameter. The merge
will result in ceil(N/merge_count) files containing the merged integers.
This function will likely be called multiple times until it results in
a single file.
## Parameters
- files: A collection of file names of the intermediate files to be merged
- merge_count: The number of files to be merged at once
- merge_file_name: A function that takes in the merge group number and
returns the file name to use for the merge file
- integer_merged: A function that is called when an integer is merged. This
function takes a single parameter, which is the number of integers that have
been merged during this round of merges. This function can be used to display
or measure merge progress
## Returns
A stream that emits the file names containing the merged integers from this
round
"""
@spec merge_intermediate_files(
Enum.t(),
pos_integer(),
(non_neg_integer() -> String.t()),
(non_neg_integer() -> :ok)
) :: Enum.t()
def merge_intermediate_files(
files,
merge_count,
merge_file_name,
integer_merged \\ fn _ -> :ok end
) do
files
# Convert the files to file groups
|> IntermediateFile.create_file_groups(merge_count)
# Merge each file group
|> Stream.scan({[], 0}, fn {file_group, group_num}, {_, total_count} ->
# Get the file name for this group's merged file
group_file_name = merge_file_name.(group_num + 1)
# Create the function that is called every time an integer in the file
# group is merged
group_integer_merged = fn count ->
# Transform the internal group count to an overall integer count
integer_merged.(total_count + count)
end
# Call the function to do the merging for this file group and count how many
# integers are being merged, which also has the effect of causing the stream
# processing to start running.
merge_count = merge_file_group(file_group, group_file_name, group_integer_merged)
|> Enum.count()
# Return the file name and the cumulative number of merged integers
{group_file_name, total_count + merge_count}
end)
# We now have a stream of merge file names and integer counts. Strip out the integer counts.
|> Stream.map(fn {group_file_name, _} -> group_file_name end)
end
This function is responsible for create a merge group, which is a group of input files that are merged to a single output file. If the number of files being merged is high enough, there will be multiple merge groups and multiple output files.
The function calls IntermediateFile.create_file_groups/2
to create chunks of file names for merging and then calls IntSort.merge_file_group/3
to merge that group. At the end, the stream that this function returns emits output files representing the files that were produced by the merge.
When we want to find out about the progress of the merge, there's no stream pipeline to plug a function into, so I had to provide a callback function that is called every time an integer is merged and written to an output file. The callback function gets called with the current integer count, which starts at 1 and ends at N, where N is the number of integers being merged. I will end up using this callback function to update the progress bar in the UI.
I used Stream.scan/3
in the pipeline so that I can keep track of the total merge count all throughout the merge process. Stream.scan/3
is like a reduce operation for streams, with an accumulator value being passed to each function call. Here the function passed to Stream.scan/3
gets called for each merge group. The function merging each group, IntSort.merge_file_group/3
, knows how many integers have been merged in that group and passes that to its own callback function. That count is added to the total count from previous merges to produce the current merge count.
Now that we have an overview of how file groups are created and used, let's go look at how an individual file group is merged.
Merging a Single File Group
Let's look at the code for IntSort.merge_file_group/3
to see how an individual file group is merged.
# Merges a group of integer files into a single integer file. Returns a tuple with the stream
# that emits the integers being merged
@spec merge_file_group(Enum.t(), String.t(), (non_neg_integer() -> :ok)) :: Enum.t()
defp merge_file_group(file_group, merged_file, integer_merged) do
# Open each file in the group as a file device
file_devices =
Enum.map(file_group, fn file ->
@integer_file.read_device!(file)
end)
# Create a merge stream to merge the file devices
merge_stream = IntermediateFile.merge_stream(file_devices, &@integer_file.close_device/1)
# Open a stream for the output file
output_stream = @integer_file.integer_file_stream(merged_file)
# Write the merged integers from the merge stream to the output file
@integer_file.write_integers_to_stream(merge_stream, output_stream)
|> Stream.with_index(1)
|> Stream.each(fn {_, count} -> integer_merged.(count) end)
end
The stream of file names to be merged is transformed into a stream of file devices. The file devices are passed to IntermediateFile.merge_stream/2
to create the stream of merged integers. The merge stream is piped to an output file stream and the number of merged integers is updated every time an integer is written to the output file file stream.
The concept of a merge stream makes this part rather simple. Some of the code in this part has been getting a bit more complex that I would like and it isn't always easy to follow. However, I feel that this particular function is pretty simple, straightforward, and easy to follow, even through it's doing a lot of stuff. This is what I want all my Elixir code to look like.
You can find the tests for file merging in test/int_sort_test.exs. Once again, the tests were useful. In running this functionality through a variety of scenarios, I found some conditions under which it failed. After getting the tests to pass, I'm fairly confident that this code will perform as expected once I combine it into a finished application. The tests will continue to be useful into the future as I (or someone else) maintain the code. If you haven't noticed, I really like automated tests.
Merging All The Files Over Multiple Generations
Now that I have the code to perform a single merge generation, I'm going to implement the function that performs all the merge generations. It performs a merge on all the input files to produce a set of output files. If the number of output files is 1, it's done. Otherwise, it keeps merging output files from the previous merge until only a single file remains.
There are two parts to this "total merge" function. First is the publically-accessible function that provides a shell around the recursive function. It makes the caller's job a bit easier so that they don't need to deal with the details of the recursive version of the function.
@doc """
Takes a collection of intermediate files and performs merges on those files until a single
file remains
## Parameters
- files: A collection of the paths of files to be merged
- merge_count: The number of files to merge at once as part of a single merge group. The
merge count must be at least 2
- gen_file_name: The function that creates the file path for a gen file, which is an
intermediate file associated with a particular merge generation and merge file group,
which will contain the results of the merge. The first parameter is the merge generation
number and the second parameter is the merge group number.
- merge_file_gen: The function that performs the entire merge process for each merge
generation. See the documentation on `IntSort.merge_intermediate_files/4` for details
regarding what this function received. Ideally, `IntSort.merge_intermediate_files/4` will
be passed as this parameter, but under other circumstances (such as testing) a different
function can be passed.
- remove_files: The function that will remove any intermediate files that are no longer needed.
This function receives a collection of file paths to be removed.
If you don't want to remove intermediate files, then pass in a function that does nothing.
- integer_merged: A function that is called when an integer is merged. This
function takes two parameters. The first parameter is the merge generation and the second
parameter is the number of integers merged during that particular generation. This function
can be used to display or measure merge progress
## Returns
The path of the file containing all the integers merged together
"""
@spec total_merge(
Enum.t(),
pos_integer(),
(non_neg_integer(), non_neg_integer() -> String.t()),
(Enum.t(),
pos_integer(),
(non_neg_integer() -> String.t()),
(non_neg_integer() -> :ok) ->
Enum.t()),
(Enum.t() -> :ok),
(non_neg_integer(), non_neg_integer() -> :ok)
) :: String.t()
def total_merge(
files,
merge_count,
gen_file_name,
merge_file_gen,
remove_files,
integer_merged \\ fn _, _ -> :ok end
) when merge_count > 1 do
# Do a recursive merge
[merged_file] =
total_merge(
files,
Enum.count(files),
@initial_merge_gen,
merge_count,
gen_file_name,
merge_file_gen,
remove_files,
integer_merged
)
# Take the remaining merge file and return it
merged_file
end
Yeah, that typespec is pretty ugly due to all the function parameters that the function takes.
The real work is done in the recursive version of the function, named do_total_merge/8
. I've noticed from reading Elixir source code that it's a common convention to have a recursive helper function with "do_" followed by the name of the main function.
# The recursive implementation of the total_merge function, which returns the merge files resulting from each merge iteration
@spec do_total_merge(
Enum.t(),
non_neg_integer(),
non_neg_integer(),
pos_integer(),
(non_neg_integer(), non_neg_integer() -> String.t()),
(Enum.t(),
pos_integer(),
(non_neg_integer() -> String.t()),
(non_neg_integer() -> :ok) ->
Enum.t()),
(Enum.t() -> :ok),
(non_neg_integer(), non_neg_integer() -> :ok),
(non_neg_integer(), non_neg_integer() -> :ok)
) :: Enum.t()
defp do_total_merge(files, file_count, _, _, _, _, _, _, _) when file_count <= 1 do
files
end
defp do_total_merge(
files,
_,
merge_gen,
merge_count,
gen_file_name,
merge_file_gen,
remove_files,
integer_merged,
merge_gen_completed
) do
# Create the function that creates a merge file name for this generation
merge_file_name = fn num -> gen_file_name.(merge_gen, num) end
# Create the callback function that gets called to keep track of merge progress
gen_integer_merged = fn count -> integer_merged.(merge_gen, count) end
# Perform the merge for this merge generation
merged_files =
merge_file_gen.(files, merge_count, merge_file_name, gen_integer_merged)
|> Enum.to_list()
# Call the callback to notify of the completion of the merge generation
merge_gen_completed.(merge_gen, Enum.count(merged_files))
# Remove any files that were merged
remove_files.(files)
# Do a recursive call to merge the next generation of merged files
result =
do_total_merge(
merged_files,
Enum.count(merged_files),
merge_gen + 1,
merge_count,
gen_file_name,
merge_file_gen,
remove_files,
integer_merged,
merge_gen_completed
)
result
end
As long as the number of files being passed into be merged is greater than 1, the function performs a merge generation, merging those files into a set of output files. When the files finally have been merged into a single file, the base case function clause is called and that file is returned.
I feel that this function is a bit ugly and I'm not entirely satisfied with it. The number of parameters is high and the typespec is really ugly. In addition, the Elixir code formatter makes the function very large by putting all the parameters on separate lines. This makes the function harder to maintain and understand. I'd prefer something simpler, but I don't have any ideas at this point. I suspect it's something I'll get better at with more Elixir coding experience.
In retrospect, creating some type definitions would have been really useful to simplify that typespec: that's what they're good for.
On the other hand, the fact that it's mostly parameter-driven means that testing it is a lot easier. The tests can be found in test/int_sort_test.exs, and pass in a bunch of dummy functions that record what data was passed to them. That makes it easy to verify that total_merge/6
is doing what it is supposed to be doing. I did indeed find a couple bugs when running the tests, so I'm glad that I was able to effectively test it.
Connecting the Merge Code to the UI
Once I had a pretty good idea from the unit test results that my code was working, I was able to connect it to the UI.
Doing this was pretty similar to connecting the chunking functionality to the UI. I gave it a callback that gets called every time an integer is merged so that we can display and update a progress bar. As before, I called the function to calculate an update frequency in order to avoid excessive progress bar updates.
This is the function that sits above the total_merge
function an helps coordinate the UI updates.
# Merges the chunk files and returns the path to the final merged file
@spec merge_chunks(Options.t(), Enum.t(), non_neg_integer()) :: String.t()
defp merge_chunks(options, chunk_files, num_integers) do
# Calculate the progress update frequency for integer merging
update_frequency = progress_update_frequency(num_integers, @progress_updates)
# Perform the merge
merge_status = fn _, count -> file_merge_status(count, num_integers, update_frequency) end
merged_file =
IntSort.total_merge(
chunk_files,
@merge_files,
&IntSort.gen_file_name/2,
&IntSort.merge_intermediate_files/4,
remove_files_func(not options.keep_intermediate),
merge_status,
&merge_gen_completed/2
)
merged_file
end
The file_merge_status/3
function is called every time an integer is merged and calls the update_progress_bar/3
function that I had previously created when implementing the chunking UI.
# Outputs the current status of the ongoing file merge
@spec file_merge_status(non_neg_integer(), non_neg_integer(), non_neg_integer()) :: :ok
defp file_merge_status(count, total_count, update_frequency) do
update_progress_bar(count, total_count, update_frequency)
end
While working on this, I found that it wasn't easy to figure out when the merge generation had completed and how many files had resulted without performing a bunch of separate calculations. I wanted to make sure that the UI stayed in sync with the actual merge code, so I implemented the merge_gen_completed
callback function in IntSort.total_merge
. This callback function is called whenever a merge generation has completed and passes the generation number as well as the number of files that resulted from the merge. I then created a callback function that outputs the information. I'm sure glad I had those unit tests to verify I didn't mess anything up when changing the total_merge
function.
Outputting the merge information also has the effect of making the cursor go to a different line when the merge generation has completed. So any further calls to update the progress bar will act on a new progress bar on a different line rather than the previous progress bar.
defp merge_gen_completed(gen, _) do
IO.puts("Gen #{gen - 1} files were merged into a single output file")
end
In the end, I had a nice UI showing the current progress. Here's what it looks like when chunking and merging 100 integers in chunks of 20.
> ./int_sort --input-file "../data/random_integers.txt" --chunk-size 20 --keep-intermediate "../output/sorted_integers.txt"
Determining the number of integers and chunks in the input file...
Number of Integers: 100
Number of Chunks: 5
Creating Chunk Files
|=====================================================================| 100% (5)
5 Gen 1 intermediate files were generated
Merging Gen 1 intermediate files
|===================================================================| 100% (100)
Gen 1 files were merged into a single output file
That output is very similar to the Node.js and .NET Core versions of this project.
Integration Testing
In the other versions of this project I did, I did manual integration testing by running the program with various inputs and examining the outputs. In the Elixir version of this project, I'm going to do some automated integration testing. I'm pretty confident this will work because of the unit testing I've done, but I want to do an automated integeration test to make sure. It will also help catch any issues from future changes.
Output Suppression
The output coming from the application will get all jumbled up with the test output, so I added a "--silent" flag to the command line options, so that the application will not output anything when the test are being run. That was pretty straightforward. I just added the flag to the CLI options and modified the argument parsing code in lib/cli/cli_args.ex. Then I added some tests for the flag and modified the existing tests so that they were no longer failing.
The more significant step was actually modifying the application so that it suppresses any output. Since this is a functional language, there really isn't anything external to check. The module attributes are determined at compile time and anything at runtime has to be passed as a function parameter. So I added an "output" function as a parameter to every function that writes output to the screen so that all the decisions regarding whether to actually output anything to the screen is made in one function instead of all over the place. That makes me really happy that I went to the trouble of separating my UI functions from the rest of the application.
Passing in functions as parameters like this could get really ugly when a lot of functions need to be passedin. In that case, I'd probably use a behaviour parameter and pass in a different module depending on the command line arguments. One could also pass a map containing functions or even entire modules. That would help with parameter bloat.
The progress bar code is from a library, and always outputs to the screen. So any code that updates the progress bar needs a silent
parameter, which allows us to simply not call the code to update the progress bar. If I had to add any more parameters, I'd probably just create an application-specific IO module with a behaviour and pass that in as parameter, but it's not worth it here.
Note that the only exception to the output suppression is the help text and the error text. Neither of the functions that output this data have the silent flag available to them, and they represent special situations in which text should be displayed. We won't be seeing this text in our integration tests unless we did something wrong in creating the tests.
Here's what the output function code looks like. Note that it creates and returns a function. Normally, it returns a function that writes the output to the screen, but when the --silent
option is enabled, it returns a function that does nothing.
# Returns the function used for output
@spec output_func(Options.t()) :: output_func()
defp output_func(%Options{silent: true}) do
fn _ -> :ok end
end
defp output_func(_) do
fn output -> IO.puts(output) end
end
Here's some code where the output function is used. I had to modify the code to call the output function instead of IO.puts
.
efp process({:ok, options}) do
# Retrieve the function to use for output
output = output_func(options)
# Calculate the number of integers and chunks in the input file
display_integer_counting_message(output)
{num_integers, num_chunks} =
integer_chunk_counts(options.input_file, options.chunk_size)
|> display_integer_counting_result(output)
# Create the chunk files
chunk_files = create_chunks(options, num_chunks, output, options.silent) |> Enum.to_list()
output.("#{Enum.count(chunk_files)} Gen 1 intermediate files were generated")
# Merge the chunk files
output.("Merging Gen 1 intermediate files")
merge_chunks(options, chunk_files, num_integers, output)
end
A quick (manual) test shows me that this works. The usual output is displayed when the "--silent" option is missing, and nothing is displayed when the "--silent" option is specified. Success!
The Integration Tests
The integration test can be found in test/integration_test.exs. I have a generic test function that takes in the number of integers to sort and the chunk size as parameters and then runs the test. Then I call this generic test function for a variety of scenarios.
The test creates an input file with random integers, creates a list of command line arguments (along with the "--silent" flag), calls the CLI.main/1
function with those arguments, and then examines the results to see if everything turned out correctly.
It examines the output file to see that the expected integers are there and that they are all sorted. If the intermediate files were to be removed (no "--keep-intermediate" flag), then it verifies that the input and output files are the only files in the test directory. If the intermediate were to be kept (the "--keep-intermediate" flag was specified), then it verifies that the expected intermediate files are present. It does not check the contents of all the intermediate files, since I think that aspect was well-tested enough in unit testing and doing it here would be a lot of effort and code for little benefit. As long as the final output is as-expected, I'm satisfied.
The integration tests went through a variety of scenarios varying the number of files being sorted with the chunk size, and testing various combinations such as an integer count that isn't evenly divisible by the chunk size, one where the integer count is evenly divisible, one where the number of integers is less than the chunk size, and so forth. The largest test sorts a million integers with a chunk size of 1,223. I had to avoid combinations that would result in an excessive runtime for the integration tests. The total runtime on my machine for all tests is around 30 seconds, and about half that time is spent doing integration tests.
Integration testing was generally quite successful. My unit tests seem to have prevented most issues from occurring at this stage in testing. I did find one issue, and that was sorting an empty input file. It's the empty collection/file tests that have often failed in the past, so I wasn't surprised to see this failing. It turned out that no chunk files were being created, and the merge function was expecting at least one file in the input. I fixed this so that an empty output file is created when there are no chunk files.
All done!
The Final Product
Well, I'm actually not quite all done. There's an opportunity for a little polish. All the other projects implementing this sorting functionality have measured how long it takes to complete the entire sorting process, so I want to implement that here as well so as to compare performance with those other implementations.
Measuring the Runtime
First, I dug up an Elixir timing function on Stackoverflow and put it in my code. This function measures the amount of time it takes another function to run and returns it as a number of seconds.
# Measures how long it takes for a function to complete and returns the amount
# of time in milliseconds
@spec measure(function()) :: non_neg_integer()
defp measure(function) do
function
|> :timer.tc()
|> elem(0)
|> Kernel.div(1_000)
end
I then altered it to return the number of milliseconds (it can impressively go all the way down to microseconds) and had it do integer division so that it returns an integer instead of a float.
Once I have the number of milliseconds that the function took to complete, I want to convert it into a human-readable string telling me the number of hours, minutes, seconds, and milliseconds. It turned out that the code to do this well was significant.
# Returns a text description of the ellapsed time
@spec ellapsed_time_description(non_neg_integer()) :: String.t()
defp ellapsed_time_description(0), do: "0ms"
defp ellapsed_time_description(num_ms) do
# The number of time units (hours, minutes, seconds, ms) are stored in a list
# that is built up and the remaining milliseconds are passed to the next function
# until we've built up a list of time units. We have to reverse the time units
# at the end because they've been in reverse order
{time_units, _} =
{[], num_ms}
|> hours_ms()
|> minutes_ms()
|> seconds_ms()
|> milliseconds_ms()
# Take the first two non-zero time units and assign them a unit type number
time_units =
time_units
|> Enum.reverse()
|> Enum.with_index(1)
|> Enum.filter(fn {unit_count, _} -> unit_count > 0 end)
|> Enum.take(2)
# Take the time units and unit type numbers and convert them into a description string
time_units
|> Enum.map(&time_description/1)
|> Enum.join(" ")
end
# Keeps track of the current time units
@type time_unit_list() :: list(non_neg_integer)
# Keeps track of the current time units and the remaining milliseconds
@type time_units_remaining() :: {time_unit_list(), non_neg_integer()}
# Extracts the number of milliseconds that will go evenly into a unit of time and returns number
# of time units and the remainder of milliseconds that did not fit evenly into a time unit
@spec unit_ms(time_units_remaining(), non_neg_integer()) :: time_units_remaining()
defp unit_ms({units, num_ms}, ms_in_unit) do
current_units = div(num_ms, ms_in_unit)
remainder = rem(num_ms, ms_in_unit)
{[current_units | units], remainder}
end
@ms_in_hour 3_600_000
@ms_in_minute 60_000
@ms_in_second 1000
@ms_in_ms 1
# Calculates the number of milliseconds in an hour and returns that along with the remaining
# milliseconds
@spec hours_ms(time_units_remaining()) :: time_units_remaining()
defp hours_ms(time_units), do: unit_ms(time_units, @ms_in_hour)
# Calculates the number of milliseconds in a minute and returns that along with the remaining
# milliseconds
@spec minutes_ms(time_units_remaining()) :: time_units_remaining()
defp minutes_ms(time_units), do: unit_ms(time_units, @ms_in_minute)
# Calculates the number of milliseconds in a second and returns that along with the remaining
# milliseconds
@spec seconds_ms(time_units_remaining()) :: time_units_remaining()
defp seconds_ms(time_units), do: unit_ms(time_units, @ms_in_second)
# Calculates the number of milliseconds in a millisecond and returns that along with the remaining
# milliseconds
@spec milliseconds_ms(time_units_remaining()) :: time_units_remaining()
defp milliseconds_ms(time_units), do: unit_ms(time_units, @ms_in_ms)
# Takes a time count count and a time unit type number and converts that into a time description
@spec time_description({non_neg_integer, pos_integer}) :: String.t()
defp time_description({unit_count, 1}), do: hours_description(unit_count)
defp time_description({unit_count, 2}), do: minutes_description(unit_count)
defp time_description({unit_count, 3}), do: seconds_description(unit_count)
defp time_description({unit_count, 4}), do: ms_description(unit_count)
# Produces a text description of the number of hours
@spec hours_description(time_unit_list()) :: String.t()
defp hours_description(num_hours) do
"#{num_hours}h"
end
# Produces a text description of the number of minutes
@spec minutes_description(non_neg_integer()) :: String.t()
defp minutes_description(num_minutes) do
"#{num_minutes}m"
end
# Produces a text description of the number of seconds
@spec seconds_description(non_neg_integer()) :: String.t()
defp seconds_description(num_seconds) do
"#{num_seconds}s"
end
# Produces a text description of the number of milliseconds
@spec ms_description(non_neg_integer()) :: String.t()
defp ms_description(num_ms) do
"#{num_ms}ms"
end
The ellapsed_time_description/1
function runs the number of milliseconds through a pipeline to convert it to a list of time units. So [2, 34, 0, 89]
would be 2 hours, 34 minutes, 0 seconds, and 89 milliseconds. This works using the unit_ms/2
function, which does the generic math to figure out the number of milliseconds in a generic unit of time. I then create time-unit-specific functions (hours_ms/1
, minutes_ms/1
, etc.) that call unit_ms/2
with the number of milliseconds in an hour, minute, etc. That saves me from having to repeat the calculation code.
{time_units, _} =
{[], num_ms}
|> hours_ms()
|> minutes_ms()
|> seconds_ms()
|> milliseconds_ms()
It would look odd to display a string like "2h 0m 5s 139ms" all the time, especially if some of those time units are 0. So what I do within ellapsed_time_description/1
is take the first two non-zero time units and discard the rest. This keeps 0 values out of the time and only displays the most significant units. Then I assign each time unit a time unit number, identifying which time unit this is, so that we know which two time units these are when converting to a string.
# Take the first two non-zero time units and assign them a unit type number
time_units =
time_units
|> Enum.reverse()
|> Enum.with_index(1)
|> Enum.filter(fn {unit_count, _} -> unit_count > 0 end)
|> Enum.take(2)
Then the collection of time units to be converted to a string are mapped to a string description by calling time_description/2
and the strings are joined together to create a final time description string.
Here are some examples of the data transformation from milliseconds to a human-readable time.
7,205,139 millseconds -> [2, 0, 5, 139]
-> [{2, 1}, {5, 3}]
-> "2h 5m"
12,439 milliseconds -> [0, 0, 12, 439]
-> [{12, 3}, {439, 4}]
-> "12s 439ms"
505,000 milliseconds -> [0, 8, 25, 0]
-> [{8, 2}, {25, 3}]
-> "8m 25s"
874 millseconds -> [0, 0, 0, 874]
-> [{874, 4}]
-> "874 ms"
The ellapsed_time_description/1
function also has a special clause for 0 millseconds, since that would result in an empty string in the standard way of calculating a time description. The clause just returns "0ms" and the work is done.
defp ellapsed_time_description(0), do: "0ms"
Once I finished the time description code, I realized that it would be useful in both the int_gen and int_sort applications, so I moved it the LargeSort.Shared.CLI
module (a new module) in lib/cli.ex in the largesort_shared project, so that it could be used in both projects.
Modifying the int_gen project to use the timing code was really easy. I just had to replace the code that ran IntGen.create_integer_file/3
with code that does the same thing with timing code.
# Create the integer file using the random integer stream
time_description = CLI.measure(fn ->
IntGen.create_integer_file(options.output_file, options.count, random_stream)
end)
|> CLI.ellapsed_time_description()
# Output how much time it took to generate the integers
output_runtime_description(time_description)
Running the Final Product
I've now reached the point where the project is finally completed. I have a reasonably polished UI and a comprehensive set of tests. I've written far more tests for the Elixir version than I've written for the C# or Node.js versions of this project, so I'm pretty confident it will work pretty well. There are probably a few corner cases I missed, but I expect those will be rarely encountered.
Now I can throw some large numbers of integers at it and see how it does. With the timing functionality, I can compare its performance to the C# and Node.js versions. I'm actually expecting this to not perform as well as those other versions because single-threaded processing is not what Elixir specializes in. It's more suited toward reliability and running many things concurrently.
Here's where I generated 1000 integers between -1000 and 1000 and then sorted them.
> cd int_gen\
> ./int_gen --count 1000 --lower-bound -1000 --upper-bound 1000 "../data/random_integers.txt"
|==================================================================| 100% (1000)
328ms
> cd ../int_sort
> ./int_sort --input-file "../data/random_integers.txt" --chunk-size 100 "../output/sorted_integers.txt"
Determining the number of integers and chunks in the input file...
Number of Integers: 1000
Number of Chunks: 10
Creating Chunk Files
|====================================================================| 100% (10)
10 Gen 1 intermediate files were generated
Merging Gen 1 intermediate files
|==================================================================| 100% (1000)
Gen 1 files were merged into a single output file
407ms
> ls ../output
sorted_integers.txt
> head -20 ..\output\sorted_integers.txt
-1000
-996
-994
-992
-991
-986
-985
-982
-979
-976
-974
-973
-964
-963
-959
-956
-943
-937
-936
-933
> tail -20 ..\output\sorted_integers.txt
976
976
977
978
979
979
982
983
983
986
986
987
987
990
992
993
993
993
995
999
That worked nicely, but I already knew that it would because I'd tried very similar operations while testing out the UI functionality.
Let's step it up a bit and try sorting 10,000 integers.
> cd ../int_gen
> ./int_gen --count 10000 --lower-bound -1000 --upper-bound 1000 "../data/random_integers.txt"
|=================================================================| 100% (10000)
343ms
> cd ../int_sort
> ./int_sort --input-file "../data/random_integers.txt" --chunk-size 100 "../output/sorted_integers.txt"
Determining the number of integers and chunks in the input file...
Number of Integers: 10000
Number of Chunks: 100
Creating Chunk Files
|===================================================================| 100% (100)
100 Gen 1 intermediate files were generated
Merging Gen 1 intermediate files
|=================================================================| 100% (10000)
Gen 1 files were merged into 10 Gen 2 files
Merging Gen 2 intermediate files
|=================================================================| 100% (10000)
Gen 2 files were merged into a single output file
1s 219ms
> ls -l ../output
total 44
-rw-r--r-- 1 kpeter 1049089 43995 Jan 29 16:30 sorted_integers.txt
Now let's sort a million integers in chunks of 1000
> cd ../int_gen
> ./int_gen --count 1000000 --lower-bound -1000 --upper-bound 1000 "../data/random_integers.txt"
|===============================================================| 100% (1000000)
9s 46ms
> cd ../int_sort
> ./int_sort --input-file "../data/random_integers.txt" --chunk-size 1000 "../output/sorted_integers.txt"
Determining the number of integers and chunks in the input file...
Number of Integers: 1000000
Number of Chunks: 1000
Creating Chunk Files
|==================================================================| 100% (1000)
1000 Gen 1 intermediate files were generated
Merging Gen 1 intermediate files
|===============================================================| 100% (1000000)
Gen 1 files were merged into 100 Gen 2 files
Merging Gen 2 intermediate files
|===============================================================| 100% (1000000)
Gen 2 files were merged into 10 Gen 3 files
Merging Gen 3 intermediate files
|===============================================================| 100% (1000000)
Gen 3 files were merged into a single output file
57s 422ms
This is the first time I've had to wait any significant amount of time for the integer generation or sorting operations to finish. Generating the integers took about 9 seconds and sorting the integers look almost a minute. The progress bars worked beautifully the entire time.
What I've noticed from playing around with this is that when the number of integers stays the same, the number of chunks doesn't seem to make a big difference for the amount of time each individual merge takes. However, the number of merges increases, which increases the overall amount of time. So it is clearly beneficial to make the chunks as big as possible without running into memory issues.
Now it's time for the big test: generating and sorting a billion integers. That will really stress this application and reveal the performance relative to my other implementations in C# and Node.js. It will certainly take many hours to complete. Due to single-threaded File I/O not being one of Elixir's strong areas, I'm expecting it to be slower than both the C# and Node.js implementations.
Here we go.
> cd ../int_gen
> ./int_gen --count 1000000000 --lower-bound -100000 --upper-bound 100000 "../data/random_integers.txt"
|============================================================| 100% (1000000000)
2h 31m
> ls -l ../data
total 6239164
-rw-r--r-- 1 kpeter 1049089 6388902455 Jan 31 13:44 random_integers.txt
> cd ../int_sort
> ./int_sort --input-file "../data/random_integers.txt" --chunk-size 100000 "../output/sorted_integers.txt"
Determining the number of integers and chunks in the input file...
Number of Integers: 1000000000
Number of Chunks: 10000
Creating Chunk Files
|=================================================================| 100% (10000)
10000 Gen 1 intermediate files were generated
Merging Gen 1 intermediate files
|============================================================| 100% (1000000000)
Gen 1 files were merged into 1000 Gen 2 files
Merging Gen 2 intermediate files
|============================================================| 100% (1000000000)
Gen 2 files were merged into 100 Gen 3 files
Merging Gen 3 intermediate files
|============================================================| 100% (1000000000)
Gen 3 files were merged into 10 Gen 4 files
Merging Gen 4 intermediate files
|============================================================| 100% (1000000000)
Gen 4 files were merged into a single output file
20h 19m
> ls -l ../output
total 6239164
-rw-r--r-- 1 kpeter 1049089 6388902455 Feb 1 10:32 sorted_integers.txt
It took four merge generations to merge everything down to a single file. The input and output files were both 6.3 GB, which is a lot of data sort and merge.
Integer generation time for the C# implementation was 4 minutes 24 seconds.
For the Node.js implementation, the integer generation time was about 3 hours 11 minutes. For the Elixir version, the integer generation time was 2 hours 31 minutes. That's a lot slower than the C# implementation, but somewhat faster than the Node.js implementation.
Integer sorting time for the C# implementation was 1 hour 40 minutes.
For the Node.js implementation, the sorting time was 19 hours.
For the Elixir version, the sorting time was 20 hours 19 minutes. That's about where I thought it would end up: a little slower than the Node.js implementation and a lot slower than the C# implementation. After all, like Node.js, this sort of single-threaded file-I/O-heavy processing is not what Elixir is optimized for.
I ran this on a machine with a quad core processor, and during the sorting process, the Erlang runtime consistently maxed out one of the processors, using 25% of the total processing power on the machine. Memory usage stayed between 10 MB and 20 MB the entire time, which surprised me. I was expecting a higher memory usage considering that the Erlang VM (BEAM) runtime is also in that process and that a lot of data is being allocated and deallocated quickly. Memory usage was similar to the C# implementation and way lower than the Node.js implementation.
That's it! I've finally finished. I learned a lot about writing Elixir code in the process of implementing this project. I feel like there's definitely room for improvement in the way I structured the code, but it's all part of the process of learning by doing. The more code I write, the better I'll get at it.
As usual, the implementation took a lot longer than I had originally anticipated. Next, we'll move onto some other topic and get back to learning about stuff.