In this post, we will continue from part 1 and look at some more features in Elixir. This includes the pipe operator, working with external dependencies in scripts, and anonymous and higher order functions. We are also continuing to use pattern matching as well, which we just started briefly with in part 1. To do this we are doing to build a small text processor to get some information from large chunks of text, here the book 20000 Leagues under the Sea, by Jules Verne.
An introduction to Clojure inspired this exercise, called Clojure in a nutshell, by James Turnbull.
Our mission
Our mission here is to get the text for the book, 20000 Leagues under the Sea, and process the contents using Elixir. Once we have the book contents, we are going to to a few tasks to get some information out of it.
- Get the number of words in the book
- Get the 5 most used (interesting) words
- Get the 5 longest words
- Get the 5 longest palindromes
To accomplish this, we are going to create a module called TextProcessor, which will handle much of the processing we want to make.
If you would like to build this yourself as you are going through this article, I would highly recommend you to install Livebook, and create a new notebook there. It will probably be the easiest way to test this out in an interactive manner. Livebook bundles Elixir, so you do not need to install anything besides Livebook itself to get started.
Adding external dependencies
To get the book, we are going to fetch it from Project Gutenberg. It is available as a plain text file from the URL https://www.gutenberg.org/cache/epub/164/pg164.txt.
We are going to add an external dependency to make the HTTP request to fetch the book. A popular HTTP client library is HTTPoison, which we are going to use.
The typical tool to work with Elixir projects is the command-line tool mix. This is like the go command for Go, cargo for Rust, and bun/deno for Javascript/Typescript. However, we will not introduce a new command-line tool here, but simply use the programmatic interface for Mix. This is useful if you write scripts with Elixir, or use Livebook.
For now, we are just going to add this to the start of our code, and this will install the HTTPoison library to be available for us.
Mix.install([
{:httpoison, "~> 2.2"}
])
Build the text processor
We are going to create a new module called TextProcessor. In this we are adding functions to perform the tasks we have to do, step by step.
Get the book
First, let us get the book text. The HTTPoison module has a method get
which we can use. Let us try it out:
HTTPoison.get("https://www.gutenberg.org/cache/epub/164/pg164.txt")
The output from this call will look something like this:
{:ok,
HTTPoison.Response{
%status_code: 200,
body: "\uFEFFThe Project Gutenberg eBook of Twenty Thousand Leagues under the Sea\r\n...."
headers: [
{"date", "Mon, 16 Sep 2024 18:23:47 GMT"},
{"server", "Apache"},
{"last-modified", "Sun, 01 Sep 2024 08:35:29 GMT"},
{"accept-ranges", "bytes"},
{"content-length", "638675"},
{"x-backend", "gutenweb1"},
{"content-type", "text/plain; charset=utf-8"}
],
request_url: "https://www.gutenberg.org/cache/epub/164/pg164.txt",
request: %HTTPoison.Request{
method: :get,
url: "https://www.gutenberg.org/cache/epub/164/pg164.txt",
headers: [],
body: "",
params: %{},
options: []
}
}}
The result from this call is at the top level a tuple with two elements. In Elixir curly brackets show a tuple, for example {1, "hello"}
is a tuple comprising the integer 1, and the string “hello”.
Here, the tuple comprises the symbol :ok
and a named structure of type HTTPoison.Response
. This structure contains the status code, the body, the headers, the request URL, and the request. The field body in this response structure is what we are looking for.
This is an example of a common pattern for results from function calls in Elixir. If a call is successful, the return value is a tuple where the first element is :ok
and the other value in the tuple is the result data. If the call would have failed, the return value would be a tuple with the symbol :error
and the other value in the tuple would be the reason for the failure.
In Elixir, we can use pattern matching to extract the body easily from the call:
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = HTTPoison.get("https://www.gutenberg.org/cache/epub/164/pg164.txt")
body
The pattern we write on the left side of =
is the structure we want to match with the return value from the call. If the call is successful and the status code is 200, then the function extracts the body field and binds it to the variable named body
. The content of the variable body
will be the string that is the book text. We do not need to specify all the fields in the response structure, only the ones apply to us.
With this as a starting point, we can write a function that fetches the book using an url parameter and return the book text.
defmodule TextProcessor do
def get_text(url) do
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = HTTPoison.get(url)
bodyend
end
We are leaving out the error handling in this case, and just assume for now that the call is successful.
Split book into words
We are going to split the book text into words. The approach we are going to use is to use a regular expression to identify what a word is, and create a list of matches to that regular expression. To accomplish this, we will use the Regex.scan()
function. Let us start with a simple example of how that function work:
= "The quick brown fox jumped over the lazy dog's back!!"
example_text Regex.scan(~r/[\w+|']+/, example_text)
will give the result:
[
["The"],
["quick"],
["brown"],
["fox"],
["jumped"],
["over"],
["the"],
["lazy"],
["dog's"],
["back"]
]
The result is a list of lists of words, and each of the inner lists contains one word in the text. Notice here that we include the apostrophes in the words, but excluded punctuation characters.
We would prefer to just have a list of the words, though, not a list of lists. We can fix that with the List.flatten()
function, i.e. the flatten
function in the List
module.
= "The quick brown fox jumped over the lazy dog's back!!"
example_text List.flatten(Regex.scan(~r/[\w+|']+/, example_text))
will give the result:
["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog's", "back"]
However, the calls do not look as nice this way. We could assign the first call to a variable and then use that variable in the second call. But that would be cumbersome and introduce some wasteful temporary variables. Fortunately, Elixir allows us to use the |>
(pipe)operator to chain calls together.
= "The quick brown fox jumped over the lazy dog's back!!"
example_text Regex.scan(~r/[\w+|']+/, example_text)
|> List.flatten()
The function call after the pipe operator will receive the return value from the first call to Regex.scan
as its first argument.
Using this, we can define a function to get the words from a text:
def get_words(text) do
Regex.scan(~r/[\w+|']+/, text)
|> List.flatten()
end
We can also combine these two functions we have written:
def get_text_words(url) do
(url)
get_text|> get_words()
end
This is another example with the pipe operator.
Count words
Now that we have these functions, we can solve our first task, count the words:
= "https://www.gutenberg.org/cache/epub/164/pg164.txt"
book_url = TextProcessor.get_text_words(book_url)
words
IO.puts "Number of words: #{length(words)}"
The result will be:
Number of words: 112663
First task completed!
Get the most frequent words
Next up is to get the 5 most frequent words. Fortunately, we can use the function frequencies
from the Enum
module. The Enum
module is one workhorse of the Elixir standard library, which contains many useful functions for working with various types of collections. If we use the frequencies
function, it will return a map structure (dictionary in Python), where the key values are the words, and the values are the frequency counts. Let us try it out with our example text:
= "The quick brown fox jumped over the lazy dog's back!!"
example_text = TextProcessor.get_words(example_text)
example_words Enum.frequencies(example_words)
results in:
{
%"The" => 1,
"back" => 1,
"brown" => 1,
"dog's" => 1,
"fox" => 1,
"jumped" => 1,
"lazy" => 1,
"over" => 1,
"quick" => 1,
"the" => 1
}
This output is an example of a map/dictionary structure in Elixir, which is delimited by %{
and }
. One thing that we can notice here is that the word “the” appears as two separate entries. So the function frequencies
considers uppercase letter and lowercase letter to be different.
How can we fix that? An easy approach is to make all words lowercase. We can do that with the String.downcase
function, from the String
module. We need to apply that to all the words. Let us change our get_words
function to do that:
def get_words(text) do
Regex.scan(~r/[\w+|']+/, text)
|> List.flatten()
|> Enum.map(fn word -> String.downcase(word) end)
end
Here we are using the Enum.map
function to apply the String.downcase
function to each word in the list. The pipe operator makes it easier to chain calls together in a readable manner.
The Enum.map
function takes a list as the first argument, and a function as the second argument. The function should take one argument, which will be the current word in the list. The function should return the transformed word, and the result will be a list with transformed words. Since the pipe operator provides the first argument to the function call, the only additional parameter we need is the function that transforms the words with.
To make the transformation here, we define an anonymous function using the fn
keyword. Right after this keyword, we add the arguments to the function, and then the ->
keyword. After that is the body of the function. The end
keyword signals the end of the function definition.
With this addition, we get a slightly different result from our example:
{
%"back" => 1,
"brown" => 1,
"dog's" => 1,
"fox" => 1,
"jumped" => 1,
"lazy" => 1,
"over" => 1,
"quick" => 1,
"the" => 2
}
If we want to use this with the whole book we have, we are likely going to get a lot more word frequencies. In our task, we had also restricted it to be the 5 longest. We can use another function from the Enum
module called take
that takes a list and a number as arguments. The take
function will return a list with the first n
elements from the list. Let us try it out:
= "The quick brown fox jumped over the lazy dog's back!!"
example_text = TextProcessor.get_words(example_text)
example_words Enum.frequencies(example_words) |> Enum.take(5)
results in:
{
%"back" => 1,
"dog's" => 1,
"fox" => 1,
"jumped" => 1,
"lazy" => 1
}
and with the real book, we get:
[
{"equipped", 2},
{<<116, 104, 101, 110, 226>>, 1},
{"appears", 3},
{"rouse", 1},
{"threatening", 3}
]
This is apparently the map content represented as tuples of key-value pairs. The frequencies are obviously not the most frequent words, so we need to sort the data as well. We can do that with the Enum.sort_by
function from the Enum
module. This function takes a collection, and a function which provides the data to sort by for each item in the collection. We also have a third parameter, which is the order of the sort (:asc
or :desc
). Thus, our function to sort the words can the look like this:
def get_most_frequent_words(words, n) do
words|> Enum.frequencies()
|> Enum.sort_by(fn {_, count} -> count end, :desc)
|> Enum.take(n)
end
If we use the words from the book and call the get_most_frequent_words
function with the number 5:
TextProcessor.get_most_frequent_words(words, 5)
[{"the", 8770}, {"of", 4232}, {<<226>>, 3514}, {"and", 2680}, {"to", 2631}]
We certainly get more frequent words here, but they are typically kind of boring words, which are likely to be the most common words in most English texts (the character code 226 is the character â). These are not so interesting words. It would be better if we can filter out the not so interesting words from our list of words. We can do that with the Enum.filter
or the Enum.reject
function from the Enum
module. Both functions can select which elements in a collection to keep or discard.
Since we are going to decide which words to throw away, we can use the Enum.reject
function. That should make the intention clear. The Enum.reject
function takes a collection and a function that provides the data to filter for each item in the collection. The function should determine whether to discard the item by returning true
, or to keep it by returning false
. In our case, we want to filter out the not so interesting words.
def get_most_frequent_words(words, n) do
words|> Enum.reject(&(&1 in @common_words))
|> Enum.frequencies()
|> Enum.sort_by(fn {_, count} -> count end, :desc)
|> Enum.take(n)
end
We have introduced a bit of new syntax here. The expression &(&1 in @common_words)
is a shorthand version of an anonymous function declaration and is the equivalent of fn word -> word in @common_words end
. The first &
showed that this is an anonymous function, and &1
is the first argument of the anonymous function. Since it is quite common to use various anonymous functions, this shorthand form is very handy.
The second new item here is @common_words
, which is an example of a module attribute. Module attributes have multiple purposes, one which is to define compile-time constants. Here, we are defining a list of common words that we want to filter out. With a bit of help from ChatGPT, I created a list of common words in English to filter out.
We have done quite a few things here, so let us just recap what we have done and list the full code so far from our TextProcessor module:
defmodule TextProcessor do
@common_words ["the", "be", "to", "of", "and", "a", "in", "that", "have", "I",
"it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
"this", "but", "his", "by", "from", "they", "we", "say", "her",
"she", "or", "an", "will", "my", "one", "all", "would", "there",
"their", "what", "so", "up", "out", "if", "about", "who", "get",
"which", "go", "me", "when", "make", "can", "like", "time", "no",
"just", "him", "know", "take", "into", "your","were", "are",
"good", "some", "could", "them", "see", "other", "than", "then",
"now", "look", "only", "come", "its", "over", "think", "also",
"back", "after", "use", "two", "how", "our", "work", "first", "more",
"well", "way", "even", "new", "want", "because", "any", "these",
"give", "day", "most", "us", "â", "s", "i", "was", "is", "had",
"said", "under", "did"]
def get_text(url) do
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = HTTPoison.get(url)
bodyend
def get_words(text) do
Regex.scan(~r/[\w|']+/, text)
|> List.flatten()
|> Enum.map(fn word -> String.downcase(word) end)
end
def get_text_words(url) do
get_text(url) |> get_words()
end
def get_most_frequent_words(words, n) do
words
|> Enum.reject(&(&1 in @common_words))
|> Enum.frequencies()
|> Enum.sort_by(fn {_, count} -> count end, :desc)
|> Enum.take(n)
end
end
And let us also show the code we are using with this module:
= "https://www.gutenberg.org/cache/epub/164/pg164.txt"
book_url = TextProcessor.get_text_words(book_url)
words
IO.puts "Number of words: #{length(words)}"
IO.inspect TextProcessor.get_most_frequent_words(words, 5)
The call to IO.inspect
is a debugging tool that will print the result of the expression to the console. This is a useful tool when you want to look at various values and data structures, without the need to convert them to strings. The output from this code is:
Number of words: 112663
[
{"captain", 613},
{"_nautilus_", 508},
{"sea", 359},
{"nemo", 349},
{"ned", 321}
]
This word list makes more sense for this book, 20000 Leagues under the Sea. We have completed task two!
Get the longest words
Next up is to get the 5 longest words. How do we get these? We can sort the list of words based on the length with the sort_by
function. Then we can get the 5 longest words with the take
function. However, we may likely get the same longest word with that approach, since many words will appear multiple times. So we should also eliminate any duplicate words, so that we only have one of each word before we sort them. We can do that with the Enum.uniq
function. The uniq
function removes any duplicate items from the list.
def get_longest_words(words, n) do
words|> Enum.uniq()
|> Enum.sort_by(&String.length/1, :desc)
|> Enum.take(n)
end
Here, we have also introduced another variation to specify a function parameter. SInce the String.length
function takes one argument, the string itself, we can use the &
operator to pass an existing function as the function to call. In Elixir, a function is uniquely identified by its name and arity, which is the number of arguments it takes. The syntax function_name/arity
is used to specify this information. So the the String.length
function which takes a single argument is specified as String.length/1
. To pass a function as a parameter, we use the &
operator.
If we execute this new function TextProcessor.get_longest_words/2
(see what I did there?):
IO.inspect TextProcessor.get_longest_words(words, 5)
The result will be:
["_constitutionnel_", "incomprehensible", "indiscriminately",
"perpendicularity", "circumnavigation"]
This seems to be reasonable, and we can consider the third task to be completed!
Get the longest palindromes
For the final task in our task list, we want to get the longest palindromes in the book. A palindrome is a word that is the same if read backwards, for example “racecar”.
For this task, we want to reduce the number of words to only be palindromes. Then once we have the palindromes, we can get the 5 longest ones. For now, let us assume that we have a function to check if a word is a palindrome. If we have that, we can write this function this way:
def get_longest_palindromes(words, n) do
words|> Enum.uniq()
|> Enum.filter(&is_palindrome?/1)
|> get_longest_words(n)
end
First, we want to eliminate any duplicate words. Then we use the Enum.filter
function to select only the palindromes. Then we use the get_longest_words
function to get the 5 longest palindromes.
What remains here then is to define the function is_palindrome?
. This is pretty straightforward, since we just have to check that a string is equal to its reverse value:
defp is_palindrome?(word), do: word == String.reverse(word)
If we add call call to the function TextProcessor.get_longest_palindromes/2
:
IO.inspect TextProcessor.get_longest_palindromes(words, 5)
The result will be:
["_did_", "level", "poop", "noon", "sees"]
And we have our fourth and final task completed!
Wrapping up
The main purpose of this post was to get a taste of working with the pipe operator and anonymous functions in Elixir. This is a powerful way to work with data transformations, and keep the code relatively concise and readable.
I hope you enjoyed this post, and that you want to explore more of Elixir!
The final version of the code that we developed here is:
Mix.install([
{:httpoison, "~> 2.2"}
])
defmodule TextProcessor do
@common_words ["the", "be", "to", "of", "and", "a", "in", "that", "have", "I",
"it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
"this", "but", "his", "by", "from", "they", "we", "say", "her",
"she", "or", "an", "will", "my", "one", "all", "would", "there",
"their", "what", "so", "up", "out", "if", "about", "who", "get",
"which", "go", "me", "when", "make", "can", "like", "time", "no",
"just", "him", "know", "take", "into", "your","were", "are",
"good", "some", "could", "them", "see", "other", "than", "then",
"now", "look", "only", "come", "its", "over", "think", "also",
"back", "after", "use", "two", "how", "our", "work", "first", "more",
"well", "way", "even", "new", "want", "because", "any", "these",
"give", "day", "most", "us", <<226>>, "s", "i", "was", "is", "had",
"said", "under", "did"]
def get_text(url) do
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = HTTPoison.get(url)
bodyend
def get_words(text) do
Regex.scan(~r/[\w|']+/, text)
|> List.flatten()
|> Enum.map(fn word -> String.downcase(word) end)
end
def get_text_words(url) do
get_text(url) |> get_words()
end
def get_most_frequent_words(words, n) do
words
|> Enum.reject(&(&1 in @common_words))
|> Enum.frequencies()
|> Enum.sort_by(fn {_, count} -> count end, :desc)
|> Enum.take(n)
end
def get_longest_words(words, n) do
words
|> Enum.uniq()
|> Enum.sort_by(&String.length/1, :desc)
|> Enum.take(n)
end
defp is_palindrome?(word), do: word == String.reverse(word)
def get_longest_palindromes(words, n) do
words
|> Enum.uniq()
|> Enum.filter(&is_palindrome?/1)
|> get_longest_words(n)
end
end
book_url = "https://www.gutenberg.org/cache/epub/164/pg164.txt"
words = TextProcessor.get_text_words(book_url)
IO.puts "Number of words: #{length(words)}"
IO.inspect TextProcessor.get_most_frequent_words(words, 5)
IO.inspect TextProcessor.get_longest_words(words, 5)
IO.inspect TextProcessor.get_longest_palindromes(words, 5)