In our company data is the most important resource. So, it is very important to analyse all the information we have in order to provide a better service to our customers and our helpdesk operators.
During one of our data analysis activities my team was trying to find out a way to correlate multiple bookings made by our customers in order to find out whether some of them were related to the same journey (e.g. a customer staying in different hotels during their holiday) or not. This information is very important both for a better understanding of our customer’s needs and it is also very helpful for our customer care operators that can provide seamless help to the customer when multiple bookings are involved.
We defined some basic rules that could help us find this correlation and we implemented them within an Apache Spark job in Scala, in order to validate them.
Basically you can see our rules as a function: given two bookings as input parameters the function will find out if they are related or not:
def areRelated(first: Booking)(second: Booking): Boolean = ???
Once the areRelated
function was implemented we needed to invoke it with all the possible combinations of bookings and get the list of the related Booking’s identifiers as output. After some data cleanup, filtering and aggregation processes, we had a List[Booking]
available. One simple solution would have been to match every booking in the list with all the others inside the same list (hopefully excluding itself), like:
// given: bookings = List[Booking]
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] = {
var relations: List[(BookingId, BookingId)] = List()
for (c <- candidates) {
if ((b.id != c.id) && areRelated(b)(c)) {
relations = (b.id, c.id) :: relations
}
}
relations
}
var relations: List[(BookingId, BookingId)] = List()
for (b <- bookings) {
relations = relations ++ findRelated(b, bookings)
}
Using a more functional approach we could replace this code with the following (using the Cats library for the Applicative
typeclass instance):
val relations = Applicative[List]
.product(bookings, bookings)
.filter { case (a, b) => a.id != b.id }
.filter { case (a, b) => areRelated(a)(b) }
.map { case (a, b) => (a.id, b.id) }
Here we are using the product
function of the Applicative
typeclass in order to create a list of tuples. The content of this list is made by all the combinations of elements of bookings (as we are using it for both the first and the second parameter). Given this list of tuples the subsequent steps are trivial: a filter
in order to remove the tuples where a booking is bound with itself, another filter
using our correlation rules in order to find out what tuples correspond to related bookings and finally a map
in order to extract the booking identifiers.
But we can go a step further as we know that this approach is a bit too heavy. Given that our related rules are commutative we don’t need to test both the combinations of two bookings (A, B) and (B, A); we can test instead only one of the two. This would reduce the number of correlation checks by 50% and also would remove 50% of the (duplicated) entries in our results. A first possible implementation could be the following:
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] = {
var relations: List[(BookingId, BookingId)] = List()
for (c <- candidates) {
if ((b.id != c.id) && areRelated(b)(c)) {
relations = (b.id, c.id) :: relations
}
}
relations
}
val relations: List[(BookingId, BookingId)] = List()
for (i <- bookings.indices) {
val sublist = bookings.drop(i)
relations = relations ++ findRelated(sublist.head, sublist.tail)
}
But we can do something better, bringing back immutability and the functional approach, by using a recursive function instead of a for loop:
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] =
candidates
.filter(c => areRelated(b)(c))
.map(c => (b.id, c.id))
@tailrec
def correlateAll(bookings: List[Booking],
relatedGroup: List[(BookingId, BookingId)] = List.empty) = bookings match {
case head :: tail => correlateAll(tail, relatedGroup ++ findRelated(head, tail))
case Nil => relatedGroup
}
correlateAll(bookings)
This is a good approach but actually, in functional programming, we have more tools on our belt. The FP paradigm provides some design patterns like the ones in the object oriented world, but with a big difference. While in the OOP world the patterns just tell you how to structure your code, in the FP world a pattern also provides a production-ready implementation so you don’t have to implement (and test and debug) it. I think this is a great plus of the functional programming approach.
So how we can satisfy our needs with an FP pattern like we’ve done before with the Applicative
typeclass? The answer is one: the Comonad typeclass.
Disclaimer: we are not going to make a deep explanation of the monad and comonad concepts. If you are interested in it I suggest that you read the Category Theory for Programmers book (freely available online) by Bartosz Milewski where an entire section is dedicated to the comonad concept.
The Comonad
typeclass provides some functions that could be very useful for us; in particular it provides the coflatten
function that basically, given a F[A]
, provides an F[F[A]]
; you can see it is as the dual of the flatten
function of the Monad
typeclass. The implementation of this typeclass could be trivial or not, depending on the data type we are working with, but in any case it must satisfy some specific laws as any other typeclass.
Since we are working with a list of bookings let’s see how the coflatten
function works on a list. In order to work with the Comonad typeclass we have to work with a NonEmptyList
data type (it is required in order to satisfy the Comonad definition and laws). As we know, a list can be defined as a chain of Cons
instances terminated with a Nil
value:
NonEmpytList(1,2,3,4) = Cons(1, Cons(2, Cons(3, Cons(4, Nil))))
Given this representation the coflatten
method takes a list, like the one above, as input and returns a list as output where every element of the output is a Cons
instance of the input. In this example, the Cons
inside the list are the following:
Cons(1, Cons(2, Cons(3, Cons(4, Nil))))
Cons(2, Cons(3, Cons(4, Nil)))
Cons(3, Cons(4, Nil))
Cons(4, Nil)
and so, the coflatten
will return:
coflatten(NonEmptyList(1, 2, 3, 4)) == NonEmptyList(
NonEmptyList(1, 2, 3, 4),
NonEmptyList(2, 3, 4),
NonEmptyList(3, 4),
NonEmptyList(4)
)
This is basically what we need in order to compare the bookings; we could rely on this method in order to find out the related bookings.
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] =
candidates
.filter(areRelated(b))
.map(c => (b.id, c.id))
val bs = NonEmptyList.fromListUnsafe(bookings)
Comonad[NonEmptyList]
.coflatten(bs)
.map(l => findRelated(l.head, l.tail))
.toList
.flatten
Let’s see if we can simplify this code. First of all, we are executing the sequence of operations coflatten
and map
; they can be replaced with a coflatMap
:
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] =
candidates
.filter(areRelated(b))
.map(c => (b.id, c.id))
val bs = NonEmptyList.fromListUnsafe(bookings)
Comonad[NonEmptyList]
.coflatMap(bs)(l => findRelated(l.head, l.tail))
.toList
.flatten
Given this new implementation, we can lose some typeclass constraints; we are using just the coflatMap
function on the comonad typeclass; Cats provides a less specific typeclass named CoflatMap
, that basically provides just the coflatMap
function:
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] =
candidates
.filter(areRelated(b))
.map(c => (b.id, c.id))
val bs = NonEmptyList.fromListUnsafe(bookings)
CoflatMap[NonEmptyList]
.coflatMap(bs)(l => findRelated(l.head, l.tail))
.toList
.flatten
Since the laws that describe the CoflatMap
typeclass are less strict that the Comonad
ones Cats can provide an instance also for the List
data type, so we can finally drop the NonEmptyList
annoyance:
def findRelated(b: Booking, candidates: List[Booking]): List[(BookingId, BookingId)] =
candidates
.filter(areRelated(b))
.map(c => (b.id, c.id))
CoflatMap[List]
.coflatMap(bookings)(l => findRelated(l.head, l.tail))
.flatten
Finally, but this is a matter of taste, we can inline the findRelated
function:
CoflatMap[List]
.coflatMap(bookings)(l =>
l.tail.filter(areRelated(l.head))
.map(c => (l.head.id, c.id)))
.flatten
That is it. This is the full implementation for our requirements.