lastminute.com logo

Technology

Content Hub: How we made NoSQL relational

paweł_suchy
paweł suchy

lastminute.com is moving many of its services to Amazon Web Services. Let's see what challenges related to the data migration itself may arise during this process, based on the case of the Content Hub application.


First of all - what is a Content Hub? It is an Enterprise Digital Experience Platform based upon a selection of engines, one of which hosts and allows management of the company’s upper-funnel websites. Users with just a few clicks can create beautiful, full of static and dynamic content web pages, and after publishing… voilà! They are instantly visible to the outside world.

Content Hub: new page
Content Hub new pages

Content Hub: last edits
Content hub last edit

Content Hub: final web page
Content hub: the page published

What databases are we currently using?

Since the beginning, Content Hub (CH) has been hosted on the Google Cloud Platform (GCP). Data is stored in multiple databases - there are many more but we will focus on those below:

  • CMS DB: stores CMS data in a “relational” manner. Among others, this includes web pages content, URLs hierarchy, images/videos details, etc. Why “relational” in double quotes? Well… because for this purpose we are using Google Cloud Datastore (GCD) - a clearly non-relational database. Although remarkably efficient and scalable - very limited in functionality and with lots of restrictions but very cost-efficient when used properly. We are not going any further here as it is not a subject of this lecture to debate if this was the best choice for this kind of data or not.
  • Display DB: aggregates data stored in CMS DB which is then consumed by services responsible for presenting content to the Internet users. In contrast to the previous database where reads and writes are almost at the same level - Display DB must be optimized for a fast read. This time the choice doesn’t need justification - yes, this database is also stored in GCD.
  • Search DB: as the name implies, it is used for phrase searching across web pages and related data. In the beginning, we used Elasticsearch, which is one of the most popular and obvious choices for this purpose. Over time, as Content Hub became more adopted across our company and lots of pages were migrated from other management solutions to our platform, the pressure on CH to increase reliability has increased significantly. Unfortunately, overhead for Elasticsearch maintenance exceeded the functional requirements, so it was decided to move Search DB to MongoDB - searching capabilities were limited to some extent, but text search was still supported.

Data flow between those databases is as follows: After changes are made in CMS, data is altered in CMS DB. Changed data is then aggregated and composed into expanded entities - occasionally nearly hitting the GCD entity size limit of 1MB. When applicable, in parallel to this process, the data (after some alterations) is also pushed to Search DB. All of these operations are executed asynchronously, transparent to the CMS user, who is not expected to wait until the whole process is finished.

Dataflow diagram

OK, this was an introduction about the pre-migration state. Now let’s focus on the engines that we are moving to. We are migrating from Google Cloud Platform to Amazon Web Services (AWS) and decided on the following migration path:

  • CMS DB → Amazon RDS for PostgreSQL. The relational engine that will fully support the needs of our application.
  • Display DB → Amazon DocumentDB (with MongoDB compatibility). Display DB contains aggregated and composed data, so this popular and mature NoSQL database engine is one of the natural choices.
  • Search DB → none. Yes, this is not a mistake! Thanks to the enormous possibilities of PostgreSQL with the usage of trigrams functionality, we could get rid of database which existed only for the needs of search queries.

The title of this post is a bit deceitful, as it may suggest that we are somehow surpassing the limits of non-relational engines. What we actually had to do was move data from the NoSQL database to the relational one with as little effort as possible, so we could limit changes in our application to a minimum. It is certainly favored by the usage of Domain-Driven Design. We aren’t using DDD strictly, just a selected set of its principles - one of them is a separation of layers. In our application, we have cloud and database adapters incorporated into the infrastructure layer. That way migration to AWS is mainly limited to the replacement of those specific adapters. Regardless of data conversion, our domain models remain intact.

As migrating Display DB from Google Cloud Datastore to DocumentDB (so from NoSQL to NoSQL) does not present any major difficulties, we will analyze and focus on only solving issues related to migration from Datastore to PostgreSQL.

Database schema

In NoSQL, it is common to have big entities with complicated structures. We could store our extensive domain models in a NoSQL database without any structural changes, without intermediary models. Below is an example of a class from our application (this one is one of the simpler) which is 1:1 same as placed in storage.

public class PageConfig
{
	public Wid<PageConfig> Id { get; init; }
		public HtmlConfig? Html { get; set; }
		public HeadConfig? Head { get; set; }
		public Wid<Site> SiteId { get; set; }
		public ICollection<string>? Tags { get; set; }
		// (...)

	public class HtmlConfig  {
 		public string? Lang { get; set; }
	}

 	public class HeadConfig
	{
		public string? Title { get; set; }
		public ICollection<PageHeadMeta>?  Metas { get; set; }
		public ICollection<PageHeadLink>? Links { get; set; }

		public class PageHeadMeta
 		{
			public string? Name { get; set; }
			public string? Content { get; set; }
			// (...)
		}

		public class PageHeadLink
		{
			public string? Rel { get; set; }
			public string? Type { get; set; }
			// (...)
		}
	}
}

As you can see above we use a strongly-typed ID (Wid<>). Built-in primitive types such as integer, string, or even more complex GUID are interchangeable, which leads to mistakes and confusion when the usage of multiple values from different types of entities is required. For example, if a function signature has arguments that take integer IDs of two entities, it would be easy to confuse the order of the arguments when calling this method. With the usage of strongly-typed IDs even if the order is mistaken, the compiler would throw an error and the code would not compile. Under the hood, our strongly-typed ID is just a long integer, so there is no problem with the representation of this type in the database.

public interface IPageConfigRepository
{
	Task<ICollection<PageConfig>> PageConfigsWithMediaListInt(int siteId, int mediaId);
	Task<ICollection<PageConfig>> PageConfigsWithMediaList(Wid<Site> siteId, Wid<Media> mediaId);
}

public class PageConfigService
{
	private readonly IPageConfigRepository _repository;
	public PageConfigService (IPageConfigRepository repository) => _repository = repository;

	public async Task TestMethod()
	{
		var siteInt = 123;
		var mediaInt = 456;
		var results = await _repository.PageConfigsWithMediaListInt(mediaInt, siteInt); // no error (!)
		results = await _repository.PageConfigsWithMediaListInt(siteInt, mediaInt); // no error

		var siteId = Wid<Site>.Create();
		var mediaId = Wid<Media>.Create();
		results = await _repository.PageConfigsWithMediaList(mediaId, siteId); // argument type compiler error
		results = await _repository.PageConfigsWithMediaList(siteId, mediaId); // no error
	}
}

Coming back to our database schema: What are the options of storing such a model with nested objects and arrays of objects in a relational database? The first that comes to mind is to create additional tables with respectively one to one and one to many relationships. Each nested value would need a separate table, which in our case could lead to exploding relationships.

In PostgreSQL, there is an alternative available: JSON data type. It allows storing structured data into a single field and manipulating it with JSON-specific functions and operators. There are two JSON data types available: json and jsonb. Binary decomposition used by jsonb makes it slower to write data but faster to process as it enables indexing. With the usage of jsonb we can get the object field by key/path and array element by index, which also means that we can use filtering on those values within the WHERE clause.

SELECT id, html -> 'lang' AS lang FROM page_config
WHERE head -> 'title' IS NOT NULL;

Another case represents arrays of simple types, such as arrays of strings or integers. Going purely relational, we would create a table with one to many relationships for each of such fields. PostgreSQL also comes with an unusual solution - it has built-in support for array columns. A column can be defined as an array of any base type and what is more: this array may be multidimensional. We can access array or subarray elements by a subscript, retrieve whole slices of an array within provided subscript ranges, and lastly get a length of such an array.

SELECT id, tags[1:2] FROM page_config
WHERE tags[0] = 'summer';

Using the above solutions, designing database schema for Content Hub models became effortless. We created regular simple-typed columns for all fields that are at the core of a model, jsonb columns for nested objects and complex collections, and array-typed columns for collections of simple types. This is how table schema for our model ultimately presents:

CREATE TABLE page_config
(
	id bigint NOT NULL,
	html jsonb,
	head jsonb,
	site_id bigint NOT NULL,
	tags text[],
	/* (...) */
	CONSTRAINT pk_page_config PRIMARY KEY (id)
)

Database integrity

We could have finished this post here because we already have developed a comprehensive solution for storing all of our models in a relational database. However, migration to PostgreSQL is also an opportunity to take care of data integrity. In Google Cloud Datastore there is no way to set constraints nor any kind of validators that would assure accuracy and consistency of the data at the database level. So far we had to check if insertions, modifications, and deletions would cause data to be incomplete or redundant only at the application level. But now, in a relational database, we decided to make use of foreign keys and unique constraints.

Implementation of this solution revealed the dark truth. It turned out that Content Hub’s data, despite the validations at the application level, is inconsistent. There are two reasons for this:

  • We were missing proper validation in a few places in the code. These were obvious bugs that had to be fixed.
  • Concurrent requests executed parallelly at the same moment. We were using a locking mechanism for crucial data to avoid critical section problems. It is not difficult to guess that data inconsistency arose in places not covered by this technique. In daily editorial usage, such concurrent requests are very unlikely to occur. The situation is completely different when we are automatically importing a massive amount of web pages to our system - multiple at once. We had to clean up our database. This process had to be done a few times. The first time was the most intense because we had to deal with inconsistencies that were growing over the years. During the pre-migration period, the operation was repeated a few more times, until the very last clean-up just before the final migration of data.

We created foreign keys on every column that contains IDs of related entities and also added some uniqueness constraints. “Every column” phrase was used here on purpose. Why? Because related entities IDs are also included in structurally nested fields of a model which, as mentioned before, are stored in our PostgreSQL as JSON. So what about those values? Actually, we have implemented a solution long before anyone was thinking about migration to a relational database. To simplify the data retrieval and validation process in our application, at the root level of the models we were adding additional array-typed fields containing aggregations of IDs used in all fields of a model - including nested objects. Below is an example of this solution:

public class PageConfig
{
	public Wid<PageConfig> Id { get; init; }
	// (...)
	public ICollection<Wid<Media>>? MediasIds => Head?.Metas?.Select(meta => meta.MediaId).ToList();
	public ICollection<Wid<Anchor>>? AnchorsIds => Head?.Links?.Select(link => link.AnchorId).ToList();

	public class HeadConfig
	{
		// (...)
		public ICollection<PageHeadMeta>?  Metas { get; set; }
		public ICollection<PageHeadLink>? Links { get; set; }

		public class PageHeadMeta
 		{
			// (...)
			public Wid<Media> MediaId { get; set; }
		}

		public class PageHeadLink
		{
			// (...)
			public Wid<Anchor> AnchorId { get; set; }
		}
	}
}

It is the model’s responsibility to properly fill those arrays. This way, other parts of a system that are retrieving entities related to other entities don’t need to know all fields that may contain requested IDs - they just filter by values in an array field. It is a bit of redundancy in terms of data storage, but nowadays we don’t need to care about database size that much.

Unfortunately, PostgreSQL does not support foreign keys on array-typed columns. That is why this was the moment to finally make use of associative tables. Data saved in an associative table is a projection of values stored in a field with an array of IDs - yes, another redundancy. Those tables are used only to keep data integral with the usage of foreign keys. Associative tables aren’t queried anywhere in our application, except at the moment of entity update - with changes made to the entity we have to update associative tables accordingly. Those associative tables aren’t even required if in the future (currently we are trying to support switching between cloud platforms with minimal effort) we will introduce JOIN queries. PostgreSQL has a built-in solution for joining on array fields.

SELECT p.*, m.* FROM page_config p
LEFT JOIN media m ON m.id = ANY(p.medias_ids);

Let’s get back to the migration of data itself. The cleaned-up database was ready to be imported to another engine. As mentioned before, domain models in our application were left intact. We could easily export-import data via JSON files without any conversion. Export was made by the instance of our application with an old cloud infrastructure layer supporting Google Cloud - while import by an instance with a replaced layer with newly developed support of AWS. While importing data to our new database we didn’t care about the order of inserted entities - so during this operation, it was almost certain we would come across foreign key constraint failure. That is why before data import all foreign keys were temporarily removed and at the moment of finalization - restored. It was very crucial to have imported data with full integrity, otherwise, the recreation of foreign keys would fail.

The last thing worth mentioning regarding database integrity is concurrency checks: Data concurrency refers to changes made to the same data in a database at the same time by multiple processes. Concurrency checks are to ensure that changes made to the data are not conflicting with each other. In Content Hub we are utilizing row version concurrency control: In an entity, there is an additional integer field containing its current version. A process requesting data change has to provide a version of the entity originally retrieved from the database. If during the update the provided version is the same as actually stored in the database, changes are applied successfully with a simultaneous increase of the row version. If the provided version is different from what is actually stored in the database, then concurrency conflict occurs - this means that the entity has been already altered by another process.

In typical relational and some non-relational databases, ensuring that the row version is correct would be achieved by adding a condition to the WHERE clause in the UPDATE command and proceeding accordingly if no rows were affected. In Google Cloud Datastore to validate the entity version we had to fetch the actual entity, ensure the version is correct, and replace the entity with new values - all was done in a single transaction because GCD supports transactions on the same group of entities. Omission of redundant queries for the actual version of the entity was the last thing done in our application in this part of the migration.

Performance and scalability

Google Cloud Datastore is extremely performant in terms of data reads. It is possible to query entities by ID or construct simple filtering clauses. Every filtering clause must have its counterpart in active indexes, otherwise, the request will return an error. Single field indexes are created automatically, multi-field are to be defined by the administrator. It is favorable in terms of performance but from developers’ perspective can be very bothersome - limiting data filtering flexibility. In our data-listing endpoints, we support a declaration of custom filters in a format compatible with OData. It was impossible to create indexes for every possible combination of filters, so in our application, we have developed a mechanism that removes part of filtering conditions when querying Datastore so that the sent filter matches the best fitting index. Records returned from the database were additionally filtered out at the application level. In some cases, this operation might be very memory greedy. Migration to PostgreSQL and disposal of this mechanism is a relief. Oh, it is also worth mentioning that GCD does not support the OR clause nor the inequality operator. Developers consuming our endpoint with OData support were repeatedly unpleasantly surprised to find this out. Having all this in mind, the data structure had to be very well-thought-out and appropriately designed.

In terms of write performance, GCD looks much worse. It has a maximum sustained single entity write rate of about 1 per second. For some time (usually a second) we can also experience inconveniences related to eventual consistency - changes made to entities are not visible immediately when querying using a secondary index. For example, inserting an entity into the database and listing the whole collection during this period wouldn’t return the mentioned entity.

public class PageConfigRepository
{
	private readonly DatastoreDb _db;
	public PageConfigRepository(DatastoreDb db) => _db = db;

	public async Task TestMethod()
	{
		await _db.InsertAsync(new Entity { Key = new Key().WithElement(new Key.Types.PathElement { Kind = "page_config" })});
		var results = (await _db.RunQueryAsync(new Query("page_config"))).Entities.Count; // 0 (?!)

		await Task.Delay(TimeSpan.FromSeconds(1));
		results = (await _db.RunQueryAsync(new Query("page_config"))).Entities.Count; // 1
	}
}

It is worth mentioning that some of the drawbacks of the Datastore have been improved in its younger sibling, the Firestore.

PostgreSQL writes are much more convenient. There are no such limits as mentioned above. The one big issue which may occur are deadlocks during transactions. Oh, yes - transactions which are covering the whole database - this is also a big advantage over GCD which allows wrapping in the transaction only operations on entities from a single group. Do not get this wrong - a single group of entities is not equivalent to a single collection - different entities in GCD can be grouped into a tree structure, so a single transaction can cover requests on all of them.

The main caveat in terms of reads is that PostgreSQL allows only one query per connection at a time. While in GCD it is allowed to execute lots of queries in parallel (gRPC and REST protocols utilized by this engine are stateless and do not keep persistent connections), in PostgreSQL all queries and commands have to be executed one by one. To omit this limitation, in certain operations such as gathering data aggregated into Display DB, we carefully harness multiple connections for this purpose. This could, when used without caution, ultimately lead to exhaustion of resources on the database server as well as increased resources utilization by the application itself.

What’s next?

After migration to the relational database, the next steps will be aimed at getting the most of it. First - making proper use of transactions. As mentioned before, previously we were able to utilize transactions in a limited way. From now on we can do this appropriately for all operations. Sooner or later applications will crash or connections will be interrupted during the process of data altering - without transactions, it doesn’t matter if this is NoSQL or not - data would be left inconsistent. What is more, without transaction, concurrent processes could access incomplete data for a while.

Secondly, utilize joined queries, subqueries and aggregations. In Datastore, there were situations when there was no other way than to make multiple queries and combine results at the application level, which in a relational database could be easily accomplished within a single query. Another case is, for example, the has_category flag for categories of products. Imagine we have a hierarchical structure of categories: subcategories have a field parent_category_id filled with the value pointing to the parent entity and the requirement of querying all categories that are empty (don’t have any subcategory) has to be fulfilled. In PostgreSQL, we can achieve this with a simple subquery while in GCD it would be necessary to add a field has_subcategories to the category entity and update it accordingly when changes to the collection of categories are made.

Last but not least: adjusting database schema to better fit relational functionality. Although we have managed to migrate data from NoSQL in the same form as it was before, there are now some opportunities to improve these models. One of the examples is the has_subcategories flag case. With PostgreSQL, completely new possibilities for data querying and data management have opened for us. It is worth rethinking database models and adjusting them accordingly where it makes sense. Software is in constant evolution, and the same applies to the data it is based on.

Special thanks to Marcin Korzonek and Jacek Ostrowski who bravely acted as consultants and proofreaders during the creation of this article.


Read next

SwiftUI and the Text concatenations super powers

SwiftUI and the Text concatenations super powers

fabrizio_duroni
fabrizio duroni
marco_de_lucchi
marco de lucchi

Do you need a way to compose beautiful text with images and custom font like you are used with Attributed String. The Text component has everything we need to create some sort of 'attributed text' directly in SwiftUI. Let's go!!! [...]

A Monorepo Experiment: reuniting a JVM-based codebase

A Monorepo Experiment: reuniting a JVM-based codebase

luigi_noto
luigi noto

Continuing the Monorepo exploration series, we’ll see in action a real-life example of a monorepo for JVM-based languages, implemented with Maven, that runs in continuous integration. The experiment of reuniting a codebase of ~700K lines of code from many projects and shared libraries, into a single repository. [...]