Databases: history and use cases (part 1)

Guillaume Jacquart
7 min readAug 14, 2018

This is part 1 of a 2 parts story about databases. Part 2 is coming soon.

Since the dawn of the internet era, web developers have expressed the need to store dynamic contents for their websites, in order to ease content management and to offer personalized experiences to end users.

As requirements evolved, data became the most important part of the ever more complex systems, and software editors soon took the opportunity to come up with new features for data storage.

In the part 1 of this article I’ll try to go through the history of content storage from the flat file to the highly efficient distributed data stores, keeping in mind of the evolving needs of the web ecosystem.

Studying the technological trends and issues of the pasts often provides new way of thinking about a concept, which can come in handy when you’re responsible for choosing your company’s next data store.

Flat file storage

One of the first way developers came with to store user and dynamic content is flat files. This solution seems like the most natural thing to do, as file storage is the main way to persist data inside a computer. It is quite interesting to notice that the term “database” appeared right after the apparition of the disk based storage.

It has the advantages to be easy to use and to allow the developer to use the file format he thinks best suit his needs (XML, JSON, CSV, …)

The main drawback of using a flat file to store data is that every abstractions around data handling (deleting a line, handling related data, updating a field in a line, …) requires specific implementation.

It also can become a bottleneck for the application once the file becomes large enough, as disk IO can prove to be slow.

Also, on the infrastructure side, have your data stored in a flat file in your server is quite insecure, as server failure or hacking could jeopardize your data consistency.

So the first flat file storage brought some insights on what could be nice to have when working with dynamic data :

  • Data storage should organize the data stored by following a specific data structure model that best represent real-life use cases
  • Data storage should provide a standardized way to query the data according to the model
  • Data storage service should be made independent of the application layer, to reduce the risk of failure

Navigational databases

These databases implement models that allow users to access data by following references to other objects.

2 models implementing these sort of databases emerged in the 70's : the hierarchical model, where objects are stored in a tree-like structure and are accessed by querying their hierarchical path— kind of like XPath, and the network model, where objects are stored in a graph-like structure, and each object is access by querying the relation path.

These databases have the advantage of offering a better performance for querying related data stored on a disk.

It also came with standardization and specifications that enabled better implementation.

The main drawback of navigation databases is that the querying is done sequentially in a procedural way :

  1. Start here
  2. Go from here to here then there
  3. Do this n-times…
  4. Return that object

The programmers had to query very specific data path to the system, that often became unreadable, and very hard to maintain. Any change on the data model would also imply a lot of side effects on the program queries.

This experience brought to light the need for a declarative way to query data, with a black box implementing the querying engine that interpret these declarative queries.

It is funny to think of the DOM, an in-memory hierarchical procedural “database”, as the most famous example of a navigation API used by every front-end developer.

A brief overview of the difference between declarative and imperative programing is available here : https://medium.com/front-end-hacking/imperative-versus-declarative-code-whats-the-difference-adc7dd6c8380

Relational databases

In the early 70’s, Edgar F. Codd put together a model to represent data stored in digital systems in terms of tuple and the relationships between them. Soon after relational databases appeared that implemented such a model in real life, and the SQL language offered a clear enough declarative way of querying and mutating your data.

To its core, relational databases represents your data as an ensemble of tuples (sets of attribute associated with values) — eg. a book — , that respect some constraints (field types, key attributes, unicity, …) — eg. book serial numbers must be unique — , and are linked together to represent real life relationships — eg. a book has one and only one author, but multiple readers who themselves can read multiple books, …

Relation databases have been hugely popular for more than 30 years now, and one might wonder how that can be in the very innovative field that is data management.

I think the SQL language has brought several niceties that contributed to the popularity of relational databases :

  • It separates the logical representation of data from the actual implementation of it. That gave way for the data analyst jobs to evolve independently from the technical constraints of database system editors
  • Associated with the normal forms established by Codd, SQL increases schema normalization that reduces data redundancy and has the nice side effect of increasing the consistency of your data
  • Normalization and improvements in databases software made way for ACID transactions which increase data consistency of softwares
  • It is also extremely powerful and, all things considered, highly readable for the wise mind

Those properties make SQL databases really efficient to store and query data for most current use cases of web platforms.

The SQL databases ruled the web for a long time, but got challenged in the late 2000’s with the explosion of the quantity of data collected and used by digital systems. Some big data actors such as social media got to the edge of what vertical scaling of database servers could do. Then soon started to question the architectural constraints that relational databases put together, in order to maximize performance of their systems.

NoSQL databases

Big data systems are digital systems that store huge amount of heterogeneous data and expect to query those data within a timely manner. That’s the 3 “V” of Big Data :

  • Volume (think Terabytes or bigger),
  • Velocity (updating a Facebook post statistics within seconds),
  • Variety (querying text as well as pictures, log files, … )

Common relational databases, in order to guarantee consistency of your data model, prevent the partitioning of your data in multiple servers. Indeed, it is much easier to ensure a multiple step transaction is successful when the data is gathered in a single server rather than orchestrating a fleet of machine to change the state of a business data.

That means relational databases are limited to the compute, memory and storage resources of the server. Those resources cannot be expanded indefinitely for technical reasons on a single machine, and costs start to increase exponentially at a certain point.

Thus, companies handling huge amount of data started to develop partitioned systems that could handle virtually unlimited amount of data with horizontal scaling. Google came first with the Hadoop ecosystem, which provided a way to store files and compute algorithms in a cluster of servers. Theses algorithm had to follow the Map / Reduce pattern, which basically consist in executing the compute in each data node and only at the end grouping the result together.

This basic brick (data storage and compute in clustered systems), made way for new databases to come alive. These new databases challenged the constraints of high consistency relational database in the profit of horizontal scalability. For that reason we called them NoSQL databases.

The first versions of these databases could not offer SQL level querying, because such features like joining tables and foreign key constraints require complete knowledge of the data perimeter, which partitioned systems do not have.

All these new NoSQL databases differ in these 2 criteria:

  • The model they use to store their data. The main models are document based, key/value and wide-columns (or column family). These models prevents each atom of data from being split across multiple nodes, and each have their matching use cases.
  • Their position in the CAP theorem spectrum : the CAP theorem stipulates that partitioned databases must make a compromise between consistency (all data is the same everywhere in the cluster) and availability (a query on the node of the cluster always get a response)

The difficulty to ensure consistency in partitioned databases made way for a new set of requirements to occur for NoSQL databases: BASE (Basically Available, Soft-state, Eventually consistent) — see the wikipedia article for more details : https://en.wikipedia.org/wiki/Eventual_consistency — as opposed to the classic ACID requirements of relational databases.

Here is a list of some famous NoSQL databases and their model and position in the CAP spectrum :

  • MongoDB : Document based, focuses on consistency (but high availability with failovers)
  • Cassandra : Wide-column or key/value, consistency is up to the client to set for each request
  • CouchDB : Document based, focuses on availability
  • HBase : Wide column, strongly consistent
  • Voldemort : Key/value, availability over consistency

Viewing these databases under the scope of the CAP theorem is indeed important, but can be quite reductive, as most of these systems provide compromises for user to adapt to their use cases, or even fine tuning of the consistency/availability by the client. See this article for precises explanation on where the CAP theorem falls short of describing real-case situations : https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html

Conclusion of part 1

This concludes part 1 of our 2 part series about the evolution of databases. We had a helicopter overview of the history of databases, how they adapted to the data-hungry industry needs and new cluster-based architectural patterns.

In part 2 we will see the other types of databases that appeared such as cache and graph databases, and we will provide basic information on how to choose the right system for your use case.

--

--