When thinking of a modern development pipeline, I always find database development tooling undeveloped in comparison to what code development tools are offering. We have version control systems, review tools, CI/CD solutions, linters and many more. There are plenty of them so you can automate workflow as you want and concentrate on writing code. But what modern development tools can't help much now is how you process and version database changes.
A struggle to build an automatic database development pipline
I've tried many different approaches: schema comparison tools (Replicator, Redgate, DBComparer etc) in combination with version control hooks, writing migrations, automatic migrations using ORM frameworks etc. But always without much success, reverting to semi-manual approach. Sure, database schema comparison tools can do pretty well, but think of complex deployment pipelines, blue/green deployment, rollbacks etc. This is where you have to do a lot manually to make schema comparison actually work.
I've had a project where schema and structure of the data were really complicated and unstable. This was a data-intensive application and I had to change schema and data a lot. That caused a lot of pain and finally made my delivery pipeline a combination of automatic and manual steps which made the whole process slow and exhausting.
Things are hard when messing with data schemas
So what are the common pain points when automating database development?
A problem of code and state . You have to track database changes separate from the code. That means that you can't always say that this particular commit will work with the current version of the database. Tests will help. Good if you have ORM definition for database objects, so you can recreate the whole structure from the code. But in that case, the problem is that you should think about how you'll handle existing data. You'll probably have to develop separate approaches to development, testing, and production environments. Alternatively, you can track schema version inside the database itself and store link to a version inside the code. Combine this with migrations and this is how most web frameworks handle it. Or you can put the whole data definition statements export into the code.
A problem of keeping up to date. You have not to forget to write migration yourself. It's always hesitant to skip this crucial step, especially for a rollback part. It's easier if you use ORM as you can always diff the definition in code. If not, you have to be cautious and not miss any tiny changes you made. Schema comparison tools are really helpful here.
A cost problem. Changes to a database are expensive. Good if you have a stable database structure before going production. Sure you're not. Adding fields will become more and more expensive as you acquire more data. And when you want a derivative database for analytic purposes with ETL layer, things will start falling apart, you'll be stuck.
A data regression problem. Changes to a database are dangerous. Do you fear to delete a field? You have to. That's why many companies never delete fields or have a long and complicated process of changing schema in production. But keeping all the legacy fields makes schema hard to maintain. Only good documentation can help with that. I've never seen one.
A time problem. Changes to a database are slow. You want to add a field, or change existing, you have to test the code if it correctly handles a new schema, then you move forward and test deployment on a staging server. Good if you have a copy of a production database. But sometimes you don't. So you have to check everything in production to avoid common pitfalls with data changes like loosing data when switching field type or forgeting about foreign keys. Yeah, foreign keys. Tightly coupled data is always a challenge to unravel when you want to touch key fields. This is a very time consuming.
Dreaming up of a solution
What I would really like to have is a tool that can take care of issues above. How it would look like? Just some thoughts about what it should do:
- It would integrate with version control tool (git). When I do a commit it snapshots a database schema. Changes should be reversible and applied separately if possible. Think of diff for a schema.
- It should take care of data when applying changes. Renaming field, changing field type and adding constraints should not degrade data. For example, deleting a field should not actually remove values from a database, but move it to something like field_a_delteted.
- A single command-line tool to handle all the tasks that can be easily integrated into the CI/CD pipeline. Like many, I love working in console. Imagine checking out revision and doing something like "database revert".
Monitoring and logs. How long did this particular migration take? Were there errors?
- A UI to review changes made to a database. Reviews are essential for code, why not for data too? This is partially solved by doing diffs in ORM definitions, but it gives no insides on how much data will be affected.
When evaluating points above, I think that it may be not simply a tool, but a complete database management framework, combining approaches, templates, and scaffolding for establishing a complete database development pipeline. When you start a web project you normally start with a framework. It will guide you on how to organize things about the application. Adding a data framework would guide you on how to organize things about database and changes to it. And I am not talking about ORM.
I personally think that data schemas are getting more and more complicated as applications are slowly evolving in the direction of smart, ML-powered solutions. The need to quickly introduce changes and test them online is essential to be competitive today. I really feel the need for new great modern data tools.