Some relevant trends
As the amount of business data continues to increase exponentially the need for smarter, more cost-efficient Data Platforms and Analytics grows. In 2020 the Corona -phenomenon is even speeding up digitalization thus speeding up also the business data volume growth. Leading enterprises and public organisations are transforming quickly within the new business environment and putting pressure on data platform vendors to innovate more intelligent ways to process and refine huge amounts of valuable data.
Cerion has profiled itself to be in the core of the above-mentioned trends, to pragmatically help customers succeed with rapidly emerging digitalization. Cerion is a pioneer in Data Platform Automation. It delivered the first versions of fully automated Data Vaults and Enterprise Data Warehouses already in 2009. Today modern, scalable, high quality, flexible, secure and robust data platforms are core business of Cerion. Our professionals manage non-structured, semi-structured and structured data to find best business solutions for our customers.
SmartEngine and D#
Cerion’s SmartEngine is a reliable data engine that automatically generates highly scalable Data Platform structures from visual conceptual data models and meta data. Modern Best Practise methodologies, like DataOps, DevOps, Iterative development and Agile models are fully supported. High productivity is undisputable and total cost of ownership (TCO) during the life cycle is among the best.
With the term D# (read DSharp) we refer to the methodology of building and maintaining leading Data Platforms together with the customer. D# was previously called SmartDW. D# is one of the leading low-code development environments and concepts for modern Big Data and Data Vault 2.0 data business solutions. D# also takes into account hybrid data solutions and aim for seamless compatibility within the enterprise data architecture of big customers. Re-usability in data modelling, full visibility into the data assets of big organisations, and better data governance are some of the benefits. Within the D#-methodology development work can be shared with the customer in many different ways. The customer can go for a turnkey solution by outsourcing most of the work to the vendor or choose more self-service, depending on the capability of customer’s personnel. Responsibilities can be managed with e.g. a simple RACI-matrix.
High Productivity for Modern Data Platforms, case/subset Data Vault 2.0
In the context of data warehouse–based analytics and reporting, the goal is to implement a methodology and corresponding development environment in order to minimize the time between identifying a reporting need and the desired report being delivered.
The Conceptual Model
For almost 30 years of my professional life, I have designed Conceptual Models to function as the basis for both BI- as well as general software solutions. Regardless of the intended implementation of the model, the main reason for modelling the actual business concepts as opposed to directly drawing a more technical structure like a database model is that the model describes familiar phenomena using established terminology that the customer can relate to, and as such it can be used as a communications tool both internally and with third parties for a wide variety of purposes.
The model itself is implementation agnostic, it only specifies what information the possible implementation should be able to handle, not how. For example, in a DW context, a specific model could have any number of different possible implementations (different database table structures), depending on other than content related needs: a multitenant DW could have an Owner-column in all tables, a multilingual solution might have a language id tagged onto some of the tables, different normalization methods could be used for one reason or another, or a certain standard, like Data Vault 2.0, might be demanded of the implementation. All of these different structures would need to support containing and loading data that is compatible with the same Conceptual Model.
From a BI perspective, there’s a specific use case for the Conceptual Model: using the structure that the customer already knows as a window into the data contained in a DW has proved to be a winner: no matter what the actual DW implementation looks like, publishing its contents through a set of “views” that are identical to the classes in the Conceptual Model makes the DW immediately familiar to everybody who knows the Conceptual Model.
Initial Problem: Low Degree of Reusability in Data Warehouse Projects
As a programmer, having been thrown into Data Warehousing some 20+ years ago, I was surprised by the fact that there was simply zero reusability in DW and ETL implementations: no matter how many times I personally had implemented a loading mechanism that loads person data into a DW table (or project, or an organization hierarchy), none of that work could be used as a basis for the next implementation, so in the next project I would do it all over again from scratch. Something was always different. The only reusable component appeared to be the acquired skill to implement DW tables in general as well as their corresponding loading mechanisms. The basic implementation pattern was always the same (with some slight variations), and the varying part from case to case was the actual data to be loaded.
In order to achieve reusability, that general knowledge of how to construct and load a DW on a generic level would have to be described as an algorithm, and then turned into a piece of software that could treat the variant parts of the task as input in order to produce the desired DW table structure and populating SQL code as output. That piece of software would eventually become SmartEngine for Data Vault 2.0. The variant part being fed to it, i.e. the description of the data that should be loaded into the DW, is the Conceptual Model augmented with additional implementation specific parameters.
The Model as a Source for Automation
A model is a compact way to document relevant information about a specific problem domain. A model-oriented approach to BI automation focuses on producing a model from which information is extracted in order to automate a wide variety of BI-related tasks. Such tasks range from automatically generating a more detailed data model from a less detailed one, fully automating the data loading process, implementing data quality checks, describing reports in terms of modelled elements as well as generally automating as many utilitarian repetitive day-to-day tasks as possible, to the benefit of both the customer and the developer.
The Development Process
Developing a DW, there are a great number of individual tasks that need to be performed to achieve the desired end result. Many of these are traditionally manual with varying degrees of supporting commercial tools available. Starting with identifying a need and ending with a report attending to it, it is here assumed that the development time spent in-between can be as short as possible only if there are as few manual tasks as possible. Hence, from a customer’s point of view, the more automation, the less time and money spent, with an added bonus of consistent and predictable end results. From a developer’s point of view, more time can then be spent on getting to understand the data as well as making something productive with it, instead of implementing tasks that are often mechanical and tedious in nature.
Skills to Algorithms
To perform any task, a person needs to have detailed knowledge of generally applicable rules and methods pertaining to the task at hand, as well as context-specific knowledge. For example, to implement the loading mechanism of a specific set of data into a Data Vault, the person needs to know how to deal with hash calculations, datatype conversions and load date columns in general, but also how this particular set of data should behave. Such a person is most likely a certified Data Vault 2.0 professional, and so has a specific technical skill set to show for it.
The driving idea is that if a person about to implement a task can be given a set of instructions that makes it possible to successfully perform the task, then that task can be automated. Going with the optimistic outlook that a model can describe any target of implementation in sufficient detail, the focus and workload then shift from tasks requiring technical, methodological or tool-specific expertise to instead modelling the data and supplying all the relevant metadata (which is exactly the same information that a human implementer would otherwise need) in a standard format which can be processed automatically.
Tasks which produce programming code (or files or text in general) are always excellent candidates for automation. The automation process results in a functionally identical result compared to the work having been done manually. The result of an automated task is not necessarily “better” than what would be achieved manually, the job just gets done faster and most likely in a more consistent way. However, the amount of manual work needed for some specific functionality might be so vast that customers tend not to get it done, so automation may also lead to the customer getting features that otherwise would not have been realistic.
Let’s Be Real
The real-world constraint of customers wanting everything by yesterday puts pressure on solution providers to implement more efficient production methods. Automation makes complete categories of manual work disappear, but other tasks (hopefully simpler) will take their place.
Building a DW is not a random process. Depending on the properties of the data, there are only a finite number of ways how it could or should be processed. For example:
- a set of data can be identified by a real business key (a hub is generated), or it can’t (a link rather than a hub is generated)
- certain values may need to have leading zeroes either added or removed in order to generate perfectly matching values (cost centers in the transaction extract have no leading zeroes, but the list of cost centers does)
- a data set containing several time-dependent versions of the same real-world entity, say, organization structure with history, needs to be handled differently than a data set with only one instance of the same entity, say, the current organization structure
- hierarchies are typically loaded into a specific kind of structure
- denormalized data could be split/normalized.
Each of these cases (or what I call BI Design Patterns) is always implemented using the same implementation pattern. Some cases are as simple as data transformations (removing leading zeroes) that only require applying some SQL function to an individual value being loaded, but others are rooted deep in the methodology being implemented and affect either how the database structure should end up, or how the load process should be implemented (Business key implies hub, historized data should be loaded in smaller batches), or both. Automating the latter category of patterns equals implementing the skill set of a Data Vault professional, resulting in a DW Automation tool.
DW related tasks can roughly be divided into those resulting in the data being available in a DW, and those operating on this data, such as preparing it for fact/dimension tables or surfing the data during bug hunts. Here we will only consider the former category, although both categories are equal time savers once automated.
The best targets for automation in general can be found by asking questions like:
- what do I spend most of my time doing?
- which tasks take me the most amount of time to finish or to get right?
- what tasks are too easily left undone for whatever reason, for example because they are too much work, or I strongly feel the results will probably never be used anyway (hint: documentation)?
- which tasks produce the most human errors?
- which tasks most often result in person-dependent end results?
- which tasks require expert knowledge?
The mindset should be that if there is an algorithmic way to describe how to perform a task manually, the task should be considered for automation. If there isn’t, you might want to rethink how you go about implementing the task until there is. And once a task has been automated, another task becomes the one that you spend the most time implementing manually.
So in the Data Vault 2.0 case, or in any DW case for that matter, these features are the Top 2 ones (the no-brainers) that should be automated for relevant platforms:
- automatically generate the database structure, as per the official Data Vault 2.0 specifications (as well as the “whatever works for this kind of data” category)
- automatically generate the hashing and loading mechanisms that populate the database structure (for example, SQL Stored Procedures)
Getting the Data Vault Right
In order to minimize the time from first exposure to the data to when it has been loaded into the Data Vault, we don’t want to spend much time manually drawing the table structure. However, we do need to make the same design decisions (the reasons for putting a certain column in a specific Data Vault table do not change just because we automate the process). Also, because a class (say, Person) is represented by only one object in a Conceptual Model, but can be several Data Vault tables in an database design tool, there is also simply less to draw when drawing a Conceptual Model, so that’s what we should go for.
Often the source structure can be directly extracted from the information schema of the source database, or some similar metadata may be available in some other format. When we have access to the data (or metadata), we can, after whatever customer-participating process we are employing, understand it to such an extent that we are able to identify the relevant BI Design Patterns and augment the existing source metadata with additional information (having made the previously mentioned design decisions), such that our automation mechanism chooses the correct implementation patterns for these specific cases, arriving at the same table structure that we would have drawn manually, but without the mistakes.
Likewise, augmenting the Conceptual Model with appropriate class- and attribute specific metadata, and finally mapping the source columns to the target attributes, we have created all the relevant information we need to generate the final Data Vault table structure.
Implementing the Loading Mechanism
From an automation point-of-view, if each column has metadata that describes its reason for being in that table, the loading mechanism can be automatically generated by analysing the set of columns in the table and choosing the correct approach for each unique and identifiable case. For example
- if a hash value is already present in a hub or a link, no insertions should be made
- if the table type is a standard satellite, the load mechanism should only insert new rows for each hash value where the hashdiff column for the newest row is different than the incoming hashdiff value for that hash value
- if a satellite’s primary key contains other columns than the hash and loaddate columns, it is a multiactive satellite, and should be loaded differently than a satellite that is not multiactive.
And so on. That is, the standard Data Vault 2.0 rules for loading data. Plus whatever variation makes sense for a particular set of columns, combined with the kind of data being loaded. Generating the loading mechanism is trivial once it has been established which approach should be taken for each table.
The Conceptual Model, Revisited
Having implemented these basic features, the next answer to the question “what am I spending most of my time doing” will be “getting the data out from the vault”. This need combined with previous experiences with Conceptual Model based structures (3NF), gives us this third feature to be automated:
- generate a set of views such that each view corresponds to a class in the Conceptual Model and that correctly gathers and presents the data from the hub/link/satellite -structure where it resides.
With these 3 features in place, we find ourselves in a situation where:
- the Data Vault has been designed using a less detailed Conceptual Model that only defines what should be stored in the DV, not how
- the design and implementation of the actual structure of the Data Vault is fully automated, using the Conceptual Model, source mappings and other metadata to split the Conceptual Model classes into appropriate Data Vault tables
- the hash/load procedures are all fully generated from the same metadata
- all the data in the Data Vault is easily accessible through a set of 3NF views
- those views are identical to their corresponding class equivalents in the Conceptual Model.
So, from the point-of-view of the guy developing reports or retrieving data for advanced analytics purposes, he only sees the 3NF structure, and he does not need to worry about the underlying Data Vault structure at all. Behind the scenes there are hubs, links, satellites, state satellites, pit tables and bridges, all fully automated. Optimization techniques (metadata based) can be used as necessary to make the view layer smooth.
Using automation that algorithmically implements both the Data Vault structure as well as all data loading mechanisms has the following practical effects:
- Minimal implementation time between specifications being ready and the data being loaded into the DW (a prerequisite to the original goal of delivering reports and analytics quickly)
- Standard and consistent end result
- No human errors
- No person-dependency
- No need to be scared of changes.
The need for expert knowledge of a specific ETL tool is no longer necessary. Time that is freed from the otherwise mechanical and time-consuming work can now be put to better use, for example to get to know the data prior to loading it as well as implementing reports and performing analytics using it. So instead of a team of ETL guys, one skilled professional can easily and efficiently design and implement the entire chain from source to report. Combine this with an incremental and agile DataOps-approach and you have an efficient and to-the-point delivery process.
Having the data resting comfortably in the Data Vault and available in the Conceptual Model layer with minimal effort is a good start. There are still a couple of steps before it is visible in the report, we’ll get to those in another story.
About the author
Kim Johnsson, Principal Data Architect
50+ years young, wannabe-musician and music producer in his spare time, if there is such a thing. Graduated from Åbo Akademi in 1996, having written his thesis about Object Modelling Technique, one of the forefathers of UML. Really a software guy, Kim wrote his first fully Conceptual Model based application generator in the mid 1990’s, so a model based DW automation mechanism is quite a natural direction for him to take. When the going gets tough, he automates it.