While working with Erica Voss from Janssen on creating the ETL Spec for SynPUF, we here at Outcomes Insights employed the Rabbit in a Hat (RiaH) tool to develop the ETL.

RiaH was helpful in creating the ETL documentation, but we felt that there was room for improvement. We created a series of issues to capture each individual feature we wanted to see added to RiaH, and we’ve gradually been working through that list, adding features to RiaH and giving them back to the community.

But it might be hard to discern our vision for RiaH based from just the tickets, so today we’ll outline what we’d like to see RiaH ultimately become.

There is a grand tradition in the startup world where new products are summed up as the pairing of two existing sites, e.g., “Hacker News for data scientists” (i.e., DataTau). Borrowing from that tradition, we see RiaH becoming “TurboTax for ETL”.

Taxes are hard. There are lots of rules from the IRS. There are many financial details to track. But tax preparation software (like TurboTax) takes users on a guided interview about their life and finances, asking increasingly detailed questions about specific tax situations based on the user’s previous responses. The software allows users to ignore irrelevant and inapplicable tax rules and focus on simple questions while providing information about the tax rules that apply to the user’s situation.

We’d like to see RiaH do the same for ETL. OHDSI’s Vocabulary v5 provides a lot of great rules for how data should be organized within CDMv5, and RiaH should take advantage of those rules to take users through a guided interview about the ETL between their source data and the targeted CDM.

Our major goal is to have WhiteRabbit/RiaH characterize the source data and determine most of the mappings that a user should consider in their ETL. Issue #45 explains this in detail, but we’ll repeat it briefly here: If we see that a source table has a column that contains ICD-9 codes, we know that source table is going to need to map to all the tables in which ICD-9 codes might be stored within the CDM, e.g., condition_occurrence, procedure_occurrence, observation, etc. There is no reason why a user should have to remember all the tables to which a given vocabulary is mapped, let alone require a user to then hand-draw 4+ arrows between tables.

Instead, RiaH should automatically draw arrows between tables after looking at the contents of each column in the source data and determining which tables a source table will obviously need to be mapped to. Then a user can just click on arrows that still need to be filled out and fill in all the relevant information for each pre-mapped arrow, adding any additional arrows they deem necessary or even pruning arrows they know aren’t relevant to their ETL. Release 0.5.0 laid much of the groundwork to make it easy for a user to navigate and interact with arrows, making this kind of workflow very efficient.

RiaH also needs to present as much information about the source tables and target CDM as possible so that users don’t have to hop out of RiaH, fire up a browser or Acrobat or Word to start leafing through documentation about a particular column. We’ve already made CDM documentation appear directly in RiaH and we still have plans to add even more information. We’re also hoping to find a way to supplement the WhiteRabbit scan report with documentation about the source data so that users have as much information about both the source and target databases as possible.

Lastly, we have an even grander vision for RiaH, but it is so vague and perhaps overly-ambitiuous, that we haven’t yet put any issues up on GitHub to discuss it.

We’re honestly frustrated that after working for hours and hours and hours on an ETL in RiaH, the end product is just a Word document that describes what the ETL should do, and that’s it. We have a great spec, but no code to run on our data. RiaH only solves half of the ETL problem. A well-specified ETL is critical, but useless without an implementation of that ETL.

We’d like to see RiaH become a tool that allows us to actually implement the ETL as it is being specified. How exactly we’ll achieve this isn’t yet clear yet. But, naively, we’re hoping that as we’re writing out natural language descriptions of some of the transformations that the ETL must perform on parts of data, we can also write out the code to perform those transformations as well.

Christian Reich (an OHDSI collaborator) is fond of saying “the perfect is the enemy of the good”, meaning trying to get something 100% right can completely kill a project that would have otherwise been incredibly useful if it solved 80% of the problem.

When we propose that RiaH allow us to generate an implementation of the ETL alongside the documentation, we imagine that we could generate 80% of the code needed to perform the ETL and that the other 20% would need to be written by software developers. But at least 80% of that boilerplate code would be done already, freeing up developers to focus on the truly sticky problems, rather than tie them up with writing out boilerplate.

Who knows if RiaH is even the proper platform for this kind of tool? Bottom line is, this is a project that is constantly churning in the back of our minds, and we’re itching to find the time and resources and innovations we need in order to solve it.