We have developed our own generalized data model, and we made it publicly available today. Why would we do this when there are already several other models in use? There are a few reasons.

  1. First, we needed a way to store data such that we could easily convert it to the OMOP, Sentinel, and PCORnet data models. Our Jigsaw software can run against a variety of data models, but we have to build the back end for each one. And that requires test data sets to make sure things are working. We didn’t want to have to do a full ETL for each data model, so we created a data model that relocates the data in a flexible way and makes it easy for us to move to another data model.
  2. Second, we wanted to ensure that we could do the ETL process quickly and easily. The biggest challenge in designing software to build clinical research studies is getting the data organized. So, we wanted a data model that required as little ETL work as possible. If you can’t get the data into the data model, then you can’t use automated tools like Jigsaw.
  3. Finally, we wanted to be able to store information and relationships that was not accommodated in other data models. And we wanted the freedom to adjust the model as new use cases arose.

How do we do this? Basically, we avoid forcing the data model to have visits, which is the fundmental building block of other data models. Since we don’t have to commit to a definition of a visit, we can easily transform our data to other data models that are visit-centric, like the ones listed above.

Really, this effort is a by-product of trying to better support claims data. Claims data doesn’t actually store a single representation of a visit. It stores a representation of how visits are paid for, which often means that there are separate facility and provider records for a single visit. And within each representation of the visit, there are individual records with linked information (e.g., procedures linked with diagnoses). So, we chose to emulate this structure in the generalized data model. And this also works for electronic health records since there is no problem storing visits if they exist; we just don’t require visits.

At any rate, we don’t want to get into all of the details here, because we have the entire thing on GitHub, where we have worked on this for the last 15 months (in our spare time). And we have a draft manuscript which we hope to get on bioarxiv so that people can read it while we pursue publication in a peer-reviewed journal.

This post would not be complete without mentioning that Marc Halperin has built some automated ETL tools that are very promising. He built them in R, but they should scale via Spark once we get things finalized. We can now design ETLs in a week or two (depending on the similarity to existing ETL specifications) compared to much longer times for custom coding by hand. So, we are now ready to take on the entire pipeline starting from raw data.

We are happy to respond to comments and questions, and we encourage anyone who is interested to talk with us about implementation.