Cohort360 using AP-HP's stack

Cohort360 is the only Open Source part for now. But we are working at Open Sourcing all the parts. You can contact us by this time: open-source [at] cohort360.org

Example of a complete working stack at AP-HP

This is the architecture deployed at AP-HP to deliver all features of Cohort360 with more than 12.9 million patients:

Cohort360 is the rectangle at the right with the front-end and the back-end. As you can see, at AP-HP, it interacts with many APIs to authenticate (Auth/JWT API), to read health records (HL7 FHIR API) and to run simple or complex queries and create cohorts of patient (QueryServer API).

All of these APIs can be replaced by yours, or can be used directly to reproduce exactly the same architecture. Reproducing this architecture might be your best option right now, as Cohort360 has only been tested with those APIs, but if you already have a FHIR API and an authentication API and have some development skills, you can try to connect your APIs and develop your own QueryServer/SparkJobServer equivalent.

Description of our APIs

  • The Auth/JWT API is simply an API that provides JWT tokens (access/refresh) that are used by the front-end to login and to authenticate users to the other APIs. This is easily exchangeable with some development.

  • The HL7 FHIR API is an API that exposes medical data, FHIR is a standard for medical APIs. At AP-HP, it is an API developed using HAPI FHIR which is a Java Spring-boot framework that simplifies the development of a FHIR API. As FHIR is a standard defining how your API should expose medical data, it would normally be easy to replace it by your own FHIR API. But let's face it, not all FHIR APIs being developed around the world expose exactly the same resources (Patient, Genomics...) and functionalities (Search, filter...). So what can we do about it? We are actively working at providing an implementation-guide that help you implement what's needed in your FHIR API to make it compatible with Cohort360. We are also discussing on using a capability statement, which describe what are the capabilities of a FHIR API, to possibly enable or disable Cohort360 functionalities automatically based on what the FHIR API say it can or cannot do.

  • The QueryServer is an API that receives possibly complex queries. Its role is not to run these queries, but only to translate them and send them to the SparkJobServer. Received queries are written using a language combining FHIR-like queries, but the SparkJobServer does not know about the FHIR syntax, but rather, knows how to query data directly in database/indexer syntax (in a SolR-like language). So the QueryServer translate them from FHIR-like to SolR-like and then sends them to the SparkJobServer. The QueryServer is asynchronous.

  • The SparkJobServer is an Open source project that we use to run Spark jobs asynchronously and without the overhead of launching Spark every time (it keeps active Spark sessions). We use it by running a Spark session which can receives, possibly complex, queries from the QueryServer. Received queries are written using a language combining SolR-like queries. The query received is interpreted by our Spark code and is distributed if needed, the result is a cohort of patients matching the criteria of the query. A job can specify whether the query should only return ids of the patients matching the criteria, or if it should persist the cohort of patient in database: the SparkJobServer can write data to Postgres and SolR. This is how Cohort360 creates cohorts of patients in database. When the data is persisted in database, it is accessible via our FHIR API, and it is then accessible in the Cohort360 application. The SparkJobServer application needs a database to operate correctly, it can run with a file database like sqlite, but works better with Postgres.

Description of our databases

In the AP-HP context we used multiple databases to store medical data:

  • Postgres to store structured medical data in the OMOP format, which is a standard that describes how you should store medical data in a database for analytics. At AP-HP we use OMOP CDM v6.0, but as OMOP is a storage format that is specialized in analytics, it does not provide as many tables and fields as a classical EHR (Electronic Health Record) database, so we extended OMOP by adding tables and fields when needed.

  • Apache SolR to store textual indexes of medical data, SolR is an Open Source alternative to Elastic Search. It is used to index textual data to make it really fast to query any textual data, whether it is exposed via our FHIR API, or via our QueryServer/SparkJobServer APIs.

  • Apache Phoenix, which used Apache HBase behind (which uses HDFS behind...), is an SQL-like OLTP distributed database. It it used to store PDFs and query them instantaneously. It can easily be replaced by an S3 database for example.

Description of our ETLs

On the schema:

  • E1 is an ETL that loads new data from our EHR into our Postgres OMOP instance

  • C1 is an ETL that loads new data from Postgres OMOP into a temporary Delta lake database

  • C2 is an ETL that loads new data from Delta lake into Apache SolR indexer

  • E2 is an ETL that loads PDFs from our EHR into our Apache Phoenix database

We used Delta lake to only load new deltas of data instead of all the data from Postgres OMOP to Apache SolR.

Dernière mise à jour