Unifying AI-Big Data Front to Fight Covid-19

BY THOMAS M. SIEBEL

Enterprise AI is about applying the sciences of data science and digital transformation to business and government processes. It is the fastest growing segment of enterprise computing. I am confident that the largest commercial application of AI will be precision medicine: disease prediction, genome-specific medical protocols, and AI-assisted diagnosis will result in greater availability of more efficacious medical care at lower cost.

It is clear that there is a ripe opportunity to apply Enterprise AI to mitigate the global COVID-19 crisis. Ripe avenues for impactful breakthroughs from AI include but are not limited to:

Applying machine learning/AI methods to mitigate the spread of the COVID-19 pandemic
Genome-specific COVID-19 medical protocols
Biomedical informatics methods for drug design and repurposing
Modeling, simulation, and prediction of COVID-19 propagation
Efficacy of COVID-19 interventions
Broader efforts in biomedicine, infectious disease modeling, response logistics and optimization, public health efforts, tools, and methodologies around the containment of rising infectious diseases, and response to pandemics so as to be better prepared for future infectious diseases.

Absent rich data sets, it is impossible perform meaningful AI, so it is no surprise that we are seeing many organizations publish COVID-19 related data sets in the public domain to fuel COVID-19 Enterprise AI efforts. Some notable recent efforts include:

Our analysis of the data sets being published suggest that these efforts fall into three categories:

Type 1: Lists of URLs
Type 2: Libraries of Discrete Data Sets
Type 3: Unified, Federated Data Images

All are positive contributions, but the three categories vary considerably in the potential benefit they offer.

Each is described below:

Type 1: Lists of URLs. The first type, like the CORD-19 project and MITRE Healthcare Coalition, consists of providing lists of unique URLs that point to different datasets that are stored in different locations, in different data structures and formats (e.g., text, images, numerical data, voice, etc.).

Type 2: Libraries of Data Sets. The second type, like the AWS COVID-19 Data Lake program and the Google Open Cloud Platform, has taken many of the data sources accessible through the URLs referenced above and stored them in a digital library that is a collection of unique data storage systems (Postgres, Dynamo DB, Neptune, Redshift, etc.). These are located in a common “physical” storage utility like AWS S3 and the Google Cloud. For those data sets that have common data structures, e.g. CSV or JSON, each unique data set is stored in a unique database in that common S3 storage utility. In these systems, the datasets may be individually accessed through Amazon software utilities like Postgres and SageMaker. Access is free up to a point, where data volumes reach a limit, or the researcher wants to integrate those with other externally available datasets and then a fee structure kicks in. The Amazon Data Lake is accessible through Amazon’s data access products. The Google Data Sets are available through Google Cloud utilities. The data are not pre-integrated nor federated.

Type 3: Integrated, Federated Data Images. The C3.ai COVID-19 Data Lake is unique in that we have curated those data sets that we understand to be of the most utility to researchers, including those listed above, and aggregated those data into a unified, federated, logical image that is immediately available for researchers to access through any utility that offers RESTful data access (e.g., Excel, Tableau, R, Python, etc.). Importantly, we have preestablished the important linkages in those complex data sets so that researchers can easily navigate and explore the data features that may be of interest (e.g., diagnosis, age, locale, preexisting condition, etc.) and can perform sophisticated data science on those data. Importantly, the C3.ai COVID-19 Data Lake provides researchers an abstraction layer to all of the disparate polyglot of structured and unstructured data, so that the researcher does not have to be aware of the physical and logical structure and associations of those data. This data set is immediately extensible by the data scientist and can be easily linked with other external data sets. The C3.ai COVID-19 Data Lake is easily extensible by the end user and can be linked with external data sources.

As a result of the integration and federation of the data, we are able to provide rich knowledge graphs to assist the researcher in understanding the scope of and connections in the data:

Experts estimate that data scientists spend up to 90% of their time and effort “wrangling” data so that the data are in a form that is accessible for analysis. In the C3.ai COVID-19 Data Lake, we have done all of that work.

In an effort to visualize the characteristics of these three categories, consider the various forms of COVID-19 data as analogous to the contents of a university lending library, including books, medical journals, videos, transcripts, medical records, census records, historical pandemic studies, etc.

You can think of the Type 1 Data Lake as a recommended reading list.

A Type 2 Data Lake is analogous to a university library with books, photos, recordings, etc., stored in a common location that can be identified and retrieved in their original form factor through some classification system like the Dewey Decimal System.

Type 3: The C3.ai COVID Data Lake is analogous to the World Wide Web, where we have used the big data equivalent of HTML to enable researchers and analysts to view and navigate all of the associations within and across the data sets from the union of the collections of all the university libraries. We are confident that this is unique in the scope and scale of the dataset aggregated, in its ability to be addressed as an integrated, federated, and unified data image, and in its ability to be accessed by any software utility that can make a RESTFUL data call.

The C3.ai COVID-19 Data Lake will be released into the public domain on April 20, 2020, as a no cost utility to facilitate advanced COVID-19 research. It is our intention to continually expand the corpus of the C3.ai COVID-19 Data Lake. We are in active discussions with many outside organizations, including many leading universities and the NIH, CDC, and United Nations, to accelerate this effort.

It is our hope and expectation that the C3.ai COVID-19 Data Lake will serve to accelerate the application of Enterprise AI to alleviate COVID-19.

Thomas M. Siebel is the CEO of C3.ai and the Chairman of the C3.ai Digital Transformation Institute.