Harvest: Data Discovery for Humans

Byron Ruth (The Children's Hospital of Philadelphia), Michael Italia (The Children's Hospital of Philadelphia)
Data, Python
Location: D136
Average rating: ****.
(4.33, 3 ratings)
Slides:   1-PDF 

While “Big Data” has been commonplace in many industries, biomedical researchers have traditionally worked with data that could be easily managed in spreadsheets and even in paper lab notebooks. The switch to electronic health records and the increasing use of genomic technologies are rendering these tools inadequate. Many business intelligence software packages exist for exploring and analyzing complex data, however these tools fit poorly in cross-institutional academic settings where the cost and burden of end-user training makes their deployment unsustainable. Furthermore, unlike transactional data such as website analytics, financial, or business operations, many aspects of biomedical data are not readily summed or averaged, further limiting the utility of existing tools.

Through our experience developing many data-intensive biomedical applications, our team at The Children’s Hospital of Philadelphia (CHOP) has been researching ways to lower the barriers for researchers and clinicians to explore their data. The result of our efforts is an open source framework called Harvest. Harvest is designed to alleviate complex (and often opaque) data models and enable interactive visual exploration of large data sets. It is composed of three primary components: Avocado, a dynamic query engine built on top of Django’s object relational mapper (ORM), Serrano, a REST API which exposes endpoints for clients, and Cilantro, a web application written in JavaScript using HTML5 and CSS3 technologies to enable interactive query and display of data.

This talk will discuss the capabilities, architecture, and motivations for Harvest. We will address some of our core strategies for mitigating data model and query complexity as well as maintaining the user’s experience when working with large data sets. The first topic regarding data complexity discusses exposing a perceivably flat data access layer to users which makes it simple to quickly find which data they are interested in without concerns about the structure of the data model. The second point will focus on the importance of highly descriptive metadata for context and discoverability of data. Researchers in particular typically use their own vocabulary for describing their data. The third point will focus on abstraction approaches for making data robust and presentable without sacrificing the ability to query and sort discrete values.

Harvest also addresses data scale by gradually exposing data and statistics depending on the current context. This topic discusses presenting aggregate statistics and distribution charts for informing users of what data is available prior to building queries and viewing data.

In addition to internal projects, Harvest powers several multi-center projects developed at CHOP including the NIDCD-funded AudGenDB and the NHLBI’s Pediatric Cardiac Genomics Consortium Data Hub.

Photo of Byron Ruth

Byron Ruth

The Children's Hospital of Philadelphia

Byron Ruth is a Senior Analyst/Programmer in the Center for Biomedical Informatics at The Children’s Hospital of Philadelphia. Byron’s skills in advanced web programming environments and content code abstraction have enabled him to lead a variety of projects at CHOP, including the development of a highly integrated audiology research database, an electronic health record-mediated clinical decision support engine for the care of premature infants, and a data management system that helps to discover relationships between genetic markers of congenital heart defects and clinical outcomes.

Photo of Michael Italia

Michael Italia

The Children's Hospital of Philadelphia

Michael is a Lead Application Scientist in The Children’s Hospital of Philadelphia’s Center for Biomedical Informatics. His primary role is to lead, support, and advise projects with a need for integrated clinical, genomic, and imaging data to enable translational research.

Michael has over 10 years of experience building and managing complex biomedical data repositories. Prior to his work at CHOP, Michael spent 8 years designing and building several genomic data integration projects for one of the world’s largest pharmaceutical companies.

Michael has had a dual interest in biology and computer science since first discovering the field of Bioinformatics as an undergrad studying Biochemistry and Molecular Biology at The Pennsylvania State University. He also holds master’s degree in Biotechnology/Bioinformatics from The University of Pennsylvania.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or scordesse@oreilly.com.

Contact Us

View a complete list of OSCON contacts