Luke Shepard

My personal website.

Using machine learning to predict if cancer cells can repair themselves

When I first started learning about machine learning a few years ago, I was blown away by the concept. Instead of a human defining every little rule, could we let machines learn patterns from existing data to do a better job? I’ve spent the last several years building products that rely on genomic data to make predictions that can drive human results.

Tempus has the largest set of DNA and RNA paired with clinical data in cancer patients. We use that dataset to derive insights that inform clinical care. Once an algorithm has demonstrated results, we work to ship it for use in a clinical setting.

Here, I’ll share the story of one of our algorithms - the HRD (homologous recombination deficiency) detection test. This test predicts the probability of a patient’s cancer having a phenotype characterized by the inability to repair DNA braks via the homologous recombination repair (HRR) pathway. We first launched the test using DNA data only, and last year built an enhanced version that also uses RNA. Both models were developed using machine learning. Let me first give some context about why this test is important and then I’ll review how it works.

The Problem: how do cancer cells develop?

Healthy cells have mechanisms that help them repair damage that naturally occurs over time. When a single strand of DNA breaks, the DNA can be copied from the paired bases on the other strand of the same chromosome in order to repair the damage. However it’s also possible for both strands of the chromosome to break, which requires specialized machinery to copy from the same gene on the other chromosome. This process is called homologous recombination repair, or HRR. The book Gene describes this well:

When DNA is damanged by a mutagen, such as X-rays, genetic information is obviously threatened. When such damage occurs, the gene can be recopied from the “twin” copy on the paired chromosome: part of the maternal copy may be redrafted from the paternal copy, again resulting in the creation of hybrid genes.

Once again, the pairing of bases is used to build the gene back. The yin fixes the yang, the image restores the original: with DNA, as with Dorian Gray, the prototype is constantly reinvigorated by its portrait. Proteins chaperone and coordinate the entire process- guiding the damaged strand to the intact gene, copying and correcting the lost information, and stitching the breaks together – ultimately resulting in the transfer of information from the undamaged strand to the damaged strand.

When homologous recombination repair fails to work, it’s known as homologous recombination deficiency, or HRD. The BRCA1 and BRCA2 genes are two of the more well known genes that regulate this, one of the central pathways that prevent cancer from developing. Mutations in the BRCA genes, among others, can trigger HRD. HRD is particularly common in Ovarian cancer, but it is present in many other cancer types as well.

If a cell presents with HRD, then the cell will have a hard time repairing a double-strand break – since the homologous repair pathway doesn’t work, the cell will likely die.

two hit hypothesis for hrd

Patients whose tumors have HRD have a treatment available to them: PARP inhibitors. From this article:

If your tumour is positive for HRD, it means that certain treatments are more likely to be effective. One of those treatments are PARP inhibitors.

PARP (poly ADP-ribose polymerase) is a type of enzyme involved in many functions of the cell, including the repair of DNA damage. In healthy cells, when DNA is broken or damaged, PARPs act as a repair crew to help fix the damage. This means the cell lives.

PARP inhibitors are a type of targeted therapy that block the action of the PARP enzyme in cancer cells, which means they can’t repair DNA damage.

The combination of being HRD positive (which blocks the first repair mechanism) and taking PARP inhibitors (which blocks the second repair mechanism) means that PARP inhibitors can be more effective than in people whose tumours aren’t HRD positive.

For more information, this video describes the underlying process of HRD fairly well:

The model: Using Artificial Intelligence to identify HRD

The most popular HRD test on the market was launched by Myriad in 2015. Today, there are a small handful of companies offering these tests, each using slightly different mechanisms. The existing tests on the market make decent predictions in the most well-studied types of cancer (for example, Ovarian cancer), but they do not generalize as well to other types of cancer. Our team at Tempus saw an opportunity to improve on the existing tests on the market by using a different detection strategy, powered by machine learning.

Tempus released a manuscript describing the model’s development (available here. They note that the HRD-RNA model may be able to find signal that isn’t present in just the DNA alone:

While other cancer cohorts … had a lower prevalence of BRCA1/2 alterations, previous work has suggested that these cohorts may exhibit the HRD phenotype and respond to PARP inhibitors. We hypothesized that tumors from patients with these cancer types may exhibit the HRD phenotype in the absence of BRCA loss. … The HRD-RNA model, in contrast to the HRD-DNA model, has the potential to capture a dynamic HRD phenotype.

There are two modalities of genomic data that we can use for the predictions: - DNA and RNA. Because HRD makes it more likely for mutations to go unrepaired, the DNA in a tumor with HRD can have “genomic scars” - evidence that there have been unrepaired mutations in the past. RNA, on the other hand, shows how the tumor is behaving. To use a computer analogy, DNA is like the source code, whereas RNA is the runtime. All cells in the body have more or less the same DNA, but they may express dramatically different proteins via RNA. In particular, a healthy cell will express different proteins than one where the homologous repair pathway is not working. We can use that to infer the health of the repair mechanism from cell behavior.

Tempus has sequenced RNA from many patients, compiling one of the world’s largest libraries of RNA data. After the initial launch of the DNA-based HRD test, our data science and bioinformatics teams set out to leverage RNA to improve the results. The data scientists pulled a training set from the Tempus data bank, and used it to train a logistic model to recognize which patterns of RNA expressions are most likely to indicate HRD positivity.

One common challenge when developing a biological based model, is how do you determine the “truth”

The data scientists who developed this model, including Benji Leibowitz and Bonnie Dougherty, followed many best practices from the industry. One

Shipping the model

Once we had an HRD model built, it was time to develop it into a product. One of the best parts of hte job is working with these incredible data scientists and pathologists to make our product go live.

  • Collect and verify upstream inputs
  • Run the algorithmic analysis
  • Publish results as a report
  • Pathologist signoff
  • Doctor views results & takes action, if relevant

First, we need to match the data that was used to training against the tdata that is generated in real time from the lab, and insert the algorithm into the pipeline. To do that we helped the science team build a python package, shipped it using poetry and added unit tests to make it robust to changes.

Next, we built the report. Each patient’s results generate a report that gets delivered to customers. Our team built a generalized system for creating PDF reports and allowing clinical signout. We then run end to end testing and ensure correctness.

hrd report