🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
So we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples, and there was no prior here that any of this would work. Like zero. We just started generating data and like sourcing human tumors, processing. We built this whole processing pipeline to get the tumors into like these arrays and the formats. So you've got like these two-week runs where you're processing two slides and we're just churning data. For months and we couldn't even train a model. So we sort of just built all this and then, then like, let's say 18 months later, hey, I wonder, can we train a model off of it? And then it was not, you know, like it wasn't obvious.
Yeah, there wasn't really like anything major to go off of. I mean, there were like Transformers developed for single-cell data. There just like weren't really datasets out there that people had been able to develop on. We do a lot of like custom model building.
Hi there, I'm RJ Honnakey and this is Brandon Anderson. We're the co-hosts of the Latent Space Science Podcast, and today we're really happy to be in the studio with some of the people from Noetic.
I'm Ron Alpa, co-founder, CEO of Noetic, physician scientist by training. My hobbies are making hot takes about AI curing cancer.
Hi, I'm Dan Bear. I'm VP of AI at Noetic. I'm a biologist by training, um, did PhD in neuroscience and then moved into comp neuro, computer vision, self-supervised learning, and have, you know, been doing AI research at Noetic for the past few years.
Maybe we should start with what is Noetic, why did you found it, what is the difference between Noetic and the other virtual cell AI companies?
Maybe just start with a little bit of a contrarian thesis, which is really the reason for founding Noetic. It's— we all know the numbers that 90%, 95% of Cancer drugs fail in the clinic. Why do they fail? So our thesis is they fail not because we're bad at pharmacology, not because we're bad at target selection, you know, making the drug. We're actually better at that process than we have ever been in the history of drug development. Most of those drugs fail, we'd argue, is because we're bad at selecting which patients those drugs are going to work in. And oftentimes you see trials where there is no placebo effect in cancer. Some patients respond to these drugs. And if you have a patient that responds, that tells you something, that there's some biology that, that's active there. Uh, but you have a problem in, in patient selection. And so really that's the thesis behind Oyelec is can we build models that can fundamentally understand patient biology from the very beginning and help you position molecules in the right patient population?
So you're actually using the models partly at least to select the patient cohort, not just— so you can imagine it working either way. You could design, oh, I think that this molecule will do well because I know something about the patient population. But you could also say, I think that this patient population is the match for this molecule.
And that's where the power of the models is. Like, once you've trained these models on patient data, you can use them on both sides of the equation. So you can use them for discovering new targets directly from the patient data, which people often refer to as reverse translation. So starting from humans and then trying to understand which targets to go after, and then you can use that to develop molecules, but you can also use them directly on patient data. If you have, you know, let's say a Phase 2 or Phase 3 trial, you can use these models to understand which patients or what underlying biology of the patients in the trial is a predictor of response. And we've been doing a ton of that recently.
Are you doing a lot of like rescuing trials that had a bad effect?
We are doing a lot of, uh, looking at like data from phase 2, phase 3 trials, and then using the models essentially to run inference on, on patient biopsies, um, and understand whether there's underlying biology that would help us design the next trial. We haven't shared any of that yet, but you'll see this too.
So cancer is kind of like infamous in that like there are many, many different types of cancers. Whenever it says like cure cancer, that is almost a meaningless, vacuous statement. So your point is even amongst cancer, or you pick a specific type of cancer and then a subtype and a subtype, there's a bunch of different patient populations that each one of them will respond differently to drugs. And your point is you can figure this out right now that like some subpopulation will do well and respond to this drug when you think, generally speaking, the rest of the population would not, even though we have historically classified this as like, oh, what type of cancer, what indication or so on.
Yeah, that's exactly right. And I would maybe even go further and say like, nobody actually knows what the subtypes are. There are cancers that originate in a certain tissue, like the lung, that, you know, have been classified into subtypes based on pathologists looking at them for, you know, more than a century. And, you know, those subtypes certainly have some connection to the real, like, carving nature at its joints. Like, what are the actual functional subtypes of disease there? But our thesis is kind of that if you look at the data, a much richer kind of data, so the multimodal data that we're generating in our lab, we're going to see that actually, you know, what people thought was one subtype of lung cancer is really 3 distinct subtypes of cancer. And that is going to be critical for figuring out which patients should get which drugs.
Yeah, maybe I'll just go back to, like, one of your first questions. And, you know, I was saying, well, like, drugs don't, you know, many drugs fail in patients because we don't understand which patients they will work in, in oncology. Why do we end up in that situation? So whenever you make a new drug, You do a set of experiments in cell culture, cells in a dish. Those cells are often cell lines. These cell lines have existed for 40, 50 years, and, and they're immortalized. So they have genomes that allow them to persist that have abnormal numbers of chromosomes. They have gene expression patterns that don't represent any known cell in like the human body, really. These are sort of Frankensteinian cells. They're cancer and dry. They're mostly cancer. Um, and then, and so you can do your experiments in, in, in these cell lines in a dish, or then you can move these into animal models. And in oncology, you often have this sort of a panel of, of different animal models, um, with, with, you know, different cancer types that you'll test these in. Um, and we, in doing these experiments, we sort of convince ourselves that, that Some of these cell lines are, let's say, lung cancer cell lines or colon cancer cell lines. And then even that some of them in, in the mouse context are colon cancer cell lines and lung cancer. And then we, in the mouse, we implant them under the skin and like weird places and we treat the mice with drugs and we, we see how they respond. But ultimately there's a big gap because they don't translate to, to patient biology most of the time. So these cancer cell lines, most of them don't even, you know, even if they are derived from a colon cancer, they don't even have the mutations that human colon cancers have in many cases. And so, and pharma has done this for, you know, 20, 30 years where you, you develop a drug, you test it against, you know, hundreds of these. It's not an art experiment. We can, you can send this out to any CRO. They'll test your drug against hundreds of, of different cancer cell lines. And then you can sit back and say, okay, well, Which of the 50 colon lines responded to my drug and which of the 50 ovarian cancer lines? And you could try and map that to human biology, but the problem is these cell lines as an abstraction do not relate in any way to human patients. And so what happens is ultimately, no matter what you do preclinical, the molecule gets in the clinic and the clinical team says, look, we don't really know how to design this trial because none of the data that you've produced gives us any insight on which patients to run, so we're gonna, we're gonna basically enroll an open-label study, so we're gonna enroll all tumors, all patients that are, uh, you know, enrollable in the, in this trial, and, and we're gonna see where we get signal. Imagine doing that in an early-phase trial where, let's say you have 50 patients and you're, you're trying to do, you know, test different doses and you don't really know the dose of the drug and you don't know what the safety margins are, and you're also trying to figure out where is my signal? Um, and then what if I told you that, let's say in, in just lung cancer, hypothetically, let's say there's only 10 different subtypes of lung cancer and you don't even know if it's lung. It could be any. So, you know, this is what happens. And oftentimes you get to the end of these early stage trials and you don't see very many responders as you would expect, um, you know, statistically, and then these molecules get canceled.
So you're imagining that your Noetic system, you help the pharmaceutical company to characterize we expect that people with a certain genetic profile or even transcriptomic profile will, will respond to this drug. And then you go and you actually sequence from the patient and you say, yes, this is a match or no. Is that the sort of grand vision?
Yeah, I mean, I would say we are even less biased than that. We are saying, okay, well, we want the model to learn let's say from lung cancers, we want the model to learn like how many different therapeutically relevant subtypes of lung cancers are just from self-supervised learning from the data. And those subtypes could be driven by large genetic changes. They could be driven by, you know, immune changes. It could be really driven by any biology that the model is learning in the process of training. And we do see, you know, different types.
I mean, feel free to contradict this, like, as the actual doctor here, but like, you know, the, the biomarkers that, you know, people have been using are, you know, biased towards simplicity. You know, does the patient have this particular mutation? Sometimes like stain for this single protein or, you know, do transcriptomics like to, to look for a particular gene signature. But like, there's no reason to think that biology or like biology of cancer is that simple that you're going to capture, you know, most of the meaningful variation with such simple biomarkers. And, you know, most of them, they have like weak correlations with, you know, clinical success. But the hypothesis really is here, like, again, if you were to carve nature at its joints and figure out What's really going on is there are, you know, these 5 subtypes that the correlation there between which patients you give a particular drug and whether you have success is much, much stronger than if you're forcing yourself to go with these like very simple biomarkers.
You mentioned the lab, you do a lot of data generation in the lab. So why do you think that that versus using existing public repositories or whatever is appropriate?
Yeah, we generate all our data in, in the lab. Everything from sourcing tumor samples themselves to processing them and generating the data. Maybe another, another hot take I have just in AI and bio is you're sort of not at the order of magnitude of data that you are in other spaces of building training models. And so it becomes really hard to brute force these problems just by collecting data. We have a couple pretty good examples of where someone has designed a dataset. So PDB was designed and has been built over the past 50 years or so. And so it's not an accident that that dataset exists. Someone decided that we are going to design this dataset, we're going to collect this data over decades and decades, and then with the intuition that potentially this would help solve protein folding down the road, and, and it did. So it's not just that PDB is a bunch of random data that, you know, has been, that people have organized from, from the web. I think that in bio, you really need to be intentional about the data that you generate and how you generate it, um, and have some foresight around, well, what are the models we're, we're gonna want to train and what are the models gonna need to learn from, um, from the very beginning. So that's why we've taken, taken this approach.
Yeah. And I mean, like a good comparison is, to the ImageNet dataset, which kicked off the deep learning revolution in computer vision with convolutional neural networks, like actually demonstrating that, you know, neural networks can do better than other methods on object categorization. ImageNet is at least the, the part of it that people were developing models on is 1.2 million images, very carefully curated. These are high-quality images, not like random images from the internet or like multiple datasets cobbled together.
And labeled.
Yeah. And labeled. And I think with the data that we're generating, we're around that scale right now. But you know, of course people have gone much, much larger in image datasets and language datasets, text datasets, obviously for LLMs. So we think that we need to get the data up to that scale before we can really see the meaningful progress on the algorithm side.
The scale of language data.
Yeah. Language is really the only modality where people are seeing these very impressive scaling results. Um, and you know, part of that has to be just the scale of data that's there and that the models are trained on. That can't be the only thing because, you know, there's a lot of like video data as well. People are training on like thousands of hours of video data and, you know, haven't seen kind of the scaling results that you have in language modeling. But having the right scale of data is necessary, if not sufficient, to like really make progress here.
Can I offer a contrarian take to that? Sure. So I mean, there's this whole concept about the jagged frontier of LLMs and generative AI and how like certain regions that can be really good at solving some problems and then remarkably stupid at solving nearby problems. And maybe the argument that's happening is that a lot of these one-chair models are just becoming massively— like, everything is becoming in-distribution. Like, if everything starts out O(D), if you just get more data, it now becomes in-distribution. Is it possible that for biological systems, because these are— they're underlying physical processes here, that you can basically make things more distribution earlier in that you can actually cover the space? Um, I kind of have some follow-ups with PDB, but maybe I'm just curious at this point.
Yeah, I mean, I think it's a good question is like sort of how much data and what kind of diversity do you need like in biology to solve, you know, say like the drug translation problem, like figuring out which drugs are going to work in which patients. My intuition from working in biology, like, for a while is that we're still pretty far from that. Like, because, you know, we're building datasets that are focused on, right now, cancer and, you know, have generated data from thousands of patients in a few major cancer subtypes. But there's, like, every other disease, there's healthy tissue, there's even other species. You know, there's a lot of biology to learn. Especially if you think about it as we have to learn kind of the spatial and functional patterns of tens of thousands of genes, tens of thousands of proteins, how their spatial arrangement contributes to the function of organs and so forth. You know, my hunch is that biology is like pretty complex and that we still need to generate a lot more data. But yeah, I, I, I don't know. Yeah.
But as a cancer company, do you think you could actually do this? Hypothetically for cancer? I mean, for at least some, you know, subclasses of cancer?
Definitely.
Yeah.
I think that we've done experiments that suggest that, you know, if we can generate data from several hundred patients in all of the major cancer indications and some of the less major indications, that that will result in a model that can generalize pretty well to kind of any type of cancer we would throw at it.
Backing up, what is the data you're collecting? Because it— my understanding is you use some pretty specialized instruments and gathering very specific data sets. So how did you come to that, that decision about how much data, how much to spend on it, and what types of data?
I'll give a hat tip to my previous employer, Recursion. So spent 6 years at Recursion from the very beginning, and a lot of what we were doing in the early days was figuring out like the things we didn't understand about the datasets and figuring out what the problems would be in the dataset. So batch effects, controls, how to orient samples on plates, things like that. Flash forward to founding of Noetic, started the company, you know, already with some, with some principles around how we should think about building the dataset. What are some things that we know matter? So for example, over many years we learned that images are actually a really powerful dataset for machine learning for Many reasons. One, they're scaled, so we can put patient samples on slides and on a single slide we can capture many patients worth of biology. The images themselves are very rich sources of biological information beyond that. Now we have a very information-dense modality and we can decrease the cost of data generation. So then we can increase the amount of data generation over the whole dataset. And that's always been a really big benefit to image-based modalities over, let's say, sequencing, where every time you run a sequencing run, you're basically— you're in, as you know, a patient's head. First, that was one, one way to think about it. The other was how do we design these datasets so we can control for things that we know are going to be important, such as batch effects. So for example, if I have a slide, we do a So let's say a spatial transcriptomics run on that slide. You stain the slide, do a bunch of, you know, wet lab processing, you put it into a machine, you get data out. If you do that on two different days, there are going to be different variables that impact the data buffer. That's going to be a large source of variation in datasets. So you want to be able to control for things like batch effects. So really you want to, you want more patients, represented on multiple different slides so you can process them different in different, uh, batches. Um, so you want to be able to control for things like this so you can go downstream and look at the data and say, okay, well, um, once we have, let's say, patient-level embeddings, we can ask, well, do the patient-level embeddings represent, let's say, patient response to immunotherapy, or do they represent, uh, staining batches?
So you're, you're actually taking different patient— one patient and you're spreading across multiple slides so that you can get a, like a, is sort of a calibration across the slides.
Yes. Our data looks very different than anyone in the space of generating data on histology or digital pathology types of specimens. So we receive a sample, we sample those samples dozens of times to build these arrays. And each array has hundreds of different patient samples randomized. And every patient is represented on multiple different arrays. And so we're getting a lot of different representations of each patient that we're sending through the data processing pipeline. And then that lets you downstream be able to answer some of these questions and control for some of these variables.
You mentioned some terms I just want to define for people. Spatial transcriptomic.
Yeah.
What is that? Yeah.
So what be— I mean, this was your first question. So what are the data types? So you just sit back and this is not my background in terms of spatial. Again, everything we did previously was cell biology in a dish. If you just sat back and you said, okay, I want to train a foundation model that understands human biology. What does that mean? What will be— how would you go after that problem? And that was really the starting point for the company is, okay, but from first principles, how would we do this? So you probably want tissue-level biology. You want to understand tissue. Cells are organized into tissues. You probably want some modality that is relevant in clinical use so you can relate clinical data to what your models are learning. That's why we generate pathology H&E. So that's, you know, what Every patient gets a tumor removed and then they get this stain on H&E, and that's what the pathologists—
I can't explain what H&E is.
Um, it's basically two, two different dyes, hematoxylin and eosin, and it, you know, really just creates a contrast over the tissue. So you've probably seen these like purplish pathology specimens. So pathologists can look at those and they can identify different cellular structures. And they've used those to classify tumors based on, you know, the classical classifications of, you know, adenocarcinomas, small cell carcinomas, things like that, on basically cellular structures.
Okay. So there's like specific patterns would show up when you add these two stains and it is well established that like you classify tumors based on—
Based on, yeah, pathology on your classifications. And this is what every, basically every tumor, um, you know, that gets processed in the hospital, will get this H&E stain. And it's how the pathologist typically classifies a tumor. Yeah. From, from the first level. So, okay. So you want that. You probably also want to understand cell types. It's really hard to understand cell types from just that stain because it doesn't reveal that much that a human can use to classify cell types, at least. So you can say, well, I, I want to know whether there are immune cells and different subtypes of immune cells. We want to have some layer of cell biology.
Okay.
So, and you want to know about immune cells because like you have these cancer cells and oftentimes the immune response dictates whether or not like it'll be, you have an effective treatment or.
It's like the immune environment of the tumor will be a core. We know is a core constituent of, of, of whether a patient's going to respond or not. So you want to know, okay, you want to give them all this. So the model's going to get this tissue level information. There's not enough enough cell-level information in there for the model to learn enough cell biology about different subtypes. So we also want to present it with some cell-level information. So we use protein stains, so standard immunofluorescence. So you basically use antibodies against a small set of cell markers to label, you know, different T cells, B cells, you know, standard subtypes of cells in the tumor and microbiome.
So in this stain, just to, for those who are familiar, the stain on the antibody has a fluorescing protein. When you hit it with a certain frequency of light, then it fluoresces. So you can tell the antibody bound to a certain protein and now it has a fluorescing protein attached to it.
Yep. And in terms of the data, so from, from the, from the tissue layer, you have an RGB image. From the next layer, you have a multi-channel image with each channel representing, you know, let's say one color. And so, for example, certain immune cells are each in a different channel. So you have this multi-channel image. Now, okay, so that's great. So we've got tissue, we've got cells, but if we actually want to make drugs, we need some, some type of molecular information. We need to tie all of this down to what's happening in the genome. What is the cell doing? What are the mechanistic principles of of the biology. So then we get spatial transcriptome. So that, that's spatially resolvable RNA. So DNA transcribed into RNA, which is, uh, translated into proteins. So we get basically the RNA, um, in a spatially resolved pattern for the same cells that we're seeing all of these other layers. So now you have between 1,000 or 19,000 different genes. And again, these are all image layers that are spots of where those RNA are and in which cells.
And this, this one works a little bit similar to the, how we talk about protein where you have a segment of RNA and then you have a fluorescing protein and usually there's some sort of combinatorial thing. So you have, if you see these 4 colors in this amplitude, then that means this gene because there's, they're right next to each other or something like that.
So for the detection method, you're basically binding a probe. At each one of those RNAs, and then you're cycling it. And it takes weeks to run one of those assays. So you're cycling the machine, it'll cycle across each species, and it'll amplify, and you'll get a signal for each RNA species. Now at this point, you, you now have basically this very rich data layer where you have the tissue, you have the cells, and you have the molecular information. And you can use all of that to train the model. And so we, you think of it as, you know, it's, it's essentially the central dogma, if you will. Um, and we also have DNA, uh, we, we genotype just so we understand the, uh, genomic, uh, alterations in these tumors.
All right. So you get this stack of images basically that you can train models on with understanding the expression of genes and the proteins that are being expressed at the time that the sample is taken, all in the image information. And then you can train your models with that.
Yeah. I mean, the spatial transcriptomics is like particularly dense because if you think, let's say there are 20,000 genes in the genome. Now, you know, we're running assays that are detecting nearly all of them in a single sample. So you can think of one of those data points as an image, except instead of being an RGB image that has 3 color channels, now all of a sudden it has like 20,000 color channels. So it's like a very meaty computer vision problem to try to look at those data and figure out what makes patient A different from patient B, and then go from that to which drug is going to work in which one.
And so you, you have a hot take about virtual cell. Like, I want to understand how— okay, so you, you know, you have this big pile of data that every single sample has a massive dataset with it, and then you have many, many samples. So how do you turn that into useful knowledge?
Maybe just what is a virtual cell? Everyone's always asking that question. I think there are really two ways to think about it. You know, one is we want to be able to simulate all the biochemical processes in a cell. So we want to have this sort of comprehensive foundation model where we understand, you know, if some signal from outside the cell interacts with the cell, then here are the millions of intracellular chemical reactions that are going to happen. And you could sort of predict them, uh, you know, from the model. Um, so that, that's one view. I think that's interesting. It's sort of an interesting intellectual pursuit. I don't think we have all the modalities of data that you would need to solve that problem. I tend to see the virtual cell problem as something more practical. We're trying to make drugs that work in patients. So, From a virtual cell perspective, really what we want to do is understand cell biology in, in some heuristic that's useful for, for making drugs. And the heuristic could be, you know, a way to understand drug targets or a way to, you know, map your cell-level biology up to patient-level biology. And so the way we've designed these first virtual cell models is really just to simulate the biology of a cell in some context. And the biology of that cell being, you know, let's say the, the cell being in some context and the output being, you know, the transcriptome in that context or, you know, the protein in that, in that context. And these types of, of, uh, you know, input-output relationships allow us to, to essentially design experiments. And so really the very simplistic thing that we're doing is, is really just the model can simulate the biology of cell or, or, you know, many cells in different contexts and give you and allow you to run some simulations in that regime.
Yeah, I mean, I think what most of the things that people are calling like virtual cell models right now are focused on single cell gene expression. So transcriptomics data, RNA data, and they're largely geared toward the problem of predicting what's going to happen to the transcriptome. So the set of genes expressed when you hit cells with either a small molecule, a drug, or a genetic perturbation. And typically this is cells grown in vitro, like either cell culture or primary cells, something like that.
I think that genetic perturbation being where I like knock out a gene or add a gene and see how that impacts the expression of the debarious RNA.
Exactly. And I think my view, and I think Ron shares it too, is that like may be of interest in some cases, but the problem we're really trying to solve is predicting what's going to happen in a patient. And you're just modeling data that comes from a patient is, in my mind, much more likely to translate to what happens when you give a patient a drug than something that's happening in cell culture.
Is there other clinical data that you're pulling into the model besides the actual— so you're calling it context of the cell, just the surrounding cells, but it— is there other— this drug caused a bad reaction kind of stuff?
Yeah, I mean, we're pulling in data from the entire patient. So not just, you know, the very local neighborhood of the patient. So far, we haven't done much integration of, you know, like electronic health records or, you know, other information that one could get about the patient. And that's pretty intentional. Like we really want these models to learn basic biology again, like the central dogma, not just the central dogma, but, you know, the basic biology of genes, proteins, cells, tissue in a self-supervised way. So purely from the data that we're generating and not be biased by, you know, what the doctor wrote about that patient because, you know, our thesis is kind of that like most of the therapeutically predictive and important information is not contained in those very small number of, um, you know, patients who have been treated with a given drug and whatever the doctors thought was important to write down given the state of knowledge at that time. So it's much more about trying to discover what's really there in, in patient biology than go based on the text that people have written about it.
So you have this self-supervised model, you eat a lot of data, you have essentially some clusters of patients now. How do you translate those clusters of patients to making decisions? Like you go to a pharma company and you say, we can repurpose, or we can suggest this subtype should be the focus of your phase 2 trials. What is the process for that? What data do they need to provide you and how do you translate your models?
So it depends on what the problem is. I think it's important. So one of— maybe I'll just back up. One of the more interesting aspects of these models is they are useful for a broad array of use cases, as we were talking about from the very beginning. So you as the pharma company could say, okay, well, I have this molecule and the target of the molecule is X. and I want to design my clinical trial, the molecule has seen zero patients so far. All I know is the target, um, and, you know, some biology around the target. So we can run simulations using the models and our, our cohorts of patients. And let's say if we were to look at, you know, in lung cancer, we can run simulations around the target and ask, okay, which sets of patients here would this target be, be important in across a cohort of, you know, lung cancers and colon cancers and, you know, across all of oncology. And you might see, and we see this some, Sometimes you might see that, you know, your target, probably don't want to put it in lung cancer. Maybe you want to put it in ovarian cancer because it's not really important in lung cancer. Yeah.
What are you simulating here? So like, are you, you say that this drug is expected to knock down this gene and therefore it will result that you want to look for clusters where knocking down this gene inhibits tumor growth rather than enhancing tumor growth?
I mean, that's certainly one, one way we could do it. There are other types of simulation where you might just want to ask, like, if there were immune cell here, like a T cell, which is responsible for actually killing tumor cells, what would happen to it? Or what genes would it express? Or what proteins would express in this particular patient's tumor microenvironment? And, you know, that's what we've called, like, these virtual cell simulations. Like, we have a model called octo virtual cell that does this. And that can give quite powerful answers to the question of, are these drugs gonna work in these patients? Because you might find like, actually, as Ron was saying, the thing that this drug targets is just not important in this particular patient's tumor in that there's not, like, it, it's not gonna have any effect on the T cells or the macrophages or some other cell type there. Then, you know, there's the type of simulation you alluded to where you can ask the model, what would happen to this patient's tumor if you were to knock down this particular target gene or its protein product? And you might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function, its ability to fight that tumor or, you know, decrease the tumor's ability to grow or some other readout that you think is correlated with clinical success. I just want to call out maybe like the, the simplest use case is the one where there's like a, a company that has a drug and they've given it to some patients and we know some of those patients responded and then it just becomes like a question of like, has the, the space of patients that the model has learned via self-supervision tell us that all of the responsive patients are in one of these clusters and not the other 9 clusters or something. So if we know that, then there's a pretty straightforward hypothesis that this is the right cluster.
So that's the scenario where you would sequence something. What would you collect about those? So you have a cohort responded and it one that didn't.
Yeah. So this is getting back to something Ron mentioned earlier, which is this type of data called H&E. It's a stain, the standard pathology stain that makes these, you know, pinkish and purplish looking images. Right now, what we do is we've built models that are trained on kind of all of the multimodal data we generate, but then once they're trained at inference time, all they need is an image of H&E, and that could be something that we generate in our lab, or it could just be, you know, a digital image that they have from a trial that was run years ago. And the reason that that is so powerful and flexible is again, because H&E is kind of like the lingua franca of, of pathology and especially oncology. So almost every patient who's been given a, you know, clinical stage drug is gonna have that.
You can look at the two cohorts, the responders and the not responders, and say these H&E's live in this, this part of the latent space and these H&E's do not.
Yeah, exactly. And I think, you know, one way we've gone further than that even is given the H&E, they can say, I predict that these genes are expressed at this location in this patient. So not only do we have these clusters, these embeddings that say, you know, all of the responders to this drug are over here, all of the non-responders are over there, but we can actually see, okay, for the responders, these are the genes that are expressed much more highly or predicted to be expressed much more highly in the responder cluster versus the non-responder cluster. And so that adds a major, like, level of interpretability there because, you know, we can see things like Okay, like, good, the responders are actually expressing the, the protein target of this drug. So we would be worried if that weren't the case, but you know, we can see it is. On the other hand, we also see that, you know, the biology is very, very complicated. So kind of explaining why these simple biomarkers, like looking at a single gene or a single protein, just really don't capture, you know, what is predictive of therapeutic response.
Yeah, so I have like a million directions I want to go here. H&E, that actually gives you a pathway to a diagnostic then as well.
Exactly.
Yeah.
Right. Yeah.
Yeah.
And so that you, you can imagine after the drug hopefully makes it to the market, then a doctor says, oh, you have cancer. I'm very sorry. We're going to do an H&E stain of your tumor. And then we're going to put it in the model and it says, oh, you know, this one won't work for you, but this one will.
That's right. And you can— so we're using the same approach for— actually today we're looking at many different mechanisms from different collaborations that we have in place. You know, one of them we've announced with a company called Agenis. These are all different mechanisms. The input is still H&E using, you know, and some of the same indication. So using H&E, we're asking whether drug A works in some sets of patients, whether drug B works in other sets of patients. And so, you can take that, you know, to its natural progression and say, well, okay, if you can use that same input, just H&E for, you know, experimental drugs, why not use it also for drugs that are on the market already? In a sense, the same assay can, they can be very predictive across many different cancers and many different potential therapeutics.
There are model, lots of models that take H&Es and go to gene expression out there, open source, whatever. They do you know, so-so. I've read in Twitter, your Twitter feed and whatever, that you feel that you have a data moat, right? And so why is Noetic's model better? Sure.
I mean, I think, you know, the scale of data that we've trained these models on is like, you know, pretty different from a lot of what's out there. Like, the reality is there's just not that much of this kind of paired H&E plus other data modalities typically. You know, there are some datasets generated by academic labs, others where, you know, they might have maybe like 100 or a few hundred patients worth of data with paired spatial transcriptomics. That might even be an overestimate. In comparison, we're generating these data that are, you know, multiple patients per slide, individual patients distributed across multiple slides. We've generated now, you know, more than 100 million cells spatially resolved, um, spatial transcriptomics. That's all paired with H&E and protein as well. At least an order of magnitude larger than any of the other datasets that we've seen out there. And I think that makes like a pretty enormous difference. I mean, we've seen with our own models that if you drop down to 40% or 10% of that data used in training, the models get a lot worse and they especially get worse at kind of generalizing to other types of cancer from the ones that they've been trained on. So I, I think that's a big piece of it. I also think that, you know, the algorithmic side of it is important. You know, we've developed custom architectures specifically for training on this multimodal data. And again, my background is in computer vision and specifically in self-supervised learning there. And so we've tried to develop, you know, self-supervised learning approaches for these data that are really adapted for solving this problem of, you know, figuring out what is different in one patient versus another and then simulating what would happen if you were to like knock down a particular gene or protein or something. So this is why we call these world models where we're trying to build models that can simulate what's going to happen if you if you take a particular action. I think that's another, another big differentiator for these models. And then again, the interpretability as well is probably a third one.
It's funny because you were just talking about how one of the other strategies people take for this is to, uh, do perturbations on cells and then watch the response. And, uh, you know, your experience plus like your strategy here is you can simulate this sort of counterfactual perturbation idea without even having to collect the data to do that. And you can see this.
Well, there's— yeah, there's a, a big piece that we haven't talked about yet, which is actually we are running perturbation experiments, except they're in vivo perturbations using a platform based in mouse. We have another platform where we are, um, it's called PerturbMap. Ron, if you want to describe any of it, but basically have— is a platform for generating highly multiplexed knockouts of individual genes. So the same kind of like CRISPR knockouts that people are doing for individual cells in vitro, except when we knock out a gene in a cancer cell, that cancer cell gets injected into a mouse. It's barcoded so we know which gene was knocked out. And it's being injected alongside like roughly 100 other cell types with different genes knocked out. So you end up with mice that have tumors that are barcoded, that have 100 different genetic perturbations in them. We can actually use that to validate our models and ask, are, you know, what the models are predicting in humans via simulation actually borne out when you do these perturbations in a mouse system. Sorry, there's a lot to go into there.
Barcode.
Yeah, so sorry, barcoding. This is a technology in which an individual gene is knocked out with CRISPR, but also this introduces a set of protein tags in that cell that get expressed. It's a combinatorial code, so gene X might have proteins A, B, and C. Gene Y, when it's knocked out, has proteins D, E, and F, and we can tag those proteins or label them with antibodies so that when we go and look in the mouse, we know exactly which gene was knocked out based on which of those protein tags were expressed.
So you knock out a gene, but you also added a gene that has the barcode proteins encoded on them.
Yeah, exactly. And I mean, the, the system's designed, so everything that we're doing here is tissue level. You could be in vivo, you know, tumors that came from human that are in the form of the tumor that are, you know, the old tissue. And then here, and then this mouse system, you have hundreds of tumors in the lungs of a mouse. And if you look at these images, it's a mouse lung with like literally hundreds of tumors in it. And each tumor. Has a distinct biology that's driven by the biology of the knockout of the gene that's being perturbed. And we can capture basically the, the biology of each tumor in a spatially resolved way. So what you can see is, okay, well, we have a bunch of tumors in human that we have, you know, certain tumors in humans, let's say, don't have immune cells in them. Um, and so those tumors are very aggressive and they don't respond to immune therapies. You can generate those same tumors in this mouse system. And again, they don't have immune cells in them. And you can do it genetically. So you can start to map kind of the gene, the causative gene relationships between these different immune or just broadly tumor genotypes or biological profiles, if you will, to, to what you see in the human. And then you can treat those mice with drugs and you see how hundreds of tumors in a single mouse responds to treatment with one drug, or you can treat many different, you know, let's say 50 different knockouts across a panel of mice with 50 different drugs. And you can start to build this intersectional pharmacology and genetic experiment.
On Twitter and various places, I've heard you say Noetic is no cell lines, no war bottles. Maybe you even said that, you know, a few months ago.
And then we just said we have mouse model. Yes.
And we, and injecting cells, like, to—
Into the lungs, into the lungs, not under the skin. So yes, so, you know, fundamentally, we think it's really important to build models that are trained on human data, and we are sourcing all these tumor— tumors to build, you know, human-centric models. So that is also— that is true. From the very beginning, we have asked this question of, you know, let's say we want to develop a drug from the very beginning. Um, and let's say the FDA, and I know things have changed a little bit with the FDA, but let's say the FDA wants you to have some data in an animal that says your new mechanism works in some animal system.
What do you do?
You're kind of stuck because you've now generated arguably the best data that you can in the human system. And then the FDA says, well, cool, but does it work in mouse? How does it work in the mouse? And then so you have to back into this system that it doesn't translate. And so from the very beginning of the company, this has been, you know, sort of a question. And so we've started, you know, probably at the same time we started generating the mouth to the human data, we started building this mouse platform with the aim of drawing connectivity between these two systems. And so we focused on a platform. We wanted a platform that one, allows you to map a diversity of human tumors because we know that if we just run a mouse model with one tumor, that tumor has no connectivity. So in the mouse system, we want to have diversity of tumors, and we want to see a mapping of diverse tumor biology to the tumor biology that we're seeing in, in the human across many different mutations. So we licensed this system and, you know, been building it so you can see many different perturbations that produce a lot of the tumor biologies, plural, that you see in the human. And then We also want to be able to get from this mouse system to biologically relevant, let's say, targets or genes in the human as well. So one of the fundamental problems in mouse systems is we share many genes with mice, but there are a lot of genes in biological process we don't share with mice, as is obvious. And so oftentimes you run into these when you're developing drugs. It's okay, you have a target, you have You know, some biology that works really well in mice, maybe that doesn't even exist in humans, or like maybe that pathway is quite useless in humans. So one of the things we've started to develop that we'll share more about soon is a way to use one of these models to essentially infer human biology from the mouse directly. And so we're in silico humanizing the mouse. So all the outputs in terms of the transcriptome from the mouse are in the form of, of the human genes. And so when we read out this mouse system, we were reading out in the form of human Raul Conrads.
How do you validate that? I mean, that's a pretty impressive claim if you can do it, but man, it seems like a tricky validation task.
In my experience, both here at Noetic and my previous employer, I could say Recursion, Recursion, like a lot of the approaches you're looking for when you're building these types of models is you're trying to ask whether the models are recognizing biology that you know to be true. So for example, in the human context, we know that 12% of patients with lung cancer respond to immune checkpoint inhibitors. Do the models recognize those patients? Can they recover those patients without training?
Wayne Cold—
yeah, yeah. And, and we see that. And then when you go look at those patients, we see the underlying features of those patients maps to what we know about those patients in, you know, the clinic. In the mouse system, we have control genes. So we ask, if you look at the mouse tumor embedding space, do the tumors that should be really cold look really cold from the human inverts?
Cool in the sense we have like, they don't have immune cells, no mice, no loyalty. Yeah.
Yeah. And then hot in the sense of like lots of immune cells. So We try to build systems where you have these handholds and then, you know, the more of these examples that, you know, to be true that, that work that you see, the more confidence you have. Obviously when, when you're into the regime of something very new, it's, it's still uncertain for some reasons.
So the bridge is sort of the bridge between the mouse and the human is you build a world model on the human, you build the world model on the mouse, and then you say, what are the, parallel structures in the two latent spaces. Is that kind of the intuition here?
That's one thing that we're doing, but actually this is like even simpler, which is that we've trained models on human H&E, spatial transcriptomics, et cetera, and then are just inferencing them on mouse H&E, which is easy to generate. And apparently mouse H&E looks enough like human H&E that the models think is perfectly valid. H&E makes predictions about is this like immune hot, like immune infiltrated versus cold versus fibrotic versus some other tumor phenotype. And those predictions are accurate. So, you know, these are like some of the controls that Ron mentioned. So, you know, we know that in mice and humans and everything, if you knock down tumor cells' ability to present antigens to immune cells, you know, those are very cold, like immune cells are nowhere near those tumors. And, you know, that's exactly what we see in the mouse. And that's exactly what the models, the in silico humanized models predict. And, you know, then there are other examples where, again, we're recovering the biology that we expect to see there. And then there are findings that are novel, but also make total biological sense. For instance, We have done knockouts in the mouse of, let's say, half a dozen genes that are all in the same pathway. So you might predict that knocking down those genes are going to produce the same phenotype because they're all on the same pathway.
And that would—
what is a pathway?
Yeah, so a pathway is like protein A signals to protein B signals to protein C. And, you know, there's like a chain of events that leads to the cell having some behavior, you know, changes in its metabolism, its growth, et cetera. So these are, I don't know if you've ever seen these crazy looking protein signaling diagrams that, you know, make you want to stay away from biology, but you know, like, you know, people have at the, you know, worked down a lot and they know that these two proteins interact physically and signal to each other and so forth.
And so, you know, one of some chain of those interactions that this protein binds to this protein and that causes it to upregulate a gene that causes this other protein to be formed, blah, blah, blah, until you get to some phenotype, meaning the cell changed the way it looks or the way it—
Exactly.
And so, you know, based on decades of biological literature doing experiments on these, there's a very strong biological prior that if you hit gene A, gene B, gene C, and they're all in the same pathway, you should get similar phenotypes. I mean, this is kind of how, like, old school genetics was done. And we see that with these in silico humanized mouse models, which is amazing to me as a biologist, that you have a model that's trained on human data, then you show it some mouse histology, and it's able to say, these 5 different tumor genotypes all look like they have the same phenotype. And lo and behold, there are, you know, 5 genes that are in the same pathway.
So you guys, uh, switching gears a little bit because we want to talk about models on the Latent Space podcast. You guys recently, there was an interesting blog post, uh, Tario, uh, model. It's a, uh, some transformer-based model. Do you want to talk about that?
Sure.
Yeah. So this is, um, uh, like new model architecture that we developed post sort of the first virtual cell model, Octo-VC, that we developed. So Tario, this model is, you know, just a different transformer architecture. One major difference between it and, you know, our prior models, I guess if this is a model podcast, this is getting into like the self-supervised learning objective. So, you know, for a while, including with Octo-VC, we were training models on what's called the masked autoencoding loss function or objective, where you have a piece of data, you chunk it up into small chunks, you mask out some of those chunks, and the training task is the model has to predict the masked-out chunks from the revealed chunks.
Like BERT.
Yeah, exactly like BERT.
What are the chunks? Because this is multimodal, and like, I would imagine the different channels contain wildly different levels of information and that I remember seeing something like 99% masking in OctoVC, if I'm—
yeah. Yeah.
So, and I was like, that was kind of surprising because when you have, you know, 19,000 channels and maybe some of the channels are fairly, uh, like most of the signal is fairly sparse, then it seems like it'd be either there's a huge redundancy here in your data. Or you really risk like just throwing the baby out with the bath. Yeah.
What are the chunks? That totally depends on which modalities we're talking about. So spatial transcriptomics, one chunk or one token might be the level of expression for a particular gene at a particular spatial location. For protein images, multiplex protein images, again, it might be, you know, the image patch for that particular protein at a particular location and so on. And, you know, for like histology images, again, those are usually just patches of the image. So pretty standard, like vision transformer style. The masking and the maybe surprising result that like you can and actually need to mask out large amounts of the data to get the model to learn anything interesting. If you ran the hypothetical where you only mask out like 10% of the, the image, you know, maybe more like BERT, um, for instance, in language modeling, what do the models learn then? You know, they learn these kind of like boring behaviors, like how to like continue an edge a little bit, you know, between two like regions of an object or something. So they can learn that task very well, but they don't end up learning anything about sort of the holistic structure of the image data. And we found pretty early on at Noetic that the same thing was true with these multimodal, like, transformers, where if you mask out a lot of it, there are actually pretty strong correlations between where protein A is expressed and where protein B is expressed, and forcing the models to learn them is really what gives it this predictive power.
And so Tario, though, yeah, is an, is a, is autoregressive model.
Yeah, exactly.
So yeah, that was going to be the, the tie-in. So, you know, prior models including Octo-VC were of this masked autoencoding style training objective. Tario is an autoregressive model, which if you think about it is kind of a particular choice of masked autoencoding except instead of randomly masking out front of the data, you're always asking the model to predict the next token in a sequence. We know that this is something that scales very well with LLMs, like training on the next token prediction task. And it's still an open question, how do you get models of other data modalities to scale the way that LLMs have scaled? Tario was not actually our first attempt, but one of our subsequent attempts to bring that autoregressive, like, next token prediction task into modeling spatial transcriptomics data. We found that when we used this architecture in this task, we started to see, you know, much better scaling behavior where bigger models, and especially at longer context lengths, were really outperforming here, the smaller models at shorter context lengths because they can see further in the image. Yeah, that's probably a big part of it. I think like the, you know, there's actually a pretty subtle but very interesting result in that blog post with Tarya, which is that you only really see the benefits of using larger models when you're looking at longer context lengths. And here longer context really means, again, like you're seeing more tissue at once, more area at once. And I'm not like super deep into the language modeling literature, but I don't know if there's an analogous thing with like language models where like you only see these scaling behaviors at, at longer context. So it could be that we're finding here is that like with patient data, you really do need to incorporate sort of more of the patient's spatial context to really get the models to learn these more complicated nonlinear patterns in, you know, the spatial transcriptomics and take advantage of it.
Is it possible part of this is because you have some number of low expression genes and that the, that the bit behavior is driven entirely by some unders— better modeling of low expression genes?
Yeah, definitely possible that like the more context you have, like the more likely you are to catch kind of these low expression but highly predictive genes, um, et cetera. I would guess it's a combination of that and larger area. Like we've done some experiments just like comparing model of the same amount of context, but in smaller or larger areas. And there definitely seems to be an advantage to looking at larger regions of tissue as well.
I want to hear about you did a big deal recently, you got a lot of press and, and I think have the distinction of being one of the only AI for bio tooling companies that is, is making money.
So, accidental.
No.
So could you tell whatever you can disclose about that?
We'd love to hear.
Yeah. So, um, we were really excited to, to announce a deal with GSK where we licensed them, um, OctoVC. Which is for Virtual Cell Foundation Model. So we announced that back in January. It's a $50 million deal, includes an upfront payment, milestones, and then separate than that also includes an annual license fee, model licensing fee. You know, I think this was an attractive deal for both parties, for us and for GSK, because, you know, really the deal focuses on models that we've trained already on lung cancer, colon cancer. Allows us to, you know, provide them with access to the models. You know, GSK is one of the top AI teams in biopharma. So, you know, they know how to use these types of capabilities. They can use them for their internal use. They can also use them to fine-tune on their data. So that was a really big sell for GSK as well, because, you know, GSK and every pharma is sitting on mountains and mountains of so-called translational data. So the types of data that we're training the models on that come from clinical trials, pathology specimens across many different therapeutics that, you know, everyone's sitting on a lot of this data and it's been very hard to unlock. And so all of a sudden, you know, GSK can, can use our models both to do simulations and to do therapeutic discovery, but they can also fine-tune the models on their data. And in a way, the model then becomes, you know, sort of GSK's version of the model. This was super exciting. You know, it was the first, at least first announced foundation model licensing deal in the space. And, you know, frankly, it was one we've been trying to do for a long time, even before Noetic. You know, I think a lot of companies have been trying to do these types of deals and it's been— I think it's just been historically slow for adoption on the pharma side and it's been slow to demonstrate like a very clear value proposition for different types of capabilities. Uh, and so what's unique about this deal is it looks, you know, it doesn't look exactly like a software, you know, licensing, um, framework for, let's say, a small amount of money with number of seats where you license. Well, it looks like a real business development deal in the industry where there's a very significant multimillion-dollar cash upfront near-term payment, but then the substrate of the deal is not a molecule it's not doing therapeutic discovery work together. It— the substrate is actually a model, which is what really made this pretty unique.
Why do you think there's appetite for this suddenly? It seems like almost whiplash that— yeah, it, you know, it seems like only a maybe a year or two ago that bio was dying and whatever, and now suddenly there's, um, this deal. Boltz is getting a ton of attention.
There's so much attention on isomorphic and people are AI-phill in some extent, we increase it more. I mean, maybe not totally, but increasingly more people are, you know, in pharma, you know, across the industry are seeing the value of different capabilities. They're able to use some of the open source capabilities and they're able to demonstrate the value to themselves internally. And if you look at a, if you look at a pharma company, you know, these companies are working on dozens and dozens of programs. And so I, you know, my opinions, just frankly my opinions, is I think pharma increasingly want to be able to access models, not just for one collaboration where you and I are working together on this one program. They want to be able to access the technology across the whole pipeline. And so I think that's going to create sort of a driving force for not just, you know, bespoke project-driven licensing, but actual license, broad licensing where a pharma can, can access the technology in many different therapeutic programs.
Yeah. And I think also, you know, with the structure prediction models, protein structure prediction, binding prediction models, there is like this massive public dataset. There are increasing amounts of data. People can generate data to augment that. So, you know, there's enough data to the point where people can train very good models, but maybe not just on the data that any one biopharma company has. And I think that the same is true, but even more so for the types of models that we are building, which are, you know, foundation models at the patient biology level where, like, you know, no one company— I mean, these companies may have a lot of data, but it's, you know, scattered, it's siloed, and pulling everything together to, like, train an actual foundation model may not be as easy as it sounds, like, within a single company, whereas we have just said, you know what, we're gonna generate enough data ourselves to actually train a real foundation model. And that's the nice thing about being a startup here is like we can make that bet that like you actually do benefit from generating all of this data in a, you know, uniformized way, like very high quality, et cetera. And then use that to develop and train the models and my opinion is that you need to have data at that scale before you can even think about developing models that actually work. It's like you can't do the AI R&D, like, or build the algorithms until you have good enough dataset to tell you whether your favorite algorithmic idea is actually working or not. That's a major advantage for us is like we have enough data to see, like, is my idea or someone else's idea about how to build a model like actually leading to improvements there.
Yeah. I mean, this is a good point. I mean, so like sometimes people ask me, well, why doesn't GSD just generate your data? So we just started generating data for years. There was no model. It was like, how many years?
Like how—
like 2 years, maybe a year and a half, at least before we had the first trained models working, like maybe a year and a half we had the first.
So, I mean, certainly, yeah, like the OctoVC model, like we trained in 2024. So yeah, that's like 2 years after.
Yeah.
Do you have any stories? So we do the—
how do you have 4 years of SIL?
So this is year 4. And so we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples. And there was no prior here that any of this would work. Like zero.
Big crazy bet. Like I was just going for it.
And like, we just started generating data and like sourcing human tumors. Processing. We built this whole processing pipeline to get the tumors into like these arrays and the formats. And it takes weeks to, you know, it takes literally 2 weeks for a machine to run a couple slides on the spatial transcriptomics. So, so you've got like these 2-week runs where you're processing 2 slides. Um, and, and we're just churning data for months and we couldn't even train up. We didn't even have enough data to train a model for like at least a year and a half. And then you're building like processing pipelines. You have to align all the data. You've gotta like post-process it off the machine. So we sort of just built all this and then, then like, let's say 18 months later, hey, I wonder if this stuff— and then it was not like, it wasn't obvious. There wasn't like, oh, we're gonna like off the shelf, um, you know, train this on some like open source architecture. Um, you know, we've had, we've, you know, Dan and the team have done a ton of work.
Yeah. There wasn't really like anything major to go off of. I mean, there were like, Transformers developed for single-cell data, but like incorporating spatial data into that was, you know, again, there just like weren't really datasets out there that people had been able to develop on. So we do a lot of like custom model building, and I enjoy that. I think people enjoy that.
Josiah, hello for joining. Yeah, how to build custom model.
How many—
yeah, really unique, innovative model.
Steve.
Sorry, who are you looking for? Like, what kind of people?
Anybody excited about doing ML research on, again, this kind of alien landscape of data where you really have to figure out what's working from first principles. And obviously the work we do should have very, very large impact. So definitely not restricted to people who have a biology background, you know, people who just like tackling very challenging machine learning problems and are open to learning the minimum amount of biology necessary to make progress I think would be great candidates.
Talking to you guys reminds me a lot of The Leash Bio Labs which I know that both of you are part of the Recursion Mafia. You know—
I'm not yet but Brady does yeah.
Yeah we're gonna have you on the show in the future too so yeah we're looking forward to it. I could put it that, um, but like, it's, it's interesting because both of you seem to have really similar philosophies and that like you have deep convictions that like, you're just going to start collecting data before you know this is going to work and you are going to just brute force it, go, go, go. And eventually it will work. And you know, you have signs.
I don't know.
I think that's really impressive. I wonder, is there something about Recursion, which is in the water, which has led to this sort of thinking of just like, we're going to commit to doing things at scale. And it may not work at first, you have to hit a certain point before it will?
I mean, we failed a lot at the beginning.
Yeah.
You mean average version?
Average version, yeah. Yeah, and so you, and we had, I said we had to build it from first principles and we really did. And so we spent many years trying to figure out like what should the data look like? Ian, myself, we're all involved in kind of platform development, how to design, you know, these datasets, how to design the experiments, iterative cycles over the years seeing, you know, things that did work, things that didn't work. And so, at the end of, you know, coming out of Recursion, I think what a lot of folks there had was like an understanding of what are the things we need to think about so that even if I want to design a different dataset, you know, today, that's like totally different. What are the things that we learned and we had to learn like over mistakes, over like not mistakes, but like trial and error basically over that many months that we would try to insert in our new approach. And so I don't know that every, everything that I've predicted at Noetic in terms of like how to generate the dataset has been important necessarily. Um, I know that we could start at the very beginning and say, okay, well, let's make sure we do these 10 things. I know every one of these 10 things was important before. Let's at least make sure we do these 10 things. I don't know that all 10 things are important for us today, but I would presume that, you know, many of them are. And it lets you sort of leapfrog that process of trial and error a little bit. Certainly we do have trial and error still, uh, but hopefully we're not having to you know, solve like, you know, 15 problems, maybe we're only solving, uh, you know, 3 problems, 4 problems over time.
So for small biotech startups, which are probably in the AI space who are collecting their own data, their own data mode, like, do you have any advice or any suggestions about how to be more successful there?
I think you sort of need to, I mean, you think ahead to, okay, what am I trying to do on the machine learning side? And like, what is the right data for solving this problem? I think oftentimes I see like a lot of companies are like, okay, well, I want to generate X dataset. I'm just going to generate X dataset and I'm going to do machine learning on that. Like, that might not be the right dataset. You might not have designed it the right way. You know, it doesn't follow that like any dataset is a machine learning dataset. It doesn't follow that, that, that, that. Data says, yes, all the problem you're trying to solve. So, and I, for me, it was really, and even founding Awake, it was, okay, what, what problem are we trying to solve? And then what are the data that are going to help solve that problem? Uh, and rather than like, you know, going from, from the data directly to, to try to solve.
I also, sorry, I also had a quick piece of advice, which is like, you know, pay attention to where the technology is and, you know, where it's changing rapidly. So, you know, I finished my PhD in 2016. I did a lot of looking at spatial RNA, like via this technique called in situ hybridization, same technique that is like at the base of what we're doing. I could look at maybe 2 genes at a time on a single sample, and that took me a full week of manual work. And, you know, I came to Noetic like 5 years later, 6 years later, and all of a sudden, you know, there are platforms where you can look at 1,000 genes or 20,000 genes at once. You know, it's a single machine that can run this assay. It's expensive, but it's just like data beyond the wildest dreams of Dan Bear in 2016. And that is only improving like rapidly. So I think it's important to see what the technology of today, you know, allows and also where it's going in terms of what data to generate.
And what does that pitch look like? So I'm going to generate data for a year and a half and then I spend $50 million and then—
It wasn't $50 million, it was maybe closer to $10 million. But if— so yeah, I mean, it isn't just— so yeah, so you have to do that if you— if, I mean, if you're going into a regime where there's no data, Yeah. Um, and you want to do something different, then, I mean, there's no shortcut to it, right? You're going to have to generate the dataset. And so you're not going to know the answer until it's there. Um, and I mean, and that's why a lot of companies are not going into that space where, where there are no datasets because, you know, I think it can be challenging to do that. Yeah.
I mean, I think a lot of smaller biotech AI startups will try this pattern where they first will either start with a public open-source dataset or they will try a pilot where they will internally collect a small amount of data and see if something works or something it doesn't. And oftentimes there's almost like a critical point where below this you're just not going to get any signal and you have to have conviction that you need to collect up to a certain point before you start like really driving something like fundamentally valuable. Yeah.
Yeah.
I mean, imagine trying to train a foundation model on not enough data.
Yeah.
The kind— and then, and then that's— it's sort of your clinical trial, right?
GPT-2, GPT-3, GPT— you know, well, GPT-1, 2, and 3, like, there was a clear progression there. As each one of them, you could see there was something which worked with scale, and there was this insight to, oh, we're gonna scale this up. Yeah. Um, you know, sometimes biological data, like the process of collecting lots of data is just very expensive to begin with. You can't just take something off the shelf and expect that you're going to hit the threshold of, you know, GPT-3, like, usefulness. Yeah. Yeah. So yeah, it takes some conviction.
It definitely takes conviction. I think, you know, it also takes sort of like a scientific belief that there's a lot out there like that we just don't know yet and that you're not going to capture the biology you need to by having right now like an agent that reads all of the biological literature. Because again, that's just like a tiny slice of what's out there. Like, this is, I don't know if it's a great analogy or if I'm gonna botch the history here, but like, in astronomy, it was required, like, Tycho Brahe, like, collecting this enormous amount of astronomical data at his observatory that then was the substrate for Kepler, you know, figuring out the first laws of motion of the planets, and then, you know, that was superseded. By like Newton's laws and so forth. But like, I, I don't— I sometimes don't know how you even get started without like this large repository of really high-quality data to begin with. And you know, maybe there's like a tragedy of the commons problem here of like who's going to generate that data and who's going to capture the value of it. But I'm very glad that we're, we're taking that bet and you know, we're seeing it pay off.
Yeah, I mean, this is not my expertise, but hypothetically speaking, how much of PDB do you need to train?
I mean, there was some people that argued that, yeah, you can get some pretty good models with, I think, 1% of PDB. And there are people going back in the 1990s argued that there was, the PDB was already complete in the sense of like, if you had a sufficiently smart algorithm, you could have done a pretty reasonable job at protein folding, even back then.
Interesting.
So you don't need a lot to get a pretty big boost, but the community was sort of independently collecting PDB data for quite some time without necessarily being convicted that this was going to lead to solving protein folding. Yeah, but then it was also usually quite— most of those structures were quite useful in and of themselves. So maybe that's their charter point was oftentimes just knowing a protein was very helpful for some useful data set.
And we did see, we did see a transition from like early data. How many samples did we get? I'm guessing probably on the order of a few hundred before there was like—
Yeah, there was a, there was definitely a moment like very soon after I joined where like we, the data set just kind of doubled in size overnight because there was like a huge bolus and like the models immediately got a lot better at that point. And you know, now we'd run these more controlled experiments of seeing, you know, what happens if you train on 10% of the data versus 40% versus 100%? What happens if you hold out all of the pancreatic cancer or all of the breast cancer? And so, you know, we have a much better idea of what kind of diversity and scale we need now. I guess I would say if we were sticking to cancer, maybe we're not like that far off. I think, you know, again, if we end up generating a few hundred patients in a bunch of major and, you know, some minor indications, which we're, you know, gonna do this year, like maybe that's enough to generalize to kind of all cancer because there is a lot of shared biology in, you know, cancer and immune cells across different tissues and different, you know, mutations and so forth. But if you think about all of the disease biology that there is for a model to learn, you know, maybe that's like another order of magnitude.
But I mean, even being able to solve all cancer biology would be pretty impressive.
Yeah, to cure cancer would be, would be great.
Well, if it solved cancer biology, I did not say cure cancer. Those are two different things.
But yeah, at least if you go battling just sort of a, like, just take one drug, if you could look at one drug mechanism across the whole of oncology, that's incredibly powerful. I mean, imagine what Merck has done with Keytruda, like Merck has run hundreds of trials with Keytruda. Like it might even be over 1,000 trials of Keytruda in different populations to find, you know, all these different indications. Okay. The subset of ovarian cancers, the subset of lung cancers, the subset of colon cancers. That's all been done, you know, by enrolling trials. If you can look at that biology. From model embeddings and at least have a very well-defined starting point for, okay, if I'm going to run a trial, it doesn't have to be as broad as it would need to be if I didn't have any answer, then that can be a really powerful tool for, you know, a diversity of mechanisms.
Yeah, maybe it's just like last point, like going back to the virtual cell hot takes, like, you know, if your goal is to build like an actual mechanistic model of an individual cell and then build up from one cell to an entire tissue and then, you know, tissue to patient and so forth, like you might need a lot more data and a lot more data modalities than, you know, just like gene expression or something like that. But, you know, we're taking much more of like a top-down approach of we're trying to first solve the problem of what is determining heterogeneity among actual patients and which of that variability is predictive of drug response. And my intuition is that you don't need to model the mechanism at the subcellular level necessarily to solve that problem of which patient should get which drug, or, you know, which targets are important in which patients. And I saw a similar debate play out in neuroscience and computational neuroscience where for a long time, people were really trying to build these biophysical models of individual neurons, and then they were going to stitch them together into models of, you know, the brain and so forth. And what actually ended up working in, you know, in terms of building computational models of the brain and behavior is this abstraction of, you know, we're just going to treat individual neurons as, you know, linear, nonlinear units, and, you know, put them together in neural networks that are connected by, you know, linear weight matrices, um, and, you know, stack a bunch of layers together and then build neural network models of the brain that abstract away kind of all of the details of biophysically what a neuron is doing. And, you know, those are now by far the, the most predictive models of how a given neuron is going to respond to real-world stimuli in a real brain. And I think that my bet is that the same is going to be true for these models too, is that like by modeling sort of at the level of functional tissue where you have a bunch of cells interacting in like a disease context, that that's going to get you to the problem of predicting kind of the, the patient-level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together.
Yeah, that makes sense to me. It's a good analogy.
I like that.
Do you have any call to action for the listeners?
Yeah, I mean, I would say one, everyone should be excited about biology. Um, you know, sometimes a lot of my hot takes on, on X recently are just that I feel like there's a huge amount of enthusiasm in sort of like the mainstream tech ecosystem and like people aren't really following a lot of like what's happening in the biology space. But at the same time, like, you're hearing, you know, French ReLab saying we're going to cure cancer. And yeah, people should actually look at the folks working on curing cancer or working on aging or working on areas of biology. These are really exciting, you know, problems. There are real, like, significant ML problems in the space. One call to action is, we'd love for people to just, like, be more stoked about learning about applications of machine learning in, like, biological sciences and, like, solving some of these hard problems because I think these are the problems that are going to like massively impact humanity in like the next 10 years. And we're just like really the very beginning. Like, you know, maybe we're in, in, in the like first inkling of the ChatGPT moment for bio, but it's like very much just the very beginning. So we'd like to catch you while you can.
Yeah.
Yeah. In line with that, to like really dig in and learn more about the details. I think, you know, a lot of the times it's presented as We have these protein folding models, we have these binding models, you know, we have AI for science agents that are, you know, like reading all of the literature and automating these computational biology workflows. And I think it's important to realize that there are a lot of problems in AI for biology, AI for biochemistry, et cetera. And some of them, and they're very important, but like solving any one of those is not going to like solve the problem of how do we develop better therapeutics. And, you know, we're focused on, you know, a pretty particular slice of that process, which is again, translating things that we know work well in some patients into actual, like successful drug trials where we know exactly which patients to give them to and that requires building foundation models at a particular level, you know, the patient level. But people should not be under the impression that, like, this is all going to be solved immediately because, you know, AI agents like LLMs are gonna just read the literature and figure out what the right drug is. Like, there are a lot more data to generate, there's a lot more ML problems to solve, and there's the need to translate those methods into actual successful drugs. And there's a lot of different places to contribute.
It's a lot to do. Yeah, I'm there. Great.
Thank you very much.
Here we are.