00:00:04:
00:00:12: Welcome to the Bio Revolution podcast, this time around virtual cells decoding life's basic units.
00:00:20: Welcome, Izzy.
00:00:21: Hello.
00:00:22: And our guest today, Fabian Thais.
00:00:24: Hello.
00:00:25: Hi.
00:00:26: Fabian is the director of the Institute for Computational Biology at Helmholtz in Munich.
00:00:32: So we have a podcast virtual one station in Munich, one in Berlin, one in Frankfurt.
00:00:39: Easy as always, we start, of course, with quotes to it closer to our topic.
00:00:51: Today, I brought two quotes from, I would say, two people who could be seen as foundational for their respective fields.
00:00:58: First is from Rudolf Wirthieu, a fellow Berlin practitioner who was foundational for the field of cell.
00:01:06: theory, the recognition that cells are even the basic units of life.
00:01:11: So what he said is where a cell arises there, a cell must have previously existed in Latin, omnis, celule e celula.
00:01:19: So Cells come from cells.
00:01:22: Just as an animal can spring only from an animal, a plant only from a plant, which I think is a foundational step in cell theory.
00:01:31: All cells must come from other cells.
00:01:33: This is when I teach at university, what I say in the first lecture to the students.
00:01:38: And then the other quote comes from Demis Hasabis, who just recently got the Nobel Prize in Chemistry, one of the minds behind Deep Mind, including the Alpha Fold, for which Nobel Prize was awarded and he said in two thousand twenty two one of my dreams in the next ten years is to produce a virtual cell.
00:01:58: what I mean by a virtual cell is you model the whole function of the cell with an AI system and I think this shows.
00:02:06: two different approaches to the question of the cell.
00:02:10: And I'm really happy that we have Fabian Thais here today, who is, as you just introduced, the director of the Institute for Computational Biology at Helmholtz, also leads the Computational Health Center at Helmholtz.
00:02:22: And I think you also have a professorship for mathematical modeling at TUM.
00:02:27: And I think many accolades, including the Leibniz Prize, So we're really happy to have you here today and discuss with you, I think, a little bit of all the exciting research projects that are going on in your group and that you're involved in and getting an idea of.
00:02:45: And I will start with this question.
00:02:47: Why we actually need virtual cells?
00:02:49: Yeah, thanks, Luisa, first for having me.
00:02:51: Thanks, Andreas, for the kind introduction.
00:02:54: Quite excited to talk about maybe the bigger questions we have in the field.
00:02:59: And I think it's a good time.
00:03:01: talk about this at the moment because you know there's sort of this.
00:03:04: A bunch of revolutions coming together.
00:03:06: and that's why this excitement sort of this reformulation of I guess an age old question in the field namely.
00:03:13: Can we you know sort of like we build cars where we have a sort of engineering plan write up things and then you know not just sort of randomly hack a few existing cars into bits and pieces and produce new ones.
00:03:25: Can we build such a blueprint for sales one that we.
00:03:29: ideally understand that when a particular modification happens, you know, these things and have this and that impact.
00:03:35: Why is it relevant?
00:03:36: I'm sure we're going to discuss a bit more later, but, you know, if you want to treat diseases, you want to understand where to sort of tune and do things.
00:03:44: So this idea of a virtual cell, I'm quite excited.
00:03:48: The definition, I guess, and you know, there's sort of very different understandings, but I think sort of the highest level picture would be, and decided dim is as an example, would be to build a system that can simulate cellular behavior under normal growth differentiation, as this beautiful work of citation was just formulating, but also of course under perturbation such as external influences, microenvironment, but then of course also particular disease.
00:04:17: Yeah, really, really interesting.
00:04:19: What I found quite interesting, so last year you co-authored with many important people in the field, including, I'm just quoting, I will not read all the authors, but it includes, for example, Aviv Regev, Emma Lundberg, Patrick Hsu.
00:04:33: So many, many people who have been thinking about these questions for a long time, actually, since I was still in research.
00:04:39: So I still know many of these publications and these thoughts at the earlier stages of what a virtual cell could look like.
00:04:47: Ten years ago, we had very different tools back then.
00:04:50: As you say, we have a revolution probably in data.
00:04:53: revolution in artificial intelligence that can fuel these virtual cell models.
00:04:58: So in this publication in Cell, what I found quite interesting, and I really like this publication, you asked or it's about how to build the virtual cell with artificial intelligence priorities and opportunities.
00:05:09: So this is not actually research output, but it's actually stepping back and convening the field together to ask the question, how can we actually even achieve this feat, which shows what a tremendous job it will be to model a virtual cell even with the help of AI.
00:05:26: Could you outline maybe a little bit what the challenges for a mathematician could be when it comes to building a virtual cell?
00:05:35: Yeah, absolutely.
00:05:36: And I think you really formulated this nicely that this paper sort of is a perspective plan type of thing called to action also if you want that actually sprang out of a symposium.
00:05:46: that many of us had.
00:05:48: where you know these these type of questions were written up and of course some of these have are already ongoing.
00:05:53: some of these are just about to start but i think was was really good timing to do that.
00:05:59: so the challenges i mean there are many for right there's always this traditional way how we used to do cell biology namely sort of this divide and conquer type of thing you know.
00:06:09: try to understand one cell type, maybe one pathway, identity, maybe one molecule, one receptor, something, and then sort of get all these details.
00:06:17: And then something like, twenty years ago, maybe a bit longer, this idea of systems biology came along, where, you know, we sort of tried to put together all these bits and pieces into the bigger picture to then understand how these sort of little entities work.
00:06:30: And then, of course, as you put them together into the human or into the particular organism you're studying.
00:06:34: So I would say the challenges are, as always, in biology that we have these very complex multicellular sort of multiscale multimodal models and these are two different things.
00:06:46: so first we have all these different types of molecules floating around in what we computationally often assume is a well steered sort of batch which obviously you know this is not the case and you know literally there's some of us that just think you know cell is a bag of RNA and you know it's a tad of a limitation if you look at cells through a microscope there's like this fantastically complex structure right and then there's sort of things like localization as all these additional processes happening.
00:07:10: So this complexity to even put into a model I don't think we know yet even close to how we could do it in what we often call forward fashions where we write down sort of a bunch of equations and sort of see how they come together.
00:07:23: So that's sort of how to deal with the different modalities.
00:07:25: Then the second one is, of course, how to deal with the scales, both in terms of spatial scale.
00:07:30: So, you know, from the very EM, so electron microscopy type of observations of all these little details to maybe how proteins fold, how they sort of form the membrane, how you sort of put these together.
00:07:40: This, by the way, is what Demis understands under virtual cell.
00:07:43: He wants to really piece them together on that level.
00:07:46: up to the scale of, you know, the nucleus, the cytoplasm, how it interacts then with other cells and so on.
00:07:52: So bridging these and then also the temporal scales, whether you know it's sort of the microsecond type of fluctuation of one molecule sort of binding to the surface and starting the signaling pathway or this much longer range of chromatin modification, this is not really clear.
00:08:04: And I think one kind of cool and innovative new way and think that's why we came together to sort of deal with that complexity.
00:08:12: is not trying to do it bottom up, but sort of trying to put this all into a more complex AI-based, maybe let's call it more black box model, where you might not directly understand all the pieces, but that is still predictive.
00:08:24: This is an approach that worked very well in other fields.
00:08:27: We used to be, and we can talk about examples maybe in a minute, but I think that's maybe an idea that we could do here.
00:08:32: Yeah, really interesting.
00:08:34: I think so.
00:08:34: When I left academia in two thousand thirteen, the car paper had just come out.
00:08:39: So this first virtual cell model where, what was it, mycoplasma genitalium had been modeled like the simplest bacterial organism.
00:08:49: And they tried to do it bottom up.
00:08:51: And I mean, this was really, I thought it was super fascinating, but maybe at the time not really scalable because of all the complexities that you are mentioning now.
00:09:01: So now we have AI at our hands, but we're still, of course, facing, as you say, a number of challenges of just the complexity of biology.
00:09:11: of the process, I think also the non-linearity of it.
00:09:14: I mean, taking out one gene doesn't equal taking out another gene.
00:09:18: So one might lead to everything breaking down completely.
00:09:22: The other one might have no effect whatsoever.
00:09:24: So there we have like very different levels of buffering, I think, for biological processes, which again is also quite difficult to model, I can imagine.
00:09:35: Where would you say, so this paper that I just mentioned came out last December, right, in two thousand twenty-four.
00:09:42: What do you think or what could you say happened since?
00:09:45: Are there any like progress steps being made within the field since?
00:09:51: Yeah, and if I may, not just since maybe, but also of course before, because you know this thing obviously builds up.
00:09:56: And people put stuff together.
00:09:59: And you mentioned Aviv Regev.
00:10:01: She's a good friend, one of the lighthouse figures in the field and fantastically the integrative person.
00:10:06: She, together with Sarah Teichmann and many of us, the two of them are really pushing it, has led this international consortium called the Human Cell Atlas that you might have heard about in some parts spiritual successor to the Human Genome Project.
00:10:19: Eric Lander was pushing for that and so on.
00:10:21: Really nice community worldwide, very integrative.
00:10:24: And that was in a sense the first type of data gathering.
00:10:28: that sort of helped me to spin off this idea of a virtual cell model.
00:10:31: because you know at the heart and now.
00:10:33: so we agree if we want to build an AI model right and AI model sort of has this or sometimes called the holy triangle of AI right so obviously you need to have your compute large scale compute particular for this very large scale.
00:10:44: foundational models are very important.
00:10:46: you have to do have your methods the way how you do all of that And that's often rapidly advancing.
00:10:51: At some point, then new things work out, and then you sort of need to extend on that.
00:10:54: And then, of course, the data part.
00:10:56: Like, where do you get the data at scale to really do these things?
00:10:59: And that was the big first endeavor of this human setup.
00:11:02: Last time, we really tried to gather and describe variability, in this case, in any human, in each human organ at scale, and then also across a different variation in population and so on.
00:11:14: And this, of course, now starts giving you so many examples that you can start building this.
00:11:18: When I started research, we had like a few transcriptomes like twenty years ago or something.
00:11:22: That's like a big thing is sort of when the human genome project first finished, like a little bit after.
00:11:27: Whereas nowadays, in a single experiment, sometimes you can make more than ten million of these single cell transcriptomes.
00:11:32: This is huge data, the type of approaches that you can now use to analyze that type of data.
00:11:39: It's just completely different from before.
00:11:42: So that was the first part.
00:11:43: gathering the data in parallel than a whole bunch of methods have been developed.
00:11:56: We would love to hear about that because I think it's the data I think for.
00:12:01: me as a biologist and for our listeners who are maybe coming from slightly different fields is still understandable right you?
00:12:07: i mean the technology gets cheaper it gets more easy to use and you just produce more and more and more of it.
00:12:14: but i think in my mind or in someone not AI savvy's mind the idea would be okay.
00:12:21: now you have a billion data points.
00:12:23: no one can make sense of that anymore right?
00:12:25: so you can just store that in your data warehouse and hope.
00:12:29: that something will happen with it.
00:12:31: So how can we actually make sense of this massive amounts of data that we now
00:12:35: have?
00:12:35: And that was actually one of the first, I think, big results in the whole field of single cell genomics, if you want, where, you know, people quite quickly realized that AI is needed in the time of just called machine learning, right?
00:12:46: To make sense exactly, as you say, of this high-dimension data set.
00:12:50: And this could be just RNA, obviously could have chromatin protein and so on.
00:12:53: There's single cell proteomics, not being thing, could be spatial context.
00:12:55: But these things can be stacked into big vectors, as you say.
00:12:59: Ten thousand million dimensional space where you have a few dots that would be the cells like.
00:13:03: what's the structure there?
00:13:04: It's a bit like sort of to take a metaphor from biology this water intense landscape picture that I'm sure you've heard right.
00:13:10: so you take that let's say the tsuk-spit cells.
00:13:12: so you know you put your undifferentiated sort of primary cell maybe x or something like that put on top and you sort of let it roll into various hills and it sort of makes up different blood cells other setups in the body and This somehow needs to be reflected in this high-dimensional dot space of transcriptomes and others, because that's essentially cell-states transitioning.
00:13:32: So what we and others have been doing is learning that space, finding the curves, finding the sub-manyfolds, as we call it.
00:13:40: If you have, let's say, some very high-dimensional thing, then the data typically maybe lies in a plane in the data, and then you want to find the plane.
00:13:47: Or it's maybe something more curved and complex, then you want to unroll that.
00:13:51: You know what I mean?
00:13:53: maybe even simpler you maybe just want to find different clusters sort of groups of points that because of their shared transcriptome you then could call cell states or cell transcriptomes and one of.
00:14:03: at one of the early HCA meetings five six years ago I remember discussing quite intensely with Eric Lander about you know what is actually now a cell type what is a state?
00:14:11: and this old question when you looked into Wikipedia there's like nine cell types in the human brain.
00:14:15: you know this has been completely revamped.
00:14:17: there's fifteen hundred different sort of transcriptome definitions of that and you can very clearly localize them to different reason and so on.
00:14:24: So this is just sort of the descriptory part without the effect of perturbation.
00:14:28: that has been done as a first step.
00:14:31: Which I think is also really, I mean, interesting now acknowledging the complexity on the single cell level for a very long time.
00:14:40: So when I was doing research that was still really the general modus operandi that we did for systems biology and i think this changed of course now really with time for transcriptomics but also for other levels of analysis like proteomics that you actually don't analyze bulk of cells but you analyze individual cells and therefore i can also predict the individual fates of cells which i think back in the days you would say okay this is a lung cell and then you have the whole lung of course cells that have very different functions and depending also if you would have a targeted drug which one you hit you might have a very different outcome.
00:15:18: where it's even I think easier to understand is the example that you just mentioned the brain.
00:15:22: if you have a very specialized neuron and you have a cell that is like for example a microglial cell that has a completely different function.
00:15:32: you will have very different outcomes by perturbing those cells.
00:15:36: Immune system would be another example, right?
00:15:38: Where you have extremely specialized, extremely specialized cell types.
00:15:42: And here I think understanding on a single cell level what those guys do and what we can do to potentially harness those cells and the proteins that they express as drug targets, that is of course really, really exciting type of research.
00:16:09: It's a person with a non-scientific background.
00:16:13: I must ask this question.
00:16:14: I am stunned and fascinated by your talk right now.
00:16:18: But if we are tackling the virtual quote, which basically states, life comes from life, and we're moving into a let's frame it in silico world, if you will, in a math and a numbers world, from both of you, how close have we gotten?
00:16:38: with a huge amount of data that are available right now and remodeling and getting closer to those virtual cells that might help tremendously in diagnostics and drug development and so forth.
00:16:52: You name it, maybe from math first and then from biology.
00:16:57: So your estimation, how close are we right now with all those tools at our fingertips?
00:17:03: So I think it's quite safe to say that in the first five years of computation development in these big landscapes, finding a key advance was really to find essentially a bunch of source cells where things started and then we were able to connect them with trajectories and we could learn these branching hierarchies.
00:17:25: This was quite interesting and possible without actually us observing this in a microscope.
00:17:29: Typically, you know, the ratio of example, you look at cell cycle.
00:17:33: You watch it, actually.
00:17:34: Sure.
00:17:34: You take RNA-seq.
00:17:35: These things are gone, right?
00:17:36: I mean, but because you have so many, you can sort of still see these sort of trajectories and you can actually learn that.
00:17:42: So that's possible.
00:17:43: tough thing and maybe have a time to talk about this a little bit later is how they react and talk perturbations like this causal type of modeling.
00:17:51: And that's actually what you really want, right?
00:17:53: This is only emerging, I would say.
00:17:54: Yeah,
00:17:55: I think we're still, even though AI helps a lot to integrate different data layers, we're still at a level where we're very reductionist.
00:18:05: And we have to be, I think, because it's not there yet to have really this complete multimodal data.
00:18:11: So all the things that happen in a cell, even at a single moment in time are, I think, and please correct me if I'm wrong, but beyond the capacity of current models.
00:18:21: And we simply also don't have the data.
00:18:22: We would have to have the data, all the right types of data at exactly the same point in time, which usually, which also is a big challenge in the field.
00:18:31: We have.
00:18:32: data sets that come from many different sources and that are not.
00:18:36: necessarily comparable or integratable.
00:18:39: Therefore, I think a number of initiatives human cell atlas, but also, for example, ARC Institute, Chancellor Zuckerberg initiative.
00:18:46: I think also at Helmholtz, there's a long initiative of data set production that is really targeted towards really fueling these models.
00:18:55: I think probably that has to come first.
00:18:57: We have to have this high quality data to be able to make this virtual copy of the cell.
00:19:03: Maybe one interesting aspect is, yeah, exactly.
00:19:06: As you say, you know, you start with maybe one easy to somewhat easier to measure anchors, such as, for example, this unbiased observation of transcripts, you know, these things that are being inscribed from DNA in a setup specific fashion.
00:19:18: But,
00:19:19: and that's why I sort of really love, and it's, I think, maybe also a good way to move to this field.
00:19:23: There's always some cool biotech colleagues that come up with another new.
00:19:27: way to measure this and that addition and modality often tacked on.
00:19:31: It's just fascinating.
00:19:32: It happens so fast and scales up so fast.
00:19:34: And one thing, for example, that, of course, people now can do, they can sort of add also a bunch of proteins.
00:19:40: Ideally, first maybe just not, maybe first they should just label antibodies with some type of oligo, when you then sequence the RNAs, you also get sort of protein expression.
00:19:48: Some others can sort of measure openness of chromatin, also with a bit of RNA together.
00:19:52: Of course, that getting everything might be a tad hard.
00:19:55: but then you have a few of these paired data sets, then you can use these in sort of proximal cell detection and piece together a bigger picture of having everything.
00:20:03: So I do think there's major steps happening to get closer to this big picture of what the current state is, which is not yet answering, of course, what happens if I do something to this state.
00:20:16: And that's where a whole lot of new sort of areas come up.
00:20:19: So basically now that the Atlas is done, we want to do a perturbation at this.
00:20:21: And often when people talk about what a virtual cell is supposed to do.
00:20:26: You know, we want to sort of see where my cell types are, my new data set, stuff like that.
00:20:31: But one of the key new things is I want to predict how it behaves on a set of perturbations.
00:20:36: You mentioned drugs already, Luisa, could be also just Christmas, you know, knock out of particular genes, combination of genes.
00:20:41: This is what
00:20:42: current
00:20:43: virtual cell challenges and so on typically and compares and that's a much more recent development.
00:20:48: Which I think is also something that of course and we can see that a number of biotechs Zaira and others are of course quite interested in because this could be very helpful for drug development or drug discovery just understanding and simulating how a certain compound would behave in other types of cells.
00:21:05: I think if I got it correctly I think from your group there was also a different type of model for virtual cells that I found quite interesting, the niche former, niche former, which I think is also quite interesting, because I mean, we always think about perturbations.
00:21:20: And I also come from a field where of course you think about, okay, what happens to the cell if I throw on a drug, and then it dies.
00:21:26: But in reality, the cell is of course not that single unit, it's not like dissociated from its colleagues that lives together with others, which again, affects the cell even under not only under pathological, but also under physiological.
00:21:40: So I think also having this layer of information is very relevant before we move to the perturbations.
00:21:47: Yeah, absolutely.
00:21:48: And I think for us, this was really an opportunity to ask challenging questions to the model.
00:21:54: In the end, we have different ways of how to describe the complexity of the data.
00:22:01: And the most powerful one, this is sort of a big learning from computer vision and large language models, is this idea of a transformer that I'm sure you've heard about, sort of one of the revolutionary paper that sort of kicked up the revolution of this large scale foundational models.
00:22:15: This is being tried in our field.
00:22:17: And I would say at the moment, we are sort of a bit beyond that first hype cycle in the sense that a lot of people get now a bit sobered up.
00:22:26: that the promise of these very flexible models are maybe a bit too much and they're not yet available.
00:22:30: And that was the question, why is that?
00:22:33: And one answer we tried to do with Nicheformer here was to ask, sort of to give it more informative data, namely the neighborhood information addition and then learn a model that's not just RNA, but RNA plus spatial context.
00:22:45: And there we could actually show in a whole set of ablation studies.
00:22:48: So studies where you compare with simpler models that these particular model that learns also in a fine-tuning stage about local neighborhood is able to actually predict quite some extent and better than classical models.
00:23:04: If you give it the state of a particular cell to build its gene expression profile, what its local neighborhood, its niche as we call it, would look like.
00:23:14: That's where niche former comes from, niche plus transformer.
00:23:18: And I think that could be quite useful.
00:23:19: So without now, measuring spatial contents, because you can't always do it right, but you also want to see how sort of what we often call counterfactuals would behave, what would happen if I take this cell, but then remove it and sort of maybe add it to next to that tumor cell, add an immune cell.
00:23:33: I want to see how that local neighborhood would change.
00:23:36: These type of questions we cannot start addressing.
00:23:38: Not fully yet.
00:23:39: So I would also say here, it's to some extent early times, but the scale that will be trained the model on, I think is quite substantial.
00:23:45: So the strain on more than a hundred million the sales, which is then times ten thousand quite a large dimension.
00:24:03: Which for example when we think a bit more like from an industry perspective more about applicability a question for example what happens if an immune cell and the cancer cell get together is of course highly relevant.
00:24:13: so we have this immunotherapies based on PD-one, PD-L-one that work very well in some tumors and not so well in others which are called the cold tumors then and the question is how to turn them hot.
00:24:24: so to understand maybe from a modeling perspective and this is something that I mean is explored endlessly.
00:24:30: how can we actually activate these tumors become more responsive to this immunotherapy, these kind of questions are of course extremely relevant in this context.
00:24:39: Let me add another example.
00:24:40: So we actually work in particular in spatial context very much on that.
00:24:44: Tumor biology also, a lot of DNA state is necessary so that people at least in clings often really on hoogenome sequencing and bug.
00:24:51: What we do, for example, in terms of cell therapy, I find it quite exciting.
00:24:56: At Helmut's Munich, we have a whole set of partners, very big diabetes department, where people are interested, not just in finding treatments for obesity, but also trying to develop treatments for the actual loss of beta cells.
00:25:07: And there's this exciting way that you can de-differentiate skin cells, make IPS out of that, and then differentiate them into ideally full beta cells.
00:25:16: That works to some extent, but just these protocols to make these beta cells, you need to make more efficient.
00:25:21: This is what If you have an engineering system, what screws do you need to turn to really just make those and not just like all these bypass ones, which make then the things so expensive.
00:25:30: And I think there's one or two first companies that offer some of these therapies that are hugely expensive.
00:25:35: So how can you make this more efficient?
00:25:37: And then B, of course, how can you even just understand and sort of find the right sales without, and that's always a big danger in cell therapy, maybe make those sales, forget something and suddenly they sort of start growing widely.
00:25:47: and you have another problem.
00:25:49: This is the engineering aspect is interesting because that's something that Jensen Wang from Nvidia said, right?
00:25:54: Now we're at the age where we can turn biology into engineering.
00:25:58: I think we're not quite there from my perspective.
00:26:01: I think the stochasticity is still too high, but there are these initial thoughts.
00:26:07: Where would you say are the biggest payoffs in the near future for these?
00:26:13: virtual cell models.
00:26:14: So how will they become useful?
00:26:16: I think there is a lot of academic groups, philanthropic groups, biotechs betting on the concept.
00:26:22: So where would you say would you bet your money for the payoff?
00:26:27: I mean I'm first now speaking as a pure scientist.
00:26:29: so of course you know I want to understand in the end biology right.
00:26:32: so make all these fun methods.
00:26:34: that is my core expertise.
00:26:35: but then you know we want to find understand the particular biological problems and I think you formulated a few already in the DC's context.
00:26:42: we discussed a few but of course something as basic but some people call it the most important time of our life namely development.
00:26:50: you know these very early time points.
00:26:52: you know understanding that like having a executable map I would just find super fascinating.
00:26:58: Like the whole I think single serbiology and this maybe but it's debatable but I think so.
00:27:02: if the most studied system could be to some extent hematopoiesis you know where there actually treatments are there already in clinics right.
00:27:08: so this hematopoietic tree has been very well understood.
00:27:11: Having said that just yesterday I visited colleagues in a leukemia laboratory that studied this to so much extent because there's all these different ways how that sort of very well understood system can make, can go wrong.
00:27:25: and then you need to find sort of treatments and either these sort of really very closely navigable advice for doctors.
00:27:31: So I think for diagnostics, in practice, you know, I don't think it's realistic to do a deep thing as an anesthesia or every tissue of any patients coming in.
00:27:40: So one key questions will be, you know, how can you take now and say your big data set or ID to your virtual cell, and tell me what is the minimal set of parameters that I need to measure, ID in a very cost efficient fashion, to then diagnose if there's ABC happening, right?
00:27:56: And these are questions I think that these models can answer quite well.
00:27:59: Of course, these need to be robust.
00:28:01: And in the past, sometimes these models worked well in one clinic, one data set, so to speak, and then don't generalize.
00:28:07: So this, what machine learning is often called, out of distribution generalization is absolutely key.
00:28:13: There was one aspect.
00:28:13: I think the other one that I bet on and there's a whole bunch of startups as well as bigger pharma going into.
00:28:19: that is a problem that my lab sort of initially formulated.
00:28:23: Something like six, seven years ago.
00:28:24: we called it SCGEN.
00:28:25: So this is generative modeling of a perturbation effect, particular with drugs.
00:28:31: And I think this idea of like just screening a big library of drugs for some effect.
00:28:39: It's just coming up to an end, you know, some of the low hand food, and this is of course not really low hand, but some of these sort of easier ones have been maybe found, and now the ones that are there, you maybe want to combine, you want to sort of go in a new direction.
00:28:50: So having a system that gives you even just a little bit of advice where to bet on, and not to make a full screen, but let me just screen, let's say, ten percent of that, but then you sort of have higher outcomes, or maybe understand how you can do combinatorics, which is such a big experimental space that you can never screen.
00:29:07: You know, this, I think, will have major impact and this is already happening now where people sort of start to optimize their systems.
00:29:13: I have one question for you Fabian, one for you Luisa Fabian first, because I think not every country has those institutions.
00:29:37: How valuable, how good is it for our country to have non-university research institutions?
00:29:46: like Hamhorns.
00:29:47: To be able to just do the research without teaching jobs?
00:29:52: and other than that, you know, is this giving us an edge in the further development?
00:29:57: Is this your take?
00:29:59: Yeah, thanks for raising this point.
00:30:01: I'm incredibly grateful, particularly at this moment.
00:30:04: We have a lot of fun and we can answer deep questions with like fantastic minds from all over the world, just really fantastic.
00:30:12: But I think this honor, I think this possibility to work, particularly at this time in the German system, with even though of course, you know, always funding discussions and so on, robust core funding with all these extra additional industry and so on, collaboration coming in versus what we see at, you know, our lighthouse country for a long time in US.
00:30:31: This is, I mean, this is just such a game changer that we can do this year.
00:30:34: I very deliberately actually joined Hamholtz for its interactive type of nature, the type of research that I do.
00:30:41: And I think many upcoming research happens in connected labs.
00:30:46: You know, you need to sort of have an experimental partner next door to actually ask the right question and then sort of to put into your model.
00:30:51: And this is sort of what we do in these sort of large institutions, but this is far from decoupled from university.
00:30:57: Pretty much all of the PIs at our Computational Health Center, this is with now maybe forty PIs and five hundred people in one of the largest places for healthy eye research in Europe.
00:31:06: Pretty much all of them are associated at one of the two Munich universities, not only because there's a whole bunch of interactions with colleagues, but also because teaching helps bring in new students, helps educate students, but then also it's just an important interaction.
00:31:21: So I think these institutes, this type of interaction, in particular for AI, where if you want to sort of level higher data science, where these things come together, I think this ecosystem is really working out.
00:31:31: And we see that we're not only doing a very good job in sort of paper writing, there was a recent nature index ranking of rising institutes.
00:31:41: in AI and the ten rising ones I think next to Harvard and MIT.
00:31:46: There were six Chinese and then there were two German ones, namely Hamilton and Max Planck.
00:31:51: So these things are to some extent working out.
00:31:55: But then of course also the transfer into industry is something we think about quite closely.
00:32:00: And there I think we can learn a bit more from US of course.
00:32:03: There must be something in the water in Munich probably that makes... Great minds.
00:32:08: Easy.
00:32:08: The one I have for you is because the thing that has been in my mind since we've been talking here is that more likely than not, in each and every episode, we address the deficits in biology itself, knowing about genes.
00:32:26: We talked about the dark genome, for instance, the whole episode recently.
00:32:30: I would imagine that there are deficits as well.
00:32:34: knowing, understanding.
00:32:37: So, from the get-go, you're not even able to feed the algorithm the
00:32:43: correct
00:32:44: data at that point.
00:32:45: True or false?
00:32:49: No, that's absolutely true.
00:32:50: But I think, I mean, this is a general problem of the type of, or the way that scientific research is done in biology.
00:32:58: is that we, of course, use very artificial systems because we cannot study the cell in its tissue in the body because we have no way of looking inside, at least for some things.
00:33:07: we could maybe with a microscope look at certain, certain stain proteins, but to really study the behavior of cells in connection is very difficult in the real scenario that they would have in tissues.
00:33:22: So in general, we culture cells in incubators that have like heart surfaces, which are not their natural way of growing.
00:33:31: They have oxygen and CO two concentrations that are not natural for them.
00:33:36: They get a medium with sugar contents that might not be what they usually see.
00:33:40: So all these things, of course, influence cells.
00:33:43: that said most cell lines that are used in laboratories, again, are pretty artificial because they have to be immortalized.
00:33:49: So if I now take out a cell from my skin and put it in a suspension, it will die.
00:33:53: So you need to somehow make these cells immortal, which means that you either often use cancer cells, which are naturally immortal.
00:34:01: So you have, for example, the Hila cells, this famous story that have been extracted from a woman in the nineteen fifties, I think, and since then have been propagated and used for all kinds of research purposes.
00:34:13: And you can immortalize primary cells or stem cells, for example.
00:34:18: But in general, by culturing cells, we're already in a slightly artificial situation.
00:34:24: So getting a more clear picture requires different types of methodologies.
00:34:29: For example, if we want to do a perturbation, it's very hard to do it in the body and then follow a single cell.
00:34:37: You can of course treat for example a whole animal but again for humans this becomes limiting.
00:34:41: so you see there's just methodological complexity to really get an natural as possible picture.
00:34:49: I think there is a lot being done for example around organoids about around growth media all these types of like growing organs on chips which give you much better ideas of how cells also interact with one another and what that would do to a perturbation.
00:35:05: but.
00:35:06: What we have to be aware of and I think this is really.
00:35:10: data that has been used to feed most of these models and the data that also has been used to draw most of the conclusions with our minds has been produced under conditions that are not natural to the cells, often in cell types that are not relevant to the context that's being studied.
00:35:26: So for example, there were during COVID times, many studies of cell lines for studying infections that really had nothing to do with the cells that would see a virus and that I mean, just doesn't make so much sense sometimes what you do there.
00:35:42: I also, because I said that very negatively now, everything, I don't think that this reductionist approach per se, I mean, I don't think that the answers that we get are meaningless by far not.
00:35:54: And I think that a lot of biology is also transferable because a lot of the learnings that we have as humans, we can actually deduce even from bacteria from E. coli, we can deduce from yeast cells that are extremely simple organisms in comparison to us.
00:36:09: But there are many principles that we can deduce by studying simpler organism going more and more towards complexity.
00:36:17: And I think the more the more complexity the system gets, the more we have to move towards the human and towards the real-world situation in order to make real sense of it.
00:36:28: And therefore we see, for example, studying contexts like understanding processes in the brain or in the immune system also gets more tricky because the model organisms are less likely to tell us what is really happening.
00:36:41: I mean, we cannot study our immune system in a yeast cell because it's just a single cell.
00:36:45: Makes
00:36:45: sense.
00:36:46: You mentioned the limitations of the sort of model systems.
00:36:49: I totally agree on that.
00:36:50: And I think with organoids and so on, there's going to be some bridging things as you hinted at.
00:36:54: I really like the fact, by the way, that you talk about limitations is a very important section in any paper.
00:36:58: Like whatever we do, it's going to be an approximation of reality.
00:37:01: So maybe if I can just briefly talk about the limitations of the actual models, right?
00:37:06: Because of course, these also just only go so far.
00:37:09: So ideally what I want to have is to have some uncertainty with that.
00:37:13: the model just tells you know this thing I'm quite sure but this part hey I'm not.
00:37:16: so you know better.
00:37:17: look into so just you know.
00:37:18: ideally because it has seen so many things that you know that works this one more than here I think what models let's say look into perturbations at the moment can do is to say predict in the convex hull or sort of sort of the region that you know already something about.
00:37:34: so you can sort of go to approximate ones but if you can really go to very strange other type of drugs, settings, cell types and so on.
00:37:43: This is a bit in question.
00:37:44: Of course, in principle, the biology as well as the physics is everywhere the same.
00:37:49: This is in different contexts, so it's like an adaptation of that.
00:37:51: It's a master gene regulatory network that describes all of development.
00:37:56: But in practice, I think the models are not yet robust enough to really learn this in general.
00:38:02: I would think at some point, Given enough and the rapid increase of this data, this interpolation is getting much much better, not even so much in the future.
00:38:11: So the second limitation I would see sort of conceptual is, let's say once we have this black box type of model,
00:38:17: I mean,
00:38:18: is that understanding?
00:38:20: Is that a scientific theory in the sense of popper where you know, you have your model, you can make a next experiment, you can falsify it.
00:38:26: That's kind of it, but you don't really understand.
00:38:28: It's not like, you know, this thing regulates that.
00:38:30: So often, as with these language models, there's now simplified models being trained on top to then start to pin what these things have actually learned.
00:38:37: So I think this explainable AI part, particularly if you want to then use these models to design a drug, to do a diagnosis, to do treatment, I think is becoming more important.
00:38:46: This is really interesting that you mentioned it, because I think that was the big question with AlphaFold, right?
00:39:02: The protein folding problem, is it solved because the AI can approximate how it works?
00:39:07: Or is it solved if the human understand how it works?
00:39:10: So I think that was quickly get into a more philosophical debate.
00:39:15: Maybe if I can ask maybe a last question, do you think, and I think this is a question that I am being asked a lot and that I find it a little bit hard to answer.
00:39:24: Do you think that transformer based models or models of these architectures are the right ones to study a biology and do you see anything on the horizon that could work better or will complement?
00:39:37: Yeah, very important point.
00:39:38: I'm just coming back from a sort of small retreat with a bunch of AI colleagues, amongst other the elite scientist of mid AI, where we've been discussing, is that it?
00:39:50: And it could be the internal transform architecture, but it could be also the losses, namely the fact that essentially what all these things do, to some extent, is taking the data and then masking a few things and trying to predict that mask.
00:40:00: And if you sort of know how to fill sentence holes, so to speak, in biology, let's say a few genes missing, maybe need to learn everything.
00:40:08: And I think there's some, in particular, seeing now the current limitations that this might not be the best way.
00:40:13: I think for image-based data, particularly for this sort of fidelity of image where there's this natural sequence nature and sort of this locality to some extent, these things actually work beautifully well, right?
00:40:23: So image encoding as well as sequence encoding.
00:40:26: So let's say if you want to do DNA language models, this I think a very robust approach, even though the context is not long enough to sort of learn across all of DNA.
00:40:36: But I think for some others, we might need to do different things.
00:40:39: And the one big learning I think that we saw from the whole development of AI is that models need biases, inductive biases.
00:40:48: You need to put in some information to actually just do anything.
00:40:51: And early on, we put a lot of inductive biases prior to these models.
00:40:57: Maybe another feed.
00:40:58: when you wanted to do language, just speech synthesis or speech to text.
00:41:02: You initially told them, hey, this is what vowels, what consonants are, and you might have used... Luis Andreas, maybe these was a dragon dictate or some of these old things.
00:41:10: They were kind of really bad, right?
00:41:12: Like really bad.
00:41:13: But now this stuff works because this shift was from manual feature engineering to data-driven feature engineering because you had so many voice files that the computer automatic could learn what's important.
00:41:24: We might not be there yet.
00:41:25: So at the moment, we see in some of our mobs and many others also that putting a prior in, let's say a GRN, let's say some information from Chromartini and so on actually helps our RNA based mod.
00:41:35: At some point, maybe we have so much data that the models would automatically find the right encodings.
00:41:40: But I think the architectures will definitely change and be updated.
00:41:43: I think we should do a second episode, most definitely, because there's so much going on.
00:41:49: Maybe we'll make a promise that we come back in this setting.
00:41:53: It was a pleasure for me, especially to listen this time and not to speak because I'm used to speaking.
00:41:59: And today it's wonderful.
00:42:08: Many thanks, Fabian.
00:42:10: Thanks.
00:42:10: Thanks, Izzy.
00:42:12: Thank you.
00:42:12: Thank you very much for having me.
00:42:13: Of course.
00:42:14: And as always, please feel free to comment on our show, to subscribe to it.
00:42:21: And further information is always at ScienceTales.com.
00:42:26: Thank you very much again, and thank you for listening or watching this episode of the Biorevolution podcast.