Welcome to European Tribune. It's gone a bit quiet around here these days, but it's still going.

Computational simulations in science

by t-------------- Thu Nov 20th, 2008 at 07:15:08 PM EST

I have said many times on ET that I don't trust most computer simulations of reality. I never supported this claim with any argument. Considering that this is the area where I work, I think I can give some ideas on the topic based on personal experience. My plan, in the future is to write a robust and extensive argument on this issue. For now, I will start with a simple example, using mostly naive arguments. Feedback is highly encouraged (especially antagonistic viewpoints that are rationally fundamented).

Caveat: I exclude computational chemistry and physics from all this (a preamble where I explain why, is due in the future). For computational chemistry and physics I mean only what theoretical chemists call ab-initio and semi-empirical methods.
Caveat: I will trade rigor for clarity (especially the word gene will be abused).

Short bio:
Undergrad in computer science (with lots of experience in software development)
MSc in bioinformatics (area: population genetics, conservation genetics)
PhD is tropical medicine (ongoing - theoretical study of the spread of drug resistant malaria)

For obvious reasons it is quite difficult for me to talk about the PhD work. I will just say that I have even more reasons to believe in what I believe.

Population genetics and simulated reality:

In many research studies involving population genetics empirical data is compared to simulated data (by a computer). Conclusions about the real world data are made according to the relationship between that data and deviations from the computed data.

An example
Imagine trying to find in a certain population which genes are under selection - Imagine, say, that a certain population of fish are able to adapt to very cold waters while others aren't, you might be wanting to know which genes are responsible for that (assuming there is a genetic basis). The procedure is more or less like this: You go to the field, collect samples (something that has DNA from individuals) from several populations. Then, in the lab, several possible candidate genes are genotyped from each individual. Each individual will have 2 copies of a gene, different versions of the same gene may exist (different alleles). You are looking for genes that seem to be "different" for the majority. The underlying hypothesis is that "normal" genes are neutral while "different" genes are under selection (hopefully are related to thermal-regulation, what you are looking for).
Imagine that you sample 2 populations (one in normal waters, another in cold waters), 10 individuals on each population and (only) 3 different genes. Imagine that you have the following distributions (I will call alleles for first gene a and A, for the second gene b and B and for the 3rd c, C).


Population a   A |  b  B |  c  C
Normal     10 10 |  0 20 | 10 10
Cold       10 10 | 20  0 | 10 10

It would seem the the first and third genes have the same distribution in both populations and the second gene is quite "different".
This is, of course, a simple and highly skewed example, for explanation purposes.
Reality is normally much more complex, involves more genes, more populations, and possibly more alleles per gene.
So what do researchers might do? They simulate a bunch of neutral("normal") genes on a computer, calculate some confidence intervals for some statistics on that simulation and then compare each gene (or the statistics derived from those real genes) against the computer simulation results. Genes that fall outside the confidence intervals from the simulation are deemed candidates for being "under selection".

So, real data is compared against simulated data to derive conclusions.

So what about the realism of that simulated data? In population genetics, most simulations (not only my selection detection example) are based on Coalescent theory. I am not going to explain the coalescent here, or this would become huge. I will present some of the common assumptions done on coalescent simulations (note that some simulators might do different).


  1. Random mating among individuals. The meaning of this is obvious. Or maybe not.

  2.  

       
    1. Some simulators don't have the notion of gender. So random mating might mean really random mating of genes.

    2.  
    3. Random mating is not very good in many situations. We all know of many species where mating is not random at all. But it is said that random mating is a good approximation for humans (especially because we don't see one male for all female behavior - Though Irish might complain here (search for Niall), or Genghis Kan descendents, by the way). So we have random mating used for species which don't mate randomly at all; and for humans, where it is seen as an acceptable approximation. I will not even go into politically dangerous terrain about random mating among humans, I will just say this: If you do some trivial math, random mating produces lots of half-siblings (where one of the parents is shared but the other is not) and almost no full siblings (where both parents are shared). Do you think this is a reasonble approximation of the reality that you know? Especially think in terms of genetic diversity (which is what we are discussing here).

    4.  
    5. Random mating is actually an extreme in the sense that it maximizes the transmission of genetic diversity. For instance if in the real scenario only one male is responsible for the next generation gene pool (this is quite typical in, say, domesticated animals), you loose the genetic diversity that all the other males might give. A random mating approach looses very little in comparison.

    6.  

  3. Meta-Population structure. Let me give an example with humans. Humans are normally studied theoretically as three populations (Africans, Asians and Europeans - forget about Native Americans to start with). These 3 populations are, you guessed it, in random mating and individuals migrate among populations (equal rates of migration among populations). So according to this model, a Portuguese like me (think about this simulation modeling ages before planes or cars existed) had a much bigger probability of mating with a Finish than with a Moroccan (as a funny aside, people from Lisbon and south, like me, are called by northern Portuguese as Moors).

  4. Hierarchical population. Continuing and detailing the example above, each population is normally in random mating. It is known that genetic diversity varies quite a lot in Europe (the Finish are normally presented as an example of an highly-inbreed nation - more than the Icelandic). But most models have no account for this internal population structure. There is normally, in most models no notion of hierarchy (i.e., the are 3 human populations are not internally subdivided).

  5. Independent generations. According to most existing models we can only mate with people of the same generation. No Charles Chaplin for you (or Niall, or Gengis Kan ;) ). Actually I think this is a bigger problem with some non-human species.

  6. In a completely different front are field and lab errors: Repeated sampling of the same individual? Might happen especially with elusive species. Errors in the lab? There is ample evidence that researchers do lots of errors in interpreting results from sequencers. How does a certain method cope with average error rates?

Now that I think of, most of the assumptions discussed above (the last being a more ambiguous case) tend bias for the increase genetic diversity...

Notice also that I am discussing approximations that are easy to dismount. In many cases much of the history of populations (especially non-human) mostly is unknown: population size, population structure, mating habits, genomic structure. So you invent.

Now, people are aware of this and the counter-argument is: "most methods are robust to changing assumptions". The immediate question then becomes: "Show me proof that your method is robust" (this can be done by doing theoretical studies where you change a certain assumption and compare the results)? Normally the answer becomes: "If have tested for a certain specific change" (normally from a certain demographic model to another). "What about for this other n differences which are also important (like, say, mating)?" The answer is invariably: "You can do it yourself, that would be nice to see".

I am consciously avoiding a discussion of the economics and sociological part of this, or to explain why people don't test for all these conditions. There are reasons why people avoid doing this, but that is not the point of this exercise (though a discussion about the sociology of science would be cool).

The only thing that I want to do, is to place some seeds of doubt on the current use of computational simulations as a proof/forecast mechanism in science, and doing that by explaining, roughly, the underlying science of it.

By the way, I also think that in many cases we are really not talking about science as many of these methods are falsifiable, but people seem to prefer to ignore that ("this method can still be used in many realistic scenarios", "This method is valid if the (unrealistic) assumptions are, if you have different assumptions that it is your responsibility").

This text was, I have to admit, half-baked. Please accept my apologies for that. It is my intention to develop the argument over time, but I think I owed some explanation on why I doubt so much of using computer models for making predictions. Please allow for some language abuse, I just want to pass the rough idea for now. Comments are very welcome


Display:
Seems to me that the sophistication of the model directs the level of accuracy of the results. Just because real life is very complicated doesn't mean that you can't learn anything from a model...  ?
by asdf on Thu Nov 20th, 2008 at 09:24:30 PM EST
That is a very important issue (actually you address 2 fundamental issues, in my opinion). I plan to write about them separately in the future.

  1. Simple models versus Complex models. I actually do think that complex models are even worse than simple models. Mainly because: a) complex models are difficult to understand (and I can prove beyond doubt that many simple models have gross mistakes) and b) a bigger parameter space allows for many ways to get a plausible model just by the sheer number of parameters and their ability to touch all solution search space in many ways. So you might get something that resembles reality just by chance that a few parameter combinations actually have the same behavior.

  2. Is modeling any good? Sure it is. I believe simple models are a good supplement to empirical research. Especially to try to find unexpected behaviors which might drive systems and also to try to clarify and rigorously define concepts that field/lab researchers use somewhat "liberally" (in fact one can say that our heads are full of models of reality). But I don't see how modeling can be used to forecast the future or even reconstruct the past.
by t-------------- on Fri Nov 21st, 2008 at 04:56:40 AM EST
[ Parent ]
I don't trust most computer simulations of reality....
Caveat: I exclude computational chemistry and physics from all this (a preamble where I explain why, is due in the future). For computational chemistry and physics I mean only what theoretical chemists call ab-initio and semi-empirical methods.

When you write the explanation, please be explicit about the question that the simulation is intended to answer. Even the best of the standard ab initio methods can't be trusted if the answer is highly sensitive to small differences in energy, while crude molecular mechanics methods can be trusted if the answer is very tolerant of such differences. A statement about trustworthiness can be misleading unless one describes what trust requires.

(re. blockquote: Apologies, but I writing this very post, I found a new way to abuse HTML!)

Words and ideas I offer here may be used freely and without attribution.

by technopolitical on Thu Nov 20th, 2008 at 10:59:49 PM EST
I have a problem here, for which advice would be welcome. It goes about clarity versus rigor.

In order to do a rigorous explanation in some of the subjects involved, a lot of wording and context is necessary. I, sometimes trade, maybe in excess, rigor for more clarity. How should some complex issues be framed so that clarity is not lost while rigor is maintained? When I try to make a certain point I don't mind losing some rigor/truth as long as that lack of truth is not misleading by itself.

Imagine just having to explain ab-initio versus semi-empirical versus molecular mechanics (when that is really not focal point here)... Not to say that I don't even feel qualified to do that...

by t-------------- on Fri Nov 21st, 2008 at 05:08:09 AM EST
[ Parent ]
I appreciate the clarity (very enjoyable and interesting read!) and think you can signal (as you have done in this essay) where complexity might cause a branching of the argument--leave a marker with maybe one or two lines about where that argument might lead, then back to your plan.

If a knowledgeable person picks up the side argument in the comments I (a non-scientist) can then see at least at what point they are diverging and hopefully follow along with some idea of how much the argument is a tweak to the main point or a direct confrontation etc...

btw, I would be interested in reading a discussion of the economics and sociology of science if anyone fancies it.

Don't fight forces, use them R. Buckminster Fuller.

by rg (leopold dot lepster at google mail dot com) on Fri Nov 21st, 2008 at 06:33:51 AM EST
[ Parent ]
I don't trust most computer simulations of reality....

Nor do I... say that's a cute trick you found there techno, good to seeya, binawhile...

but then if i were a computer i wouldn't trust the human models of reality either, lol-

'The history of public debt is full of irony. It rarely follows our ideas of order and justice.' Thomas Piketty

by melo (melometa4(at)gmail.com) on Sat Nov 22nd, 2008 at 07:28:09 PM EST
[ Parent ]
I know you said to allow for some language abuse, but you really should change this:

I also think that in many cases we are really not talking about science as many of these methods are falsifiable

To qualify as science, something must be falsifiable. It is just wrong to say something is not science because it is "falsifiable".

I guess you meant to say "many of these methods are false/wrong". In that case I would point out that being wrong does not mean it is not science. The plum pudding model is wrong, but in its day it was a reasonable proposition based on available data. New data forced a more accurate model to emerge. Despite being completely wrong, the plum pudding model played a valid scientific role. To this day it is useful as an example of how science works. With time and observation models improve. Why should the same process not work in respect of computer simulations?

by det on Fri Nov 21st, 2008 at 06:50:08 AM EST
Yep, you are correct, thanks.
by t-------------- on Fri Nov 21st, 2008 at 11:15:49 AM EST
[ Parent ]
I have at least one issue with this, depending on how broad your exclusion of "computational chemistry and physics from all this" is meant to be.

Aren't you projecting from population genetics to computer simulations in general? Is this fair? How does the existence of more or less robust theories in a field affect your claim?

In fact your case seems to me, to be a discussion on how one cannot extract a theory by statistical means alone. Or how a lack of theory makes models blind. A theory lacking depth, leads to models lacking depth.

Anyway. The proof of the pudding is in the eating: Epistimology aside, doesn't the predictive ability (where it is an aim) of a model, validate at least practically its usefulness?

Take meteorology. I don't think there is any doubt (correct me if I'm wrong) that weather forecasting has become vastly more exact a business over the past two decades at least, nearing the limit that chaos allows us to reach. Or climate prediction: as I pointed out in another thread, the track record of correspondance with reality of even quite simple climate models, seems to be, if not amazing, then quite good.

The road of excess leads to the palace of wisdom - William Blake

by talos (mihalis at gmail dot com) on Fri Nov 21st, 2008 at 08:15:18 AM EST

I have at least one issue with this, depending on how broad your exclusion of "computational chemistry and physics from all this" is meant to be.

My "view" is very narrow. Stops at the protein level. Maybe at the cell membrane level.


Aren't you projecting from population genetics to computer simulations in general? Is this fair? How does the existence of more or less robust theories in a field affect your claim?

You might be correct. I take the bias from my background (which actually includes also a similar field in the spread of drug resistance). I don't want to go into social issues for now, but my argument is complexity based: Our computational feasible models of chemical reality (sorry I have some old background in theoretical chemistry, the physics that I know comes mostly from there) can't even simulate a large molecule reliably for more than a few seconds, that I suggest that the the bigger the system, the most approximations have to be done, the bigger the possible mistake.
By the way, what about the butterfly effect? Especially in the context of simulations that are approximate...
Any way, your argument carries weight and I will need some time to think about it.

Anyway people use computer models to make decision on things from drug policies to climate change, to finance quantitative methods (other people here are more prepared than me to talk about quants). I think there is enough practical evidence for not taking the results are the gospel.


In fact your case seems to me, to be a discussion on how one cannot extract a theory by statistical means alone. Or how a lack of theory makes models blind. A theory lacking depth, leads to models lacking depth.

I am inferring correctly that more depth will imply more complexity? The truth is that we are dealing with a system with a mass amount of variables and a mass amount of unknowns. Many things are not known (and are not knowable) and most inference methods are dependent on assumptions about unknown parameters. For instance some models require a certain demography, but we really don't know the past demography of many species.

By the way that is much modeling done on medicine (now talking about another domain, which I know something) where many fundamental variables are not known: pharmacokinetics and pharmacodynamics of drugs plus their mechanisms of action are many many times not know, just speculated. How can you reliably model this? Even when you know, how can you study the epidemiological behavior of a certain disease to a predicable level on real scenarios on the terrain when you sometimes have unpredictable events like wars causing massive changes in reality? Again, I am not saying that models are useless, but to predict the future?


Anyway. The proof of the pudding is in the eating: Epistimology aside, doesn't the predictive ability (where it is an aim) of a model, validate at least practically its usefulness?

Take meteorology. I don't think there is any doubt (correct me if I'm wrong) that weather forecasting has

For climate modeling I don't know anything at all. For metereology, at least in Liverpool, the BBC cannot give an accurate prediction on rain with hours of distance ;) . This one I am painfully aware.

by t-------------- on Fri Nov 21st, 2008 at 12:03:53 PM EST
[ Parent ]
Since you are working in population genetics it would seem to be possible to devise straightforward experiments that could be performed with inexpensive test animals where some of the assumptions, such as randomness of mating can be directly controlled.  Then you could do the computer simulation using the standard assumptions, do the live experiment while forcing patterns that conflict with the assumptions in specific ways, such as single dominant male performs 90% of mating in the population or typically monogamous mating,  do the live experiment forcing conditions that comply with the assumptions.  Do the analysis and then compare the results.  

Am I totally misunderstanding the nature of the problem?  If such an empirical procedure is possible but has not been done, that would be the basis for a critique of the entire discipline.  However, caution should possibly be exercised, depending on the attitudes of your doctoral committee, until after you have your degree. --or tenure?

"It is not necessary to have hope in order to persevere."

by ARGeezer (ARGeezer a in a circle eurotrib daught com) on Sat Nov 22nd, 2008 at 01:46:06 AM EST
I think there are two issues at play here. And in the population genetics, your suggestion doesn't apply.

On PopGen the issue is not mistrust on the model if the assumptions hold. I would find it completely revolutionary that results would be different from expected theoretical results on a completely controlled experiment. Actually I think Mendel did what you want for traits (which, by sheer luck, mapped in a clear way to genes).
The problem is that, in real life there are massive deviations from models that are constructed and a) many of those deviations are both unknown (in many cases you don't know say, mating behavior) and b) their effects are not properly quantified. But, if you know all about your species and history (mating, behavior, population structure) you can construct a model which should provide realistic results. The underlying genetic mechanisms are reasonably well known.

Another, different problem, is when you know very little of the core properties of systems in order to model them. For instance for many drugs pharmacological properties' are not known. So any model that is created has a much bigger amount of speculation (has the core behavior is not known).

by t-------------- on Mon Nov 24th, 2008 at 01:07:12 PM EST
[ Parent ]
So, real data is compared against simulated data to derive conclusions.
Just a quick point about the above sentence. This practice is no different in essence from comparing real data with a completely solved theoretical model. The role of the simulation (and the confidence intervals computed from it etc) should be to numerically compute theoretically interesting quantities which are otherwise impossible to calculate explicitly.

This is not the same as comparing the results of a simulation directly with reality. Rather, the simulation should be an approximation of the theoretical model (which is unnecessary if the model can be solved), and the real data should always be analysed against the theoretical model.

In this sense, the intrinsic realism of the simulation is not important, what is important is to simulate the theory correctly, because it is the theory which is falsifiable and which brings (potential) understanding.

--
$E(X_t|F_s) = X_s,\quad t > s$

by martingale on Sun Nov 23rd, 2008 at 02:16:32 AM EST
FWIW, these are crucial points to limit the boundaries of casual discussions and future research parameters, I think. Limits to hypothesis, however refined, ought to be set in order to distiguish scientific method from fantastatic descriptions of phenomenology.

It is one thing to prove a model ("to simulate the theory correctly"). It is altogether another to realize and reproduce a model by material means.


Diversity is the key to economic and political evolution.

by Cat on Mon Nov 24th, 2008 at 03:33:37 AM EST
[ Parent ]
Strictly speaking the theory is OK (at least on the population genetics example).

The issue is more of departure from assumptions and sensitivity analysis.

There is a displacement between the model and reality. Say, we know that humans don't do random mating. But random mating is used in most models. Are methods that use random mating models robust to deviations in mating models?

Another, related issue is that we don't know some parameters. Even if we were to replace random mating, what should be put in place? Maybe for humans we could agree in something better on some parameters (as we know quite a lot about that species), but for many species we don't know.

by t-------------- on Mon Nov 24th, 2008 at 01:17:41 PM EST
[ Parent ]
But isn't it so that, with sufficient sized samples, the 'chosen' as opposed to random mating becomes statistically or actuarily irrelevant - depending on what you are studying?

The whole point, I thought, of modelling was to come up with a coarse grain insight into what is happening to the system. There is some limit of fine grainness' in all models: like the limits we use all the time - for instance we see people as entities, as not bunches of atoms. It helps in navigation.

I don't see the difference between the grain of models and the grain of photographs. One can overcompress and deresolve photographs such that all meaning disappears OR increase resolution until, depending on how it is going to be viewed, further increase adds nothing to clarity.

That gives you a window of resolution for displaying any photograph - and the same applies to models, of which a photograph is a subset.

That window is the area in which models can impart meaning.

You can't be me, I'm taken

by Sven Triloqvist on Mon Nov 24th, 2008 at 01:37:48 PM EST
[ Parent ]

But isn't it so that, with sufficient sized samples, the 'chosen' as opposed to random mating becomes statistically or actuarily irrelevant - depending on
what you are studying?

"Depending on what you are studying" is a good point here. In some cases it is relevant, in others it is not. In which cases it is relevant? We mostly don't know.


The whole point, I thought, of modelling was to come up with a coarse grain insight into what is happening to the system. There is some limit of fine grainness' in all models: like the limits we use all the time - for instance we see people as entities, as not bunches of atoms. It helps in navigation.

The point is not a reductionist argument. I think it is, if anything, more of the opposite.
Most of these models present themselves as a "precise and rigorous" approach when you cannot have that: Rigorous computations are made on top of massive unknowns and presented as almost hard science because the method is precise, but based on, at best, speculative assumptions.

Taking your example, it is if wrong and/or unknown models of atoms were used in order to make simulations to talk about sociology. It is reductionism based on bad models and precise computation devices in order to give any conclusions the authority of mathematics.

Or taking your photo example it is as if having an originally very bad photo one tried to used the best methods in order to enhance well beyond the information you have there.

Your photo metaphor has actually a severe problem: when you look at a photo, you can easily observe if it is appropriate to your needs or not, when you look at some of the data, theory and models you really don't know if it is appropriate for your needs, as the problem is way complex.

I would be much more comfortable with rougher modules and rougher methods. But of course, that would make things less authoritative and conclusive and would diminish the importance of people doing theoretical work.

Theoretical work that is, sometimes used to make important decisions in things like deciding drug deployment policies or how to intervene in saving endangered species.

by t-------------- on Tue Nov 25th, 2008 at 05:29:34 AM EST
[ Parent ]


Display:
Go to: [ European Tribune Homepage : Top of page : Top of comments ]