Saturday, October 14, 2006

The Second X Prize (http://genomics.xprize.org/) may bring radical new "wet science" breakthroughs; but where does all that data go?

A cursory look at ftp://ftp.ncbi.nih.gov/genbank/genomes/H_sapiens/ shows that the latest data weighs in at just over a Gig, compressed! I don't know if that contains annotations, SNPs, STSs, Cytogenetic Bands, etc. (Anyone care to give me some real world figures?)

The new X Prize requires the complete sequencing of 100 humans. So that's roughly 100GB compressed. So for decompression, add another 2X for temp working space. That's a total of 300GB of data. A quick trip to Fry's gets you that for under $300 these days, and falling.

But now consider that for quality control, archival purposes, analysis, transmission, sharing, etc. you will almost certainly need multiple copies. OK, so round up and be very conservative. Assume for each batch of 100 human genomes, you will need 1TB of storage. Now we're getting a bit pricey, but still well within the reach of any individual with a few K dollars to burn.

As of today, there are ~300 million (documented) US residents. That's 3M TB == 3Exabytes (10E18). Today, major financial datacenters top out in the petabyte range. Still, ignoring the technology advances needed, because they will exist soon enough, it will still be a massive outlay. If disk costs halve every 2 years, in 5 years (the anticipated completion of the X Prize), the cost per TB for off the shelf storage could be as low as $100 USD. That's still $300M USD. Certainly possible given the coffers of a large Fortune company, a wealthy institution, or the US Government. Note that this is just for the raw storage; no labor, no electricity, no cooling, no repairs, computing power, etc. OK, round up to $1B and it is still a plausible US Government effort.

Now, for those of you who (like me) foresee no end of medical advances when we have such a trove of genetic data readily available...wait! readily available?...how are you going to disseminate all this data to the large number of academic and corporate scientists who could benefit from the data? Ever try downloading 1GB of data, or 1TB? Surely we can burn it to optical media and transport whole copies to various local repositories (at $1B a pop, give or take); but even then, you will need database tools to analyze it, software to crunch the data, graphics to visualize the data, etc. etc.

What's the point? There needs to be an effort to match the X Prize effort --- a "D" prize for storing, transporting, and using the data that results from the X Prize.

Now, what happens when the technology is available to instantly (in relative terms) sequence every homo sapien in the US? Who polices the use of this data? In California, they routinely (by law) take a blood sample of every child born, which is maintained in a bank under somewhat vague authority. I looked into it briefly when my daughter was born. There wasn't much publicly available info. I'm no conspiracy theorist, neither am I paranoid, but to me, the threat of losing one's private genetic data is of the utmost concern. Not panic, but genuine consideration.

If one thing is certain, this data will be gathered, and there will be infinitely wonderful changes in human existence because of it. However, there will also be criminal abuses of the technology. What laws and enforcement policies do we need in place to deal with the inevitable evils? Or maybe we'll find the genes responsible for such malevolence and eradicate them? Hmmm...I predict a new occupation: genomic attorney!

No comments: