Saturday, October 14, 2006

The Second X Prize (http://genomics.xprize.org/) may bring radical new "wet science" breakthroughs; but where does all that data go?

A cursory look at ftp://ftp.ncbi.nih.gov/genbank/genomes/H_sapiens/ shows that the latest data weighs in at just over a Gig, compressed! I don't know if that contains annotations, SNPs, STSs, Cytogenetic Bands, etc. (Anyone care to give me some real world figures?)

The new X Prize requires the complete sequencing of 100 humans. So that's roughly 100GB compressed. So for decompression, add another 2X for temp working space. That's a total of 300GB of data. A quick trip to Fry's gets you that for under $300 these days, and falling.

But now consider that for quality control, archival purposes, analysis, transmission, sharing, etc. you will almost certainly need multiple copies. OK, so round up and be very conservative. Assume for each batch of 100 human genomes, you will need 1TB of storage. Now we're getting a bit pricey, but still well within the reach of any individual with a few K dollars to burn.

As of today, there are ~300 million (documented) US residents. That's 3M TB == 3Exabytes (10E18). Today, major financial datacenters top out in the petabyte range. Still, ignoring the technology advances needed, because they will exist soon enough, it will still be a massive outlay. If disk costs halve every 2 years, in 5 years (the anticipated completion of the X Prize), the cost per TB for off the shelf storage could be as low as $100 USD. That's still $300M USD. Certainly possible given the coffers of a large Fortune company, a wealthy institution, or the US Government. Note that this is just for the raw storage; no labor, no electricity, no cooling, no repairs, computing power, etc. OK, round up to $1B and it is still a plausible US Government effort.

Now, for those of you who (like me) foresee no end of medical advances when we have such a trove of genetic data readily available...wait! readily available?...how are you going to disseminate all this data to the large number of academic and corporate scientists who could benefit from the data? Ever try downloading 1GB of data, or 1TB? Surely we can burn it to optical media and transport whole copies to various local repositories (at $1B a pop, give or take); but even then, you will need database tools to analyze it, software to crunch the data, graphics to visualize the data, etc. etc.

What's the point? There needs to be an effort to match the X Prize effort --- a "D" prize for storing, transporting, and using the data that results from the X Prize.

Now, what happens when the technology is available to instantly (in relative terms) sequence every homo sapien in the US? Who polices the use of this data? In California, they routinely (by law) take a blood sample of every child born, which is maintained in a bank under somewhat vague authority. I looked into it briefly when my daughter was born. There wasn't much publicly available info. I'm no conspiracy theorist, neither am I paranoid, but to me, the threat of losing one's private genetic data is of the utmost concern. Not panic, but genuine consideration.

If one thing is certain, this data will be gathered, and there will be infinitely wonderful changes in human existence because of it. However, there will also be criminal abuses of the technology. What laws and enforcement policies do we need in place to deal with the inevitable evils? Or maybe we'll find the genes responsible for such malevolence and eradicate them? Hmmm...I predict a new occupation: genomic attorney!

Saturday, October 07, 2006

Yahoo Mail APIs: Cathedral or Bazaar?

As best I can tell from what little info is available (noting that I was not a participant of Hack Day, or in any way part of the Hack cognoscenti), a primary difference between the free and premium API access will be that the free account APIs will not allow full email content access. I can only assume that this is to protect the revenue stream for premium features derived from full content access, e.g. email archive.

I can only assume that, in the short term, premium users represent so large a revenue source for Yahoo that it is willing to risk stifling long-term creativity. To be fair, it would be a difficult pro forma analysis: maximize the feature set available to developers who want to expand the utility of the platform vs. cannibalize current revenue streams. In a corporate setting, this is what we call a CLM.

Yet, from the sidelines, I would assert that if a user has already made a choice to use free accounts, it seems unlikely that they would switch to premium for any reason relevant to the APIs. (Ah, something my microeconomics prof lectured about is buzzing in my head...but I think I was asleep that day...) Wouldn't it be more likely that users from other mail services would migrate to Yahoo --- both free and premium --- if more creative and innovative services were available? If anyone from Yahoo would care to send me the data, I would gladly develop a predictive model, gratis. (Not holding my breath.)

It certainly may be that some of the premium services such as archival could be re-implemented by 3rd parties; I would assert that rather than eroding Yahoo's business, it would open new --- and most likely unexpected --- services that increase the attractiveness of the platform. Admittedly, this is only an opinion and would benefit from some number crunching. If I have a point here, it is this: has anyone at Yahoo crunched those numbers, or is it just that no one wants to speak out in the cathedral down in Sunnyvale?

"If you have the right attitude, interesting problems will find you." -Eric Steven Raymond