I’m trying to upload HAPMAP LD data into MySQL. The table is named CEU after the population group. Here is the description:
+------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+-------+ | chromosome | varchar(2) | | | 0 | | | marker1 | int(11) | | | 0 | | | marker2 | int(11) | | | 0 | | | population | varchar(5) | | PRI | | | | rsmarker1 | varchar(12) | | PRI | | | | rsmarker2 | varchar(12) | | PRI | | | | dprime | double | | | 0 | | | rsquare | double | | | 0 | | | LOD | double | | | 0 | | | fbin | int(11) | | | 0 | | +------------+-------------+------+-----+---------+-------+
One file for each of twenty-two chromosomes, X and Y are not needed for this exercise. Chromosome one holds nearly 50 million rows of data in a 2.5GB file. This collection of files creates such a large and onerous footprint that there are very few desktop machines able to decompress the files, let alone manipulate them in some useful and timely way. You are invited to try this at home. (Data available here.) Problems like these call for appropriate machines and databases. Our database server is an eight processor, 32GB/RAM Sun Microsystems v880 with a fiber-channel attached 4.5TB Sun StorEdge array. It should be up to the task. But is the database?
The database trick will be to complete the import without blowing off the end of MySQL table limits. According to documentation there is a 4.2B row limit on MYISAM table types. But there is a 4GB table size limit I think we’ll run into first. Table status currently reads:
mysql> show table status like 'ceu' \G
*************************** 1. row ***************************
Name: ceu
Engine: MyISAM
Version: 9
Row_format: Dynamic
Rows: 11021586
Avg_row_length: 70
Data_length: 776406388
Max_data_length: 4294967295
Index_length: 124839936
Data_free: 0
Auto_increment: NULL
Create_time: 2006-08-31 15:26:56
Update_time: 2006-08-31 21:07:54
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
One possible fix might be to alter the table with the following command:
mysql> alter table ceu max_rows = 200000000000 avg_row_length = 60 ;
Query OK, 11192144 rows affected (24 min 40.44 sec)
Records: 11192144 Duplicates: 0 Warnings: 0
mysql> show table status like 'ceu' \G
*************************** 1. row ***************************
Name: ceu
Engine: MyISAM
Version: 9
Row_format: Dynamic
Rows: 11249635
Avg_row_length: 70
Data_length: 792599672
Max_data_length: 1099511627775
Index_length: 139523072
Data_free: 0
Auto_increment: NULL
Create_time: 2006-08-31 21:11:00
Update_time: 2006-08-31 21:36:50
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options: max_rows=2431504384 avg_row_length=60
Comment:
1 row in set (0.01 sec)
This done I’m letting the import run over night. It may take some tweaking to get right. Once the data is loaded we’ll need to build indexes to speed the query process. I can already think of a number of things I would do differently if this were loaded into Oracle. I may end up in Oracle before this exercise is over. Regardless, it is an interesting test of databases at their limits. So far:
Turns out I don’t think I have enough room for all this data on the current partition. If I extrapolate from the currently installed data I will run out some time tomorrow afternoon.
New space has been found and I’m trying again. If this database gets larger than half a terabyte we’ll have to rethink using MySQL as the partition for MySQL is limited.
Here are the chromosomes currently or partially loaded:
mysql> select distinct(chromosome) from ceu order by chromosome; +------------+ | chromosome | +------------+ | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 8 | | 9 | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | | 20 | | 21 | | 22 | +------------+ 21 rows in set (1 hour 15 min 39.11 sec)
Only one more chromosome to go. These imports should complete some time Tuesday.
Finished. This oversized table is complete:
mysql> select count(*) from ceu; +-----------+ | count(*) | +-----------+ | 696238513 | +-----------+ 1 row in set (0.01 sec)
Queries may look something like this:
mysql> SELECT count(*) FROM suzanne.ceu c where ( c.rsquare > 0.8 and c.rsmarker1 = "rs10473282" or c.rsmarker2 = "rs10473282" ); +----------+ | count(*) | +----------+ | 179 | +----------+ 1 row in set (1 hour 12 min 8.17 sec)
In an effort to speed things up a bit, I’ve added:
ALTER TABLE `suzanne`.`ceu` ADD INDEX `rsquare`(`rsquare`), ADD INDEX `dprime`(`dprime`);
It will take some time to build these indexes and then we’ll run the query again. Reading about other methods of modifying the MySQL database engine to bette accommodate data-sets of this size.
A natural partition of this data might be along the chromosome divide but that doesn’t support the type of questions we need to ask. The real utility of a database like this will be having all the parts in one place.
Where are X and Y? Not sure of the biological reasons for their omission.