TY - JOUR
T1 - ProteomeGRID
T2 - Towards a high-throughput proteomics pipeline through opportunistic cluster image computing for two-dimensional gel electrophoresis
AU - Dowsey, Andrew W.
AU - Dunn, Michael J.
AU - Yang, Guang Zhong
PY - 2004/12
Y1 - 2004/12
N2 - The quest for high-throughput proteomics has revealed a number of critical issues. Whilst improved two-dimensional gel electrophoresis (2-DE) sample preparation, staining and imaging issues are being actively pursued by industry, reliable high-throughput spot matching and quantification remains a significant bottleneck in the bioinformatics pipeline, thus restricting the flow of data to mass spectrometry through robotic spot excision and protein digestion. To this end, it is important to establish a full multi-site Grid infrastructure for the processing, archival, standardisation and retrieval of proteomic data and metadata. Particular emphasis needs to be placed on large-scale image mining and statistical cross-validation for reliable, fully automated differential expression analysis, and the development of a statistical 2-DE object model and ontology that underpins the emerging HUPO PSI GPS (Human Proteome Organization Proteomics Standards Initiative General Proteomics Standards. The first step towards this goal is to overcome the computational and communications burden entailed by the image analysis of 2-DE gels with Grid enabled cluster computing. This paper presents the proTurbo framework as part of the ProteomeGRID, which utilises Condor cluster management combined with CORBA communications and JPEG-LS lossless image compression for task farming. A novel probabilistic eager scheduler has been developed to minimise make-span, where tasks are duplicated in response to the likelihood of the Condor machines' owners evicting them. A 60 gel experiment was pair-wise image registered (3540 tasks) on a 40 machine Linux cluster. Real-world performance and network overhead was gauged, and Poisson distributed workerevictions were simulated. Our results show a 4:1 lossless and 9:1 near lossless image compression ratio and so network overhead did not affect other users. With 40 workers a 32 x speed-up was seen (80% resource efficiency), and the eager scheduler reduced the impact of evictions by 58%.
AB - The quest for high-throughput proteomics has revealed a number of critical issues. Whilst improved two-dimensional gel electrophoresis (2-DE) sample preparation, staining and imaging issues are being actively pursued by industry, reliable high-throughput spot matching and quantification remains a significant bottleneck in the bioinformatics pipeline, thus restricting the flow of data to mass spectrometry through robotic spot excision and protein digestion. To this end, it is important to establish a full multi-site Grid infrastructure for the processing, archival, standardisation and retrieval of proteomic data and metadata. Particular emphasis needs to be placed on large-scale image mining and statistical cross-validation for reliable, fully automated differential expression analysis, and the development of a statistical 2-DE object model and ontology that underpins the emerging HUPO PSI GPS (Human Proteome Organization Proteomics Standards Initiative General Proteomics Standards. The first step towards this goal is to overcome the computational and communications burden entailed by the image analysis of 2-DE gels with Grid enabled cluster computing. This paper presents the proTurbo framework as part of the ProteomeGRID, which utilises Condor cluster management combined with CORBA communications and JPEG-LS lossless image compression for task farming. A novel probabilistic eager scheduler has been developed to minimise make-span, where tasks are duplicated in response to the likelihood of the Condor machines' owners evicting them. A 60 gel experiment was pair-wise image registered (3540 tasks) on a 40 machine Linux cluster. Real-world performance and network overhead was gauged, and Poisson distributed workerevictions were simulated. Our results show a 4:1 lossless and 9:1 near lossless image compression ratio and so network overhead did not affect other users. With 40 workers a 32 x speed-up was seen (80% resource efficiency), and the eager scheduler reduced the impact of evictions by 58%.
KW - Automated pipeline
KW - Bioinformatics
KW - GRID
KW - Lossless image compression
KW - Task farming cluster computing
UR - http://www.scopus.com/inward/record.url?scp=10644290213&partnerID=8YFLogxK
U2 - 10.1002/pmic.200300894
DO - 10.1002/pmic.200300894
M3 - Article (Academic Journal)
C2 - 15478217
AN - SCOPUS:10644290213
SN - 1615-9853
VL - 4
SP - 3800
EP - 3812
JO - Proteomics
JF - Proteomics
IS - 12
ER -