| 17th International Workshop on Advanced Computing and Analysis Techniques in Physics Research | |
| Automated Finite State Workflow for Distributed Data Production | |
| 物理学;计算机科学 | |
| Hajdu, L.^1 ; Didenko, L.^1 ; Lauret, J.^1 ; Amol, J.^2,3 ; Betts, W.^1 ; Jang, H.J.^2 ; Noh, S.Y.^2,3 | |
| Software and Computing Group, RHIC/ STAR Experiment, Brookhaven National Lab, PO Box 5000, Upton | |
| NY | |
| 11973-5000, United States^1 | |
| Korea Institute of Science and Technology Information, 245 Daehangno, Yuseong, Daejeon | |
| 305-806, Korea, Republic of^2 | |
| Korea University of Science and Technology, Yuseong, Dajeon | |
| 305-350, Korea, Republic of^3 | |
| 关键词: Brookhaven national laboratory; Computing capacity; Distributed data; High-efficiency; Raw data files; Research programs; Software stacks; Statistical errors; | |
| Others : https://iopscience.iop.org/article/10.1088/1742-6596/762/1/012006/pdf DOI : 10.1088/1742-6596/762/1/012006 |
|
| 学科分类:计算机科学(综合) | |
| 来源: IOP | |
PDF
|
|
【 摘 要 】
In statistically hungry science domains, data deluges can be both a blessing and a curse. They allow the narrowing of statistical errors from known measurements, and open the door to new scientific opportunities as research programs mature. They are also a testament to the efficiency of experimental operations. However, growing data samples may need to be processed with little or no opportunity for huge increases in computing capacity. A standard strategy has thus been to share resources across multiple experiments at a given facility. Another has been to use middleware that "glues" resources across the world so they are able to locally run the experimental software stack (either natively or virtually). We describe a framework STAR has successfully used to reconstruct a ∼400 TB dataset consisting of over 100,000 jobs submitted to a remote site in Korea from STAR's Tier 0 facility at the Brookhaven National Laboratory. The framework automates the full workflow, taking raw data files from tape and writing Physics-ready output back to tape without operator or remote site intervention. Through hardening we have demonstrated 97(±2)% efficiency, over a period of 7 months of operation. The high efficiency is attributed to finite state checking with retries to encourage resilience in the system over capricious and fallible infrastructure.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| Automated Finite State Workflow for Distributed Data Production | 1440KB |
PDF