会议论文详细信息
20th International Conference on Computing in High Energy and Nuclear Physics
ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm
物理学;计算机科学
Rybczynski, Tomasz^1,2 ; Bonaccorsi, Enrico^2 ; Neufeld, Niko^2
AGH University of Science and Technology, Cracow, Poland^1
CERN, Poland^2
关键词: Distributed file systems;    Encoding algorithms;    Fault-tolerant;    File replication;    Multiple servers;    Proof of concept;    Proton collisions;    Write-once-read-many;   
Others  :  https://iopscience.iop.org/article/10.1088/1742-6596/513/4/042038/pdf
DOI  :  10.1088/1742-6596/513/4/042038
学科分类:计算机科学(综合)
来源: IOP
PDF
【 摘 要 】

The LHCb experiment records millions of proton collisions every second, but only a fraction of them are useful for LHCb physics. In order to filter out the «bad events» a large farm of x86-servers (∼2000 nodes) has been put in place. These servers boot from and run from NFS, however they use their local disk to temporarily store data, which cannot be processed in real-time («data-deferring»). These events are subsequently processed, when there are no live-data coming in. The effective CPU power is thus greatly increased. This gain in CPU power depends critically on the availability of the local disks. For cost and power-reasons, mirroring (RAID-1) is not used, leading to a lot of operational headache with failing disks and disk-errors or server failures induced by faulty disks. To mitigate these problems and increase the reliability of the LHCb farm, while at same time keeping cost and power-consumption low, an extensive research and study of existing highly available and distributed file systems has been done. While many distributed file systems are providing reliability by «file replication», none of the evaluated ones supports erasure algorithms. A decentralised, distributed and fault-tolerant «write once read many» file system has been designed and implemented as a proof of concept providing fault tolerance without using expensive-in terms of disk space-file replication techniques and providing a unique namespace as a main goals. This paper describes the design and the implementation of the Erasure Codes File System (ECFS) and presents the specialised FUSE interface for Linux. Depending on the encoding algorithm ECFS will use a certain number of target directories as a backend to store the segments that compose the encoded data. When target directories are mounted via nfs/autofs-ECFS will act as a file-system over network/block-level raid over multiple servers.

【 预 览 】
附件列表
Files Size Format View
ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm 848KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:25次