The HOSS Project Abstract

The Heterogeneous Operating System Software (HOSS) project is building operating system software for high-performance, heterogeneous clusters. The intent of this research is to develop mechanisms for enabling high-level software such as applications, message-passing libraries, and runtime systems to better manage resources through the operating system kernel. The resulting operating system kernel will provide management functions allowing middleware to predict the performance ramifications of policy choices, and to allow runtime systems to make quality-of-service guarantees to application programs.

The advent of multicomputers composed of cheap, commodity workstations connected by high-speed networks has opened a new era in high-performance computing. Unfortunately, while programming environments such as MPI, SIMPLE, and Legion are foci of current software development, operating system research in such environments has lagged. Although these clustered environments are capable of sustaining computation rates that rival or surpass conventional supercomputers, the performance delivered to scientific applications is only a fraction of this maximum.

The key to understanding this disappointing performance lies in examining the layered software that resides between the application and the hardware comprising the cluster. At the lowest level, the operating system runs on the bare hardware. Middleware, such as runtime systems and message-passing libraries, run over the operating system, and applications run in the highest layer. Poorly or naively implemented software in any of these levels can contribute to software overhead, and unforeseen interactions between the layers can compound this problem. We intend to overcome this problem by providing kernel mechanisms to describe the operating system behavior to higher layers.

This description will include the costs for expanding or contracting address spaces, adding threads to programs, and migrating threads. The result of this work will be an operating system that delivers more flexible resource control to middleware and applications, thereby delivering a higher percentage of the underlying system performance.