SERT: Scale-free, Energy-aware, Resilient and Transparent Adaptation of CSE Applications to Mega-core Systems

Find Similar History 41 Claim Ownership Request Data Change Add Favourite

Title
SERT: Scale-free, Energy-aware, Resilient and Transparent Adaptation of CSE Applications to Mega-core Systems

CoPED ID
88f04be2-1229-493c-b899-e00134c08e7d

Status
Closed

Funders

Value
£1,927,856

Start Date
March 31, 2015

End Date
Sept. 29, 2018

Description

More Like This


Moore's Law and Dennard scaling have led to dramatic performance increases in microprocessors, the basis of modern supercomputers, which consist of clusters of nodes that include microprocessors and memory. This design is deeply embedded in parallel programming languages, the runtime systems that orchestrate parallel execution, and computational science applications.

Some deviations from this simple, symmetric design have occurred over the years, but now we have pushed transistor scaling to the extent that simplicity is giving way to complex architectures. The absence of Dennard scaling, which has not held for about a decade, and the atomic dimensions of transistors have profound implications on the architecture of current and future supercomputers.

Scalability limitations will arise from insufficient data access locality. Exascale systems will have up to 100x more cores and commensurately less memory space and bandwidth per core. However, in-situ data analysis, motivated by decreasing file system bandwidths will increase the memory footprints of scientific applications. Thus, we must improve per-core data access locality and reduce contention and interference for shared resources.

Energy constraints will fundamentally limit the performance and reliability of future large-scale systems. These constraints lead many to predict a phenomenon of "dark silicon" in which half or more of the transistors on each chip must be powered down for safe operation. Low-power processor technologies based on sub-threshold or near-threshold voltage operation are a viable alternative. However, these techniques dramatically decrease the mean time to failure at scale and, thus, require new paradigms to sustain throughput and correctness.

Non-deterministic performance variation will arise from design process variation that leads to asymmetric performance and power consumption in architecturally symmetric hardware components. The manifestations of the asymmetries are non-deterministic and can vary with small changes to system components or software. This performance variation produces non-deterministic, non-algorithmic load imbalance.

Reliability limitations will stem from the massive number of system components, which proportionally reduces the mean-time-to-failure, but also from the component wear and from low-voltage operation, which introduces timing errors. Infrastructure-level power capping may also compromise application reliability or create severe load imbalances.

The impact of these changes on technology will travel as a shockwave throughout the software stack. For decades, we have designed computational science applications based on very strict assumptions that performance is uniform and processors are reliable. In the future, hardware will behave unpredictably, at times erratically. Software must compensate for this behavior.

Our research anticipates this future hardware landscape. Our ecosystem will combine binary adaptation, code refactoring, and approximate computation to prepare CSE applications. We will provide them with scale-freedom - the ability to run well at scale under dynamic execution conditions - with at most limited, platform-agnostic code refactoring. Our software will provide automatic load balancing and concurrency throttling to tame non-deterministic performance variations. Finally, our new form of user-controlled approximate computation will enable execution of CSE applications on hardware with low supply voltages, or any form of faulty hardware, by selectively dropping or tolerating erroneous computation that arises from unreliable execution, thus saving energy. Cumulatively, these tools will enable non-intrusive reengineering of major computational science libraries and applications (2DRMP, Code_Saturne, DL_POLY, LB3D) and prepare them for the next generation of UK supercomputers. The project partners with NAG a leading UK HPC software and service provider.


More Information

Potential Impact:
The project will achieve commercial impact through the development of production-level Computational Science and Engineering Software that will catalyse performance and productivity in applications within the EPSRC remit; industrial engagement with UK and international stakeholders, in particular through membership of project partners in the European Technology Platform for HPC (ETP4HPC); exploration of the potential to receive follow-on funding and create spin-out companies with instruments such as the Impact Account Acceleration at Queen's Belfast; and the organisation of an industrial workshop. The project will achieve further economic impact through better utilisation and reduction of the total cost of ownership of the major UK supercomputing infrastructures and improved productivity in sectors of the UK high-technology economy that depend on HPC.

The project will achieve academic impact by publishing results in the very best journals and conferences across the areas of high performance computing, computational science, scientific computing, programming languages and computer architecture. All publications will follow Green or Gold open access routes, the former leveraging institutional publication repositories and the latter institutional funding. All software developed in the project will be open-sourced, with associated training provision in the form of tutorials and short modules. Further academic impact will be achieved via exchange visits and demonstration sessions with project partner NAG, ClusterVision, and other HPC vendors and groups in the UK.

Societal impact will be achieved through prominent presence in social media (Web 2.0, LinkedIn, Twitter and YouTube Channels) to disseminate the results to professionals and the general public. Further societal presence will be achieved through distribution of news articles, press releases, and video presentations. The project will develop software technologies for emerging many-core systems, a skill which is highly marketable.

The project follows a comprehensive software management plan: It will produce three software outputs (Adaptor, RightSizer, Approximator), licensed under GPL. The tools will be developed, tested and maintained in a GITlab software repository, with the associated GIT revision control system hosted by Queen's Belfast and shared between the project partners. The software will be user-level and will not require interventions to the host operating system, which would prevent its deployment on the target systems (ARCHER, BlueJoule, NextScale, Titan). It will be based on the GNU stack for maximum portability across current and future platforms. The software will support and be compatible with widely used parallel programming languages (MPI, OpenMP, OpenCL) and libraries (MAGMA, PLASMA, ATLAS). Source code changes in MPI, OpenMP and OpenCL, where needed, will be feasible with the adoption of open-source implementations of them (e.g. OpenMPI, PoCL, GOMP).

The software will be released to and hosted for the public by Queen's Belfast during the course of the project, and later by STFC for production use on the targeted supercomputers. The GITlab repository that will house the software at QUB is well tested and already provides support for code development, maintenance, revision control and testing in nine large-scale software development projects (EPSRC, FP7/H2020, and industry-lead), involving 28 research groups in the UK, Germany, Switzerland, Sweden, Greece, Austria, Ireland and the US, and totalling hundreds of KLOC in C/C++ parallel code. We will use Doxygen for formal code documentation, DokuWiki for informal documentation and discussion among developers, and BugZilla for bug tracking. We will use nightly builds and regression tests. A permanent research engineer funded from Queen's will undertake the role of software maintenance and quality control manager and will be responsible for maintaining the highest coding and documentation

Subjects by relevance
  1. Computer programmes
  2. Information technology
  3. Programming
  4. Microprocessors
  5. Computers

Extracted key phrases
  1. Scale software development project
  2. Scale system
  3. Sert
  4. UK hpc software
  5. Comprehensive software management plan
  6. Software technology
  7. Core system
  8. Software repository
  9. Deterministic performance variation
  10. Software stack
  11. Software maintenance
  12. Software output
  13. GIT revision control system
  14. Core datum access locality
  15. Project partner NAG

Related Pages

UKRI project entry

UK Project Locations