EUFRATE Hardening techniques for commercial-off-the-shelf FPGA for digital telecommunication payloads

  • Status
    Ongoing
  • Status date
    2024-10-22
  • Activity Code
    5C.416
Objectives

The EUFRATE project focuses on leveraging Field Programmable Gate Arrays (FPGAs) to develop adaptable, reliable, and cost-effective solutions for mission-critical applications, particularly in space missions and satellite communications. The project emphasizes several key objectives:

  • Flexibility and Reconfigurability: Utilizing reconfigurable FPGAs, the project aims to enable in-orbit updates and error correction through dynamic partial reconfiguration (DPR), essential for adapting to evolving mission needs without physical access to hardware.

  • Dependability: By employing commercial-off-the-shelf (COTS) FPGAs, the project seeks to ensure system reliability in harsh environments, using innovative error correction and fault tolerance techniques to maintain operations even in the event of hardware failures.

  • Cost-Effectiveness: The project balances custom FPGA designs with the practicality of COTS FPGAs, delivering high-performance systems that are both efficient and economical.

  • Scalability: Developing a scalable FPGA-based computing cluster, EUFRATE targets large-scale computing applications, enabling flexible communication and efficient task distribution within the cluster.

  • Satellite Telecommunications: Advancements in satellite communications are pursued through Software-Defined Radios (SDRs) on FPGAs, aiming to create adaptable and cost-effective communication systems.

Overall, EUFRATE aims to address current and future challenges in space missions and satellite communications through innovative FPGA-based solutions.

Challenges

Key challenges encompassed both technical and organizational aspects. Technically, implementing Dynamic Partial Reconfiguration, ensuring seamless communication within the four-board cluster, and achieving the required number of Programmable Functional Units to meet stringent performance criteria were significant hurdles. Organizationally, coordinating efforts across three distinct workgroups and facilitating efficient information exchange posed challenges. Despite these challenges, both technical and organizational aspects were successfully managed, contributing to the overall success of the project.

Benefits

The proposed solution offers significant advantages over existing systems by leveraging the flexibility and power of Commercial-Off-The-Shelf (COTS) FPGAs, specifically in the challenging environment of Geostationary-Earth Orbit (GEO). Traditional rad-hard systems often suffer from limited performance and high costs. In contrast, this solution integrates multiple FPGAs into a coordinated cluster, allowing functionalities to be distributed across different devices, which enhances redundancy and fault tolerance. Radiation-hardening techniques, such as Triple Module Redundancy (TMR) and memory scrubbing, are applied at the board level, drastically improving reliability without the need for specialized radiation-hardened components. This approach not only mitigates the risks associated with Single Event Effects (SEEs) but also ensures that the system remains operational even in the presence of radiation-induced faults. The use of high-speed protocols for interconnecting FPGAs reduces latency and overhead, while a built-in system for continuous device health monitoring ensures that faulty elements are quickly replaced by spares, maintaining uninterrupted service. The scalability of this FPGA cluster architecture allows it to be adapted for increasingly complex tasks, making it a versatile and cost-effective solution for the high-performance demands of modern space missions, especially in telecommunications.

Features

The product leverages advanced clustering and dynamic reconfiguration techniques to ensure high performance and resilience. Its core lies in the clustered FPGA architecture, optimizing computational efficiency through parallel processing. This design not only accelerates application processing but also seamlessly redistributes tasks if a node fails, ensuring continuous operation. Dynamic Partial Reconfiguration (DPR) further enhances reliability, enabling quick recovery from errors by reallocating tasks across the FPGA network. Triple Modular Redundancy (TMR) adds protection by tripling critical blocks, reducing the risk of failures. Full node reboot capabilities allow rapid system restoration in severe fault scenarios by redistributing tasks and reinitializing affected nodes.

Communication within the system is handled by the Aurora protocol, which provides high-speed, reliable data transfer across the FPGA nodes, both within and between tiles. This protocol’s scalability and efficiency are crucial for maintaining high data rates and ensuring seamless communication, even as the system scales up. Overall, the product offers a powerful, resilient, and scalable solution, well-suited for mission-critical applications where reliability and performance are paramount.

System Architecture

The system architecture developed is centred around a highly scalable and resilient FPGA-based cluster, utilizing a mixed-mesh topology to optimize communication and performance. Figure 1 show the setup used for testing the system, with one tile implemented for the tests and the Spacecraft emulator as well.

Figure 1 Testbed setup
Figure 1 Testbed setup

Within each tile, nodes are interconnected using a wormhole router with an XY-routing algorithm, facilitating efficient data transfer through packets. The Aurora protocol, chosen for its scalability and high data rate, handles communication within and between tiles, connecting a maximum of four FPGAs per node to minimize link requirements while maximizing the cluster’s scalability.

The architecture also includes a robust control system using I2C protocol, allowing nodes to operate as either Master or Slave, dynamically sending and receiving data as needed. This setup supports over 100 nodes, ensuring scalability and efficient management of the cluster.

Key components include the Blazes and BaByloN Processing System, featuring microprocessors dedicated to cluster management and application-specific computations, and the TMR Beacon Controller, which monitors the health of the nodes and triggers dynamic reconfiguration when necessary. The architecture also incorporates an array of PFU-based accelerators for handling parallel and compute-intensive tasks, and a DDR4 Memory Controller for managing external memory access. The combination of these elements ensures a robust, high-performance system capable of meeting the project’s demanding requirements.

Figure 2 reports the high-level block diagram implemented for the testbed setup.

Figure 2 Testbed block diagram
Figure 2 Testbed block diagram
Plan

The project started on May 31st, 2021; the milestones outline key phases over 24 months. After the Kick-Off (KO), the project progressed through various reviews, including the System Requirements Review (SRR), Preliminary Design Review (PDR), Critical Design Review/Test Readiness Review (CDR/TRR), and the Test Review Boards (TRB). The project concludes with a Final Review (FR) and Final Presentation (FP). 

Current status

The project is successfully completed. The Final Review and Final Presentation have been concluded, marking the end of all planned activities. The design has been finalized and validated through comprehensive radiation tests, which yielded positive results. These tests confirmed the project's objectives, demonstrating the effectiveness of the proposed solutions. All milestones have been achieved, and the outcomes align with the initial goals, showcasing a robust and reliable design.

Prime Contractor