THRUST
2. SECURE ARCHITECTURES
Leveraging on the physical protection techniques
investigated in thrust 1, secure architectures
with solidly grounded assumptions on physical security will be explored
in thrust 2. Nowadays, energy efficiency and low power consumption
are major market differentiators for all scales, ranging from MCUs to
SoCs. This motivates the focus on low-overhead solutions for
secure architectures in thrust 2.To this aim, the physical
security countermeasures in thrust 1 and the architecture
features are synergistically exploited to maximize energy and
area efficiency. Firstly, the architecture investigated in this
thrust introduces selective physical protectionin essential
components that maintain the security state, while using architectural
security approaches (i.e., access control,
encrypted on-chip communication) to provide security
guarantees on the physically unprotected areas of the chip. In this
way, data can be transmitted over long distances inside the
chip, while avoiding the large cost of physical security
protection. Secondly, a key attribute of the architecture security
component is the physical separation of secure and
insecure components, as well as strict security control orchestrated by
a trusted CPU with respect to isolation (i.e., NoC access
to prevent information leakage) and access protection (to verify
correct operation). The trusted CPU is the gatekeeper in the
interaction between the secure and the insecure layers. Thirdly,
the trusted CPU is protected against from protocol- and
software-level attacks via innovative low-overhead,
hardware-based monitoring.Fourthly, an across-level “vertical” approachis
adopted that embraces the circuit (for key generation and physical
protection in thrust 1), architectural and intra-chip
communication protocol level, as opposed to more conventional approaches
focusing on one or a few levels of abstraction.
Thrust 2 will leverage
the unique expertise and capabilities of the SOCure team, spanning
from Networks on Chip to secure communication
protocols and architectures. A team member from NUS (previously
with MIT) has been leading the field of NoCs for almost two decades [PEH],
and has developed several NoC design methodologies that will be leveraged
in the architectural exploration in this thrust. Another team member
has wide expertise in SoC architectures and unique
experience with the development of high-level simulations [TRV] such
as the popular SNIPER [SNP], as needed for the architectural exploration
and security-overhead tradeoff analysis in this thrust. Another team
member is an expert in lightweight protocols and anomaly detection for
distributed networks, as needed for the NoC-centric approach discussed
below [BPS].Another team member has expertise in architecture-level
security and Trojan detection [AC]. Various industrial partners with
strong expertise in the domain covered by thrust 2 are in the process of
joining the team. Our international partners from the US (Princeton)
[RBL] and the UK (Cambridge) [SMN] are world-renowned experts in secure
architectures.
Regarding the state of the
art in secure architectures, a variety of secure systems have been
proposed, each being restricted to a specific class of system (general-purpose
vs. embedded) or threat target. However, proposals at the system and the
Network-on-Chip (NoC) level tend to be complex and hence power and area
hungry, or do not provide adequate whole-platform security guarantees. For
example, ARM TrustZone [A09] and Intel SGX [I14] are examples of commercially
available general-purpose systems that aim to increase security and
isolation of the applications. TrustZone focuses on secure/insecure access
being controlled by the main processor, and assumes that the confidential
on-chip data cannot be accessed externally (which is actually not true,
even under non-invasive attacks). Intel SGX encrypts off-chip data at a
large hardware overhead, as it creates specialized hardware enclaves below
the hypervisor level for software isolation, mostly targeting cloud providers.
These fault models are clearly targeted to application-level isolation and
external-chip data protection, but not application data protection from
inside the chip. Embedded solutions like Fulmine [CSS17] continues with
this trend but only enables secure storage outside of the chip. As
many-core chips interconnected with NoCs such (e.g., Intel Xeon
Phi) emerged in the server and datacenter domain, the state of the art in
secure NoCs naturally targeted high-performance chips, with
highly sophisticated NoC designs comprising many virtual channels and
buffers. For instance, SurfNoC [WGO13] ensures non-interference between
different domains, by partitioning and scheduling virtual channels of
links across domains with a large number of virtual channels per
input port (e.g., 16 to 32)
and per NoC router (e.g., 80 to 160). Such highly
complex NoCsentail a very large overhead that can be justified only in
massively parallel architectures for servers, and cannot be
efficiently scaled downto lower classes of computing. For example, a
secure NoC counteracting NoC-level information
leakage (i.e., Fort-NoCs [ACR14]), a router with 5 virtual channels
per port and 25 virtual
channels per router take up 706mW on 45nm TSMC process,
which exceeds by orders of magnitude the power budget of secure MCUs, and
is comparable to the entire power consumption of smartphone processors.
In SOCure, we will investigate ultra-lightweight secure
NoCsthat provide secure connectivity across the SoC, yet ensuring very low
power and area overhead. In general, NoCs comprise datapath (wires
and switches for actually transporting the bits), control (for handling
datapath sharing between packets), and buffering (for temporary
storage to deal with contention by other packets, e.g.
virtual channel queues). Control and buffering are essentially overheads
in NoCs, and there have been prior NoC designs that reduce them, down to
NoCs that are completely buffer-less. Without buffering in the NoC, flits
that contend with others can no longer be temporarily stored in the
routers. As a compromise, several proposed NoCs deflect the
contending flits to other ports, misrouting them [MM09], [DPC16],
dropping the contending flits [HJL09], introducing a regular NoC as
backup [LMJ16], or buffering and retransmitting the dropped flits at the
network interface [HJL09]. As for the control, this can be offloaded
to software, so the compiler or scheduler determines a contention-free
schedule [TKM02], [KMM17], or an ultra-lightweight control network
[DPC17]. Nevertheless, these prior works in ultra-lightweight NoCs do not
handle security. In SOCure, we will leverage
our broad experience in prior ultra-lightweight NoC designs
[DPC16], [KMM17], [DPC17] to introduce low-overhead mechanisms for
security.Further details on our proposed NoC architecture are provided
below.
Regarding the state of the art in secure NoC although
at large area/energy penalty, existing security mechanisms primarily focus on access control,
primarily based on monitoring and analyzing data transfers. Starting with
the Security Enhanced Communication Architecture (SECA) [CRR05], [PGS11],
such security mechanisms use state-ful and state-less policies that use
address and/or values to determine access rights. Such approaches have
high computational complexity and do not scale well with the addition of
IP blocks. Enhancements to policy based access control mechanisms for
NoC security have been proposed in the form of encryption techniques for
confidentiality and integrity [CCG11], [CCG12]. The main drawback of
existing approaches in this direction are the area, energy, and latency
overheads. The encryption and integrity monitoring techniques proposed in
[WB11], [CL10] are aimed at code and data in main memory, and do not
address the problems of access control or attacks on availability. In the
context of bus-type architectures, mechanisms that exploit the broadcast
nature of the bus to detect non-conformant data transfers have also been
proposed [KV11]. These techniques do not extent to NoC based architectures
such as those considered in this proposal. Finally, there is little
attention in existing literature to authentication of IP identities during
run-time.
Regarding malicious RTL modificationand bugs, numerous policy and
runtime-based approaches to detect them have been
proposed [BBT17], [WS10], [HSK15], although they introduce
additional hardware complexity and overhead. In addition, many
previous hardware processor bugs are the result of incorrect privilege
escalation [HSK15], the result of which can occur when repurposing general
purpose processors for security-related tasks. To address this
challenge, pre-silicon software tools checking for
security-related bugs or backdoors will be adopted in the digital design
flow, leveraging the unique capabilities of one of our industrial
partners[SIC]. Also, SOCure completely eliminates the need
for escalation bug checking, as the privileged processor only runs
secure softwareas explained in the following. Traditional secure architectures
either combine trusted and untrusted software on custom hardware [A09], or
create large trusted IP regions that can be as large as the
entire chip [CSS17], which would require extensive physical
hardware security to prevent eavesdropping and tampering. The
approach introduced in SOCure is to minimize the surface area of the
secure and insecure worlds, and to isolate untrusted IPs (Fig. 15, blue
boxes) from others (both trusted and untrusted). The resulting
architecture minimizes attack vectors from untrusted IPs and software, and
provides a low-overhead and clear security framework that controls access
to the different components of this system. Isolating privileged
operations on a separate small and efficient core, we avoid the potential
for security-privilege escalation bugswhen CPUs or IPs
operate in both secure and insecure modes[HSK15]. Examples of trusted
CPU access control include router configuration (interference control) and
the restriction of access to shared resources, such as external memory or
internal sensors. The secure network interfaces encrypt
all the data leaving the trusted IPs, and can directly access the
outside world (SRAM, Flash) without the need to reprocess or encrypt the
data. In other words, secure intra-chip communications automatically
assure data security also off chip.
As for the detection of untrusted IPs, existing work on the detection and
mitigation of threats from hardware Trojans in untrusted IPs is primarily
based on evaluating their design and activation characteristics [HFK10],
[TK10], [CNB09], [WS11]. Techniques such as those proposed in
[WMS13], [BS10], [ZT11] and [RCK15] are based on static validation of the
IP cores, with emphasis on the detection of suspicious regions, nodes or
unused circuits. While these techniques may be effective in certain
scenarios, they have high time complexity and cost, and frequently exhibit
significant false negatives/positives depending on the choice of test
sets, threshold values, design type etc. Run-time monitors for detecting
hardware Trojans have been proposed in [WS10], [DDK13]. However,
these solutions are specific to microprocessor cores and are not
applicable to scenarios with arbitrary IPs interconnected in a SoC.
The above challenges and limitations of prior art are
addressed in SOCure by adopting the architecture illustrated in Fig. 15. From this
figure,the SOCure architecture comprises trusted and
untrusted entities, and the NoC handles traffic within the trusted
region, between trusted region and untrusted IPs, as well as with
off-chip memories through untrusted memory controllers. Such security
properties are supported by an ultra- lightweight NoC
design, given the tight power and area constraints of at the low end of
the computing scale spectrum. These tradeoffs prompt us to propose a
bare metal buffer-less NoC architecture where scheduling and control
is offloaded to the compiler and the OS scheduler running on the trusted
CPUs, which lies within the Trusted regions where applications’
communications are known in advance. The software-scheduled NoC will essentially
be composed of just the data path (wires and crossbar switches), with
switch settings configured for each set of applications by the scheduler.
This allows for a NoC that can be pushed to maximum throughput by the
schedule, yet remain buffer-less and without control logic. The setup of
the NoC switches is initiated by the trusted CPUs and the data
is encrypted at the NoC interfaces and wrappers by a lightweight block
cipher engine with single-cycle latency and ultra-low power
consumption (see below). Accordingly, the data path
remains encrypted throughout the NoC transmission. The data path wires
and switches can be readily partitioned in the floorplan, and are
physically protected to prevent temperature snooping, eavesdropping
and tampering.
The encryption across the NoC and
for off-chip communication needs to be
performed with ultra-lightweight crypto-engines, as they are present
in each NoC router and they are constantly between the sender and the
receiver in any on-chip transaction. To this aim, we will use
the recently proposed Simon block cipher [BSS13], which is relatively
simple and can be easily accommodated within a single clock cycle for
all practical architectures. Their area and energy efficiency is
substantially better than AES [DR02]. Based on our recent estimates
in CMOS 40nm under a set of novel energy reduction techniques,
an energy of 0.1pJ/bit is achievable, leading to a power
consumption in the order of 1mW at 500MHz, which is 10-100x
lower than the power target for the low-power trusted processor.
This methodology will enable large numbers of connected IPs without
a significant power overhead
While applications are typically known beforehand on a
secure platform, and
thus the application communication flow can be pre-characterized,
interactions between IP blocks and dynamic aspects of off-chip memory
traffic will lead to some portion of on-chip traffic that is unpredictable
at compile time. We will thus explore a buffer-less control network where
ordering and dependencies between traffic flows can be captured as tokens,
with the tokens triggering switch configuration across the NoC. The
control network will similarly be tamper-proofed, and will leverage our
prior work on buffer-less ordering NoCs [DPC17].In addition to system-level
protection with encryption and lightweight access control, the security of
the lightweight trusted processors will be
hardened through lightweight monitoring, access protection, and
authenticated encrypted software. As they act as the interface between
the trusted and untrusted components, additional measures are needed to
protect against replay attacks, buffer overflows, and other
software vulnerabilities. Monitoring of software in hardware is
continuous, providing cycle-level protection, as well as much more
efficient compared to software-only techniques. Also, the integrity
of the NoC and the trusted processors after manufacturingwill be
checked through the common practice of reverse engineering, i.e. by
delayering and imaging the chip to verify the perfect correspondence to
the netlist of the original design. The cost of reverse engineering is now
relatively low, by virtue of the availability of low-cost SEM microscopes
(see “Landscape, trends and motivation” section), and is typically in
the order of very few tens of k$/mm2 or lower.
In SOCure, robustness against attacks from untrusted on-chip IPsis
based on the assumption that keys are generated locally in each
router (using a physically secure PUF –see thrust 1), and the trusted CPU
is able to securely manage the
exchange of temporary session keys among IPs at run-time or
chip boot time. This requires the adoption of secure communication
protocols over the non-secured NoC, as discussed in the following. As in Fig. 15, the
trusted CPU determines the trust in the presence of untrusted IPs.
The trusted CPU manages keys, facilitates secure communication
between IPs, implements security policies, and also serves as a detection
agent for attacks launches by the IPs (e.g. DoS attacks on the NoC or an
IP). To assist the trusted CPU in its operations, trusted NoC routers are
adopted with encryption capability at the interface between each IP and
the switch connecting it to the NoC. Since IPs may be untrusted, the
security functionalities such as encryption, key exchange etc. for each IP
will be handled by the trusted routers in the NoC switch to which the
IP connected to. The routers are connected to the trusted CPU through
point-to-point links (secure NoC in Fig. 15 in red line) that are
physically secure. Since changes in the control policy are
infrequent, these links may be just serial to keep their
area/energy cost insignificant. The functions of the architecture and
the related protocol are as follows:
1. KEY INITIALIZATION AND EXCHANGE.
All cryptographic keys are handled only
by the trusted routers and CPU(s). The initial key exchange between
the trusted CPU and each trusted router will be done at
testing time. Each router is equipped with a one-time readable PUF (see
thrust 1) that is read by the trusted CPU to setup a challenge-response
pair (CRP) associated with the router. This initial CRP will be used by
the router to facilitate the setup or update of cryptographic
keys during the operational phase. When two IPs wish to communicate
during the operational stage, a session key will be setup between then
with the help of the trusted CPU. The IP initiating the communication will
request the trusted CPU to set up a session key (through the trusted
router that it is connected to). Then, the trusted CPU proceeds with
the request, based on the privileges and security policies.
2. DATA CONFIDENTIALITY.
All intra-chip communications are encrypted
using a lightweight Simon crypto core, ensuring the confidentiality of the
messages and counteracting eavesdropping, man-in-the-middle and replay
attacks. In addition, time- and space-based partitioning on the secure NoC
will ensure that IPs that are not party to an ongoing message exchange
will not have access to any contents of the messages, including the
headers.3. ACCESS CONTROL. Access to resources (e.g. registers, memory
locations) requested by any IP will be routed through the trusted CPU to
ensure conformance with security policies. The trusted CPU also sets up
the policies for routing tables in the NoC in order to provide isolation
to data transfers. 4. REAL-TIME PROTECTION. The trusted
routers and CPU(s) have features to facilitate the
monitoring of the network activities due to each IP, in order to detect
attacks and policy violations. For example, the routers monitor and
report traffic metrics (e.g., delays experienced by
packets) to the CPU to detect and counteract DoS attacks.
The fundamental novelty of the proposed intra-chip
communication scheme is in the distributed nature of hardware
security primitives coupled with a
software-defined centrally controlled communication
architecture, to ensure security in the presence of untrusted IPs. As
second element of novelty of this communication scheme, the proposed
architecture always operates in the secure mode(e.g. all packets are
encrypted and turning it off is not an option), unlike existing solutions
where the operation of the SoC may switch between secure and unsecure
modes. Thirdly, the physically-based authentication and communication
mechanism can be modified during the lifecycle of the SoC, thus
allowing the interesting property of upgrade-ability. In other words,
if a hardware vulnerability is discovered, or some security policy is
discovered to be too restrictive, the security policy defined in the
trusted CPU can be modified over time.
The trusted CPU in Fig. 15 is the other pillar of the
proposed architecture, and its security assurance is a major
challenge owing to various forms of vulnerabilities that a system can be
exposed to, across design layers. In state-of-the-art designs,
the trusted CPU is usually built of a Trusted Execution Environment
(TEE), which is a hardened, tightly controlled and usually limited
execution environment in the processor designed to run critical secure services
and protect critical assets. TEE protects the confidentiality and the integrity
of code and data loaded into it, so that the applications running in the
Rich Execution Environment (REE) will not be able to tamper with it. The
hardening of TEE and its separation with REE is a daunting task given
multiple applications requesting the cryptographic services,
and inevitable sharing of resources due to that. Existing solutions for
TEE can be classified into three categories. First, the application
runs in an encrypted enclave (e.g., Intel SGX) sharing the
secure hardware with the insecure applications. Second, virtual machines run
in an encrypted memory (e.g., AMD SEV). As third class, a
virtual CPU is used to clearly separate the operations between a secure
CPU and a normal CPU (e.g., ARM TrustZone). This hard separation
advocated in the last approach is clearly advantageous, but also
stops short of absolute security due to the semantic gap between the
two modes of operations. Besides, to cater to the lightweight design
segments, the ARM TrustZone does not include any built-in
cryptographic capabilities and secure non-volatile memory, although they
would be required for services such as secure boot, key and data
sealing and remote attestation.Ideally, the trusted CPU needs a clear semantic
translation or, even better, support the same
Instruction-Set-Architecture (ISA) to run the TEE and the REE execution.
Secondly, full support of cryptographic acceleration with minimal
performance/power overhead is very important. Thirdly, the TEE system
needs to be connected to the root of trust through a secure and
robust chain of trusted operations. Fourthly, the trusted CPU operation needs
to be protected against passive and active side channel attacks.
The above four capabilities will be pursued for the trusted CPU in
SOCure by introducing new methods. In particular, an open-source
architecture based on an open architecture(e.g., RISC-V ISA) will be used
as a test vehicle, due to its widespread adoption in recent years and the
strong interest of industry. An example of the system-level view of the
trusted CPU operations is shown in Fig.16.Besides the side-channel
attack-resistance (which is addressed in thrust 1), two
directions will be explored in this thrust, as discussed in
the following. First, new techniques to protect the trusted CPU
against malicious activity arising from the debug interfacewill be
investigated, being the related ports backdoors that allow intrusion
during the lifetime of the device. This will be achieved through a novel protocol
involving authenticated debugger and built-in key management schemes.
Second, the trusted CPU will provide a low-level resistance against
malware/ransomware by utilizing the hardware performance counters (HPCs)
of the REE and TEE. The dataset recorded from the HPCs will be used to
train an artificial neural network (ANN) under the normal application
execution scenario, which will then be used to identify
anomaliesand malicious applications at execution time. Depending on the
results of the analysis of benchmarks, on-chip acceleration will be
considered if the performance overhead exceeds the expectations
(e.g., in the order of percentage points of the nominal throughput).
The overall aim of this exploration is to achieve a targeted level of
trustworthiness, with the constraint of minimizing silicon area and the
performance/energy overhead. The design will be benchmarked against
the targeted attack scenarios (for security) and the baseline design (for
overhead quantification).
The design of trusted CPU brings forth the following novel
propositions:
• Protecting scan-chain and debug interface from
malicious attackers through cryptographic protocols
• Utilizing hardware
performance counters for malware/ransomware detection
has been recently proposed by our team [ABM18], with excellent
preliminary results. This new idea will be widely explored in SOCure,
in the context of HPC-trained ANN
structure in lightweight processors.
• The entire chain-of-trust through the
secure boot operation will be holistically investigated and
designed, along with the assurance of memory integrity
and confidentiality. Accordingly, a comprehensive analysis
of the entire protocol will be performed, instead of
mainly relying on a root-of-trust provision through
key management. This will leverage the synergy between the creation
of the root of trust in thrust 1, and the architecture-protocol innovation
in thrust 2.
Finally, architectural support for run-time hardware-based memory
isolation enforcement will also be investigated, in order to
prevent software side-channel attacks on on-chip memories, which is known
as a software-level threat that requires hardware solutions [P05],
[OST06], [B05]. These attacks aim to retrieve confidential data from
an area of the memory that a malicious app is not supposed to have access
to (as enabled by fault attacks, among the others). As main research
direction, low-power Content-addressable memories (CAMs) will be explored to
introduce spatial randomization of on-chip memory access, as well as SRAMs empowered
with single-cycle flush capability (i.e., erasure of entire banks in a
single cycle). The first capability permits to break
the deterministic relation between data and the physical address, thus
preventing attackers from locating sensitive information in the memory.
The second capability permits to quickly release portions of unused
memories, so that malicious SW applications cannot read data coming from
applications that have been previously executed, while avoiding the
prohibitive time penalty of sequentially erasing traditional
memories. These capabilities prevent the attacker from locating sensitive
memory data and from accessing confidential data previously generated by
other applications, as required to counteract SW side-channel attacks.
Regarding the deliverables (see details in Section 5), thrust 2
pursues the exploration, design, refinement and evaluation of the
architectural system design through cycle-level simulations of from
block level to the entire system level. This is a routine approach that
is followed in the validation of architectural-level security, as it
offers the ability to simulate large systems in a reasonable time (e.g.,
50MHz of equivalent clock cycle) by leveraging massively parallel
over-the-cloud computing services (e.g., Amazon AWS). In
particular, simulation on servers and FPGAs will be the primary means to
understand the system properties of interest, including the robustness
against the various types of attacks at the architectural and
protocol level. Traditional servers can provide performance, power
and area results and are good for use while in development. For the
evaluation phase, as well as software development, the FPGA hardware is
significantly faster and allows for rapid development and evaluation of
the system. As detailed in Section 5, simulations will be performed by
describing the system with varying levels of accuracy, initially
at the architecture level, and progressively down to cycle-level and
then production of and RTL design, which is then usable in the chip
design flow. With feedback from lower-level simulations in the chip design
environment, the timing and power models of both the
non-secure and secure designs will be evaluated, in order to
quantify the overhead imposed by the various proposed solutions, and
explore the related tradeoffs for each of them. This thrust will
also contribute designs RTL for chip-level demonstrators, and in
particular on critical blocks that need to be experimentally characterized
on silicon (i.e., NoC, crypto-engine modules that are robust against
side-channel attacks, a core such as MSP430 or lowRISC, core
robust against Trojan injection).
Regarding the collaboration with RISE, the research work related to
thrust 2 will be focused on re-engineering hardware fundamentals of IoT
processor design. To this aim, we will exploit prior DARPA-funded work by
Cambridge/SRI/Arm on CHERI for 64-bit cores. Scaling down to 32-bit
processors, we will explore how fine-grained CHERI memory protection
composes with microcontroller-facing Memory Protection Units (MPUs).
This contrasts with 64-bit CHERI + Memory Management Units (MMUs) in
application-class processors that support complex virtualisation of
the address space (relocating accesses as well as controlling their use)
not used in microcontrollers. Throughout this work, we will collaborate
closely with IoT-facing vendors including ARM Research based on Cambridge.
In regard to the collaboration with Technion, the joint research activity
in this thrust will be centered around using debug/testing interfaces as side
channel. Reverse engineering of a VLSI device is a complex task that
traditionally requires tedious work and expensive equipment. The ultimate goal
of the reverse engineering process is to discover its underlying algorithm. The
scope of this joint research addresses the extraction of the circuit from the
physical device, such as removing the package, performing cross-section,
delayering, and imaging of nanoscale. Then, techniques will be explored in the
context of IP theft prevention, detecting HTH, malware and ransomware, and
using scan side channel and machine learning techniques to detect unique
signatures due to malicious circuit activities.