x86_64系统调用

发表于 2025-09-19 更新于 2025-10-07 分类于 note

操作系统x86_64系统调用相关知识与源码分析

系统调用

本文专注于x86_64架构的系统调用，与x86架构不同的是，32位使用的是int80中断来处理的系统调用，具体看arch/x86/entry/entry_32.S文件。而x86_64架构对应的kernel入口在entry_64.S中，其直接使用syscall指令。二者的区别在于，

直接使用syscall需要保存的寄存器数量更少。x86 的 int 0x80需要保存更多寄存器（如 CS:EIP、SS:ESP、EFLAGS等），而 syscall仅保存 RFLAGS和 CS:RIP，因此减少了上下文切换的开销。此外x86_64还避免了中断处理的开销。
参数传递的寄存器不同。x86：前6个参数依次通过 ebx、ecx、edx、esi、edi、ebp传递，超出部分通过栈传递。x86_64：前6个参数通过 rdi、rsi、rdx、r10、r8、r9传递，且第四个参数（原 ecx）改为 r10，以避免与返回地址冲突。关键区别在于x86_64 的 rcx寄存器用于保存返回地址，因此参数传递需避开该寄存器。
错误码返回不同。两者均通过 eax/rax返回结果，但 x86_64 的错误码以负数形式（范围 -4095至 -1）表示，而 x86 使用正数错误码。

系统调用的初始化

在系统启动之初的CPU_init阶段，会对CPU上一些MSR寄存器进行初始化，而对于syscall_init部分的初始化在syscall_init函数中。

// file: arch/x86/kernel/cpu/common.c

void syscall_init(void)
{
	/* The default user and kernel segments */
	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);

	/*
	 * Except the IA32_STAR MSR, there is NO need to setup SYSCALL and
	 * SYSENTER MSRs for FRED, because FRED uses the ring 3 FRED
	 * entrypoint for SYSCALL and SYSENTER, and ERETU is the only legit
	 * instruction to return to ring 3 (both sysexit and sysret cause
	 * #UD when FRED is enabled).
	 */
	if (!cpu_feature_enabled(X86_FEATURE_FRED))
		idt_syscall_init();
}

static inline void idt_syscall_init(void)
{
	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

	if (ia32_enabled()) {
		wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
		/*
		 * This only works on Intel CPUs.
		 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
		 * This does not cause SYSENTER to jump to the wrong location, because
		 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
		 */
		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
		wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
			    (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
	} else {
		wrmsrl_cstar((unsigned long)entry_SYSCALL32_ignore);
		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
		wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
	}

	/*
	 * Flags to clear on syscall; clear as much as possible
	 * to minimize user space-kernel interference.
	 */
	wrmsrl(MSR_SYSCALL_MASK,
	       X86_EFLAGS_CF|X86_EFLAGS_PF|X86_EFLAGS_AF|
	       X86_EFLAGS_ZF|X86_EFLAGS_SF|X86_EFLAGS_TF|
	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
	       X86_EFLAGS_AC|X86_EFLAGS_ID);
}

其中在第6行，将User段的cs和kernel段的cs写入MSR_STAR寄存器。第一个特殊模块集寄存器- MSR_STAR 的 63:48 为用户代码的代码段。这些数据将加载至 CS 和 SS 段选择符，由提供将系统调用返回至相应特权级的用户代码功能的 sysret 指令使用。同时从内核代码来看，当用户空间应用程序执行系统调用时，MSR_STAR 的 47:32 将作为 CS and SS段选择寄存器的基地址。

1 2	/* The default user and kernel segments */ wrmsr(MSR_STAR, 0, (__USER32_CS << 16) \| __KERNEL_CS);

然后在第21行写入entry_SYSCALL_64函数的地址，即kernel entry的入口地址。

1	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

设置好系统调用入口后，需要以下特殊模式的寄存器进行操作，以方便64位CPU能够执行32位的程序。

MSR_CSTAR - target rip for the compability mode callers;
MSR_IA32_SYSENTER_CS - target cs for the sysenter instruction;
MSR_IA32_SYSENTER_ESP - target esp for the sysenter instruction;
MSR_IA32_SYSENTER_EIP - target eip for the sysenter instruction.

if (ia32_enabled()) {
		wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
		/*
		 * This only works on Intel CPUs.
		 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
		 * This does not cause SYSENTER to jump to the wrong location, because
		 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
		 */
		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
		wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
			    (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
	} else {
		wrmsrl_cstar((unsigned long)entry_SYSCALL32_ignore);
		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
		wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
	}
	wrmsrl(MSR_SYSCALL_MASK,
	       X86_EFLAGS_CF|X86_EFLAGS_PF|X86_EFLAGS_AF|
	       X86_EFLAGS_ZF|X86_EFLAGS_SF|X86_EFLAGS_TF|
	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
	       X86_EFLAGS_AC|X86_EFLAGS_ID);  //禁止中断和一些别的掩码位，最重要的是禁止中断。也就是说syscall是不允许中断的
}

这里需要了解一下几个寄存器的作用

esp - 当前栈顶
eip - 下一条指令地址
ebp - 当前栈底
cs - 代码段地址

这里代码就是比较明了，就是把32位程序的使用的一些MSR寄存器等进行设置，把64位下的值给他们。不过这是关于64位程序运行32位的部分，因此我们先跳过这块，直接看具体的入口函数，即entry_SYSCALL_64，该函数地址被写入MSR_LSTAR寄存器，一旦用户态执行syscall指令，就会跳转到该地址进行执行。

syscall指令主要做了以下几件事：

把syscall指令的下一条指令（也就是返回地址）存入 %rcx 寄存，然后把指令指针寄存器 %rip 替换成IA32_LSTAR MSR寄存器里的值。
把 rflags 标志寄存器的值保存到 %r11，然后把 rflags 的值与 IA32_FMASK MSR 里的值做掩码运算。
把 IA32_STAR MSR寄存器里第32~47位加载到 CS 和 SS 段寄存器。

真正系统调用

进入内核态运行代码之前需要的一些准备工作（切换段寄存器和CR3等）

然后我们来看真正的内核入口的汇编代码

SYM_CODE_START(entry_SYSCALL_64)
	UNWIND_HINT_ENTRY
	ENDBR
	
	swapgs 
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	/* IRQs are off. */
	movq	%rsp, %rdi
	/* Sign extend the lower 32bit as syscall numbers are treated as int */
	movslq	%eax, %rsi

	/* clobbers %rax, make sure it is after saving the syscall nr */
	IBRS_ENTER
	UNTRAIN_RET
	CLEAR_BRANCH_HISTORY

	call	do_syscall_64		/* returns with IRQs disabled */

首先前两行，第一行是处理ftrace,kprobe等关于内核栈调用和调试信息相关的部分，我们不关心，因此跳过，第二行是x86开启CET保护后，函数call过来必须跳转的指令，不然会异常，所以我们也不关心。

1 2	UNWIND_HINT_ENTRY ENDBR

然后是切换GS段基址，从用户态切换到内核态。因为x86-64上有一个per CPU数据区域，是需要通过GS寄存器访问的，在用户态下GS 指向用户线程的 TLS；内核态下，GS 指向内核的 per_cpu 数据。这一步是从用户态的段切换到内核态的段。PS：其实这里切换来切换去都是同一个段，因为平坦内存模型，在GDT表中，所有的段的基址都是0，只是不同段的idx不同，比如12 - kernel code segment， 13 - kernel data segment， 14 - default user CS （具体看文件arch/x86/include/asm/segment.h中的注释）所以这里只是一个历史遗留问题，为了保证早期只支持段的设备做的。

swapgs

然后把当前用户态的栈指针%RSP临时存储的TSS段，注释已经告诉我们这个段的sp2字段是空的。

1 2	/* tss.sp2 is scratch space. */ movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)

然后是切换页表到内核页表，即修改CR3寄存器。这个宏把rsp当作一个临时寄存器，因为rsp的值已经存储过的。然后是一个处理CPU特性的宏，判断CPU支不支持PTI（Page Table Isolation），如果不支持就直接跳转到末尾.Lend_\@, 支持的话就接着执行下面的语句，当前语句是空"".

SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
.Lend_\@:
.endm

.macro ADJUST_KERNEL_CR3 reg:req
	ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
	/* Clear PCID and "MITIGATION_PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
.endm

#define PTI_USER_PGTABLE_BIT		PAGE_SHIFT
#define PTI_USER_PGTABLE_MASK		(1 << PTI_USER_PGTABLE_BIT)
#define PTI_USER_PCID_BIT		X86_CR3_PTI_PCID_USER_BIT
#define PTI_USER_PCID_MASK		(1 << PTI_USER_PCID_BIT)
#define PTI_USER_PGTABLE_AND_PCID_MASK  (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)

然后上面的宏第四行把CR3寄存器临时存到rsp寄存器里，然后又是一个宏，这里判断CPU支不支持PCID(Process Context ID)，如果支持就不需要flush TLB。同时把用户态CR3的PTI_USER_PCID_MASK和PTI_USER_PGTABLE_MASK清空，然后再将其写回CR3寄存器中。这里其实关键的地方在于，用户态的CR3和内核态的CR3只差了一页。用一张网上的图就是如下所示，所以这里把PTI_USER_PGTABLE_MASK，即1<<12这里变成了0，表明是kernel的CR3了。这里其实就是为了防护熔断和幽灵漏洞设置的KPTI部分。

最后就是切换到内核栈了,即把per-cpu变量中的栈顶写入RSP。

1	movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp

保存用户态寄存器值

前面已经设置好内核栈和内核态页表了之后，就开始保存一些寄存器的值了，我们一行行来看

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR #标记这个地址没有endbr指令，并不做什么关键的事情。

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
	
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
	PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
	CLEAR_REGS clear_bp=\clear_bp
.endm

.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
	.if \save_ret
	pushq	%rsi		/* pt_regs->si */  # 有些情况栈上已经会存在返回地址，例如call指令压入的，如果直接push会导致返回地址被破坏，因此先把返回地址保存一下，
	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
	movq	%rdi, 8(%rsp)	/* pt_regs->di (overwriting original return address) */
	.else
	pushq   %rdi		/* pt_regs->di */
	pushq   %rsi		/* pt_regs->si */
	.endif
	pushq	\rdx		/* pt_regs->dx */
	pushq   \rcx		/* pt_regs->cx */
	pushq   \rax		/* pt_regs->ax */
	pushq   %r8		/* pt_regs->r8 */
	pushq   %r9		/* pt_regs->r9 */
	pushq   %r10		/* pt_regs->r10 */
	pushq   %r11		/* pt_regs->r11 */
	pushq	%rbx		/* pt_regs->rbx */
	pushq	%rbp		/* pt_regs->rbp */
	pushq	%r12		/* pt_regs->r12 */
	pushq	%r13		/* pt_regs->r13 */
	pushq	%r14		/* pt_regs->r14 */
	pushq	%r15		/* pt_regs->r15 */

	.if \unwind_hint
	UNWIND_HINT_REGS
	.endif

	.if \save_ret
	pushq	%rsi		/* return address on top of stack */ # 最后再写入rsi，即返回地址
	.endif
.endm

.macro CLEAR_REGS clear_bp=1
	/*
	 * Sanitize registers of values that a speculation attack might
	 * otherwise want to exploit. The lower registers are likely clobbered
	 * well before they could be put to use in a speculative execution
	 * gadget.
	 */
	xorl	%esi,  %esi	/* nospec si  */
	xorl	%edx,  %edx	/* nospec dx  */
	xorl	%ecx,  %ecx	/* nospec cx  */
	xorl	%r8d,  %r8d	/* nospec r8  */
	xorl	%r9d,  %r9d	/* nospec r9  */
	xorl	%r10d, %r10d	/* nospec r10 */
	xorl	%r11d, %r11d	/* nospec r11 */
	xorl	%ebx,  %ebx	/* nospec rbx */
	.if \clear_bp
	xorl	%ebp,  %ebp	/* nospec rbp */
	.endif
	xorl	%r12d, %r12d	/* nospec r12 */
	xorl	%r13d, %r13d	/* nospec r13 */
	xorl	%r14d, %r14d	/* nospec r14 */
	xorl	%r15d, %r15d	/* nospec r15 */

.endm

然后开始push寄存器到pt_regs结构体

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss  用户段DS*/
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp 之前保存到tss段的用户态rsp寄存器*/
	pushq	%r11					/* pt_regs->flags syscall指令自动存储的rflags寄存器*/
	pushq	$__USER_CS				/* pt_regs->cs 用户段CS寄存器*/
	pushq	%rcx					/* pt_regs->ip 用户态的下一条指令，是由syscall指令执行时候CPU自动存储到rcx寄存器中的*/
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax 把rax寄存器压入栈，也就是系统调用号*/

然后是一个宏调用, 就是按顺序压入

PUSH_AND_CLEAR_REGS rax=$-ENOSYS
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
	PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
	CLEAR_REGS clear_bp=\clear_bp
.endm

.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
	.if \save_ret
	pushq	%rsi		/* pt_regs->si */
	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
	movq	%rdi, 8(%rsp)	/* pt_regs->di (overwriting original return address) */
	.else
	pushq   %rdi		/* pt_regs->di */
	pushq   %rsi		/* pt_regs->si */
	.endif
	pushq	\rdx		/* pt_regs->dx */
	pushq   \rcx		/* pt_regs->cx */
	pushq   \rax		/* pt_regs->ax */
	pushq   %r8		/* pt_regs->r8 */
	pushq   %r9		/* pt_regs->r9 */
	pushq   %r10		/* pt_regs->r10 */
	pushq   %r11		/* pt_regs->r11 */
	pushq	%rbx		/* pt_regs->rbx */
	pushq	%rbp		/* pt_regs->rbp */
	pushq	%r12		/* pt_regs->r12 */
	pushq	%r13		/* pt_regs->r13 */
	pushq	%r14		/* pt_regs->r14 */
	pushq	%r15		/* pt_regs->r15 */

	.if \unwind_hint
	UNWIND_HINT_REGS
	.endif

	.if \save_ret
	pushq	%rsi		/* return address on top of stack */
	.endif
.endm

在看这一部分的时候可以结合pt_regs结构体一起来看, 因为压入栈的顺序与pt_regs的顺序相反。

struct pt_regs {
	/*
	 * C ABI says these regs are callee-preserved. They aren't saved on
	 * kernel entry unless syscall needs a complete, fully filled
	 * "struct pt_regs".
	 */
	unsigned long r15;
	unsigned long r14;
	unsigned long r13;
	unsigned long r12;
	unsigned long bp;
	unsigned long bx;

	/* These regs are callee-clobbered. Always saved on kernel entry. */
	unsigned long r11;
	unsigned long r10;
	unsigned long r9;
	unsigned long r8;
	unsigned long ax;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;

	/*
	 * orig_ax is used on entry for:
	 * - the syscall number (syscall, sysenter, int80)
	 * - error_code stored by the CPU on traps and exceptions
	 * - the interrupt number for device interrupts
	 *
	 * A FRED stack frame starts here:
	 *   1) It _always_ includes an error code;
	 *
	 *   2) The return frame for ERET[US] starts here, but
	 *      the content of orig_ax is ignored.
	 */
	unsigned long orig_ax;

	/* The IRETQ return frame starts here */
	unsigned long ip;

	union {
		/* CS selector */
		u16		cs;
		/* The extended 64-bit data slot containing CS */
		u64		csx;
		/* The FRED CS extension */
		struct fred_cs	fred_cs;
	};

	unsigned long flags;
	unsigned long sp;

	union {
		/* SS selector */
		u16		ss;
		/* The extended 64-bit data slot containing SS */
		u64		ssx;
		/* The FRED SS extension */
		struct fred_ss	fred_ss;
	};

	/*
	 * Top of stack on IDT systems, while FRED systems have extra fields
	 * defined above for storing exception related information, e.g. CR2 or
	 * DR6.
	 */
};

把以上两个结合起来其实就是以下这张图：

然后仔细观察会发现，rax其实被压入栈了两次，

orig_ax 用于 restart_syscall / ptrace / syscall trace：可以知道用户最初调用哪个 syscall。
ax 是常规用户寄存器快照，和其他通用寄存器一样，用于 pt_regs 完整保存，虽然两个值一样，但是这里需要给返回值一个占位。

另外还有%rdi并没有被清除，这是因为rdi是syscall的第一个参数，因为会直接被用，所以就不清0了。

OK 到这里就已经把所有寄存器的值保存到内核栈上了，然后就可以调用真正的系统调用了。

/* IRQs are off. */
	movq	%rsp, %rdi
	/* Sign extend the lower 32bit as syscall numbers are treated as int */
	movslq	%eax, %rsi

	/* clobbers %rax, make sure it is after saving the syscall nr */
	IBRS_ENTER  # 限制间接分支预测
	UNTRAIN_RET # 清理返回栈预测器
	CLEAR_BRANCH_HISTORY # 清分支历史，防止 spec leak
	
	call	do_syscall_64		/* returns with IRQs disabled */
	
	
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)

我们先看这个函数有两个参数，第一个是pt_regs地址，第二个是系统调用号，所以这里汇编将rsp的值，也就是pt_regs的地址给rdi寄存器，也就是ABI规约里的第一个参数，然后把eax，也就是系统调用号给rsi，也就是ABI规约里的第二个参数。然后做了一些防止幽灵攻击的措施。（这里就不详细了解了，可以具体查看相关幽灵和熔断漏洞Understanding-Spectre-v2-and-How-the-Vulnerability-Impact-the-Cloud-Security_Gavin-Guo.pdf）

do_syscall_64

终于我们跳出了汇编的函数，来到了看得明白一点的C函数，我先贴个完整的代码，然后我们再一点点看

__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
	add_random_kstack_offset(); //给内核栈加一个随机偏移
	nr = syscall_enter_from_user_mode(regs, nr); // 做一些限制检查，比如syscall 过滤/限制（seccomp 等），然后可能要调整syscall的编号，最终nr才是需要调用的编号。

	instrumentation_begin(); //用于kprobe, ftrace的跟踪

	if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
		/* Invalid system call, but still a system call. */
		regs->ax = __x64_sys_ni_syscall(regs);
	}

	instrumentation_end();
	syscall_exit_to_user_mode(regs);

	/*
	 * Check that the register state is valid for using SYSRET to exit
	 * to userspace.  Otherwise use the slower but fully capable IRET
	 * exit path.
	 */

	/* XEN PV guests always use the IRET path */
	if (cpu_feature_enabled(X86_FEATURE_XENPV))
		return false;

	/* SYSRET requires RCX == RIP and R11 == EFLAGS */
	if (unlikely(regs->cx != regs->ip || regs->r11 != regs->flags))
		return false;

	/* CS and SS must match the values set in MSR_STAR */
	if (unlikely(regs->cs != __USER_CS || regs->ss != __USER_DS))
		return false;

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * TASK_SIZE_MAX covers all user-accessible addresses other than
	 * the deprecated vsyscall page.
	 */
	if (unlikely(regs->ip >= TASK_SIZE_MAX))
		return false;

	/*
	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
	 * restoring TF results in a trap from userspace immediately after
	 * SYSRET.
	 */
	if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
		return false;

	/* Use SYSRET to exit to userspace */
	return true;
}

关键在于调用系统调用的地方在以下代码片段。先尝试用64位，再尝试32位，如果实在都失败，就用一个ni（not implemented）的系统调用，这样不会挂内核。

do_syscall_x64(regs, nr) → 尝试调用 64-bit syscall
do_syscall_x32(regs, nr) → 尝试调用 32-bit compatibility syscall
如果两者都失败，且 syscall 编号有效 (nr != -1)：
- 调用 __x64_sys_ni_syscall() → “not implemented” syscall
- 将返回值存到 regs->ax，保证用户态看到正确的返回值（-ENOSYS）

	if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
		/* Invalid system call, but still a system call. */
		regs->ax = __x64_sys_ni_syscall(regs);
	}

static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
	/*
	 * Convert negative numbers to very high and thus out of range
	 * numbers for comparisons.
	 */
	unsigned int unr = nr;

	if (likely(unr < NR_syscalls)) {
		unr = array_index_nospec(unr, NR_syscalls); //防止幽灵攻击，比如传入一个nr=-1的值，然后转uint时候会变得很大。
		regs->ax = x64_sys_call(regs, unr); //根据unr的编号调用具体函数
		return true;
	}
	return false;
}

static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
{
	/*
	 * Adjust the starting offset of the table, and convert numbers
	 * < __X32_SYSCALL_BIT to very high and thus out of range
	 * numbers for comparisons.
	 */
	unsigned int xnr = nr - __X32_SYSCALL_BIT;

	if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
		xnr = array_index_nospec(xnr, X32_NR_syscalls);
		regs->ax = x32_sys_call(regs, xnr);
		return true;
	}
	return false;
}

其实x64_sys_call的代码也很巧妙,就是一个简单的switch，然后中间用一个头文件弄了一个自动生成的syscall表，具体内容看文件arch/x86/entry/syscalls/syscall_64.tbl

long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
	switch (nr) {
	#include <asm/syscalls_64.h>
	default: return __x64_sys_ni_syscall(regs);
	}
};

OK到这里就大概清楚了具体的系统调用怎么调用的，怎么call到表里的表项的。

系统调用的返回

先看看完整的代码路径, 执行完do_syscall_64之后，会把syscall返回的结果写入pt_regs->ax部分，然后do_syscall_64的返回值如果是true，则是使用sysret返回，是快速路径，如果是false，则使用iret满速路径。

	call	do_syscall_64		/* returns with IRQs disabled */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 * In the Xen PV case we must use iret anyway.
	 */

	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
		"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	IBRS_EXIT #清除IBRS相关寄存器
	POP_REGS pop_rdi=0 # 把所有pt_regs寄存器弹出来，并且不弹出%rdi，因为本身没有压入过

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi # 把当前栈顶%rsp 给 rdi寄存器
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp  #把trampoline的栈给rsp，这是一个固定的小栈
	UNWIND_HINT_END_OF_STACK

	pushq	RSP-RDI(%rdi)	/* RSP */  
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	swapgs #把 GS 基址从内核的 per-cpu 基址交换回用户态 GS（TLS）的值。这一步必须在恢复用户寄存器并在最终 sysretq 之前执行，因为用户态的 GS/FS 必须是用户原值。
	CLEAR_CPU_BUFFERS
	sysretq #这里就返回用户态了，后续代码都不会被调用到。 sysretq：真正的、快速的返回到用户态（使用 RCX = RIP, R11 = RFLAGS，CS/SS 由 MSR_STAR 指定，且 RSP 已经被恢复为用户 RSP）。sysretq 之后 CPU 立即在用户态继续执行。
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	int3  # int3 是调试断点指令（如果控制流不该到这里却到这里，触发 oops / bugcheck）。在正确的系统里 sysretq 不会返回到下一条指令，这里的 int3 只是保险（"should not reach"）。
SYM_CODE_END(entry_SYSCALL_64)

然后有两行代码比较魔幻

	pushq	RSP-RDI(%rdi)	/* RSP */  
	pushq	(%rdi)		/* RDI */
        
#define DATA(offset)		(KEXEC_CONTROL_CODE_MAX_SIZE+(offset))

/* Minimal CPU state */
#define RSP			DATA(0x0)

因为我们进入了trampoline栈，但是呢原本内核栈还有一些数据没有pop完，如下所示, 所以这两行总的作用就是复制原来旧内核栈上保存的 RSP 和 RDI 到当前 trampoline 栈以便恢复。那么以下这些寄存器就不pop了嘛，有人肯定会有疑问。这就有了iret和sysret的区别，iret要求栈上需要有ss sp这些东西，从而自动会pop，但是sysret不会，这些脏东西会留在内核栈上，但是无所谓，下次进来的时候会把这些覆盖。

具体一点就是

iretq 要求栈上有严格的 frame（SS, RSP, RFLAGS, CS, RIP），所以 CPU 会真正 pop 它们。
sysretq 更轻量：它直接从 %rcx 取返回地址，从 %r11 取 RFLAGS，然后跳回用户态。它根本不会理会栈上的那一坨。