30 1 2021

c++ note

1 basic

1.1 virtual

在构造函数和析构函数中调用虚的成员函数不会有虚函数的效果, 即还是调用父类的成员函数, 可以使用工厂模式来解决。
类中有虚函数时, 析构函数要为虚或者 protected

1.2 lvalue and rvalue

rvalue
- xvalue(Expiring Value)
- prvalue(Pure Right Value)
Return Value Optimization, 编译器减少函数返回时生成临时值（对象）的个数, 编译器默认使用, 编译时使用 -fno-elide-constructors 来禁用
Named Return Value Optimization
std::forward 完美转发, defined in header <utility>, 能保留const, 可以被static_cast替代, 能转发：
- const T &
- T &
- const T &&
- T &&

1.3 cast

const_cast: 可 cast const 的对象, 除 function pointers
reinterpret_cast: 将内存里的值重新解释
dynamic_cast: 一般地讲, 能用虚函数解决的问题就不要用 dynamic_cast, 能够用 dynamic_cast 解决的就不要用 typeid

1.4 inheritance

return type

If the function Derived::f overrides a function Base::f, their return types must either be the same or be covariant. Two types are covariant if they satisfy all of the following requirements:

both types are pointers or references (lvalue or rvalue) to classes. Multi-level pointers or references are not allowed. the referenced/pointed-to class in the return type of Base::f() must be a unambiguous and accessible direct or indirect base class of the referenced/pointed-to class of the return type of Derived::f(). the return type of Derived::f() must be equally or less cv-qualified than the return type of Base::f().
强制要求派生类显式地使用接口, 又为派生类准备一个可调用的默认实现：为纯虚函数提供定义, 这种定义不会存入虚表, 因此基类本身仍然是抽象类, 并且派生类仍然需要提供一个显式实现
```
class Base
{
public:
  virtual void doSomething() = 0;

};

void Base::doSomething() { std::cout << "haha" << std::endl; };

class Derived : public Base
{
public:
  virtual void doSomething()
  {
    Base::doSomething();
  }
};
```

虚继承可以解决菱形继承问题

class VBase {
  virtual void f();
};

class Middle1 : virtual VBase {};

class Middle2 : virtual VBase {};

class Derived : virtual Middle1, virtual Middle2 {};

int main() {
  return sizeof(Derived);
}

-Xclang -fdump-record-layouts -std=c++11

1.5 time

Function	Type	Precision	Remark
time(2)	time_t	second
ftime(3)	struct timeb	millisecond	obsolete
gettimeofday(2)	struct timeval	microsecond
clock_gettime(2)	struct timespec	nanosecond
Time Stamp Counter	64-bit register	CPU related

time/gettimeofday/clock_gettime 都可以使用 vdso, 可以不陷入内核粗粒度地获取时间, clock_gettime 需要指定 CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE
```
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_COARSE, &ts);
unsigned long long millisecs = ts.tv_sec * 1000 + ts.tv_nsec / (1000 * 1000);
```

tm struct

#include <time.h>
	   struct tm {
	     int tm_sec;    /* Seconds (0-60) */
	     int tm_min;    /* Minutes (0-59) */
	     int tm_hour;   /* Hours (0-23) */
	     int tm_mday;   /* Day of the month (1-31) */
	     int tm_mon;    /* Month (0-11) */
	     int tm_year;   /* Year - 1900 */
	     int tm_wday;   /* Day of the week (0-6, Sunday = 0) */
	     int tm_yday;   /* Day in the year (0-365, 1 Jan = 0) */
	     int tm_isdst;  /* Daylight saving time */
	 };

timeval struct

#include <sys/time.h>
	   struct timeval {
	     time_t      tv_sec;     /* seconds */
	     suseconds_t tv_usec;    /* microseconds */
	 };

timezone struct

  struct timezone {
    int tz_minuteswest;     /* minutes west of Greenwich */
    int tz_dsttime;         /* type of DST correction */
};

timespec struct

  struct timespec {
	time_t          tv_sec;
	long            tv_nsec;
};

Convert tm structure to time_t time_t mktime (struct tm * timeptr);

Convert time_t to tm as local time struct tm * localtime (const time_t * timer);

struct tm *localtime_r(const time_t *timep, struct tm *result);

setenv("TZ", "/usr/share/zoneinfo/America/Los_Angeles", 1); // POSIX-specific
setenv("TZ", "/usr/share/zoneinfo/Asia/Shanghai", 1); // POSIX-specific

Convert time_t to tm as UTC time struct tm * gmtime (const time_t * timer);

1.6 smart pointer

enable_shared_from_this

是一种 CRTP
shared_from_this: returns a shared_ptr which shares ownership of *this
weak_from_this: returns the weak_ptr which shares ownership of *this

#include <memory>
#include <iostream>

struct Good: std::enable_shared_from_this<Good> // note: public inheritance
{
    std::shared_ptr<Good> getptr() {
	return shared_from_this();
    }
};

struct Bad
{
    std::shared_ptr<Bad> getptr() {
	return std::shared_ptr<Bad>(this);
    }
    ~Bad() { std::cout << "Bad::~Bad() called\n"; }
};

int main()
{
    // Good: the two shared_ptr's share the same object
    std::shared_ptr<Good> gp1 = std::make_shared<Good>();
    std::shared_ptr<Good> gp2 = gp1->getptr();
    std::cout << "gp2.use_count() = " << gp2.use_count() << '\n';

    // Bad: shared_from_this is called without having std::shared_ptr owning the caller 
    try {
	Good not_so_good;
	std::shared_ptr<Good> gp1 = not_so_good.getptr();
    } catch(std::bad_weak_ptr& e) {
	// undefined behavior (until C++17) and std::bad_weak_ptr thrown (since C++17)
	std::cout << e.what() << '\n';    
    }

    // Bad, each shared_ptr thinks it's the only owner of the object
    std::shared_ptr<Bad> bp1 = std::make_shared<Bad>();
    std::shared_ptr<Bad> bp2 = bp1->getptr();
    std::cout << "bp2.use_count() = " << bp2.use_count() << '\n';
} // UB: double-delete of Bad

weak_ptr
- std::weak_ptr models temporary ownership: when an object needs to be accessed only if it exists, and it may be deleted at any time by someone else, std::weak_ptr is used to track the object, and it is converted to std::shared_ptr to assume temporary ownership. If the original std::shared_ptr is destroyed at this time, the object's lifetime is extended until the temporary std::shared_ptr is destroyed as well.
- Another use for std::weak_ptr is to break reference cycles formed by objects managed by std::shared_ptr. If such cycle is orphaned (i,e. there are no outside shared pointers into the cycle), the shared_ptr reference counts cannot reach zero and the memory is leaked. To prevent this, one of the pointers in the cycle can be made weak.
the trade-offs between make_shared and shared_ptr+new
- std::shared_ptr<T>(new T(args…)) performs at least two allocations (one for the object T and one for the control block of the shared pointer), while std::make_shared<T> typically performs only one allocation.
- If any std::weak_ptr references the control block created by std::make_shared after the lifetime of all shared owners ended, the memory occupied by T persists until all weak owners get destroyed as well, which may be undesirable if sizeof(T) is large.
- std::shared_ptr<T>(new T(args…)) may call a non-public constructor of T if executed in context where it is accessible, while std::make_shared requires public access to the selected constructor.
- Unlike the std::shared_ptr constructors, std::make_shared does not allow a custom deleter.
- std::make_shared uses ::new, so if any special behavior has been set up using a class-specific operator new, it will differ from std::shared_ptr<T>(new T(args…)).

1.7 cmake

添加编译选项: add_compile_options(-g -O0 -std=c++17 -D_DEBUG -D__LINUX__ -Wno-enum-compare)
获取父目录: get_filename_component(PARENT_DIR ${PROJECT_SOURCE_DIR} DIRECTORY)
打印信息: message(STATUS "parent directory is: " ${PARENT_DIR})
指定可执行文件存放目录: set(EXECUTABLE_OUTPUT_PATH ${PARENT_DIR}/bin)
添加 protobuf 的头文件路径和链接库:
- find_package(Protobuf)
- include_directories(${Protobuf_INCLUDE_DIRS})
- target_link_libraries(a.out ${Protobuf_LIBRARIES} pthread)

1.8 RAII

资源获取即初始化（Resource Acquisition Is Initialization）, 或称 RAII, 它将必须在使用前请求的资源的生命周期与一个对象的生命周期相绑定。

拥有 open()/close()、lock()/unlock(), 或 init()/copyFrom()/destroy() 成员函数的类是非 RAII 类的典型的例子

1.9 value semantics

对一个具有值语义的原始变量变量赋值可以转换成内存的 bit-wise-copy
如果一个type X 具有值语义, 则：
- X 的 size 在编译时可以确定
- 将 X 的变量 x, 赋值与另一个变量 y, 无须专门的 = operator, 简单的 bit-wise-copy 即可
- 当上述赋值发生后, x 和 y 脱离关系：x 和 y 可以独立销毁, 其内存也可以独立释放

1.10 CRTP

Curiously Recurring Template Pattern 奇异递归模板模式, 更一般地被称作 F-bound polymorphism

派生类继承自模板类, 派生类将自身作为参数传给模板类
基类转换成派生类用的是 static_cast 而不是 dynamic_cast, 降低了继承带来的虚函数表查询开销
enable_shared_from_this 和 ranges::view_interface 属于 CRTP

1.11 network I/O basic

1.11.1 socket

setsockopt
- SO_REUSEADDR enables local address reuse 对 time-wait 链接, 确保 server 重启成功
- SO_REUSEPORT enables duplicate address and port bindings 可解决 thundering herd problem
  
  SO_REUSEPORT (since Linux 3.9) Permits multiple AF_INET or AF_INET6 sockets to be bound to an identical socket address. This option must be set on each socket (including the first socket) prior to calling bind(2) on the socket. To prevent port hijacking, all of the processes binding to the same address must have the same effective UID. This option can be employed with both TCP and UDP sockets.
  
  For TCP sockets, this option allows accept(2) load distribution in a multi- threaded server to be improved by using a distinct listener socket for each thread. This provides improved load distribution as compared to traditional techniques such using a single accept(2)ing thread that distributes connec‐ tions, or having multiple threads that compete to accept(2) from the same socket.
  
  For UDP sockets, the use of this option can provide better distribution of incoming datagrams to multiple processes (or threads) as compared to the tra‐ ditional technique of having multiple processes compete to receive datagrams on the same socket.

1.11.2 select

当前进程使用 O(N) 时间轮询 bitmap 中准备就绪的 socket, 并将其保存进 socket 的等待队列中（作为等待者）, 如果没有准备就绪的 socket, 该进程将会被挂起, 等有数据传输完毕触发中断程序（kernel 也可以启动 interrupt coalescing 机制, 让网卡做中断合并）使 cpu 让出时间片去将数据导入 socket 的读缓冲区并将等待者移入工作队列

1.11.3 poll

用数组代替 select 中的 bitmap 使其监听的 socket 数可以大于1024
signal

SIGPIPE
client close socket, server call write, then server will receive a RST segment, if server do a write again, then it will cause SIGPIPE

SIGCHLD
zombie child process
usage
- ignore SIGPIPE and SIGCHLD
- non-blocking sokcet + I/O multiplexing
- create listenfd: socket_nonblocking, sokcet_cloexec (or call fcntl f_setfl(o_nonblock), f_setfd(fd_cloexec))
```
listenfd = socket(PF_INET, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, 0);
//or
flags = fcntl (listenfd, F_GETFL, 0);
flags |= O_NONBLOCK;
fcntl (listenfd, F_SETFL, flags);
```
- &*pollfds.begin() 等同于 pollfds.data()
- 每次调用传一个 struct pollfd 的数组给内核, 内核使用链表存储
- specifying a negative value in timeout means an infinite timeout.
- specifying a timeout of zero causes poll() to return immediately, even if no file descriptors are ready.
- accept(2) return EMFILE 处理（太多文件）
```
idlefd = open("dev/null", O_RDONLY | O_CLOEXEC);
if (errno == EMFILE)
  {
    close(idlefd);
    idlefd = accept(listenfd, nullptr, nullptr);
    close(idlefd);
    idlefd = open("dev/null", O_RDONLY | O_CLOEXEC);
    continue;
```

1.11.4 epoll

file operations 中的 poll 函数的作用

// file_operations Struct Reference 
#include <fs.h>
unsigned int(* 	poll )(struct file *, struct poll_table_struct *);

typedef struct poll_table_struct {
  poll_queue_proc _qproc;
  unsigned long _key;
 } poll_table;

将当前进程的 task_struct 加入设备驱动的等待队列, 并设置回调函数
检查已发生的事件 POLLIN, POLLOUT, POLLERR

等待队列

等待队列包含队列头(wait_queue_head_t)和队列项(wait_queue_t)

#include <wait.h>
struct list_head{
    struct list_head *next, &prev;
}; // 双向链表

typedef struct __wait_queue_head wait_queue_head_t;
struct __wait_queue_head {
    spinlock_t lock;
    struct list_head task_list;
};

typedef struct __wait_queue wait_queue_t;
struct __wait_queue {
    unsigned int flags;
#define WQ_FLAG_EXCLUSIVE   0x01
    void *private;		/* 指向 task_struct */
    wait_queue_func_t func;	/* callback function */
    struct list_head task_list;
};

epoll_ctl(epfd, EPOLL_ADD_CTL, fd, &event)
- copy_from_user 将用户的 event 拷贝到内核 epds
- 得到 epfd 的 file 结构体 named file, fd 的结构体 named tfile
- tfile 需要支持 poll: tfile->f_op->poll
- epfd 不能监听自己: file == tfile
- 获取 eventpoll 结构体 (每创建一个 epollfd, 内核就会分配一个 eventpoll 与之对应, 可以说是内核态的 epollfd), 来自与 epoll_create1() 中的分配: ep = file->private_data
- 加锁: mutex_lock(&ep->mtx)
- 对于每一个监听的 fd, 内核都有分配一个 epitem 结构体, epi = ep_find(ep, tfile, fd), rbtree 查找, O(lgn) 的时间复杂度
- if !epi 表示可插入, 复制到内核的 event: epds 添加事件 epds.events |= POLLERR | POLLHUP
- call ep_insert(ep, &epds, tfile, fd)
  - 从 slab 中分配一个 epitem: epi = kmem_***_alloc(epi_***, GFP_KERNEL)
  - 初始化 epi->ep = ep, ep_set_ffd(&epi->ffd, tfile, fd), epi->event = *event …
  - struct ep_pqueue epq: epq.epi = epi
  - 初始化 epq 的 poll_table, 指定调用 poll_wait 时的回调函数为 ep_ptable_queue_proc: init_poll_funcptr(&epq.pt, ep_ptable_queue_proc)
  - revents = tfile->f_op->poll(tfile, &epq.pt): f_op->poll(), sock_poll(), udp/tcp_poll(), datagram_poll(), sock_poll_wait(), ep_ptable_queue_proc()
    - static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt)
      - 根据 poll table 获取 epitem: struct epitem *epi = ep_item_from_epqueue(pt)
      - 创建 struct eppoll_entry * pwq = kmem_***_alloc(pwq_***, GFP_KERNEL)
      - 指定 ep_poll_callback 为唤醒时的回调函数: init_waitqueue_func_entry(&pwq->wait, ep_poll_callback)
      - pwq->base = epi
      - 将 pwq 挂入 sock->wq->wait (等待队列): add_wait_queue(whead, &pwq->wait)
    - static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
      - 通过等待队列获取 epitem 和 eventpoll: struct epitem *epi = ep_item_from_wait(wait), struct eventpoll *ep = epi->ep
      - 检查有没有我们关心的事件: if (key && !((unsigned long) key & epi->event.events))
  - 文件监听自己的 epitem 链起来: list_add_tail(&epi->fllink, &tfile->f_ep_links)
  - epitem 插入 eventpoll 的 rbtree 中: ep_rbtree_insert(ep, epi)
  - 监听数加1: atomic_inc(&ep->user->epoll_watches)
- 解锁: mutex_unlock(&ep->mtx)
epoll_wait(epfd, events, maxevents, timeout)
- 验证内存空间是否有效: access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))
- 获取 epfd 在内核对应的 struct event_poll *ep = fget(epfd)->private_data
- call ep_poll(ep, events, maxevents, timeout)
  - 检查 ready list (rdllist) 是否为空: list_empty(&ep->rdllist)
  - 初始化等待队列: init_waitqueue_entry(&wait, current), current 是当前进程的 task_struct*
  - 挂载到 ep 的等待队列: __add_wait_queue_exclusive(&ep->wq, &wait)
  - sleep
  - set_current_state(TASK_INTERRUPTIBLE): if the ep_poll_callback() sends us a wakeup in between, we can wake
  - wake 后: __remove_wait_queue(&ep->wq, &wait), set_current_state(TASK_RUNNING)
  - call ep_send_events(ep, events, maxevents)
    - call ep_scan_ready_list(ep, ep_send_events_proc, &esed)
      - 监听到 events 的 epitem 都链到 rdllist 上了
      - 将 rdllist 所有的 epitem 都转移到了 txlist 上, 清空 rdllist: list_splice_init(&ep->rdllist, &txlist)
      - 清空 ovflist: ep->ovflist = NULL
      - call the callback function: (*sproc)(ep, &txlist, priv), 即 ep_send_events_proc(ep, &txlist, priv)
        
        priv: Private opaque data passed to the @sproc callback
        
        扫描 txlist, 依次取出 epitem: list_first_entry(head, struct epitem, rdllink)
        
        事件和用户传入的数据都 copy 给用户空间: __put_user(revents, &uevent->events), __put_user(epi->event.data, &uevent->data)
        
        判断 trigger mode (LT or ET): if (!(epi->event.events & EPOLLET))
        
        如果是 LT, 重新插入到 rdllist: list_add_tail(&epi->rdllink, &ep->rdllist)
      - all the events that happens during that period of time are chained in ep->ovflist and requeued later on
      - 将 ovflist 中的 epi 加到 rdllist
      - 上一次没有处理完的 epitem, 重新插入到 rdllist: list_splice(&txlist, &ep->rdllist)
      - 如果 rdllist 不为空, 唤醒

用户态与内核态间的数据交换

user	kernel
int epfd	struct event_poll *ep = fget(epfd)->private_data
int fd (listenfd/connfd)	struct epitem *epi: alloc from slab, ep_set_ffd
struct epoll_event event	struct epoll_event epds: copy_from_user
events 数组	copy to user by __put_user function

LT and ET

在 LT 模式下, ready list 上取出的 epitem 上报完事件后会重新加回 ready list

// static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, void *priv)
if (!(epi->event.events & EPOLLET)) {
	    list_add_tail(&epi->rdllink, &ep->rdllist);
	}

如果 ready list 不为空, 且此时有进程阻塞在同一个 event_poll 睡眠队列上, 则唤醒它

// static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), void *priv)

if (!list_empty(&ep->rdllist)) {
    /*
     * Wake up (if active) both the eventpoll wait list and
     * the ->poll() wait list (delayed after we release the lock).
     */
    if (waitqueue_active(&ep->wq))
	wake_up_locked(&ep->wq);
    if (waitqueue_active(&ep->poll_wait))
	pwake++;
}

usage

使用 epoll_create1(EPOLL_CLOEXEC) 创建 epollfd

使用 epoll_ctl(epollfd, EPOLL_CTL_ADD, listenfd, &event) 添加 event

// The struct epoll_event is defined as:

    typedef union epoll_data {
	void    *ptr;
	int      fd;
	uint32_t u32;
	uint64_t u64;
    } epoll_data_t;

    struct epoll_event {
	uint32_t     events;    /* Epoll events */
	epoll_data_t data;      /* User data variable */
    };

    struct epoll_event event;
    event.events = EPOLLIN;
    event.data.fd = listenfd;
    if (mode == 1)
      event.events |= EPOLLET;
    else if (mode == 2)
      event.events |= EPOLLONESHOT;
    epoll_ctl(epollfd, EPOLL_CTL_ADD, listenfd, &event);

调用 epoll_wait(epollfd, epoll_event 数组, epoll_event 数组的 size), 将就绪 event 的传入 epoll_event 数组
```
nready = epoll_wait(epollfd, events.data(), static_cast<int>(events.size()), -1);
```

thundering herd
- EPOLLEXCLUSIVE: 只适用于多个线程/进程拥有各自的 epfd, 然后监听同一 listenfd
- SO_REUSEPORT: 只适用于多个 listenfd 绑定同一端口

1.12 malloc

The malloc(), calloc(), valloc(), realloc(), and reallocf() functions allocate memory. The allocated memory is aligned such that it can be used for any data type, including AltiVec- and SSE-related types. The aligned_alloc() function allocates memory with the requested alignment. The free() function frees allocations that were created via the preceding allocation functions.

The malloc() function allocates size bytes of memory and returns a pointer to the allocated memory.

The calloc() function contiguously allocates enough space for count objects that are size bytes of memory each and returns a pointer to the allocated memory. The allocated memory is filled with bytes of value zero.

The valloc() function allocates size bytes of memory and returns a pointer to the allocated memory. The allocated memory is aligned on a page boundary.

The aligned_alloc() function allocates size bytes of memory with an alignment specified by alignment and returns a pointer to the allocated memory.

The realloc() function tries to change the size of the allocation pointed to by ptr to size, and returns ptr. If there is not enough room to enlarge the memory allo- cation pointed to by ptr, realloc() creates a new allocation, copies as much of the old data pointed to by ptr as will fit to the new allocation, frees the old allo- cation, and returns a pointer to the allocated memory. If ptr is NULL, realloc() is identical to a call to malloc() for size bytes. If size is zero and ptr is not NULL, a new, minimum sized object is allocated and the original object is freed. When extending a region allocated with calloc(3), realloc(3) does not guarantee that the additional memory is also zero-filled.

The reallocf() function is identical to the realloc() function, except that it will free the passed pointer when the requested memory cannot be allocated. This is a FreeBSD specific API designed to ease the problems with traditional coding styles for realloc causing memory leaks in libraries.

The free() function deallocates the memory allocation pointed to by ptr. If ptr is a NULL pointer, no operation is performed.

dlmalloc – General purpose allocator
ptmalloc2 - glibc
- ptmalloc2 was forked from dlmalloc
- 每个 chunk 至少需要 8 个字节的 overhead
- ptmalloc 将相似大小的 chunk 用双向链表链接起来, 这样的一个链表被称为一个 bin, Ptmalloc 一共维护了 128 个 bin, 并使用一个数组来存储这些 bin
- 大内存采用mmap(), 小内存使用brk()
- 有一个主分配区 (main arena), 多个非主分配区, 非主分配区只能使用 mmap 申请虚拟内存
- per thread arena: maintain separate heap and freelist data structures for each thread
- application’s arena limit is based on number of cores present in the system.
  - For 32 bit systems: Number of arena = 2 * number of cores.
  - For 64 bit systems: Number of arena = 8 * number of cores.
- A single thread arena can have multiple heaps (non contiguous region, created by mmap)
- Heap Header: heap_info (Main arena dont have multiple heaps and hence no heap_info structure)
- Arena Header: malloc_state (contains information about bins, top chunk, last remainder chunk…) (Unlike thread arena, main arena’s arena header isnt part of sbrk’d heap segment. Its a global variable and hence its found in libc.so’s data segment)
- Chunk Header: malloc_chunk
- chunk 可分为:
  Allocated chunk
  - prev_size: If the previous chunk is free, this field contains the size of previous chunk. Else if previous chunk is allocated, this field contains previous chunk’s user data.
  - size: This field contains the size of this allocated chunk. Last 3 bits of this field contains flag information.
  - PREV_INUSE (P) – This bit is set when previous chunk is allocated.
  - IS_MMAPPED (M) – This bit is set when chunk is mmap’d.
  - NON_MAIN_ARENA (N) – This bit is set when this chunk belongs to a thread arena.
  Free chunk
  - prev_size: No two free chunks can be adjacent together. When both the chunks are free, its gets combined into one single free chunk. Hence always previous chunk to this freed chunk would be allocated and therefore prev_size contains previous chunk’s user data.
  - size: This field contains the size of this free chunk.
  - fd: Forward pointer – Points to next chunk in the same bin (and NOT to the next chunk present in physical memory).
  - bk: Backward pointer – Points to previous chunk in the same bin (and NOT to the previous chunk present in physical memory).
  Top chunk
  - Chunk which is at the top border of an arena is called top chunk. It doesnt belong to any bin.
  Last Remainder chunk
  - Last remainder chunk helps to improve locality of reference ie) consecutive malloc request of small chunks might end up being allocated close to each other.
- Bins: Bins are the freelist datastructures. They are used to hold free chunks.
  - Fast Bin: Chunks of size 16 to 80 bytes (Number of bins – 10) (addition and deletion happens at the front end of the list – LIFO)
  - Unsorted Bin: When small or large chunk gets freed instead of adding them in to their respective bins, its gets added into unsorted bin. (Number of bins – 1)
  - Small Bin: Chunks of size less than 512 bytes (Number of bins – 62)
  - Large Bin: Chunks of size greater than equal to 512 (Number of bins – 63)

jemalloc – FreeBSD and Firefox
tcmalloc – Google
- 小对象 (<=32K) 从 ThreadCache 分配, 大对象从 CentralCache 分配
libumem – Solaris

1.13 thread-Safe, async-signal-safe and reentrant

Reentrant Function

A function whose effect, when called by two or more threads, is guaranteed to be as if the threads each executed the function one after another in an undefined order, even if the actual execution is interleaved.

Thread-Safe

A function that may be safely invoked concurrently by multiple threads. Each function defined in the System Interfaces volume of IEEE Std 1003.1-2001 is thread-safe unless explicitly stated otherwise.

Async-Signal-Safe Function

A function that may be invoked, without restriction, from signal-catching functions. No function is async-signal-safe unless explicitly described as such.

可重入函数必然是线程安全函数和异步信号安全函数

2 stl container

2.1 std::array

template<typename T, size_t N>

内存分配在栈（stack）上, 不会重新分配, 随机访问元素
swap: 交换每一个元素
fill: 对所有元素赋值

2.2 std::vector

template<typename T, typename Allocator = allocator<T> >

assign: 赋值
capacity: 容量
reserve: 预先分配内存
shrink_to_fit: resize到合适的内存大小
push_back, emplace_back: 尾部插入
insert, emplace: 插入
vector 的元素不能为 bool, vector<bool> 是按 bit 存储

2.3 std::deque

acronym of double-ended queue 双端队列

push_front, emplace_front: 头部插入

2.4 std::list

Doubly linked list 双向列表(循环)

remove, remove_if: 删除
reverse: 反转
sort: 排序
merge: 合并已排序的list
unique: 已排序的list去重
splice: 接合

2.5 std::forward_list

Single linked list 单向列表

before_begin: begin的前一个迭代器
erase_after: 删除下一个元素, 返回 void
insert_after: 插入
splice_after: 接合

2.6 std::set

template<typename T, typename Compare = less<T>, typename Allocator = allocator<T>>

通常为红黑树
std::multiset 允许元素重复, std::set 不允许
count: 查找元素个数
find: 查找元素
lower_bound: 第一个可插入点
upper_bound: 最后一个可插入点
equal_range: pair(lower_bound, upper_bound)
insert: 插入, 返回值为 pair<Iterator, bool>
std::find: 根据 operator== 查找；而 std::set::find 根据 Compare 查找
std::set::find 比 std::find 快

2.7 std::map

template<typename Key, typename T, typename Compare = less<Key>, typename Allocator = allocator<pair<const Key, T> > >

find: 返回 pair<const Key, T>
emplace_hint: 推荐插入
[] 和 .at(), [const Key] 不存在时插入pair, 返回pair.second, .at(const Key) 不存在时不插入, 返回一个异常

2.8 std::unordered_map

template<typename Key, typename T, typename Hash = hash<Key>, typename EqPred = equal_to<Key>, typename Allocator = allocator<pair<const Key, T> > >

需要使用模板类的偏化定义键的 hash 函数, 如果有两个值, 可以使用 boost 库的 hash_combine

  // Key class 's hash function
namespaece std
  {
    template<> struct hash<Key>	// Template Specialization 
    {
      size_t operator()(const Key &k) const
      {
	return k.value;
      }
    };
  }

 // if we need combine 2 values
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
  std::hash<T> hasher;
  seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

namespaece std
  {
    template<> struct hash<Key>	// Template Specialization 
    {
      size_t operator()(const Key &k) const
      {
      auto seed = hash<int>()(k.v1);
      hash_combine(seed, k.v2);
      return seed;
      }
    };
  }

2.9 std::remove

STL 中 remove() 只是将待删除元素之后的元素移动到容器的前端, 而不是删除。若要真正移除, 需要搭配使用 erase()

std::string str2 = "Text\n with\tsome \t  whitespaces\n\n";
str2.erase(std::remove_if(str2.begin(), 
			  str2.end(),
			  [](unsigned char x){return std::isspace(x);}),
	   str2.end());

3 concurrency

3.1 basic

std::thread::hardware_concurrency 硬件支持的线程数
std::this_thread::yield() 让出时间片
Binary semaphore 与 Mutex 的不同: mutex 一定要由获得锁的进程释放, 而 semaphore 可以由其它进程释放, 因此 semaphore 可以用於進程間同步
spinlock 与 semaphore 的不同: spinlock 是 busy waiting, 而 semaphore 是 sleep。只有多 CPU 的內核态非进程空间, 才会用到 spinlock。 Linux kernel 的 spinlock 在非 SMP 的情況下，只是关 irq，沒有別的操作，用於確保該段程序的運行不會被打斷。而 spinlock 也一般沒有必要用於可以 sleep 的進程空間。
用户空间的 spinlock 可以通过 atomic_flag 的 clear/test_and_set 来实现 unlock/lock/try_lock
std::bind 和 std::thread 必须显式通过 std::ref 来绑定引用进行传参, 否则, 形参的引用声明是无效的
```
std::thread t1(transfer, std::ref(my_account), std::ref(your_account), 10);
```

3.2 atomic

Compare & Swap: 看一看内存 *reg 里的值是不是 oldval, 如果是的话, 则对其赋值 newval

int compare_and_swap (int* reg, int oldval, int newval)
{
  int old_reg_val = *reg;
  if (old_reg_val == oldval) {
    *reg = newval;
  }
  return old_reg_val;
}
/* gcc CAS */
bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...);
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...);
/* c++ 11 CAS */
template< class T >
bool atomic_compare_exchange_weak( std::atomic* obj, T* expected, T desired );
template< class T >
bool atomic_compare_exchange_weak( volatile std::atomic* obj, T* expected, T desired );

operators

原子指令 (x均为std::atomic<int>)	作用
x.load()	返回x的值。
x.store(n)	把x设为n, 什么都不返回。
x.exchange(n)	把x设为n, 返回设定之前的值。
x.compare_exchange_strong(expected_ref, desired)	若x等于expected_ref, 则设为desired；否则把最新值写入expected_ref。
x.compare_exchange_weak(expected_ref, desired)	相比compare_exchange_strong可能有spurious wakeup
x.fetch_add(n), x.fetch_sub(n)	原子地做x += n, x-= n, 返回修改之前的值。

memory order

memory order	作用
memory_order_relaxed	没有fencing作用
memory_order_consume	后面依赖此原子变量的访存指令勿重排至此条指令之前
memory_order_acquire	后面访存指令勿重排至此条指令之前
memory_order_release	前面访存指令勿重排至此条指令之后。当此条指令的结果对其他线程可见后, 之前的所有指令都可见
memory_order_acq_rel	acquire + release语意
memory_order_seq_cst	acq_rel语意外加所有使用seq_cst的指令有严格地全序关系

限制重排

Release-Acquire ordering: 在 store() 之前的所有读写操作, 不允许被移动到这个 store() 的后面。在 load() 之后的所有读写操作, 不允许被移动到这个 load() 的前面。假设 Thread-1 store() 的那个值, 成功被 Thread-2 load() 到了, 那么 Thread-1 在store()之前对内存的所有写入操作, 此时对 Thread-2 来说, 都是可见的。

atomic_flag

可于用户空间用 atomic_flag 实现自旋互斥, 互斥锁是是一种 sleep-waiting 的锁, 自旋锁是一种 busy-waiting 的锁

3.3 mutex

mutex 类似于 binary semaphore, 不同的是只能由上锁的进程/线程来解锁

一般设为 mutable 使得 const member function 可以使用

  class ThreadsafeCounter {
  mutable std::mutex m; // The "M&M rule": mutable and mutex go together
  int data = 0;
 public:
  int get() const {
    std::lock_guard<std::mutex> lk(m);
    return data;
  }
  void inc() {
    std::lock_guard<std::mutex> lk(m);
    ++data;
  }
};

使用 std::lock_guard<std::mutex> 这种 RAII 防止出现异常导致 mutex 没有 unlock, 配合 std::lock 和 std::adopt_lock 可以防止死锁
lock_guard, unique_lock and scoped_lock

std::scoped_lock lock(e1.m, e2.m);

// 等价代码 1 （用 std::lock 和 std::lock_guard ）
// std::lock(e1.m, e2.m);
// std::lock_guard<std::mutex> lk1(e1.m, std::adopt_lock);
// std::lock_guard<std::mutex> lk2(e2.m, std::adopt_lock);

// 等价代码 2 （若需要 unique_lock , 例如对于条件变量）
// std::unique_lock<std::mutex> lk1(e1.m, std::defer_lock);
// std::unique_lock<std::mutex> lk2(e2.m, std::defer_lock);
// std::lock(lk1, lk2);

递归锁 recursive mutex 和非递归锁 non-recursive mutex

pthread_mutex_t 默认是非递归的, 可以通过设置 PTHREAD_MUTEX_RECURSIVE 属性, 将 pthread_mutex_t 锁设置为递归锁

//create recursive attribute
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);

//set recursive attribute
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);

pthread_mutex_init(&g_mutex, &attr);

shared_mutex 用于读写锁
- std::unique_lock<std::shared_mutex> write lock
  - 只有一个线程/写者能写
- std::shared_lock<std::shared_mutex> read lock
  - 多个线程/读者能同时读

3.4 condition variable

notify_one(): notifies one waiting thread
notify_all(): notifies all waiting threads

wait_for()

    /* template< class Rep, class Period, class Predicate > */
    /* bool wait_for( std::unique_lock<std::mutex>& lock, */
    /*              const std::chrono::duration<Rep, Period>& rel_time, */
    /*              Predicate pred); */

/* Equivalent to */ 
wait_until(lock, std::chrono::steady_clock::now() + rel_time, std::move(pred));

4 asio

4.1 timer

boost::asio::deadline_timer 使用的计量时间是系统时间 (posix_time), 因此修改系统时间会影响deadline_timer的行为
基于 std::chrono::steady_clock 的 boost::asio::steady_timer 是一个不会受系统时间影响的定时器
boost::asio::strand 基于 mutex 实现, 保证 callback 的顺序, 使用 strand.post/wrap 包装非线程安全的操作

4.2 daytime

client synchronous
- resolver(io_context)
- endpoint = resolver.resolve(ip address, "daytime"): result of resolver
- socket(io_context): connect to an endpoint
- buffer: a buffer of a boost array
- socket.read_some(buffer, error): return length
server synchronous
- acceptor(io_context, endpoint(tcp4, "daytime"))
- socket(io_context)
- acceptor.accept(socket)
- buffer: a buffer of a string
- write(socket, buffer, error)
server asynchronous
- class tcp_connection: a CRTP
  - static create(io_context)
  - start(): call async_write
  - handle_write(error, bytes_transferred): do nothing
  - member variables: socket and message
- class tcp_server
  - start_accept(): call acceptor.async_accept
  - handle_accept(std::shared_ptr<tcp_connection>, error): call tcp_connection::start and start_accept
  - member variables: io_context and acceptor

4.3 chat

char_message
- data = header + body
- decode/encode header: use strncat/sprintf/memcpy
char_server
- participant: has a virtual function deliver(msg)
- room: has a set of participants and a deque of messages, can join/leave participant, can let all participant in the room deliver(msg)
- session: a CRTP inherits from participant, has a socket, a reference of a room, a read message and a deque of write messages
  - constructor: socket and reference of room
  - start: let room join shared_from_this and begin read header
  - deliver(&msg): push msg to write messages, call do_write
  - do_read_header: async read, call do_read_body
  - do_read_body: async read, let room deliver(read message), call do_read_header
  - do_write: async write the front of write messages utils they're empty
- server: has a acceptor and a room
  - constructor: io_context and endpoint
  - do_accept: async accept, let session start, call do_accept itself
char_client: io_context run in a new thread, client should be closed before this thread join
- client: has a reference of io_context, a socket, a read message and a deque of write messages
  - constructor: io_context and endpoint
  - write(&msg): post, push message to write messages, call do_write
  - close: post, let socket close
  - do_connect(&endpoint): async connect, call do_read_header
  - do_read_header: async read, call do_read_body
  - do_read_body: async read, let room deliver(read message), call do_read_header
  - do_write: async write the front of write messages utils they're empty

5 muduo

5.1 book

对象构造的线程安全: 构造期间不要泄露 this 指针
空悬指针 dangling pointer 指向已销毁对象或已回收地址, 野指针 wild pointer 未经初始化的指针
fork() 只克隆当前线程的 thread of control
时间戳精确到微秒, 通过 gettimeofday(2) 获取时间, 不会陷入内核

5.2 timestamp

boost::less_than_comparable<>: 要求实现 operator < , 自动实现 >, <=, >=, 模板元
BOOST_STATIC_ASSERT: 编译时断言, assert: 运行时断言, 在头文件 <boost/static_assert.hpp>

gmtime_r: 秒数转结构体

time_t seconds = static_cast<time_t>(....);
struct tm tm_time;
gmtime_r(&seconds, &tm_time);

PRId64: 跨 32/64 位平台的 lld/ld

#define __STDC_FORMAT_MACROS
#include <inttypes.h>
#undef __STDC_FORMAT_MACROS
printf("%" PRId64 "\n", value)
printf("%lld", value);		/* 32 bit OS*/
printf("%ld", value);		/* 64 bit OS */

5.3 exception

backtrace: 栈回溯, 保存各个栈帧的地址到 buffer, backtrace_symbols: 根据地址, 转成相应的函数符号, 使用了 malloc(3), 需要 free 返回的 char **

#include <execinfo.h>

int backtrace(void **buffer, int size);

char **backtrace_symbols(void *const *buffer, int size);

void backtrace_symbols_fd(void *const *buffer, int size, int fd);

demangle 解构, 还原函数, 把不利于阅读的符号转为利于人类阅读的符号

5.4 thread

__thread 修饰的变量是线程局部存储的, 只能修饰 POD (plain old data), 初始化只能是编译期常量

__thread int t_cacheTid = 0;
__thread string a("aaa");	/* error 不是 POD */
__thread string *b = new string; /* error */
__thread string *c = nullptr;	 /* ok */

tsd 线程特定数据 thread-specific data 也称 tls (thread-local storage)
- pthread_key_create
- pthread_key_delete
- pthread_getspecific
- pthread_setspecific

获取线程 id

#include <unistd.h>
#include <sys/syscall.h>
#include <thread>
std::thread::id this_id = std::this_thread::get_id();
auto pthread_id = pthread_self(); /* same as get_id */
auto tid = static_cast<pid_t>(::syscall(SYS_gettid));

pthread_atfork : 调用 fork() 在创建进程前, 在父进程调用 prepare, 创建完进程后再父进程中调用 parent, 在子进程中调用 child

/* The pthread_atfork() function registers fork handlers that are to
   be executed when fork(2) is called by this thread */
#include <pthread.h>

int pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void));

无界队列和有界队列
- 生产者消费者模型, 使用 semaphore 或 condition variable 实现
- BlockingQueue: mutex, condition_variable(notEmpty), deque<T>
  - put: lock_guard 保护 deque push back, notEmpty notify
  - take: while deque is empty notEmpty wait, deque pop front
- BoundedBlockingQueue: mutex, condition_variable(notEmpty, notFull), boost::circular_buffer<T>
  - put: while queue is full notFull wait
  - take: after pop front, notFull notify

6 unix

6.1 baisc

int / iret
- 调用 sys_open:
```
mov 0x05 ,eax       /* 设置系统调用号 05: sys_open */
int 0x80
```
- 调用 int 0x80 后, 查找中断描述符表(IDT, Interrupt Descriptor Table), 进行特权级检查(DPL = CPL = 3), 在 GDT / LDT 中找到对应的段描述符
- 段寄存器 DPL >= CPL 才能访问内核段的内存空间(通过 set_system_intr_gate 来设置)
- Linux 只为每个 CPU 维护一个 TSS, 通过 TSS(Task State Segment) 来切换到内核栈
- 系统调用库(glibc) 中, int 0x80 只有在硬件不支持快速系统调用(sysenter / syscall)的时候才会调用
sysenter / sysexit
- 没有特权级别检查(CPL, DPL), 也没有压栈的操作
syscall / sysret
- 64 位
exit / _exit exit() 定义在 stdlib.h 中, _exit() 定义在 unistd.h 中, _exit() 是一个 sys_exit 系统调用, 而 exit() 先调用执行各终止处理函数, 关闭所有标准IO, 然后调用sys_exit

The function _exit() terminates the calling process "immediately". Any open file descriptors belonging to the process are closed; any children of the process are inherited by process 1, init, and the process’s par- ent is sent a SIGCHLD signal.

The exit() function causes normal process termination and the value of status & 0377 is returned to the parent (see wait(2)).

All functions registered with atexit(3) and on_exit(3) are called, in the reverse order of their registration. (It is possible for one of these functions to use atexit(3) or on_exit(3) to register an addi- tional function to be executed during exit processing; the new regis- tration is added to the front of the list of functions that remain to be called.) If one of these functions does not return (e.g., it calls _exit(2), or kills itself with a signal), then none of the remaining functions is called, and further exit processing (in particular, flush- ing of stdio(3) streams) is abandoned. If a function has been regis- tered multiple times using atexit(3) or on_exit(3), then it is called as many times as it was registered.

All open stdio(3) streams are flushed and closed. Files created by tmpfile(3) are removed.

The C standard specifies two constants, EXIT_SUCCESS and EXIT_FAILURE, that may be passed to exit() to indicate successful or unsuccessful termination, respectively.

6.2 man sections

Executable programs or shell commands
System calls (functions provided by the kernel)
Library calls (functions within program libraries)
Special files (usually found in /dev)
File formats and conventions eg /etc/passwd
Games
Miscellaneous (including macro packages and conventions), e.g. man(7)
System administration commands (usually only for root)
Kernel routines [Non standard]

6.3 ctrl

ctrl-c: = kill -s INT [PID] (kill foreground process) 发送 SIGINT 信号给前台进程组中的所有进程, 强制终止程序的执行
ctrl-z: (suspend foreground process) 发送 SIGTSTP 信号给前台进程组中的所有进程, 挂起一个进程, 使用 fg/bg 操作恢复执行前台或后台的进程
ctrl-d: (terminate input, or exit shell) 一个特殊的二进制值, 表示 EOF, 作用相当于在终端中输入 exit 后回车
ctrl-/: 发送 SIGQUIT 信号给前台进程组中的所有进程, 终止前台进程并生成 core 文件
ctrl-s: 中断控制台输出
ctrl-q: 恢复控制台输出

6.4 *nix commands

pkg-config - Return metainformation about installed libraries
```
pkg-config --cflags --libs yaml-cpp 
```

查看动态 link directory

# mac 下查看 shared library(.dylib)
otool -L a.out			
# linux 下查看 shared library(.so)
ldd a.out

查看某进程的虚拟内存分配情况

cat /proc/[PID]/maps
pmap [PID]

file descriptor

#!/bin/sh
exec 6< a.txt	 # 创建一个 fd 对 a.txt 读操作
exec 7> a.txt	 # 创建一个fd 对 a.txt 写操作
exec 8<> a.txt # 0u stdin, 1u stdout, 2u stderr, 6r read only, 7w write only, 8u 

echo "hhh" >& 7			# 往 a.txt 里边写数据
read a 0<&  6 # 读取第一行数据到 a, 每个文件描述符代表的数据结构中都有自己的偏移量

lsof -op $$			# 查看当前进程正在使用的文件的描述符

cd /proc/$$/fd			# $$代表当前进程的ID号

readelf 读取 ELF 文件信息
nslookup: query internet name servers, explore properties of DNS mappings
dig: DNS lookup utility
telnet [host] [port]: like a echo client
objdump: readelf -s ~= objdump -t 查看 symbol table, objdump -d 反汇编, objdump -s 查看每段的内容, objdump -h 查看每段的大小

nc (netcat)

nc -zv localhost 20-30		# 获取开放的端口
nc -l 9999			# Listen for an incoming connection

tcpdump 抓包

nc -l 9999
tcpdump -i any 'port 9999' -XX -nn -vv -S
nc -v 9999

kill [options] <pid> […] - send a signal to a process
- the default signal for kill is TERM
- kill -l 查看可发送的所有信号
- a PID of -1 is special: it indicates all processes except the kill process itself and init
- kill -9 -1: kill all processes you can kill, 8) SIGFPE 9) SIGKILL 10) SIGUSR1
strace - trace system calls and signals
```
strace -e trace=write ./a.out
```

7 Primer C++

7.1 string, vector and array

老的编译器需要区分 >> 和 > >

"Some compilers may require the old-style declarations for a vector of vectors, for example, vector<vector<int> >."

vector 不能通过下标操作符进行插入

"The subscript operator on vector (and string) fetches an existing element; it does not add an element."

built-in 数组通过两个函数获得头尾指针

int ia[] = {0,1,2,3,4,5,6,7,8,9}; // ia is an array of ten ints
int *beg = begin(ia); // pointer to the first element in ia
int *last = end(ia); // pointer one past the last element in ia

"arrays are not class types, so these functions are not member functions. Instead, they take an argument that is an array"

两个指针相减的类型为 ptrdiff_t

"The result of subtracting two pointers is a library type named ptrdiff_t. Like size_t, the ptrdiff_t type is a machine-specific type and is defined in the cstddef header. Because subtraction might yield a negative distance, ptrdiff_t is a signed integral type."

built-in 数组可以取负数作为下标

int *p = &ia[2]; // p points to the element indexed by 2
int j = p[1]; // p[1] is equivalent to *(p + 1),
		  // p[1] is the same element as ia[3]
int k = p[-2]; // p[-2] is the same element as ia[0]

"The library types force the index used with a subscript to be an unsigned value. The built-in subscript operator does not. The index used with the built-in subscript operator can be a negative value."

initialize a C-style character string from a library string

string s("Hello World");  // s holds Hello World
char *str = s; // error: can't initialize a char* from a string
const char *str = s.c_str(); // ok

多维数组的 range for 要使用引用, avoid the normal array to pointer conversion

"To use a multidimensional array in a range for, the loop control variable for all but the innermost array must be references."

7.2 expression

使用 decltype 时会区分 lvalue 和 rvalue

"When we apply decltype to an expression (other than a variable), the result is a reference type if the expression yields oan lvalue. As an example, assume p is an int*. Because dereference yields an lvalue, decltype(* p) is int&. On the other hand, because the address of operator yields an rvalue, decltype(&p) is int**, that is, a pointer to a pointer to type int."

在一条表达式中如有未定义执行顺序的 operators （如int i = f1() * f2();）, 我们不能确定 f1() 和 f2() 哪个先执行, 会造成 has undefined behavior

"For operators that do not specify evaluation order, it is an error for an expression to refer to and change the same object. Expressions that do so have undefined behavior. As a simple example, the << operator makes no guarantees about when or how its operands are evaluated. As a result, the following output expression is undefined:"

int i = 0;
cout << i << " " << ++i << endl; // undefined

"There are four operators that do guarantee the order in which operands are evaluated. The logical AND (&&) operator guarantees that its left-hand operand is evaluated first. Moreover, we are also guaranteed that the right-hand operand is evaluated only if the left-hand operand is true. The only other operators that guarantee the order in which operands are evaluated are the logical OR (||) operator, the conditional (? :) operator, and the comma (,) operator."

• The right side of an && is evaluated if and only if the left side is true. • The right side of an || is evaluated if and only if the left side is false.

bool 不应该用于计算

bool b = true;
bool b2 = -b; // b2 is true! (-1 is true)

"bool values should not be used for computation. The result of -b is a good example of what we had in mind"

const_cast 只用于修改 constness

"If the object was originally not a const, using a cast to obtain write access is legal. However, using a const_cast in order to write to a const object is undefined."

Only a const_cast may be used to change the constness of an expression. Trying to change whether an expression is const with any of the other forms of named cast is a compile-time error. Similarly, we cannot use a const_cast to change the type of an expression

const char *cp;

char *q = static_cast<char*>(cp); // error: static_cast can't cast away const
static_cast<string>(cp); // ok: converts string literal to string
const_cast<string>(cp); // error: const_cast only changes constness

8 CSAPP

8.1 basic

registers
- arguments 1, 2, 3, 4, 5, 6 分别放在 rdi, rsi, rdx, rcx, r8, r9
- 浮点类型的参数是由另外一组寄存器传递的
- return value 放在 rax
overflow buffer
- randomize stack position
- make the stack not executable
- use stack canary
rep; ret
- 汇编中用 rep 后面跟 ret 的组合来避免使 ret 指令成为条件跳转指令的目标
- 这里的 rep 指令就是作为一种空操作, 因此作为跳转目的插入它, 除了能使代码在 AMD 上运行得更快之外, 不会改变代码的其他行为
process
- process = process context + code, data and stack
- process context = program context(data registers, condition codes, stack pointer, program counter) + kernel context(vm structures, descriptor table, brk pointer)
- process = thread + code, data and kernel context
- each process's context is described by a task_struct structure
- The task_struct holds data such as the scheduling policy, scheduler priority, real time priority, processor allowed time counter, processor registers, file handles (files_struct), virtual memory (mm_struct).
thread
- threads associated with process form a pool of peers, unlike processes which form a tree hierarchy
- single core processor: simulate parallelism by time slicing
- multi-core processor: can have true parallelism
- one thread can read and write the stack of any other thread
thread vs. process
- similar
  - each has its own logical control flow
  - each can run concurrently with others
  - each is context switched
- different
  - threads share all code and data (except loacl stacks)
  - threads are less expensive than processes, ~20k cycles to create and reap a process, ~10k cycles for a thread

8.2 optimization

conditional move: 使用 conditional moves 能避免 branch prediction
unrolling and accumulating: 不使用流水线的话最好能优化到 latency bound（单位 clock cycles per element）, 使用流水线可以达到 throughout bound
ymm register: 使用 SIMD (Single instruction, multiple data) operations 加速运算（vectorizing后）
branch misprediction invalidation: 寄存器有多个副本, 当分支预测错误时还原寄存器的值, reload pipline

8.3 memory

bus interface standard
- PCIe 属于全双工模式, 而 SATA 是半双工模式,
- NVMe 与AHCI 相比使用多队列, 所以 NVMe + PCIe 比 AHCI + SATA 快
non-volatile memory
- 根据浮置栅存储的位的多少, 闪存可分为 SLC (Single Level Cell Multi Level Cell), MLC, TLC and QLC
volatile memory
- SRAM: cache memory
- DRAM: main memory, frame buffer
locality
- 程序需考虑 temporal locality and spatial locality
- matrix multiplication 通过分块（blocking）增加 temporal locality, 通过改变循环的顺序改变 spatial locality（i*k 与 k*j 的矩阵相乘, 最佳顺序为 kij）

8.4 linking

linker symbol

global symbols
non-static global functions/variables

external symbols
global symbols referenced but defined by other module

local symbols
static global functions/variables
relocation entry: complier 告诉 linker 去填充 symbols 所在的地址
可重定位目标文件
- .bss: 未初始化的全局和静态 C 变量, 以及所有被初始化为 0 的全局或静态变量。在目标文件中这个节不占据实际的空间, 它仅仅是一个占位符。目标文件格式区分已初始化和未初始化变量是为了空间效率, 在目标文件中, 未初始化变量不需要占据任何实际的磁盘空间。运行时, 在内存中分配这些变量, 初始值为 0
- .symtab: 一个符号表, 它存放在程序中定义和引用的函数和全局变量的信息, 和编译器中的符号表不同, .symtab 符号表不包含局部变量的条目
GCC -fno-common: 告诉链接器, 在遇到多重定义的全局符号时, 触发一个错误
static libraries
- ar – create and maintain library archives
- 使用多个 .o 文件创建 .a static library
shared libraries
- 解决了 static libraries 的 potential duplication
- 可以多个进程共享
library interpositioning

complie time
macro-expand

link time
linker trick to have special name resolution (gcc -Wl)

load/run time
修改 LD_PRELOAD
link directory

# mac 下查看 shared library(.dylib)
otool -L a.out
# linux 下查看 shared library(.so)
ldd a.out
# 查看第三方库路径
pkg-config --cflags --libs yaml-cpp

8.5 exceptional control flow

asynchronous exceptions (interrputs)
- cause by events external to the processor, such as timer interrput
synchronous exceptions

traps
international, example: system calls

faults
uninternational but possibly recoverable, example: page faults

aborts
uninternational and unrecoverable
process
context
address space + registers

states
running, stopped, terminated

exit
called once, never returns

fork
called once, returns twice (to parent and child)

wait
parent reap a child, synchronizing with child

waitpid
waiting for specific process
```
#include <sys/types.h>
#include <sys/wait.h>
pid_t waitpid(pid_t pid, int *wstatus, int options); 
/* The value of options is an OR of zero or more of the following constants:
 wait()就是 pid = -1、options = 0 的 waitpid()
 WNOHANG     return 0 immediately if no child has exited.
 */

 /* wait3() 和 wait4() 函数除了可以获得子进程状态信息外, 还可以获得子进程的资源使用信息, 这些信息是通过参数 rusage 得到的 */
 pid_t wait3(int *status,int options,struct rusage *rusage);
 pid_t wait4(pid_t pid,int *status,int options,struct rusage *rusage);
```
The pid parameter specifies the set of child processes for which to wait. If pid is -1, the call waits for any child process. If pid is 0, the call waits for any child process in the process group of the caller. If pid is greater than zero, the call waits for the process with process id pid. If pid is less than -1, the call waits for any process whose process group id equals the absolute value of pid.
reap
如果 parent 没有 reap child 进程, init process (pid = 1) 这个进程会去 reap zombie child process

execve
loading and running programs, called once, nerver returns

exec
execute a file
- l: 可变参数 const char *arg
  execl("/bin/ls", "ls", "-l", NULL);
- v: 参数列表 char *const argv[]
  int ret; char *argv[] = {"ls", "-l", NULL}; ret = execvp("ls",argv);
- e: 传递环境变量 char *const envp[]
  extern char **environ;
- p: 第一个参数 path 不用输入完整路径, 只有给出命令名即可, 它会在环境变量 PATH 当中查找命令
  execlp("ls", "ls", "-l", NULL);
shell
- fg(foreground) 和 bg(background) 的区别在于 fg 调用了 waitpid(pid, &status, 0)
signal
- pause: wait for the receipt of a signal
- kill: send signal to a process
- 每个信号类型都有一个预定义的默认行为, 是下面中的一种: • 进程终止。 • 进程终止并转储内存。 • 进程停止(挂起)直到被 SIGCONT 信号重启。 • 进程忽略该信号。
- pnb (pending nonblocked signals) = pending & ~blocked
- sighandler_t signal(int signum, sighandler_t handler)
  - 如果 handler 是 SIG_IGN, 那么忽略类型为 signum 的信号。
  - 如果 handler 是 SIG_DFL, 那么类型为 signum 的信号行为恢复为默认行为。
- a signal handler is a separate logical flow (not process) that runs concurrently with the main program
- SIGSTOP 和 SIGKILL, 它们的默认行为是不能修改的
- sigprocmask: explicit blocking and unblocking mechanism 只为单线程定义的, pthread_sigmasks 可以在多线程中使用
```
sigset_t mask, prev_mask;

sigfillset(&mask);			   /* all sigs */
sigprocmask(SIG_BLOCK, &mask, &prev_mask);  /* Block sigs */

sigprocmask(SIG_SETMASK, &prev_mask, NULL); /* Restore sigs */

/* ---------------------------------------- */

sigemptyset(&mask);
sigaddset(&mask, SIGCHLD);
sigprocmask(SIG_BLOCK, &mask, &prev_mask);  /* Block SIGCHLD */

sigprocmask(SIG_SETMASK, &prev_mask, NULL); /* Restore sigs */
```
- guideline for writing safe handler
  - as simple as possible
  - call only async-signal-safe functions
  - save and restore errno on entry and exit
  - protect shared data by temporarily blocking all signals
  - declare global variables as volatile (sig_atomic_t), 不能被加载到寄存器上
- async-signal-safety
  - "man 7 signal": show async-signal-safe functions
  - write is the only async-signal-safe output function
  - printf, malloc, exit are not async-signal-safe, will cause deadlock

8.6 system level I/O

End of line (EOL) indicators in different systems
- Linux and Mac Os: '\n'(0xa) - line feed
- Windows and Internet protocols: '\r\n'(0xd 0xa) - carriage return and line feed
strace: 追踪程序调用的系统命令和信号
- example: strace -e trace=write ./cpstdin
内核维护的 3 个数据结构
- 进程级的文件描述符表 descriptor table
- 系统级的打开文件描述符表 open file table
- 文件系统的 i-node 表 i-node table
descriptor table [one table per process]
- 不同的 file descriptor 可以指向相同的 file, 指向不同的 open file table, 其中的 file position 可能不同
- child process 会复制一份 parent process 的 descriptor table
open file table [shared by all process]
- file pos
- reference count
v-node table [shared by all process]
- informations in stat struct
- difference between inode and vnode
  
  The vnode structure ("virtual node") is an essential part of the virtual file system (VFS) support in Linux.
  
  The file system dependent/independent split was done just above the UNIX-kernel inode layer. This was an obvious choice, as the inode was the main object for file manipulation in the kernel. […] The file system dependent inode was renamed vnode (virtual node). All file manipulation is done with a vnode object. Similarly, file systems are manipulated through an object called a vfs (virtual file system). The vfs is the analog to the old mount-table entry. The file system independent layer is generally referred to a the vnode layer.
dup and dup2 - duplicate a file descriptor
- int dup(int oldfd); int dup2(int oldfd, int newfd);
- The dup() system call creates a copy of the file descriptor oldfd, using the lowest-numbered unused file descriptor for the new descriptor.
- After a successful return, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other. The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.
- The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. If the file descriptor newfd was previously open, it is silently closed before being reused.
- The steps of closing and reusing the file descriptor newfd are performed atomically.
- If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed.
- If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd.
- most common use: I/O redirection (ls > a.txt)
fcntl - manipulate file descriptor
- int fcntl(int fd, int cmd, … * arg * );
- when cmd = F_DUPFD (int)
  - Duplicate the file descriptor fd using the lowest-numbered available file descriptor greater than or equal to arg. This is different from dup2(2), which uses exactly the file descriptor specified. On success, the new file descriptor is returned.
standard I/O
- 先 write/read 到 internal buffer, 再 transfers bytes from an internal buffer to a user buffer (flush)
- 遇到换行符会自动调用 fflush

8.7 virtual memory

cache
- write-back rather than write-through: try to defer writing back to disk
- each process has its own page table (an array of page table entries (PTEs) that maps virtual pages to physical pages) in DRAM
- page miss cause page fault (an exception)
- allocating pages: call sbrk, 会改变 program break 的位置 (heap 的结束地址)
memory management
- execve allocate virtual pages for .text and .data section & create PTEs marked as invalid
- .text and .data section are copied, page by page
memory protection
- extend PTEs with permission bits (sup, read, write, exec)
- 64位的地址只使用了低48位, 高位全为1的是用于内核, 高位全为0的是用于用户
Page Table Base Register: CR3 寄存器保存着当前进程页目录的物理地址, 切换进程就会改变 CR3 的值 (part of the process' context)
multi-level page tables
- PGD：page global directory (47-39), 页全局目录, 查看大小: getconf PAGE_SIZE = 4096
- PUD：page upper directory (38-30), 页上级目录
- PMD：page middle directory (29-21), 页中间目录
- PTE：page table entry (20-12), 页表项
TLB (Translation Lookaside Buffer)
- TLB miss 后才会去 multi-level page tables 中寻找 PPN
components of the virtual address (VA)
- TLBI: TLB index
- TLBT: TLB tag
- VPO: virtual page offset
- VPN: virtual page number
components of the physical address (PA)
- PPO: physical page offset
- PPN: physical page number
- CO: byte offset within cache line
- CI: cache index
- CT: cache tag
Linux VM organization
- task_struct 进程描述符: 有指向内存描述符的指针 struct mm_struct *mm, *active_mm
- mm_struct 内存描述符: 有指向线性区对象的链表头 struct vm_area_struct *mmap , 有指向线性区对象的红黑树 struct rb_root mm_rb, 有 pgd_t * pgd 指向第一级页表(页全局目录)的基址, 当内核运行这个进程时, 就将 pgd 存放在 CR3 控制寄存器中
- vm_area_truct: 描述了虚拟地址空间的一个区间
  - vm_prot: read/write premission for this area
  - vm_flags: pages shared with other processes or private for this process
page fault
- 段错误: 不在任何一个 vm_area_truct 的 vm_start 到 vm_end 之间
- 终止: 没有访问权限
- 选择一个牺牲页面, 如果这个牺牲页面被修改过, 那么就将它交换出去, 换入新的页面并更新页表
private copy-on-write objects: make a copy when write instead of it reflecting changes to disk
user-level memory mapping
- void *mmp(void *start, int len, int prot, int flags, int fd, int offset)
- map len bytes starting at offset of the file specified by file description fd, preferably at address start
  - start: may be 0 for "pick an address"
  - prot: PROT_READ, PROT_WRITE, PROT_EXEC, …
  - flags: MAP_ANON (anonymous, get a demand 0 page), MAP_PRIVATE, MAP_SHARED, …

8.8 dynamic memory allocation

performance goals
- throughput: number of completed requests per unit time
- peak memory utilization = max_playload / heap_size
- placement policy, splitting policy, coalescing policy
placement policy
- 首次适配从头开始搜索空闲链表, 选择第一个合适的空闲块。
- 下一次适配和首次适配很相似, 只不过不是从链表的起始处开始每次搜索, 而是从上一次查询结束的地方开始。
- 最佳适配检查每个空闲块, 选择适合所需请求大小的最小空闲块。
keeping track of free blocks
- method 1: implicit list using length – links all blocks
- method 2: explicit list among the free blocks using pointers
- method 3: segregated free list
- method 4: blocks sorted by size
implicit list
- alignment: 16 bytes (2 words)
- header (1 word = size + previous allocation status + allocation status) + payload and padding + boundary tag (footer 1 word, only on free blocks)
- we dont need boundary tag on allocated blocks
explicit free list
- allocated blocks are the same as implicit list
- free blocks store forward/back pointers
- how to put a newly freed block? - LIFO policy and address-ordered policy
- allocate is linear time in number of free blocks instead of all blocks
implicit memory management - garbage collection
- must make some assumptions:
  - memory manager can distinguish pointers from non-pointers
  - all pointers point to the start of a block
  - cannot hide pointers
- GC algorithms:
  - mark-and-sweep collection: view memory as a directed graph, use extra mark bit in the head of each block
  - reference counting
  - copying collection
  - generational collectors
  - mark-compact collection
  - conservation collection
  - incremental collection
  - 基于引用计数只需要局部信息, 基于 trace 需要全局信息, 引用计数缺少全局信息, 无法处理循环引用, 可使用 弱引用 解决
dealing with memory-related perils and pitfalls
- debugger: gdb
- data structure consistency checker
- binary translator: valgrind
- glibc malloc contains checking code: setenv MALLOC_CHECK_ 3

8.9 network programming

DNS is multi-multi mapping
struct sockaddr:
- sa_family(2 bytes - uint16_t) tells which protocol family it is
- sa_data(14 bytes) is the address data

struct sockaddr_in:

sin_family(2 bytes) protocol family always AF_INET
sin_port(2 bytes) port number in network bytes order(big endian order)
sin_addr(4 bytes) IP address in network bytes order
sin_zero(8 bytes) pad to sizeof(struct sockaddr)
must cast sockaddr_in * to sockaddr * for functions

//ex:
// _in 后缀是互联网络 (internet) 的缩写
struct sockaddr_in serverAddr;
memset(&serverAddr, 0, sizeof(serverAddr)); // memset in <string.h>
serverAddr.sin_family = AF_INET;
serverAddr.sin_port = htons(5188); // htons in <arpa/inet.h>
serverAddr.sin_addr.s_addr = htonl(INADDR_ANY);

bind(listenfd, (sockaddr*)&serverAddr, sizeof(serverAddr))

struct sockaddr_in peerAddr;

connfd = accept4(listenfd, (sockaddr*)&peerAddr, &peerLength, SOCK_NONBLOCK | SOCK_CLOEXEC);

std::cout << "ip = " << inet_ntoa(peerAddr.sin_addr)
	  << "port = " << ntohs(peerAddr.sin_port) << std::endl;

AF = Address Family, PF = Protocol Family

server: socket, setsockopt, bind, listen, accept, session(read/write)

socket

#include <sys/types.h>
#include <sys/socket.h>
int socket(int domain, int type, int protocol);

// SOCK_STREAM indicates that the socket will be the end point of a connection
listenfd = socket(PF_INET, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, IPPROTO_TCP);

bind

int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

bind(listenfd, (struct sockaddr*)&serverAddr, sizeof(serverAddr));

listen

int listen(int sockfd, int backlog);
// backlog 参数暗示了内核在开始拒绝连接请求之前, 队列中要排队的未完成的连接请求的数量, 一般为 128
listen(listenfd, SOMAXCONN);

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags);

connfd = accept4(listenfd, (sockaddr*)&peerAddr, &peerLength, SOCK_NONBLOCK | SOCK_CLOEXEC);

client: socket, connect, getsockname, session(read/write)

socket

clientfd = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);

connect

int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

struct sockaddr_in serverAddr;
memset(&serverAddr, 0, sizeof(serverAddr));
serverAddr.sin_family = AF_INET;
serverAddr.sin_port = htons(5188);
serverAddr.sin_addr.s_addr = inet_addr("0.0.0.0");

connect(clientfd, (sockaddr*)&serverAddr, sizeof(serverAddr));

getaddrinfo

#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

int getaddrinfo(const char *node, const char *service, const struct addrinfo *hints, struct addrinfo **res);
void freeaddrinfo(struct addrinfo *res);
const char *gai_strerror(int errcode);

struct addrinfo {
	   int ai_flags;           /* input flags */
	   int ai_family;          /* protocol family for socket */
	   int ai_socktype;        /* socket type */
	   int ai_protocol;        /* protocol for socket */
	   socklen_t ai_addrlen;   /* length of socket-address */
	   struct sockaddr *ai_addr; /* socket-address for socket */
	   char *ai_canonname;     /* canonical name for service location */
	   struct addrinfo *ai_next; /* pointer to next in list */
   };

getnameinfo

int getnameinfo(const struct sockaddr *addr, socklen_t addrlen,
		char *host, socklen_t hostlen,
		char *serv, socklen_t servlen, int flags);

URI and URL: URI = Universal Resource Identifier 统一资源标志符 URL = Universal Resource Locator 统一资源定位符 URN = Universal Resource Name 统一资源名称 url 是 uri 的子集: URI = URL + URN
CGI (common gateway interface)
- in environment variable QUERY_STRING, the server pass arguments to the child
- the server capture the content produced by the child: the child generates it output on stdout, server uses dup2 to redirect stdout to its connected socket

8.10 concurrent programming

approches for writing concurrent servers

process-based
kernel automatically interleaves multiple logical flows, each flow has its own private address space.

event-based
programmer manually interleaves multiple logical flow, all flows share the same address space, use technique called I/O multiplexing.

thread-based
kernel automatically interleaves multiple logical flows, all flows share the same address space, hybrid of process-based and event-based
process-based
- must reap zombie children
```
void sigchld_handler(int sig) 
{
while (waitpid(-1, 0, WNOHANG) > 0)
    ;
return;
}

signal(SIGCHLD, sigchld_handler);
```
- parent must close its copy connfd, after fork, refcnt(connfd) = 2, connection will not be closed until refcnt(connfd) = 0
event-based
- no process or thread control overhead, design for high-performance web servers and search engiens, e.g., Node.js, nginx, Tornado
- hard to provide fine-grained concurrency (very coarse-grained)
- cannot take advantage of multi-core
thread-based
- malloc connfdp, and free it in the thread
- pthread_detach(pthread_self()): 主线程与子线程分离, 状态改为 unjoinable 状态
- must be careful to avoid unintended sharing
- all functions called by a thread must be thread-safe

8.11 synchronization

semaphores(信号量): non-negative global integer synchronization variable. Manipulated by P and V operations.
```
#include <semaphore.h>

int sem_init(sem_t *s, 0, unsigned int val); /* s = val */
int sem_wait(sem_t *s);			     /* P(s) */
int sem_post(sem_t *s);			     /* V(s) */
```
- P(s) proberen (测试)
  - if s is non-zero, then decrement s by 1 and return immediately. (test and decrement operations occur atomically)
  - if s is zero, then suspend thread until s become non-zero and the thread is restarted by a V operation.
  - after restarting, the P operation decrements s and return control to the caller.
- V(s) Verhogen (增加)
  - increment by 1. (increment operation occurs atomically)
  - if there are any threads blocked in a P operation waiting for s to become non-zero, then restart exaclty one of the threads, which then completes operation by decrementing s.
- semaphore invariant: s >= 0
- mutex: binary semaphore (only 0 and 1) used for mutual exclusion, P lock the mutex, V release the mutex.
- using semaphores to coordinate access to shared resources
  - producer-consumer problem:
    - 有三个信号量 mutex, slots, items
    - insert 时 P(slots), V(items)
    - remove 时 P(items), V(slots)
  - readers-writers problem:
    - guaranteeing each thread mutually exclusive access to its critical section
    - favors readers: writer could be starved sort of indefinitely waiting for all these readers to finish
    - favors writers: writer could starve out readers
    - pthread_once: pthread_once() 指定的函数执行且仅执行一次
      #include <pthread.h> int pthread_once(pthread_once_t *once_control, void (*init_routine)(void)); static pthread_once_t once = PTHREAD_ONCE_INIT; pthread_once(&once, init_routine);
thread-unsafe functions
- class 1: failing to protect shared varibles
- class 2: relying on persistent state across multiple function invocations
  - lib.c rand() function relies on static state (next)
- class 3: returning a pointer to a static variable
  - use lock-and-copy wrap lib.c ctime() function
- class 4: calling thread-unsafe functions

reentrant functions
- def: a function is reentrant iff it accesses no shared variables when called by multiple threads
- a subset of thread-safe functions
- gethostbyname, gethostbyaddr 和 inet_ntoa 函数是已弃用的网络编程函数, 已经分别被可重入的 getaddrinfo, getnameinfo 和 inet_ntop 函数取代
thread-safe library functions
- all functions in standard c library, such as malloc, free, printf, scanf
- most unix system calls, with a few exceptions:
  - asctime, ctime, gethostbyname, gethostbyaddr, inet_ntoa, localtime, rand
  - reentrant version denoted by _r: asctime_r, rand_r

8.12 thread level parallelism

out-of-order processor structure
memory consistency
- sequential consistency: overall effect consistent with each individual thread, arbitrary interleaving between threads
- non-coherent cache senario: may violate sequential consistency
- snoopy caches: tagged cahce line in main memory with its state (shared, exclusive), snoopy caches will get the correct copy of other threads' cache

9 Introduction to Computer Networking

9.1 intro

layers

application
bi-directional reliable byte stream between two applications, using application-specific semantics (eg. http, bit-torrent).

transport
gurantees correct, in-order delivery of data end-to-end. controls congestion.

network
delivers datagrams end-to-end. best-effort delivery - no gurantees. must use the Internet protocol (IP).

link
delivers data over a single link between an end host and router, or between routers.
encapsulation

Virtual Private Network (VPN)
http inside tcp inside ip inside tls inside tcp inside ip inside ethernet
endianness
- helper functions for convert network byte order (include <arpa/inet.h>):
- htons(), ntohs(), htonl(), ntohl(): htons means "host to network short", ntohl means "network to host long"
- inet_pton / inet_ntop: dotted decimal string (presentation) <—> IP address in network byte order (network)
计算机上网需要的4个参数
- 本机的 IP 地址 (静态或动态)
- 子网掩码

　- 网关的 IP 地址　- DNS 的 IP 地址

DHCP 属于 application layer, transport layer 使用 UDP
- UDP header 中设置发出方的端口和接收方的端口: 这一部分是 DHCP 协议规定好的, 发出方是 68 端口, 接收方是 67 端口
- 用于分配动态 IP 地址
- 中继代理 DHCP relay agent: 网关充当中继代理的角色
- 虚拟局域网技术 VLAN
- 配置 DHCP snooping 可以解决 DHCP 欺骗问题和 ARP 欺骗问题
ARP
- 两台主机不在同一个子网络, 只能把数据包传送到两个子网络连接处的 gateway, 让网关去处理
- 两台主机在同一个子网络, 可以用 ARP 协议, 得到对方的 MAC 地址
- ARP 协议也是发出一个数据包, 其中包含它所要查询主机的 IP 地址，在对方的 MAC 地址这一栏, 填的是 FF:FF:FF:FF:FF:FF, 表示这是一个"广播"地址它所在子网络的每一台主机, 都会收到这个数据包, 从中取出 IP 地址, 与自身的 IP 地址进行比较。如果两者相同, 都做出回复, 向对方报告自己的 MAC 地址, 否则就丢弃这个包

访问网站
- DNS 进程检查 cache, 如果没有找到, 检查本地 host 文件, 如果没有找到, 通过 DNS 服务器获取域名对应的 ip address
- 判断这个 IP 地址是不是在同一个子网络, 这就要用到子网掩码
- 是在同一个子网络: 通过广播 ARP 获取 ip 对应的 mac 地址, 不是同一子网络, MAC 地址将是网关的 MAC 地址
- 构建 http 数据包, 嵌在 TCP 中, 嵌在 IP 中, 嵌入以太网中, 以太网数据包的数据部分, 最大长度为1500字节

9.2 tcp

stream of bytes
reliable delivery
- ackownledgments indicate correct delivery.
- checksums detect corrupted data.
- squence numbers detect missing data.
- flow-control prevents overrunning receiver.
in-squence
congestion control
connection setup
- sequence number, acknowledgment number, ack bit and syn bit are used
- 3-way handshake
  - syn, sequence number $S_a$ , turns to syn sent state
  - syn/ack, ack number $S_{a+1}$ and sequence number $S_p$ , turns to syn received state
  - ack, ack number $S_{p+1}$ and sequence number $S_{a+1}$ , turns to established state
- support "simultaneous open": 4-way handshake
connection teardown
- sequence number, acknowledgment number, ack bit and fin bit are used
- 4-way wave hand
  - fin/ack, ack number $S_b$ and sequence number $S_a$ , turns to fin wait 1 state
  - ack, ack number $S_{a+1}$ , turns to close wait state
  - fin/ack, ack number $S_{a+1}$ and sequence number $S_b$ , turns to last ack state
  - ack, ack number $S_{b+1}$ , turns to time wait state
- active closer goes into TIME WAIT: sending fin before receiving one, keep socket around 2 MSL(maximun segment lifetime)
- 尽可能在服务器端避免 TIME WAIT

9.3 ucp

very few applications use UDP, examples: DNS, DHCP

9.4 ICMP

Internet Control Message Protocol (ICMP)

self-contained message reporting error.
new datagram: first 8 byte payload + header + type + code
ping uses ICMP:
- 构造 ICMP 数据包 –> 构造 IP 数据包 –> 构造以太网数据帧 -—> 物理传输到目标主机 -—> 获取以太网数据帧 –> 解析出 IP 数据包 –> 解析出 ICMP 数据包 –> 发送回送应答报文
- echo request (type 8, code 0)
- echo reply (type 0, code 0)
traceroute uses ICMP: find the routers from A to B
- A sends a UDP message, TTL set to 1
- the first router sends the TTL expired (type 11) ICMP message to A
- A sends a UDP message, TTL set to 2
- the second router sends the TTL expired (type 11) ICMP message to A
- ……
- when messages reach B, B sends the port unreachable ICMP message to A

9.5 error detection

checksum
- IP, TCP
- fast but not robust
cyclic redundancy codes (CRC)
- Ethernet
- more expensive than checksum
- n 位 CRC = (message << n) 除以 generator polynomial 得到的余数
- check: (message << n + CRC) / generator 的 remainder 为0
message authentication code (MAC)
- Secure Socket Layer (SSL)/Transport Layer Security (TLS) – https
- robust to malicious modifications
- can't guarantee detecting any error

9.6 sliding window

sender
- every segment has squence number
- maintain 3 variables
  - send window size (SWS)
  - last acknowledgment received (LAR)
  - last segment sent (LSS)
- maintain invariant: LSS - LAR <= SWS
- window stalling: can't move past the first unacknowledgement piece of data
receiver
- maintain 3 variables
  - receive window size (RWS)
  - last acceptable segment (LAS)
  - last segment received (LSR)
- maintain invariant: LAS - LSR <= RWS
- if received packet is < LAS, send acknowledgement
  - send cumulative acks: if received 1, 2, 3, 5, acknowledge 3
generally need SWS + RWS squenece numbers
- RWS packets in unkown state (ack may/may not be lost)
- SWS packets in flight must not overflow squence number space
tcp flow control: tcp uses sliding window protocol for flow control
go-back-N and selective repeat
- go-back-N: one loss will lead to entire window retransmitting
- selective repeat: one loss will lead to only that packet retransmitting

9.7 packet switching

end-to-end delay
- propagation delay: the time it takes a single bit to travel over a link at propagation speed c
- packetization delay: the time from when the first to the last bit of a packet is transmitted
- queuing delay: wait in the routers' packet buffers
generic packet switch
- lookup address: forwarding table (destination address -> egress link)
  - lookup Ehternet address using exact match
  - lookup IP address using longest prefix match (binary tries) (ternary content addressable memory TCAM)
- update header
- queue packet (buffer memory)

10 Computer Networking: A Top-Down Approach

packet switch (分组交换机) 主要有两种: router (路由器), link-layer switch (链路层交换机)
- 每台分组交换机有多条链路与之相连, 对于每条链路, 该分组交换机有一个输出缓存 (output buffer), 也叫输出队列 (output queue)
- 每台路由器有一个转发表 (forwarding table)
时延的类型: 处理时延, 排队时延, 传输时延, 传播时延
layered network
- ethernet frame
- IP packet/datagram
- TCP segment
- Application message
port should in net.ipv4.ip_local_port_range (sysctl -A | grep port_range)
因为 http 服务器不保存关于客户的任何信息, 所以 http 是一个无状态协议 (stateless protocol)
cookie 可以用于识别一个用户, 可以在无状态的 http 之上建立一个用户会话层
conditional GET 方法: 使的缓存器 (Web cache, 也叫代理服务器 proxy server)证实它的对象是最新的
- 报文中包含提个 "if-Modified-Since: " 首部行的 get 请求即为 conditional get request
- 响应报文中状态行为 304 Not Modified 表示 proxy 可以使用缓存的对象
传输层的多路复用和多路分解:
- 多路复用: 从不同的 socket 中收集数据块, 并为每个数据块封装上首部信息, 从而生成报文段, 然后将报文段传递到网络层
- 多路分解: 将传输层报文段的数据交付到正确的 socket
最大传输单元 Maximum Transmission Unit, MTU 通常为 1500 字节, 最大报文段长度 Maximum Segment Size, MSS 通常为 1460 字节
一条 TCP 连接的双方均可随机地选择初始序号, 这样做可以减少将旧连接 (使用同样的端口号 SO_REUSEADDR) 发送的报文误认为是新连接的
TimeoutInterval = EstimatedRTT + 4 * DevRTT, RTT 估计值和偏差值通过指数加权移动平均 (Exponential Weighted Moving Averge, EWMA) 计算
duplicate ACK 冗余 ACK: 接收方对已经接收到的最后一个按序字节数据进行重复确认产生, 发送方一旦接受到 3 个冗余 ACK, TCP 就执行快速重传 (fast retransmit)
TCP 的差错恢复机制为 go-back-N 和 selective repeat 的混合, 接收方有选择地确认失序报文段
丢包事件的定义: 出现超时或者收到来自接收方的 3 个冗余 ACK, 出现过度的拥塞时, 路径上的一台或多台路由器的缓存会溢出, 导致丢包
慢启动
- TCP 连接开始时, cwnd 初始值为一个 MSS, 每接收到一个 ACK 就增加一个 MSS, 这导致每过一个 RTT 就翻倍
- 当遇到一个 timeout 丢包事件, 设置慢启动阈值 ssthresh 为 cwnd/2, 将 cwnd 设置为 1, 重新开始慢启动过程, 当检测到 cwnd 到达 sshthresh 时, 结束慢启动并转移到拥塞避免模式
- 当遇到一个冗余 ACK 为 3 的丢包事件, 执行一次快速重传, cwnd 减半, 结束慢启动并进入快速恢复模式
拥塞避免
- 每个 RTT 增加一个 MSS, 可通过每个 ACK 增加 1/n 个 MSS 实现
- 出现 timeout: ssthresh 设为 cwnd/2, cwnd 设为 1, 进入慢启动
- 出现 3 duplicate ack: ssthresh 设为 cwnd/2, cwnd 减半, 进入快速恢复
快速恢复
- 对于每个冗余 ACK, 增加一个 MSS, 进入拥塞避免
- 出现 timeout: ssthresh 设为 cwnd/2, cwnd 设为 1, 进入慢启动
TCP 拥塞控制被称为加性增, 乘性减 Additive-Increase, Multiplicative-Decrease AIMD
转发和路由选择
- 转发 forwarding 是指分组从一个输入链路接口转移到适当的输出链路接口的路由器本地动作
- 路由选择 routing 是指确定分组从源到目的地所采取的的端到端路径的网络范围

11 Modern Operating Systems

errno is thread-safe, on Linux, the global errno variable is thread-specific 每个线程都有属于它自己的局部 errno

/* in /usr/include/bits/errno.h: */
/* Function to get address of global `errno' variable.  */
extern int *__errno_location (void) __THROW __attribute__ ((__const__));

#  if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value.  */
#   define errno (*__errno_location ())
#  endif

只有在有理由认为等待时间是非常短的情况下, 才使用忙等待, 用于忙等待的锁, 称为自旋锁(spin lock)
futex - fast user-space locking futex 通过用户态和内核配合, 可以减小开销, 并且线程灵活可控是否要进入睡眠等待还是 spin 等待
- 用户线程通过 CAS 类原子指令尝试获取锁, 如果成功, 则获取锁成功, 这种情况下无需调用系统调用, 不需要进入内核, 开销很小
- 如果 CAS 失败, 可以选择 spin 重试, 也可以选择挂起自己等待唤醒。这里即调用系统调用, 让内核操作挂起, 为了保证锁原语, 调用者将 futex word 的当前状态(锁定状态)作为参数传入内核, 内核进行检查如果与 futex word 的当前一致, 则挂起线程。否则返回失败。
- 为了唤醒等待线程, 获取锁的线程在释放锁后, 需要调用系统调用, 来通知锁释放, 内核会唤醒等待者进行竞争锁
- futex 应用于 pthread_mutex_t 和 sem_t
```
/* pthread_mutex_lock: */
atomic_dec(pthread_mutex_t.value);
if (pthread_mutex_t.value != 0)
  futex(WAIT);

/* pthread_mutex_unlock: */
atomic_inc(pthread_mutex_t.value);
if(pthread_mutex_t.value != 1)
  futex(WAKEUP);

sem_wait(sem_t *sem)
{
  for (;;) {
    if (atomic_decrement_if_positive(sem->count))
       break;
    futex_wait(&sem->count, 0);
   }
}

sem_post(sem_t *sem)
{
  n = atomic_increment(sem->count);
  futex_wake(&sem->count, n + 1);
}
```
second chance 二次机会算法: 寻找一个在最近时间间隔没有被访问过的页面, 如果所有的页面都被访问过了, 该算法就简化为纯粹的 FIFO 算法
局部分配策略和全局分配策略: 页面置换时考虑所有进程的页面还是只考虑本进程的页面
- 一种方法是 PFF (Page Fault Frequency, 缺页中断率) 算法, 指出何时增加或减少分配给一个进程的页面
缺页中断处理:
- 陷入内核, 保存进程上下文
- 检查地址是否有效, 检查存取与保护是否一致, 如果不一致向进程发送一个信号
- 检查是否有空闲的页框, 如果没有, 执行页面置换算法寻找一个页面淘汰, 如果选择的页面脏了, 安排写进磁盘, 挂起进程直到传输结束
- 恢复进程上下文
虚拟内存的核心是两张表: LDT (local Descriptor Table 局部描述符表), GDT (global Descriptor Table 全局描述符表)
- 每个程序都有自己的 LDT, 共享同一个 GDT, LDT 描述局部于每个程序的段
内核将所有进程的 task_struct 组织成一个双向链表, 不需要遍历这个链表来访问进程描述符(PID), PID 可以直接被映射成进程的 task_struct 所在的地址
linux 将线程分为三类: 实时先入先出, 实时轮转, 分时, 实时的优先级为 0 ~ 99, 非实时的优先级为 100 ~ 139
处理非实时任务的默认调度器: 公平调度器 (Completely Fair Scheduler, CFS)
- 根据任务在 CPU 上运行时间的长短 (虚拟运行时间 vruntime) 排列在一个红黑树上, 优先调度 vruntime 小的任务, 优先级较低的任务 vruntime 会增加的更快
- 调度器只考虑可以运行的任务, 其他任务被放入等待队列中, 等待队列包含了一个自旋锁
内核使用 page 结构体表示一个 4k 大小的物理页, 也叫页框
linux 把空闲的页框分组为 11 个块链表, 在第 i 个链表中每个页框块包含 2 的 i 次方个连续的页框

页框操作 alloc_pages(), page_address()

/* 分配 2^order 个连续的物理页, 返回一个指针指向第一个物理页的 page 结构体 */
static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
/* 返回 page 所映射的虚拟地址 */
void *page_address(const struct page *page)

slab
- 每种对象类型对应着一个高速缓存
- 每个高速缓存由多个 slab 组成, 每个 slab 由一个或多个物理上连续的页组成
- 减少伙伴算法在分配小块连续内存时产生的内部碎片
- 将频繁使用的对象缓存起来
- 通过着色技术调整对象以更好的使用硬件高速缓存
- slob / slub

12 source code

12.1 epoll

初始化

static int __init eventpoll_init(void)
{
    mutex_init(&epmutex);

    /* Initialize the structure used to perform safe poll wait head wake ups */
    ep_poll_safewake_init(&psw);

    /* Allocates slab cache used to allocate "struct epitem" items */
    epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
	    0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC,
	    NULL);

    /* Allocates slab cache used to allocate "struct eppoll_entry" */
    pwq_cache = kmem_cache_create("eventpoll_pwq",
	    sizeof(struct eppoll_entry), 0,
	    EPI_SLAB_DEBUG|SLAB_PANIC, NULL);

    return 0;
}
fs_initcall(eventpoll_init);

结构体

/* 每创建一个 epollfd, 内核就会分配一个 eventpoll 与之对应, 可以说是
 * 内核态的 epollfd. */
struct eventpoll {
    spinlock_t lock;

    struct mutex mtx;     /* 防止使用时被删除 */

    wait_queue_head_t wq;     /* sys_epoll_wait() 使用的等待队列 */

    wait_queue_head_t poll_wait;     /* file->epoll() 使用的等待队列 */

    struct list_head rdllist;     /* List of ready file descriptors */

    struct rb_root rbr; /* RB tree root used to store monitored fd structs */

    struct epitem *ovflist; /* 一个单链表链接着所有的 struct epitem 当 event 转移到用户空间时 */

    struct user_struct *user;
};

/* epitem 表示一个被监听的 fd */
struct epitem {
    struct rb_node rbn;     /* RB tree node used to link this structure to the eventpoll RB tree */

    struct list_head rdllink;     /* List header used to link this structure to the eventpoll ready list */

    struct epitem *next;

    struct epoll_filefd ffd;     /* The file descriptor information this item refers to */

    int nwait;     /* Number of active wait queue attached to poll operations */

    struct list_head pwqlist;     /* List containing poll wait queues */

    struct eventpoll *ep;     /* 当前 epitem 属于哪个 eventpoll */

    struct list_head fllink;

    struct epoll_event event;     /* 当前的 epitem 关系哪些 events, 这个数据是调用 epoll_ctl 时从用户态传递过来 */
};

struct epoll_filefd {
    struct file *file;
    int fd;
};

/* poll 所用到的钩子 Wait structure used by the poll hooks */
struct eppoll_entry {

    struct list_head llink;     /* List header used to link this structure to the "struct epitem" */

    struct epitem *base; /* The "base" pointer is set to the container "struct epitem" */

    wait_queue_t wait;

    wait_queue_head_t *whead;
};

/* Wrapper struct used by poll queueing */
struct ep_pqueue {
    poll_table pt;
    struct epitem *epi;
};

/* Used by the ep_send_events() function as callback private data */
struct ep_send_events_data {
    int maxevents;
    struct epoll_event __user *events;
};

函数

create

从 slab 缓存中创建一个 eventpoll 对象,并且创建一个匿名的 fd 跟 fd 对应的 file 对象, 而 eventpoll 对象保存在 struct file 结构的 private 指针中,并且返回, 该 fd 对应的 file operations 实现了 poll 跟 release 操作

SYSCALL_DEFINE1(epoll_create, int, size)
{
	if (size <= 0)
		return -EINVAL;
	return sys_epoll_create1(0);
}


SYSCALL_DEFINE1(epoll_create1, int, flags)
{
    int error;
    struct eventpoll *ep = NULL;  // 主描述符

    BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

    /* 对于 epoll 来讲, 目前唯一有效的 FLAG 就是 CLOEXEC */
    if (flags & ~EPOLL_CLOEXEC)
	return -EINVAL;

    error = ep_alloc(&ep);     /* 分配一个struct eventpoll */

    if (error < 0)
	return error;
    /*
     * Creates all the items needed to setup an eventpoll file. That is,
     * a file structure and a free file descriptor.
     * 推荐阅读 <Linux device driver 3rd>
     */
    error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
		 O_RDWR | (flags & O_CLOEXEC));
    if (error < 0)
	ep_free(ep);
    return error;
}

control

将 epoll_event 结构拷贝到内核空间中

插入操作: 创建一个与 fd 对应的 epitem 结构指定了调用 poll_wait 时的回调函数用于数据就绪时唤醒进程, 初始化设备的等待队列, 将该进程注册到等待队列, 完成这一步, epitem 就跟这个 socket 关联起来了, 当它有状态变化时, 会通过 ep_poll_callback() 来通知. 最后调用加入的 fd 的 file operation->poll 函数(最后会调用 poll_wait 操作)用于完成注册操作, 最后将epitem结构添加到红黑树中

// 使用范例
struct epoll_event event;
event.data.fd = listenfd;
event.events = EPOLLIN;
epoll_ctl(epollfd, EPOLL_CTL_ADD, listenfd, &event);

// epoll_ctl
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
	struct epoll_event __user *, event)
{
    int error;
    struct file *file, *tfile;
    struct eventpoll *ep;
    struct epitem *epi;
    struct epoll_event epds;
    error = -EFAULT;

    /*
     * 错误处理以及从用户空间将 epoll_event 结构的 event copy 到内核空间.
     */
    if (ep_op_has_event(op) &&
	copy_from_user(&epds, event, sizeof(struct epoll_event)))
	goto error_return;

    /* Get the "struct file *" for the eventpoll file 
     * 这个结构在 epoll_create1() 中, 由函数 anon_inode_getfd() 分配 */
    error = -EBADF;
    file = fget(epfd);
    if (!file)
	goto error_return;

    /* Get the "struct file *" for the target file */
    /* 我们需要监听的 fd, 也有个 struct file 结构 */
    tfile = fget(fd);
    if (!tfile)
	goto error_fput;

    /* The target file descriptor must support poll */
    /* 监听的文件需要支持 poll */
    error = -EPERM;
    if (!tfile->f_op || !tfile->f_op->poll)
	goto error_tgt_fput;

    /*
     * We have to check that the file structure underneath the file descriptor
     * the user passed to us _is_ an eventpoll file. And also we do not permit
     * adding an epoll file descriptor inside itself.
     */
    /* epoll 不能自己监听自己 */
    error = -EINVAL;
    if (file == tfile || !is_file_epoll(file))
	goto error_tgt_fput;

    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    /* 获取 eventpoll 结构, 来自与 epoll_create1() 中的分配 */
    ep = file->private_data;

    /* 接下来的操作有可能修改数据结构内容, 加锁 */
    mutex_lock(&ep->mtx);

    /*
     * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
     * above, we can be sure to be able to use the item looked up by
     * ep_find() till we release the mutex.
     */
    /* 对于每一个监听的 fd, 内核都有分配一个 epitem 结构,
     * epoll 不允许重复添加 fd, 首先查找该 fd 是不是已经存在,
     * ep_find() rbtree 查找, O(lgn) 的时间复杂度
     */
    epi = ep_find(ep, tfile, fd);
    error = -EINVAL;

    switch (op) {
	/* 添加 */
    case EPOLL_CTL_ADD:
	if (!epi) {
	    /* 之前的 find 没有找到有效的 epitem, 证明是第一次插入, 接受 */
	    epds.events |= POLLERR | POLLHUP;
	    /* rbtree 插入 */
	    error = ep_insert(ep, &epds, tfile, fd);
	} else
	    /* 重复添加 */
	    error = -EEXIST;
	break;

	/* 删除 */
    case EPOLL_CTL_DEL:
	if (epi)
	    error = ep_remove(ep, epi);
	else
	    error = -ENOENT;
	break;

	/* 修改 */
    case EPOLL_CTL_MOD:
	if (epi) {
	    epds.events |= POLLERR | POLLHUP;
	    error = ep_modify(ep, epi, &epds);
	} else
	    error = -ENOENT;
	break;
    }

    mutex_unlock(&ep->mtx);

error_tgt_fput:
    fput(tfile);
error_fput:
    fput(file);
error_return:
    return error;
}

/* 分配一个 eventpoll 结构 */
static int ep_alloc(struct eventpoll **pep);


/* 从 slab 中分配一个 epitem */
/* UDP socket 的流程: 
 * f_op->poll(), sock_poll(), udp_poll(), datagram_poll(), sock_poll_wait(), ep_ptable_queue_proc()
 * epitem 跟这个 socket 关联起来后, 当它有状态变化时, 会通过 ep_poll_callback() 来通知
 * 后将 epitem 插入到对应的 eventpoll 中
 */
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
		     struct file *tfile, int fd);


/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
/*
 * 该函数在调用 f_op->poll() 时会被调用
 * epoll 主动 poll 某个 fd 时, 用来将 epitem 与指定的 fd 关联起来
 * 关联的办法是使用等待队列(waitqueue)
 * 指定 ep_poll_callback 为唤醒时的回调函数
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt);


/*
 * This is the callback that is passed to the wait queue wakeup
 * machanism. It is called by the stored file descriptors when they
 * have events to report.
 */
/*
 * 这个是关键性的回调函数, 当我们监听的 fd 发生状态改变时, 它会被调用
 * 参数 key 被当作一个 unsigned long 整数使用, 携带的是 events
 * 将当前的 epitem 放入 ready list
 */
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key);

wait

/*
 * Implement the event wait interface for the eventpoll file. It is the kernel
 * part of the user space epoll_wait(2).
 */
// Usage:
// #include <sys/epoll.h>

// int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

// nready = epoll_wait(epollfd, events.data(), static_cast<int>(events.size()), -1);
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
	int, maxevents, int, timeout)
{
    int error;
    struct file *file;
    struct eventpoll *ep;
    /* The maximum number of event must be greater than zero */
    if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
	return -EINVAL;
    /* Verify that the area passed by the user is writeable */
    /* 内核对应用程序采取的策略是"绝对不信任",
     * 所以内核跟应用程序之间的数据交互大都是 copy
     * epoll_wait() 需要内核返回数据给用户空间, 内存由用户程序提供,
     * 所以内核会用一些手段来验证这一段内存空间是不是有效的
     */
    if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {
	error = -EFAULT;
	goto error_return;
    }
    /* Get the "struct file *" for the eventpoll file */
    error = -EBADF;

    file = fget(epfd);
    if (!file)
	goto error_return;
    /*
     * We have to check that the file structure underneath the fd
     * the user passed to us _is_ an eventpoll file.
     */
    error = -EINVAL;

    if (!is_file_epoll(file))
	goto error_fput;
    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    ep = file->private_data;
    /* Time to fish for events ... */
    error = ep_poll(ep, events, maxevents, timeout);
error_fput:
    fput(file);
error_return:
    return error;
}




/* 这个函数真正将执行 epoll_wait 的进程带入睡眠状态 */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
	   int maxevents, long timeout)
{
    int res, eavail;
    unsigned long flags;
    long jtimeout;
    wait_queue_t wait;//等待队列
    /*
     * Calculate the timeout by checking for the "infinite" value (-1)
     * and the overflow condition. The passed timeout is in milliseconds,
     * that why (t * HZ) / 1000.
     */
    /* 计算睡觉时间, 毫秒要转换为 HZ */
    jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
	MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
retry:
    spin_lock_irqsave(&ep->lock, flags);
    res = 0;
    /* 如果 ready list 不为空, 就不睡了, 直接干活 */
    if (list_empty(&ep->rdllist)) {
	/*
	 * We don't have any available event to return to the caller.
	 * We need to sleep here, and we will be wake up by
	 * ep_poll_callback() when events will become available.
	 */
	/* 初始化一个等待队列, 准备直接把自己挂起,
	 * current 是一个宏, 代表当前进程 */
	init_waitqueue_entry(&wait, current); //初始化等待队列, wait 表示当前进程
	__add_wait_queue_exclusive(&ep->wq, &wait); //挂载到 ep 结构的等待队列
	for (;;) {
	    /*
	     * We don't want to sleep if the ep_poll_callback() sends us
	     * a wakeup in between. That's why we set the task state
	     * to TASK_INTERRUPTIBLE before doing the checks.
	     */

	    set_current_state(TASK_INTERRUPTIBLE);

	    if (!list_empty(&ep->rdllist) || !jtimeout)
		break;

	    /* 如果有信号产生, 也起床 */
	    if (signal_pending(current)) {
		res = -EINTR;
		break;
	    }

	    spin_unlock_irqrestore(&ep->lock, flags);

	    jtimeout = schedule_timeout(jtimeout); //睡觉
	    spin_lock_irqsave(&ep->lock, flags);
	}
	__remove_wait_queue(&ep->wq, &wait);
	/* 醒来 */
	set_current_state(TASK_RUNNING);
    }
    /* Is it worth to try to dig for events ? */
    eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
    spin_unlock_irqrestore(&ep->lock, flags);
    /*
     * Try to transfer events to user space. In case we get 0 events and
     * there's still timeout left over, we go trying again in search of
     * more luck.
     */
    /* 如果一切正常, 有 event 发生, 就开始准备数据 copy 给用户空间了 */
    if (!res && eavail &&
	!(res = ep_send_events(ep, events, maxevents)) && jtimeout)
	goto retry;
    return res;
}

static int ep_send_events(struct eventpoll *ep,
	      struct epoll_event __user *events, int maxevents)
{
    struct ep_send_events_data esed;
    esed.maxevents = maxevents;
    esed.events = events;
    return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
}

/**
 * ep_scan_ready_list - Scans the ready list in a way that makes possible for
 *                      the scan code, to call f_op->poll(). Also allows for
 *                      O(NumReady) performance.
 *
 * @ep: Pointer to the epoll private data structure.
 * @sproc: Pointer to the scan callback.
 * @priv: Private opaque data passed to the @sproc callback.
 *
 * Returns: The same integer error code returned by the @sproc callback.
 */
static int ep_scan_ready_list(struct eventpoll *ep,
		  int (*sproc)(struct eventpoll *,
		       struct list_head *, void *),
		  void *priv)
{
    int error, pwake = 0;
    unsigned long flags;
    struct epitem *epi, *nepi;
    LIST_HEAD(txlist);
    /*
     * We need to lock this because we could be hit by
     * eventpoll_release_file() and epoll_ctl().
     */
    mutex_lock(&ep->mtx);
    /*
     * Steal the ready list, and re-init the original one to the
     * empty list. Also, set ep->ovflist to NULL so that events
     * happening while looping w/out locks, are not lost. We cannot
     * have the poll callback to queue directly on ep->rdllist,
     * because we want the "sproc" callback to be able to do it
     * in a lockless way.
     */
    spin_lock_irqsave(&ep->lock, flags);
    /* 所有监听到 events 的 epitem 都链到 rdllist 上了,
     * 但是这一步之后, 所有的 epitem 都转移到了 txlist 上, 而 rdllist 被清空了,
     */
    list_splice_init(&ep->rdllist, &txlist);

    ep->ovflist = NULL;
    spin_unlock_irqrestore(&ep->lock, flags);
    /*
     * Now call the callback function.
     */
    error = (*sproc)(ep, &txlist, priv);
    spin_lock_irqsave(&ep->lock, flags);
    /*
     * During the time we spent inside the "sproc" callback, some
     * other events might have been queued by the poll callback.
     * We re-insert them inside the main ready-list here.
     */
    /* 现在我们来处理ovflist, 这些 epitem 都是我们在传递数据给用户空间时
     * 监听到了事件 All the events that happens during that period of time are
     * chained in ep->ovflist and requeued later on
     */
    for (nepi = ep->ovflist; (epi = nepi) != NULL;
	 nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
	/*
	 * We need to check if the item is already in the list.
	 * During the "sproc" callback execution time, items are
	 * queued into ->ovflist but the "txlist" might already
	 * contain them, and the list_splice() below takes care of them.
	 */
	/* 将这些直接放入 ready list */
	if (!ep_is_linked(&epi->rdllink))
	    list_add_tail(&epi->rdllink, &ep->rdllist);
    }
    /*
     * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
     * releasing the lock, events will be queued in the normal way inside
     * ep->rdllist.
     */
    ep->ovflist = EP_UNACTIVE_PTR;
    /*
     * Quickly re-inject items left on "txlist".
     */
    /* 上一次没有处理完的 epitem, 重新插入到 ready list */
    list_splice(&txlist, &ep->rdllist);
    /* ready list不为空, 直接唤醒... */
    if (!list_empty(&ep->rdllist)) {
	/*
	 * Wake up (if active) both the eventpoll wait list and
	 * the ->poll() wait list (delayed after we release the lock).
	 */
	if (waitqueue_active(&ep->wq))
	    wake_up_locked(&ep->wq);
	if (waitqueue_active(&ep->poll_wait))
	    pwake++;
    }
    spin_unlock_irqrestore(&ep->lock, flags);
    mutex_unlock(&ep->mtx);
    /* We have to call this outside the lock */
    if (pwake)
	ep_poll_safewake(&ep->poll_wait);
    return error;
}


/* 该函数作为 callback 在 ep_scan_ready_list() 中被调用
 * head 是一个链表, 包含了已经 ready 的 epitem,
 * 这个不是 eventpoll 里面的 ready list, 而是上面函数中的 txlist
 */
static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
		   void *priv)
{
    struct ep_send_events_data *esed = priv;
    int eventcnt;
    unsigned int revents;
    struct epitem *epi;
    struct epoll_event __user *uevent;
    /*
     * We can loop without lock because we are passed a task private list.
     * Items cannot vanish during the loop because ep_scan_ready_list() is
     * holding "mtx" during this call.
     */
    /* 扫描整个链表 */
    for (eventcnt = 0, uevent = esed->events;
	 !list_empty(head) && eventcnt < esed->maxevents;) {
	/* 取出第一个成员 */
	epi = list_first_entry(head, struct epitem, rdllink);
	/* 然后从链表里面移除 */
	list_del_init(&epi->rdllink);
	/* 读取events */
	revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
	    epi->event.events;
	if (revents) {
	    /* 将当前的事件和用户传入的数据都 copy 给用户空间 */
	    if (__put_user(revents, &uevent->events) ||
		__put_user(epi->event.data, &uevent->data)) {
		list_add(&epi->rdllink, head);
		return eventcnt ? eventcnt : -EFAULT;
	    }
	    eventcnt++;
	    uevent++;
	    if (epi->event.events & EPOLLONESHOT)
		epi->event.events &= EP_PRIVATE_BITS;
	    else if (!(epi->event.events & EPOLLET)) {
		/* EPOLL ET 和非 ET 的区别:
		 *
		 * 如果是 ET, epitem 是不会再进入到 readly list,
		 * 除非 fd 再次发生了状态改变, ep_poll_callback 被调用.
		 *
		 * 如果是非 ET, 不管有没有有效的事件或者数据,
		 * 都会被重新插入到 ready list, 再下一次 epoll_wait
		 * 时, 会立即返回, 并通知给用户空间, 如果这个
		 * 被监听的 fd 确实没事件也没数据了, epoll_wait会返回一个0,
		 * 空转一次.
		 */
		list_add_tail(&epi->rdllink, &ep->rdllist);
	    }
	}
    }
    return eventcnt;
}


/* ep_free 在 epollfd 被 close 时调用 */ 
static void ep_free(struct eventpoll *ep)
{
    struct rb_node *rbp;
    struct epitem *epi;
    /* We need to release all tasks waiting for these file */
    if (waitqueue_active(&ep->poll_wait))
	ep_poll_safewake(&ep->poll_wait);
    /*
     * We need to lock this because we could be hit by
     * eventpoll_release_file() while we're freeing the "struct eventpoll".
     * We do not need to hold "ep->mtx" here because the epoll file
     * is on the way to be removed and no one has references to it
     * anymore. The only hit might come from eventpoll_release_file() but
     * holding "epmutex" is sufficent here.
     */
    mutex_lock(&epmutex);
    /*
     * Walks through the whole tree by unregistering poll callbacks.
     */
    for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
	epi = rb_entry(rbp, struct epitem, rbn);
	ep_unregister_pollwait(ep, epi);
    }
    /*
     * Walks through the whole tree by freeing each "struct epitem". At this
     * point we are sure no poll callbacks will be lingering around, and also by
     * holding "epmutex" we can be sure that no file cleanup code will hit
     * us during this operation. So we can avoid the lock on "ep->lock".
     */

    while ((rbp = rb_first(&ep->rbr)) != NULL) {
	epi = rb_entry(rbp, struct epitem, rbn);
	ep_remove(ep, epi);
    }
    mutex_unlock(&epmutex);
    mutex_destroy(&ep->mtx);
    free_uid(ep->user);
    kfree(ep);
}
/* File callbacks that implement the eventpoll file behaviour */
static const struct file_operations eventpoll_fops = {
    .release    = ep_eventpoll_release,
    .poll       = ep_eventpoll_poll
};
/* Fast test to see if the file is an evenpoll file */
static inline int is_file_epoll(struct file *f)
{
    return f->f_op == &eventpoll_fops;
}

13 miscellaneous

FD_CLOEXEC 实现 close-on-exec, 关闭子进程无用文件描述符
```
int flags = fcntl(fd, F_GETFD);  
flags |= FD_CLOEXEC;  
fcntl(fd, F_SETFD, flags);
```
- open 函数中 flags 参数可以传入 O_CLOEXEC
- listenfd = socket(PF_INET, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, IPPROTO_TCP)
- connfd = accept4(listenfd, (sockaddr*)&peerAddr, &peerLength, SOCK_NONBLOCK | SOCK_CLOEXEC);
链接时, 依赖其他库的库一定要放到被依赖库的前面
tcp backlog
- 第二次握手: 服务端发送完 SYN/ACK 后在内存中建立 SYN-RECEIVED 的连接，将连接放进 incomplete connection queue, 最大长度为 /proc/sys/net/ipv4/tcp_max_syn_backlog
```
cat /proc/sys/net/ipv4/tcp_max_syn_backlog
/* 1024 */
```
  - 关闭 syncookies(net.ipv4.tcp_syncookies = 0), 当队列满时, 不再接受新的连接
  - 开启 syncookies(net.ipv4.tcp_syncookies = 1), 当队列满时, 不受影响
- 第三次握手: 服务端收到 ACK 后, TCP 连接进入 ESTABLISHED 状态, 将连接放进 complete connection queue, 等待应用程序进行 accept, 其最大长度为 listen 函数的参数 backlog
```
int listen(int sockfd, int backlog);
// backlog 参数暗示了内核在开始拒绝连接请求之前, 队列中要排队的未完成的连接请求的数量, 一般为 128
listen(listenfd, SOMAXCONN);    
```
  - 当 sysctl_tcp_abort_on_overflow 为 0 时, Linux 内核只丢弃客户端的 ACK 包, 然后什么都不做
  - 当 sysctl_tcp_abort_on_overflow 非 0 时, Linux 内核会返回 RST 包, reset TCP 连接
- ss
```
# 查看进程的 backlog 数量
ss -anp | grep nginx
# 未完成队列的
ss -anp | grep nginx | grep -c SYN_RECV
```
__builtin_popcount() 用于计算一个 32 位无符号整数有多少个位为 1
long __builtin_expect (long EXP, long C) 语义为 "期望 exp 表达式的值等于常量 c ", 增加条件分支预测的准确性