Memory Management of Linux kernel

宋宝华 Memory Management of Linux kernel Barry Song 宋宝华 MM Background Address binding §  An address used in an instruction can point anywhere in the virtual address space of the process, it still must be bound to a physical memory address §  Programs are made of modules. Compilers or assemblers that are translating a module, don’t know where the module will be loaded in the physical memory §  PC contains a virtual address that must be translated to a physical address before an instruction is fetched. Addresses of operands are also virtual addresses that need to be translated before the operand is brought from the main memory (or cache) Registers (CPU) Address translation (MMU - Memory Management Unit) Main Memory Disk Storage Virtual Address Real Address I/O bus Hardware Cache Memory Bus Converting the linear address Linear address conversion in the architecture-independent memory model Linear address Paging Hardware With TLB Performance Characteristics of TLB §  Typical TLB –  Size: 8 - 4,096 entries –  Hit time: 0.5 - 1 clock cycle –  Miss penalty: 10 - 100 clock cycles –  Miss rate: 0.01 - 10% §  If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate for page mapping –  1*0.99 + (1+30)X0.01=1.30 –  1.30 clock cycles per memory access MMU: access permissions for privileged (kernel) and user modes §  it can have different access permissions for privileged (kernel) and user modes. The access permission bits control access to the corresponding memory region. If an access is made to an area of memory without the required permissions, then a permission fault is raised. Page fault §  A page fault is a trap to the software raised by the hardware when a program accesses a page that is not loaded in physical memory. -  Minor(soft): the page is loaded in memory at the time the fault is generated, but is not marked in MMU as being loaded in memory. The page fault handler merely needs to make the entry for that page in MMU point to the page in memory and indicate that the page is loaded in memory; it does not need to read the page into memory. -  Major(hard): the page is not loaded in memory at the time of the fault. The page fault handler in the OS needs to find a free location: either a page in memory, or another non-free page in memory. Once the space has been made available, the OS can read the data for the new page into memory, add an entry to its location in the memory management unit, and indicate that the page is loaded. -  Invalid: If a page fault occurs for a reference to an address that's not part of the virtual address space, meaning there can't be a page in memory corresponding to it, then it is called an invalid page fault. The page fault handler in the operating system will then generally pass a segmentation fault to the offending process Kernel and userspace memory §  In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens: memory overcommit §  Most OSes allow memory overcommit -  Allocate more virtual memory than physical memory §  How does this work? -  Physical pages allocated on demand only -  If free space is low… • OS frees some pages non-critical pages (e.g., cache) • Worst case, page some stuff out to disk Linux Memory Management §  How to know if a memory region unallocated vs. unloaded? – Virtual memory areas (VMAs) §  How to manage physical memory allocation? – Page descriptors – Page allocators (e.g., buddy algorithm, SLOB, SLUB, SLAB) §  Where to read a demand fetched page from? –Radix trees (page_tree) §  How to identify which PTEs map a physical page when evicting? –Reverse mappings –anon vmas(anon_vma), and radix priority trees (i_mmap) §  How to unify file accesses and swapping? –Page Cache 宋宝华 Kernel Memory Management Page & zone §  The system’s physical memory is made up of fixed-size blocks (called ‘page frames’) §  The Linux kernel uses a data-structure to keep track of each physical page frame. This data-structure is called a ‘struct page’ §  Pages aggregated into memory zones Page struct §  Each physical page has a page descriptor associated with it §  Contains reference count for the page §  Contains a pointer to the reverse map (struct address space or struct anon_vma) §  Contains pointers to lru lists (to evict the page) §  Descriptor to address: void * page_address(struct page *page) Kernel page table §  pagetable_init(): initialise the page tables necessary to reference all physical memory in ZONE_DMA and ZONE_NORMAL. §  that high memory in ZONE_HIGHMEM cannot be directly referenced and mappings are set up for it temporarily §  Once pagetable_init() returns, the page tables for kernel space are now full initialised so the static PGD (swapper_pg_dir) is loaded into the CR3 register so that the static table is now being used by the paging unit. virt_to_phys - map virtual addresses to physical §  unsigned long virt_to_phys (volatile void * address); The returned physical address is the physical (CPU) mapping for the memory address given. It is only valid to use this function on addresses directly mapped or allocated via kmalloc. §  void * phys_to_virt (unsigned long address); The returned virtual address is a current CPU mapping for the memory address given. It is only valid to use this function on addresses that have a kernel mapping The Buddy System Algorithm §  All free page frames are grouped into Order-n freelists (per-zone) §  The physical address of the first frame of a given block (of frames) is a multiple of the size of that block. The Buddy System Algorithm §  When a request for a contiguous block comes in, the OS searches the list for the smallest block bigger than the request. If nothing is available, it searches the next larger block list and continues this process iteratively until it either finds a block to allocate or discovers that the request cannot be granted. §  If a block is bigger than the request, it is split into pieces which are put into their appropriate lists. The Buddy System Algorithm (Deallocation) f f f f u f u f u f u u u u u u f = used = free u We’re going to deallocate this one The Buddy System Algorithm (Deallocation) f f f f u f f f u f u u u u u u f = used = free u New contiguous block Next, deallocate this block The Buddy System Algorithm (Deallocation) f f f f f f f f u f u u u u u u f = used = free f New contiguous block The Buddy System Algorithm (Deallocation) f f f f f f f f u f u u u u u u f = used = free f New contiguous block Alloc pages §  get_zeroed_page(unsigned int flags); Returns a pointer to a new page and fills the page with zeros. §  _ _get_free_page(unsigned int flags); Similar to get_zeroed_page, but doesn't clear the page. §  _ _get_free_pages(unsigned int flags, unsigned int order); §  struct page *alloc_pages(unsigned int flags, unsigned int order); Internal Fragmentation: slab allocation §  Often, memory allocation requests are for the same type of objects such as process control blocks, file descriptors, etc. §  The kernel can use this fact by storing memory that has been used and released (called a slab) in a cache where, when a similar request comes in for the same type of object, the kernel does not have to reallocate that memory, it just returns a previously-used slab. 宋宝华 slab §  Each cache maintains blocks of contiguous pages in memory called slabs which are carved up into small chunks for the data structures and objects the cache manages. The relationship between these different structures Slab APIs §  kmem_cache_t *kmem_cache_create(const char *name, size_t size, size_t offset, unsigned long flags, void (*constructor)(void *, kmem_cache_t *, unsigned long flags), void (*destructor)(void *, kmem_cache_t *, unsigned long flags)); §  void *kmem_cache_alloc(kmem_cache_t *cache, int flags); §  void kmem_cache_free(kmem_cache_t *cache, const void *obj); 宋宝华 /proc/slabinfo §  For each slab cache, the cache name, the number of currently active objects, the total number of available objects, the size of each object in bytes, the number of pages with at least one active object, the total number of allocated pages, and the number of pages per slab are given. Kmalloc & Slab §  kmalloc() returns memory from one of a series of slab caches (see below) with names like "size-32", ..., "size-131072". void *kmalloc(size_t size, int flags); void kfree(void *obj); vmalloc §  you are allocating memory for a large sequential buffer that exists only in software. §  It’s important to note that vmalloc has more overhead than __get_free_pages(), because it must both retrieve the memory and build the page tables. Therefore, it doesn’t make sense to call vmalloc() to allocate just one page void *vmalloc(unsigned long size); void vfree(void * addr); Page fault on vmalloc'ed memory §  On x86, it is hardware that reads the Linux page tables, and it always reads the page tables for the current process. This means that the kernel part of each process's page tables needs to contain copies of all the kernel PTEs on x86. To avoid having to set a PTE in every process's page tables when doing a vmalloc or ioremap, x86 just creates them in swapper_pg_dir and then creates them on demand in each process's page tables as needed. §  On PowerPC Linux always read the kernel page tables rooted at swapper_pg_dir to get translations for kernel addresses. PowerPC can do that because it is kernel code (software) that reads the Linux page tables. So there is no page-fault handling of address in kernel space. 宋宝华 Out Of Memory §  An OOM (Out Of Memory) error is what happens when the kernel runs out of memory in its own internal pools and is unable to reclaim memory from any other sources. It basically starts killing random processes, and spits a lot of logging into dmesg. §  very few OOM events are genuine kernel bugs. Virtually all of them are user applications which are behaving badly. 宋宝华 OOM example #include #include #define MEGABYTE 1024*1024 int main(int argc, char *argv[]) { void *myblock = NULL; int count = 0; while(1) { myblock = (void *) malloc(MEGABYTE); if (!myblock) break; memset(myblock,1, MEGABYTE); printf("Currently allocating %d MB\n",++count); } exit(0); } §  Program ends because of the Linux kernel’s OOM killer. 宋宝华 OOM killer criteria §  The function badness() in mm/oom_kill.c give a score to each existing processes. The one with highest score will be the victim: –  VM size. the sum of the size of all VMAs owned by the process. The bigger the VM size, the higher the score. –  Related to #1, the VM size of the process's children are important too. The VM size is cumulative if a process has one or more children. –  Processes with task priorities smaller than zero (niced processes) get more points. –  Superuser processes are important, by assumption; thus they have their scores reduced. –  Process runtime. The longer it runs, the lower the score. –  Processes that perform direct hardware access are more immune. –  The swapper (pid 0) and init (pid 1) processes, as well as any kernel threads immune from the list of potential victims. 宋宝华 Process Address Space virtual memory area mm_struct §  The process address space is described by the mm_struct struct meaning that only one exists for each process and is shared between userspace threads. In fact, threads are identified in the task list by finding all task_structs which have pointers to the same mm_struct. virtual memory area §  The full address space of a process is rarely used, only sparse regions are. Each region is represented by a vm_area_struct which never overlap and represent a set of addresses with the same protection and purpose Page fault §  This states that the page is only fetched from backing storage when the hardware raises a page fault exception which the operating system traps and allocates a page. The characteristics of backing storage imply that some sort of page prefetching policy would result in less page faults Page fault §  Major page faults occur when data has to be read from disk which is an expensive operation, else the fault is referred to as a minor, or soft page fault. 宋宝华 /proc/pid/maps §  The topmost segment in the process address space is the stack §  Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. §  Speaking of the heap, it comes next in our plunge into address space. §  Finally, we get to the lowest segments of memory: BSS, data, and program text. Linux Process Memory Binaries and Libraries Memory Mapping of Multi-processes VSS vs. RSS 宋宝华 smem – process memory $ adb pull /data/smem/memdata.tar . smem -S memdata.tar smem -S memdata.tar --pie=command $ ./smemcap >memdata.tar address_space 宋宝华 Why page cache §  Two serious problems must be solved by the OS when it comes to files. -  The first one is the mind-blowing slowness of hard drives, and disk seeks in particular, relative to memory. -  The second is the need to load file contents in physical memory once and share the contents among programs. If you use Process Explorer to poke at Windows processes, you’ll see there are ~15MB worth of common DLLs loaded in every process. if Windows box right now is running 100 processes, without sharing it’ll use up to ~1.5 GB of physical RAM just for common DLLs. 宋宝华 Page cache example §  conjure a Linux program named render, which opens file scene.dat and reads it 512 bytes at a time, storing the file contents into a heap-allocated block. 宋宝华 Page cache example -> mmap §  After 12KB have been read, render‘s heap and the relevant page frames look thus: mmap §  mmap is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory- mapped file I/O. It naturally implements demand paging, because initially file contents are not entirely read from disk and do not use physical RAM at all. The actual reads from disk are performed in a "lazy" manner, after a specific location is accessed. §  File-backed mapping maps an area of the process's virtual memory to files §  Anonymous mapping maps an area of the process's virtual memory not backed by any file Mapping functions § C library provides 3 functions in sys/mman.h –  caddr_t mmap(caddr_t addr, size_t len, int prot, int flags, int fd, off_t off); –  int munmap(caddr_t addr, size_t len); –  int mprotect(caddr_t addr, size_t len, int prot); –  int msync; mmap & page cache §  When you use file mapping, the kernel maps your program’s virtual pages directly onto the page cache. §  When you map a file its contents are not brought into memory all at once, but rather on demand via page faults. The fault handler maps your virtual pages onto the page cache after obtaining a page frame with the needed file contents. Private and share mapping §  in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are. §  Kernels use the copy on write mechanism, enabled by page table entries, to implement private mappings. Private mapping §  In the example below, both render and another program called render3d (am I creative or what?) have mapped scene.dat privately. Shared Pages Example §  a shared mapping is simply mapped onto the page cache and that’s it. Updates are visible to other processes and end up in the disk. Address spaces from two running instances of one program §  Dynamically loaded libraries are brought into your program’s address space via file mapping. two running instances of the file-mapping “render program”: 宋宝华 Page cache writeback §  Up to and including the 2.6.31 version of the Linux kernel, the pdflush threads ensured that dirty pages were periodically written to the underlying storage device. §  Jens Axboe developed a new, more effective writeback mechanism for Linux Kernel version 2.6.32. This approach provides threads for each device root@pc:~# ls -l /dev/sda brw-rw---- 1 root disk 8, 0 2011-09-01 10:36 /dev/sda root@pc:~# ls -l /dev/sdb brw-rw---- 1 root disk 8, 16 2011-09-01 10:36 /dev/sdb root@pc:~# ps -eaf | grep -i flush root 935 2 0 10:36 ? 00:00:00 [flush-8:0] root 936 2 0 10:36 ? 00:00:00 [flush-8:16] 宋宝华 Drop caches §  To free pagecache: # echo 1 > /proc/sys/vm/drop_caches §  To free dentries and inodes: # echo 2 > /proc/sys/vm/drop_caches §  To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches Swapping §  anonymous pages (those without a disk file to serve as backing storage - pages of malloc()'d memory, for example) are assigned an entry in the system swapfile, and those pages are maintained in the swap cache. Swap & page table §  When a page is swapped out, Linux uses the corresponding PTE to store enough information to locate the page on disk again. Swap cache §  The swap cache is purely conceptual as it is simply a specialisation of the page cache. –  pages in the swap cache rather than the page cache is that pages in the swap cache always use swapper_space as their address_space in page→mapping. –  pages are added to the swap cache with add_to_swap_cache() instead of add_to_page_cache(). 宋宝华 page frame reclaiming §  Low on memory reclaiming: The kernel detects a "low on memory" condition. §  Hibernation reclaiming: The kernel must free memory because it is entering in the suspend-to-disk state (we don't further discuss this case). §  Periodic reclaiming: A kernel thread is activated periodically to perform memory reclaiming, if necessary. 宋宝华 page frame reclaiming Linux Page Replacement §  Page replacement strategy applies to all page cache and anonymous pages §  Linux uses a variation of the clock algorithm to approximate an LRU page-replacement strategy §  The memory manager uses two linked lists –  Active list •  Contains active pages •  Most-recently used pages are near head of active list –  Inactive list •  Contains inactive pages •  Least-recently used pages near tail of inactive list –  Only pages in the inactive list are replaced Linux Page Replacement Moving pages across the LRU lists §  The PG_referenced flag in the page descriptor is used to double the number of accesses required to move a page from the inactive list to the active list and double the number of "missing accesses" required to move a page from the active list to the inactive list §  How to know when an inactive page is accessed: Remove PTE_P bit §  How to know when a page is not accessed for a while: Use PTE_Accessed bit Reverse Mappings §  Problem: how to swap out a shared mapping? –  Many PTEs may point to it –  But, we know only identity of physical page §  Could maintain reverse PTE i.e., for every page, list of PTEs that point to it –  Could get large. Very inefficient. –  Solution: reverse maps •  Anonymous reverse maps: anon_vma •  Idea: maintain one reverse mapping per vma(logical object) §  rather than one reverse mapping per page –  Based on observation most pages in VMA or other logical object (e.g., file) have the same set of mappers –  rmap contains VMAs that may map a page mlock §  mlock() and mlockall() respectively lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. int mlock(const void *addr, size_t len); int munlock(const void *addr, size_t len); int mlockall(int flags); –  MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. –  MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be for instance new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. glibc ptmalloc §  glibc uses an allocator named ptmalloc. Wolfram Gloger created it as a modified version of the original malloc library created by Doug Lea. §  The allocator uses two functions to get a chunk of memory from the kernel: –  brk() sets the end of the process's data segment. –  mmap() creates a new VMA and passes it to the allocator. §  If the request is equal or larger than M_MMAP_THRESHOLD, the allocator uses mmap(). If it is smaller, the allocator calls brk(). By default, M_MMAP_THRESHOLD is 128KB, but you may freely change it by using mallopt(). 宋宝华 Memory Profiling 宋宝华 /proc/meminfo §  This file reports statistics about memory usage on the system. It is used by free to report the amount of free and used memory (both physical and swap) on the system as well as the shared memory and buffers used by the kernel. 宋宝华 vmstat $ vmstat 5 procs -----------memory---------- ---swap--- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 0 833704 54824 25196 328672 10 0 343 18 510 1382 96 4 0 0 6 0 833704 54556 25092 324584 0 0 333 22 504 1180 93 7 0 0 4 0 833704 51516 25112 320856 33 0 315 19 508 1234 95 5 0 0 3 0 833704 54836 24984 314404 6 0 223 27 498 1191 95 5 0 0 3 0 833704 53072 24944 307844 4 0 216 22 518 1375 96 4 0 0 5 0 833704 53928 24888 304076 6 0 262 18 548 1665 94 6 0 0 3 4 843964 50192 184 58064 16 2416 16 2464 570 1451 78 22 0 0 3 7 908244 48756 224 47760 118 13645 149 13664 730 1245 76 16 0 8 3 2 922064 54280 340 49228 1470 2838 1817 2865 711 1481 88 12 0 0 4 2 932644 54068 424 52204 1972 2195 2596 2211 678 1388 90 10 0 0 2 3 944012 56304 492 52292 2986 2591 3063 2615 735 1562 89 11 0 0 2 4 957304 54604 572 51964 4042 3414 4096 3438 852 1808 88 12 0 0 ... §  We can monitor the amount of data going in and out of swap with the vmstat command 宋宝华 /proc/sys/vm/ §  Raising the value in min_free_kbytes will cause the system to start reclaiming memory at an earlier time than it would have before. §  There are scenarios where raising the cache in dirty_background_ratio and dirty_ratio dramatically has positive effects on performance. §  Swappiness is a property of the Linux kernel that changes the balance between swapping out runtime memory, as opposed to dropping pages from the system page cache. 宋宝华 Linux-ftools §  These are tools designed for working with modern linux system calls including, mincore, fallocate, fadvise, etc. 宋宝华 free §  To display free memory size in MB (megabytes): $ free –m total used free shared buffers cached Mem: 750 625 125 0 35 335 -/+ buffers/cache: 254 496 Swap: 956 0 956 宋宝华 Static memory §  A compiled binrary will allocate static memory to store two kinds of symbols: code symbols (a.k.a text) and data symbols (a.k.a data). 宋宝华 getrusage() §  int getrusage(int who, struct rusage *usage); getrusage() returns resource usage measures struct rusage { struct timeval ru_utime; /* user CPU time used */ struct timeval ru_stime; /* system CPU time used */ long ru_maxrss; /* maximum resident set size */ long ru_ixrss; /* integral shared memory size */ long ru_idrss; /* integral unshared data size */ long ru_isrss; /* integral unshared stack size */ long ru_minflt; /* page reclaims (soft page faults) */ long ru_majflt; /* page faults (hard page faults) */ long ru_nswap; /* swaps */ long ru_inblock; /* block input operations */ long ru_oublock; /* block output operations */ long ru_msgsnd; /* IPC messages sent */ long ru_msgrcv; /* IPC messages received */ long ru_nsignals; /* signals received */ long ru_nvcsw; /* voluntary context switches */ long ru_nivcsw; /* involuntary context switches */ }; 宋宝华




需要 8 金币 [ 分享pdf获得金币 ] 1 人已下载





下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!