• 1. Fedor G PikusWhere did my performance go? (Scaling visualization in concurrent C++ Programs)Mentor Graphics, Design2Silicon DivisionCPPCon, September 2014
  • 2. What Is This Talk About? F.G. Pikus - Where did my performance go? - CPPCon 20142How to get a detailed and useful profile (timeline of events) of a concurrent program. Why do we need it? What problems do we want to solve? What new problems did we create? How to solve them? Some useful techniques and tricks. Some clever code (a bit simplified). Some simpler than expected code (warning: this may be rather anticlimactic). General assumptions and limitations: Reasonably current Linux/Unix 64-bit address space (applying to 32-bits is possible but requires some careful coding) I’ve only tested this on x86/64 hardware Ask questions (or I will!)
  • 3. Why measure and profile performance?F.G. Pikus - Where did my performance go? - CPPCon 20143Performance is important C++ is a performance-oriented language C++ is often used when performance is a requirement Measurements are important Programmers are notoriously bad at intuiting performance characteristics Performance often depends on subtle details of the program, data, libraries, OS, and hardware Performance of concurrent programs (more) often depends on more and subtler details of the program, data, (more) libraries, OS, and (more) hardware Programmers are notoriously worse at intuiting behavior of concurrent programs
  • 4. Why write concurrent programs?F.G. Pikus - Where did my performance go? - CPPCon 20144The only way to take advantage of the newer processors Single CPU cores are not getting much faster CPUs are growing in size and transistor count, most of which goes toward more cores more generally, more parallel compute units
  • 5. How to profile performance of concurrent programs?F.G. Pikus - Where did my performance go? - CPPCon 20145Traditional profilers remain useful: They identify parts of the code that deserve attention (and parts of the code that can be ignored, performance-wise) Many problems become obvious with detailed enough profile Key difference between profiling sequential and concurrent program: Order of execution of a sequential program is known, the only question is how long each step took Order of execution of a concurrent program is (somewhat) random but performance may greatly depend on the order
  • 6. High-Performance program+Concurrency = ?F.G. Pikus - Where did my performance go? - CPPCon 20146Program 1 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 0.6s, CPU time 1.1s 4 threads – real time: 0.9s, CPU time 2s 64 threads – real time: 10s, CPU time: 100s Program 2 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 1s, CPU time: 1s 4 threads – real time: 1s, CPU time: 1s Program 3 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 1s, CPU time: 2s 4 threads – real time: 1s, CPU time: 4s
  • 7. High-Performance program+Concurrency = F.G. Pikus - Where did my performance go? - CPPCon 20147Program 1 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 0.6s, CPU time 1.1s 4 threads – real time: 0.9s, CPU time 2s 64 threads – real time: 10s, CPU time: 100stimecomputationthreadswaiting on a lockLock Contention
  • 8. Profiling concurrent programs: Who, What, When?F.G. Pikus - Where did my performance go? - CPPCon 20148Peek into the execution of threads “Thread microscope” – “Threadscope”? We want to know: which thread did what computation? how long it did it take? what were other threads doing at that time? did other events happen before, after, or simultaneously? how long was a wait on which lock? which thread was holding the lock at that time? how many threads contend for a lock (or a result, in lock-free programs)? Picture is worth a thousand words: Visualizing the timeline of relevant events can really help to diagnose performance problems
  • 9. High-Performance program+Concurrency = F.G. Pikus - Where did my performance go? - CPPCon 20149Program 2 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 1s, CPU time: 1s 4 threads – real time: 1s, CPU time: 1Complete Serialization
  • 10. High-Performance program+Concurrency = F.G. Pikus - Where did my performance go? - CPPCon 201410Program 3 1 thread – real time: 1s, CPU time: 1s 2 threads – real time: 1s, CPU time: 2s 4 threads – real time: 1s, CPU time: 4sMemory-Bound program?
  • 11. Memory Bandwidth and Memory-BoundF.G. Pikus - Where did my performance go? - CPPCon 201411
  • 12. Memory Bandwidth and Memory-BoundF.G. Pikus - Where did my performance go? - CPPCon 2014121 thread2 threads4 threads
  • 13. Memory Bandwidth and Memory-BoundF.G. Pikus - Where did my performance go? - CPPCon 2014134 threads, *42 threads, *21 thread
  • 14. Bonus ExampleF.G. Pikus - Where did my performance go? - CPPCon 201414From Herb Sutter’s talk “Lock-Free Programming” CPPCon 2014, Monday Sept 8th LowLockQueue with several producer and consumer threads
  • 15. Example 1Baseline code. template struct LowLockQueue { struct Node { Node( T val ) : value(val), next(nullptr) { } T value; // objects are held by value atomic next; }; first and last point to the before-the-first and last nodes divider points to a boundary between producer and consumer Node *first, *last; // for producer only atomic divider; // shared: P/C boundary atomic producerLock; // shared by producers atomic consumerLock; // shared by consumers
  • 16. Example 1 (continued)Consume returns the value in the first unconsumed node. Note: The entire body of Consume is inside the critical section, so we get no concurrency among consumers in this code. bool Consume( T& result ) { while( consumerLock.exchange(true) ) { } // acquire exclusivity if( divider->next != nullptr ) { // if queue is nonempty result = divider->next->value; // copy it back to the caller divider = divider->next; // publish that we took an item consumerLock = false; // release exclusivity return true; // and report success } consumerLock = false; // release exclusivity return false; // queue was empty }
  • 17. Example 1 (continued)Produce adds a new nodeto the tail, then lazily cleans up any consumed nodes at the front of the list. Note: Not all of the body of Produce is inside the critical section, so there is some concurrency among producers in this code. bool Produce( const T& t ) { Node* tmp = new Node( t ); // do work off to the side while( producerLock.exchange(true) ) { } // acquire exclusivity last = last->next = tmp; // publish the new item while( first != divider ) { // lazy cleanup Node* tmp = first; first = first->next; delete tmp; } producerLock = false; // release exclusivity return true; } };
  • 18. Bonus ExampleF.G. Pikus - Where did my performance go? - CPPCon 201418From Herb Sutter’s talk “Lock-Free Programming” CPPCon 2014, Monday Sept 8th LowLockQueue with several producer and consumer threads Initial implementation yielded “negative scaling” 2 consumer threads give lower total throughput than 1 thread Consumer lock was shown to be the main problem Consumer workload (memory copy) was inside critical section Removing cleanup from producer was the next big improvement Again, work inside critical section What would the timeline show?
  • 19. Example 1 (continued)Consume returns the value in the first unconsumed node. Note: The entire body of Consume is inside the critical section, so we get no concurrency among consumers in this code. bool Consume( T& result ) { while( consumerLock.exchange(true) ) { } // acquire exclusivity if( divider->next != nullptr ) { // if queue is nonempty result = divider->next->value; // copy it back to the caller divider = divider->next; // publish that we took an item consumerLock = false; // release exclusivity return true; // and report success } consumerLock = false; // release exclusivity return false; // queue was empty }
  • 20. Example 1 (continued)Produce adds a new nodeto the tail, then lazily cleans up any consumed nodes at the front of the list. Note: Not all of the body of Produce is inside the critical section, so there is some concurrency among producers in this code. bool Produce( const T& t ) { Node* tmp = new Node( t ); // do work off to the side while( producerLock.exchange(true) ) { } // acquire exclusivity last = last->next = tmp; // publish the new item while( first != divider ) { // lazy cleanup Node* tmp = first; first = first->next; delete tmp; } producerLock = false; // release exclusivity return true; } };
  • 21. LowLockQueue Timeline (best guess)F.G. Pikus - Where did my performance go? - CPPCon 201421consumeconsumers consumeLockcopycopy consumecopyLockproduceproducers produceLocknew Nodenew Nodeproducecleanupnew Node producenew NodeDoing work inside critical section means keeping other threads waiting Timeline of events (computations, locks) visualizes many problems Not all problems (memory contention is not directly visible and has to be inferred, takes practice)
  • 22. How to get the timeline of a program?F.G. Pikus - Where did my performance go? - CPPCon 201422Instrument the program to mark the “interesting” events start and end of a function call or another computation requesting and acquiring a lock start and end of a message or I/O Run the program, collect the data Visualize the data There are profilers that collect similar data (VTune, Sun collector, Visual Studio) If you have one and it works, use it first They don’t know what are “interesting” events, so they often collect too much data or not enough data Profiler is often the best source of hints which code to instrument!
  • 23. What are the challenges?F.G. Pikus - Where did my performance go? - CPPCon 201423Any measurement disturbs the program being measured and affects the results More so if measurements require synchronization (locks) Collecting data from multiple threads risks race conditions More so if data collection is done without synchronization Total volume of data can be large Adding this much I/O to a program can affect the performance
  • 24. What are the solutions?F.G. Pikus - Where did my performance go? - CPPCon 201424Minimize the overhead added to computing threads Do most profiling-related work on separate threads This takes CPUs/cores away from the program Reduce resident memory footprint of the profiling Run I/O on separate threads work threadsprofile processing threadsProfile Results
  • 25. What data we may want to collect?F.G. Pikus - Where did my performance go? - CPPCon 201425What was computed, or what did we wait for? Real time CPU time, or CPU load, for the thread CPU time, or CPU load, for the process Parameters or other user data Stack trace Nesting depth Collect only what you need, use expensive annotations less frequently f1(1)f1(3)f1(2)f2()f3()L1T1T2g1(1)g1(2)g2(2)g1(3)g2(3)
  • 26. How to collect the dataF.G. Pikus - Where did my performance go? - CPPCon 201426Minimize the work done on compute threads Minimize synchronization introduced into compute threads Collect only what is necessary Collect efficiently Make instrumenting the program as simple as possible void f1(int x) { Collector C(“f1”, x); // Start of f1(x) … do some work … result1 = g1(x); result2 = g2(x); … do more work … } // End of f1(x) – ~Collector() collects final datavoid g1(int x) { Collector C(“g1”, x); // Nested task … even more work … }
  • 27. Collect only what is necessary … What is necessary?F.G. Pikus - Where did my performance go? - CPPCon 201427Real time Task ID Thread ID Nesting depth CPU time, per thread CPU time, per process User data Stack trace Other state? We need several data collection classesAlmost always neededSometimes neededSpecial casesAlways needed
  • 28. Collect only what is necessary … Collect where?F.G. Pikus - Where did my performance go? - CPPCon 201428Real time Task ID Thread ID Nesting depth CPU time, per thread CPU time, per process User data Stack trace Other state? Collected data needs to be stored in memoryAlways neededAlmost always neededSometimes neededSpecial casesAllocation!Synchronization!Data Races!Overhead!
  • 29. Collect only what is necessary … Collect where?F.G. Pikus - Where did my performance go? - CPPCon 201429Real time Task ID Thread ID Nesting depth CPU time, per thread CPU time, per process User data Stack trace Other state? Collected data needs to be stored in memoryAlways neededAlmost always neededSometimes neededSpecial cases
  • 30. High-resolution Timers on X86/Linuxclock_gettime() Supports several clocks CLOCK_MONOTONIC – real time CLOCK_PROCESS_CPUTIME_ID – CPU time per process CLOCK_THREAD_CPUTIME_ID – CPU time per thread Resolution is 1ns BUT timer itself takes about 50ns (real time) to 150ns (CPU time) on a fast processor We can wrap these library calls into classes HighResThreadTimer and HighResProcessTimer HighResThreadTimer T; … do some work … uint64_t time_ns = T.Time(); // nanoseconds F.G. Pikus - Where did my performance go? - CPPCon 201430
  • 31. Memory allocationF.G. Pikus - Where did my performance go? - CPPCon 201431We want Thread safe allocation Low overhead Minimally disruptive synchronization So does everybody else, what’s special here? We don’t need General allocation patterns Memory is allocated when new data is added by a work thread Memory is no longer needed when data is saved to disk or processed by one of the service threads General deallocation Deallocation is entirely controlled by our profiler From the point of view of the work threads, we need a memory pool
  • 32. Memory PoolF.G. Pikus - Where did my performance go? - CPPCon 201432Memory Pool is just a (conceptually) contiguous region of memory with a pointer (or offset) Allocation from the memory pool is simple: void* pool_allocate(size_t s) { void* p = top; increment(top, s); return p; } EmptyUsedtop’topAllocated
  • 33. Thread-safe Memory PoolF.G. Pikus - Where did my performance go? - CPPCon 201433Memory Pool is just a (conceptually) contiguous region of memory with a pointer (or offset) Thread-safe allocation from the memory pool is simple: void* pool_allocate(size_t s) { return atomic_increment(top, s); // returns old value } After atomic increment, allocated region belongs to the calling work thread, no more data races EmptyUsedtop’topAllocated
  • 34. Simple Data CollectorF.G. Pikus - Where did my performance go? - CPPCon 201434This is what we want to store in memory struct TaskRecord { uint32_t depth; // Record nesting depth uint64_t tid; // Thread id uint64_t tag; // User-given identifier (“f1”) uint64_t start_time; // nanoseconds uint64_t stop_time; uint64_t cpu_time; TaskRecord(int depth, size_t tag, uint64_t start_time) : depth(depth), tid(pthread_self()), tag(tag), start_time(start_time), stop_time(0), cpu_time(0) {} };
  • 35. Simple Data CollectorF.G. Pikus - Where did my performance go? - CPPCon 201435This is the collector class class Collector { int* depth_; TaskRecord* record_; HighResThreadTimer timer_; public: Collector(size_t tag); ~Collector(); }; Measured “task” is the lifetime of the Collector HighResThreadTimer measures CPU time since its creation “Depth” counts how many Collector objects are alive on the current thread
  • 36. Simple Data CollectorF.G. Pikus - Where did my performance go? - CPPCon 201436This is what we do at the start of the measurement … Collector::Collector(size_t tag) : depth_(GetDepth()), record_(new(pool_allocate(sizeof(TaskRecord))) TaskRecord(*depth_++, tag, GetRealTime())) {} } … and at the end of the measurement Collector::~Collector() { record_->stop_time = GetRealTime(); record_->cpu_time = timer_.Time(); --*depth_; }atomic_increment()the thread owns the memory
  • 37. Simple Data CollectorF.G. Pikus - Where did my performance go? - CPPCon 201437This is what we do at the start of the measurement … Collector::Collector(size_t tag) : depth_(GetDepth()), record_(new(pool_allocate(sizeof(TaskRecord))) TaskRecord(*depth_++, tag, GetRealTime())) {} } … and at the end of the measurement Collector::~Collector() { record_->stop_time = GetRealTime(); record_->cpu_time = timer_.Time(); --*depth_; }atomic_increment()the thread owns the memory
  • 38. Thread DepthF.G. Pikus - Where did my performance go? - CPPCon 201438Thread depth counts the number of Collector objects currently alive on the calling thread If there is only one thread, depth is a global (static) counter: No race conditions on depth – only one thread! No ambiguity as long as all Collectors are stack variablesdepthCollector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } }; Collector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } };
  • 39. Thread DepthF.G. Pikus - Where did my performance go? - CPPCon 201439Thread depth counts the number of Collector objects currently alive on the calling thread If there are many threads, we need per-thread static counter: No race conditions on depth – one value per thread! Per-thread static == Thread Local Storage GCC: static __thread int depth;depthCollector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } }; Collector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } }; depthCollector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } }; Collector { int* depth_; Collector() { *depth_++; } ~Collector() { *depth_--; } }; thread 1thread 2
  • 40. We’re half done!F.G. Pikus - Where did my performance go? - CPPCon 201440The picture so far is complete, as far as work threads see At the beginning of a measurement: Allocate memory from the pool Store real time Increment and store depth At the end of the measurement: Store real time Decrement depth Store CPU time And never see that memory again! But somebody else has to…
  • 41. Thread-safe Memory Pool – Profiler’s ViewF.G. Pikus - Where did my performance go? - CPPCon 201441Profiler has to save the data to disk after it stops changing Profiler has to release the memory, eventually Profiler has to grow memory pool, too Work threads see the abstraction of continuous memory region Somebody has to provide this abstraction All problems would be solved if we had infinite memory EmptyReadyNot ready
  • 42. How much memory do we really have?F.G. Pikus - Where did my performance go? - CPPCon 201442   
  • 43. Virtual Memory and Physical MemoryF.G. Pikus - Where did my performance go? - CPPCon 201443Virtual memory is the logically addressable memory This is what your program sees Physical memory is what the hardware provides Not all physical memory belongs to your process “Your” physical memory may be non-contiguous The same logical address for different processes maps to different physical locationsAddress space of a process0x00…000xff…ffNot YoursYoursNot YoursYours0x00271828
  • 44. Virtual Memory and Memory MapsF.G. Pikus - Where did my performance go? - CPPCon 201444Virtual memory is mapped onto physical memory Virtual and physical memory are organized into pages of fixed size (4K on Linux, usually, but don’t hard-code this!) Mapping is done by the OS using a PageTable and some help from the hardware (TLB) PageTable maps process ID and logical page number to physical page number
  • 45. PageTable, conceptuallyF.G. Pikus - Where did my performance go? - CPPCon 201445Logical address space of process 1PageTable of process 1Physical memoryPageTable of process 2Logical address space of process 2……
  • 46. Virtual Memory and Memory MapsF.G. Pikus - Where did my performance go? - CPPCon 201446Virtual memory is mapped onto physical memory Virtual and physical memory are organized into pages of fixed size (4K on Linux, usually, but don’t hard-code this!) Mapping is done by the OS using a PageTable and some help from the hardware (TLB) PageTable maps process ID and logical page number to physical page number Very efficiently! A logical page can map to physical memory page (including shared memory) swap page disk file page (swap is just a kind of disk file) nothing
  • 47. Managing Memory MapsF.G. Pikus - Where did my performance go? - CPPCon 201447Most of the address space of a process maps to nothing What fraction of the 128TB are you really using? PageTable is really efficient when mapping pages to nothing How can you map a page to physical memory? Access it (read or write) The very first access yields a zero-filled page Not so for pages you get from malloc() – your access is not first How can you map a page back to nothing? madvise(address, length, MADV_DONTNEED); address and length have to be aligned on page size how to get page size? sysconf(_SC_PAGESIZE) Don’t do this on a page you got from malloc()! How can you map a page to disk? mmap(…, length, …, file_descriptor, … )
  • 48. Creating Memory MapsF.G. Pikus - Where did my performance go? - CPPCon 201448mmap() is used to map a range of logical addresses onto a disk file, a shared memory region, or a region of physical memory Accessing memory creates “proxy” pages in physical memory Files created by mmap are usually sparse (file size > used space)Range of logical address spaceDisk space Physical RAM
  • 49. Creating Memory MapsF.G. Pikus - Where did my performance go? - CPPCon 201449mmap() is used to map a range of logical addresses onto a disk file, a shared memory region, or a region of physical memory Accessing memory creates “proxy” pages in physical memory Files created by mmap are usually sparse (file size > used space) Pages mapped by mmap() are not “real” until they are accessed – they don’t count against the resident set Physical memory pages can be released by madvise() Not giving up logical addresses, can access the page again If the map is backed by a disk file, data on disk is not erased If a disk-backed logical page is accessed again, a new physical page is mapped and initialized from disk munmap() gives up the range of logical addresses
  • 50. Simple Memory Pool Mirrored to DiskF.G. Pikus - Where did my performance go? - CPPCon 201450class DiskPool { char* addr_; // Initialized by mmap() size_t len_; // Given by the user size_t top_; // Offset from address_ public: DiskPool(size_t l, const char* file) : len_(l), top_(0) { … addr_ = mmap(…); } ~DiskPool() { … munmap(address_, len_); } void* allocate(size_t s) { size_t t = atomic_increment(top_, s); return addr_+t; } void flush(); // Flush to disk, // release physical pages };work thread’s viewprofiler’s view
  • 51. Asynchronous I/O, the hard wayF.G. Pikus - Where did my performance go? - CPPCon 201451Because the easy way can’t possibly work… Profiler needs a thread (or threads!) that monitors the pool detects when pages are no longer used for storing data (ready) flushes these pages to disk releases corresponding physical pages Few days of lock-free coding and some razor blades should do it Just out of curiosity, what was the easy way?smart EmptyReadyNot ready
  • 52. Asynchronous I/O, the easy wayF.G. Pikus - Where did my performance go? - CPPCon 201452What would happen if we released all physical memory mapped to the pool? void DiskPool::flush() {madvise(addr_, len_, MADV_DONTNEED);} All changes still in memory is written to disk On some OSes, msync() call is needed for that All physical memory mapped to the pool addresses is released to the OS and removed from our resident set Mapping of the virtual address space to the disk file remains intact If the program accesses (reads or writes) one of the logical pages, a physical page is created from disk data It behaves the right way, but at what cost?
  • 53. Don’t Guess, Measure!F.G. Pikus - Where did my performance go? - CPPCon 201453class DiskPool { void flush() {madvise(addr_, len_, MADV_DONTNEED);} }; I/O thread is really a “flush thread”: atomic run(1); // Global atomic flagFlush Thread while ( run.load(memory_order_acquire) ) { sleep(1); // 1 second disk_pool.discard(); }…work threadsMain Thread … all data is written … run.store(0, memory_order_release);
  • 54. Simple Memory Pool Mirrored to DiskF.G. Pikus - Where did my performance go? - CPPCon 201454With 32 threads (on 32 cores), we can sustain 1 event per ~100 microseconds, on each thread With more than 32 threads we need to throttle it down a bit Peak rate can be ~1event/1usec as long as the average rate is ok PageTable is amazingly efficient! class DiskPool { char* addr_; // Initialized by mmap() size_t len_; // Given by the user size_t top_; // Offset from address_ public: void* allocate(size_t s) { size_t t = atomic_increment(top_, s); return addr_+t; } void flush() {madvise(addr_, len_, MADV_DONTNEED);} };
  • 55. What happens when the pool fills up?F.G. Pikus - Where did my performance go? - CPPCon 201455The easy way: the program dies Do you really want TeraBytes of profiling data? Practical observations: 32 threads, 12 hours of run time, all the instrumentations needed to explain program behavior in details – not even close (<100GB) The harder ways (also the 32-bit way) Roll over the end of the memory range, start writing from the beginning (must save disk content first) Freeze the whole system and map new memory ranges In any case, this is a very rare event, it can be expensive
  • 56. ThreadScope Annotations - ExampleF.G. Pikus - Where did my performance go? - CPPCon 201456Annotations record when something begins and ends, and how much CPU time was consumed All annotations implicitly create an object and measure its lifetime Process records measure per-process CPU time THREADSCOPE_TIMED_PROCESS(“Simulation”); THREADSCOPE_TIMED_IPROCESS(“tile”, tile_number); THREADSCOPE_TIMED_SPROCESS(“model“, model->name()); THREADSCOPE_TIMED_SPRINTF_PROCESS( “step %s(%d)“, model->name(), tile_number); The name of the annotation and the additional information will be displayed by the viewer Process records can be nested, will display parent-child or “contains” relation
  • 57. More ThreadScope AnnotationsF.G. Pikus - Where did my performance go? - CPPCon 201457Task records measure per-thread CPU time Task records can be nested too Process records are more expensive than task records THREADSCOPE_TIMED_TASK(“Processing tile”); THREADSCOPE_TIMED_ITASK(“tile”, tile_number); Lock and Wait records are the only records with explicit “end” THREADSCOPE_LOCK_BEGIN(L1, “Global lock”); ScopeLock L(&global_lock); THREADSCOPE_LOCK_END(L1) Event record does not have duration THREADSCOPE_EVENT(“Started processing”); THREADSCOPE_PENDING_EVENT( “Done processing”);Unique name Same name Event happens on this lineEvent happens at the end of the scope
  • 58. Lazy UI – Visualize using CSS/HTMLF.G. Pikus - Where did my performance go? - CPPCon 201458Time axisThread IDEach bar is a record of some measurement Nested records, or “sub-records”Mouse-over pop-up shows details CPU utilizationMemory useCan view results “live”
  • 59. But I really want to do things the hard way…F.G. Pikus - Where did my performance go? - CPPCon 201459Some measurements require more processing More frequent measurements Measurements involving stack traces These measurements require some sort of accumulation 1 event per microsecond? Even if you could collect that much raw data, what would you do with it? Example: profiling memory allocation We can collect stack trace at every allocation and deallocation We can’t write this much data to disk We need to process (accumulate) the data, for example, count the total number of allocations for each stack trace
  • 60. How to accumulate data during profilingF.G. Pikus - Where did my performance go? - CPPCon 201460Work threads still generate events and writes into the pool but the pool is not backed by disk Processing thread sweeps the pool, reads the data, does the processing, releases physical memory, writes the results into another (disk) pool Profiler thread needs to know which data is ready Flush thread flushes disk pool, releases physical memory Processing thread will page in some memory as needed, flush thread will release these pages again(?) EmptyReadyNot readyReleased(?)
  • 61. Anonymous Memory PoolF.G. Pikus - Where did my performance go? - CPPCon 201461mmap() can be called without a file pass -1 for file descriptor Result is a chunk of reserved virtual address space No physical memory is allocated, yet All logical pages are mapped to nothing Your malloc() probably does that, deep inside When a page is accessed, a physical page is mapped New page is zero-filled When a page is released with madvise(), its content is gone – it will be all zeroes if it’s read again Reserved address space counts as virtual memory size but not as the resident set Linux kernel has some settings that can (try to) change that
  • 62. Profiling With Stack TracesF.G. Pikus - Where did my performance go? - CPPCon 201462Example – counting memory allocations per stack trace Work threads must extract stack trace and store it in the pool (together with allocation size, thread id, etc) struct TaskRecord { uint32_t count; // Count of stack frames void* stack[1]; // “struct hack” }; Processing thread sweeps the pool until it finds the first uninitialized record (count == 0) But, but… data races?!Make the count atomic!TaskRecord { atomic count; }; Work Thread … save stack trace … release_store() Processing Thread while (acquire_load()) { … process and release … }
  • 63. Processing in BackgroundF.G. Pikus - Where did my performance go? - CPPCon 201463 Sweep Threadwhile (n = acquire_load(p->count)) { read_stack(p); if (look_up_stack() == false) new_stack(); update_stack_count(); discard(begin, p++); } EmptyReadyNot readyReleasedmemory poolWork Thread allocatewrite EmptyReadyNot readydisk poolstopdiscardreadpwriteallocateFlush Threaddiscard
  • 64. If one thread is good, two must be betterF.G. Pikus - Where did my performance go? - CPPCon 201464We have one consumer (sweep) thread trying to keep up with multiple producer (work) threads. If there are enough producer threads, consumer thread can become saturated. Sweep falls behind – memory accumulates – memory runs out Add another consumer thread More complex synchronization (atomic count updates, atomic stack lookups/additions) Pipeline processing First thread counts stacks Second thread converts stack pointers to symbols Use another memory pool to pass data between pipeline stages
  • 65. What have we learned?F.G. Pikus - Where did my performance go? - CPPCon 201465Writing high-performance concurrent programs is hard Harder still for memory-bound programs Timeline of concurrent execution makes debugging performance problems easier Low overhead is key Low-overhead memory allocation Low-overhead synchronization Lock-free programming is easy if we have infinite memory Sometimes we can have infinite memory Thread-local data is like per-thread static Pipelines as an alternative to data sharing
  • 66. (本页无文本内容)