| 
 home publications teaching short CV personal the pub | How to Write Fast Code18-645 (CMU, ECE)Basic Information
											Course number: 18-645, 12 units
											
											
											
											
											
											
											Spring 2008, MW: 4:30--6:00pm, HH B131
											
											
											
											
											
											
											
											Instructor: Markus Püschel (PH B16, pueschel at ece, 8-4259)TAs: Srinivas Chellappa (PH B10, schellap at andrew, 8-7104) and Frédéric de Mesmay (PH B10, fdemesma at andrew, 8-7104)
 Admin: Carol Patterson (PH B15, carol at ece, 8-7286)
Office hours:T 3:00-4:00pm, Srinivas, PH B10
 R 11:30am-12:30pm, Frederic, PH B10
 F 11:30am-12:30pm, Markus, PH B16
Requirements: solid C programming skills, senior undergraduate or graduate student
										
										
										
										
										
										
										
										
										
										 Course Description
The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture. This interdisciplinary course introduces the student to the foundations and state-of-the-art techniques in high performance software development using important functionality such as linear algebra kernels, transforms, filters, and others as examples. The course will explain how to optimize for the memory hierarchy, take advantage of special instruction sets, and how to write multithreaded code for multicore platforms, based on state-of-the-art research. Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.  The course will build upon but extend the version taught in Spring 2005. Topics Covered
											Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), O-calculus, algorithms in publishing
											
											
											
											
											
											
											Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets, multicore platforms
											
											
											
											
											
											
											
											Compilers: strengths, limitations, how to use
											
											
											
											
											
											
											
											Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy, using vector instructions, writing multithreaded code); these techniques are studied using the examples in the next bullet
											
											
											
											
											
											
											
											Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, sorting, others, your research project
											
											
											
											
											
											
											State-of-the-art research in Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others
										
										
										
										
										
										
										
										 Goals of this Course
											Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
											Understand the connection between algorithms, implementations, and computer architecture 
											
											
											
											
											
											
											
											Learn some fundamental numerical algorithms
											
											
											
											
											
											
											
											Learn how to analyze numerical algorithms
										
										
										
										
										
										
										
										
										 TextbookThere is no textbook for this class. Some of the material follows this tutorial.  The part that is foundation (algorithms, computer architecture etc.) will be compiled from several standard books. The core part, which analyzes cutting edge implementations for numerical problems is compiled from research papers, the instructor's own experience. Grading
											35% research project
												
													Topic: Very fast, ideally adaptive implementation of a numerical problem
													
													
													
													
													
													
													Team up in pairs
													
													
													
													
													
													
													28. January: Suggest to me a problem or I give you one
													
													
													
													
													
													
													Show "milestones" during semester
													
													
													
													
													
													
													
													
													Write 4 page standard conference paper (template will be provided)
													
													
													
													
													
													
													
													
													Give short presentation end of semester
												
												
												
												
												
												
												15% midterm
												
													Mostly about algorithm analysis
													
													
													
													
													
													
													
													
													Some multiple choice
												
												
												
												
												
												
												
												
												40% homework
												
													Exercises on algorithms analysis
													
													
													
													
													
													
													
													
													Implementation exercises. Purpose: study the effect of program optimizations, compilers, special instructions, etc. Tasks: writing and submitting C code & creating runtime/performance plots
													
													
													
													
													
													
													
													
													Some templates will be provided
												
												
												
												
												
												
												
												
												10% class participation
												
												
													It is important to attend (many things I teach cannot be found in books)
													
													
													
													
													
													
													
													
													I encourage you to ask questions
													
													
													
													
													
													
													
													
													I will provide some anonymous feedback mechanism
												
												
												
												
												
												
												
												
												 Final ExamHomework
											Homework 1: due Thursday, Jan 31st, 6pm. Solutions.
											
											
											
											
											
											
											Homework 2: due Thursday, Feb 7th, 6pm. Solutions.
											
											
											
											
											
											
											Homework 3 (code templates): due Thursday, Feb 14th, 6pm.
											
											
											
											
											
											
											
											Homework 4 (start research project): due Thursday, Feb 21st, 6pm.
											
											
											
											
											
											
											Homework 5 (paper, stream benchmark): due Friday, Feb 29th, 6pm.
											
											
											
											
											
											
											
											Homework 6 (code): 1(a) due Thursday, Mar 6th; rest due Friday, Mar 21st.
										Resources for SSE intrinsics: Intel manual, go to section on Streaming SIMD Extensions; Microsoft manual
Homework 7 (continue research project): due Friday, Mar 28th.
										
											
											
											
											
											
											
											
											Homework 8: due, Friday, Apr 4th, 6pm.
										
											
											
											
											
											
											
											
											Homework 9 (continue research project): due Friday, Apr 11th. 
										
											
											
											
											
											
											
											
											From now on: work on research project; see timeline and instructions below.
										
										
										
										
										
										
										
										 Midterm
Wednesday, 05. March. Solutions. Research Project
											How it works:
												
													You select a numerical problem and create a correct (verified) implementation in C and measure the performance/runtime you achieve
													
													
													
													
													
													
													
													You analyze the implementation and apply various optimization techniques (as explained in class)
													
													
													
													
													
													
													
													You write a paper about your work and give a poster presentation
												
													
													
													
													
													
													
													
													Each problem has a supervisor (shown below in parentheses)
												
												
												
												
												
												
												
												Template for 4 page paper:
												
											Poster presentation
												
													Buy a cardboard of at least 2.5 x 3.5 feet (e.g., at Kinko's)
															
													
													
													
													
													
													
													
													You can make and print a poster or use a collection of slides (probably between 9 and 12)
													
													
													
													
													
													
													
													Poster template and more instructions
												Timeline:
												
												
													25. Apr., 6pm: First version of paper due
														
															use template and instructions above
															
															
															
															
															
															
															
															
															
															put a printout into the usual box
															
															
															
															
															
															
															
															
															
															paper is complete except for some final code optimization or performance line
															
															
															
															
															
															
															
															
															
															do a good job otherwise we ask you to fix things
														
														
														
														
														
														
														
														
														
														30. Apr., 5:30pm - 8:30pm, Scaife Hall: Poster presentations
														
													7. May, 6pm: Final paper and code due. Instructions:
														
															Put all your code into a .zip file, named 18645-project-userid.zip where userid is the user id of any one of the project members. 
															
															
															
															
															
															
															
															Inside this .zip file, include a README file in plain text that describes (briefly, in about 20 lines) how to compile and run your code. 
															
															
															
															
															
															
															
															Email your .zip file and a .pdf of your paper as two separate attachments to schellap+18645-project@andrew.cmu.edu. (Send only one email per group).
														
														
														
														
														
														
														
														Projects: 
												
													Mike Glazer and Kenny Stauffer: Singular-value decomposition (Fred)
												
													
													
													
													
													
													
													Dan Dancescu, Xunnan (Karl) Fu, and Joshua Primero: Eigenvalues (Fred)
												
													
													
													
													
													
													
													Teck Hua Lee, Brian Loo and Tze Chang Ng: Matrix inversion (Fred)
													Sheethal Bhat and Shreyas Venugopalan: Mean shift algorithm for segmentation (Vas)
												
													
													
													
													
													
													
													Theodoros Strigkos and Evangelos Vlachos: Stencil computations (Franz)
												
													
													
													
													
													
													
													Abhay M. Mavalankar and Anupama Suryanarayanan: Displacement based algorithms for Toeplitz matrices (Markus)
												
													
													
													
													
													
													
													Mukta Gore, Aruna Manjunatha, and Deepak M. Rangaraj: Motion estimation (Markus)
												
													
													
													
													
													
													
													Ramu Bhagavatula and Adam Hartman: Multiresolution classifier (Markus)
												
													
													
													
													
													
													
													Andrew Moyer and Panchalam S. Ramanujan: Kalman filter (Markus)
												
													
													
													
													
													
													
													Saagar Patel and Dmitriy Solomonov: Seam carving images (Vas)
												
													
													
													
													
													
													
													Hung-Chih Lai and Derrick B. Losli: Object detection (Franz)
												
													
													
													
													
													
													
													
													Shang-Wei Wang and William Wong: IIR filters (Markus)
												
													
													
													
													
													
													
													
													Atul Talesara and Vishal Mhatre: Arithmetic for large numbers (Markus)
												
													
													
													
													
													
													
													
													Syed W. Haider: Optimal binary search organization (Vas)
												
													
													
													
													
													
													
													Maladau Mou: 2-D correlation (Markus)
												
													
													
													
													
													
													
													Farhan Mohamed Ali and Chris Thayer : MMM on GPU (Franz)
										
										
												
												
												
												
												
												
												 Lectures (including pdfs, paper links may need CMU IP)
											1. Lecture (14. Jan.): Technicalities, overview and motivation (slides)
											
											
											
											
											
											
											
											2. Lecture (16. Jan.): Problem, algorithms, asymptotic analysis, divide-and-conquer algorithms (slides, notes)
										
											
											
											
											
											
											
											
											3. Lecture (23. Jan.): Asymptotic analysis (multiple variables), cost analysis, solving recurrences (slides, notes)
											
											
											
											
											
											
											
											4. Lecture (28. Jan.): Architecture, microarchitecture, cache (slides, notes)
											
											
											
											
											
											
											
											5. Lecture (30. Jan.): Runtime and performance, cache behavior of code (slides, notes)
										
											
											
											
											
											
											
											
											6. Lecture (04. Feb.): Linear algebra software, blocking, MMM (slides, notes)
										
											
											
											
											
											
											
											
											7. Lecture (06. Feb.): Optimizing MMM for the Memory hierarchy, ATLAS (slides, notes)
										
											
											
											
											
											
											
											
											8. Lecture (11. Feb.): Model-based ATLAS (slides, notes, more notes, paper)
										
											
											
											
											
											
											
											
											9. Lecture (13. Feb.): Gauss elimination, LU factorization (slides, notes)
											
											
											
											
											
											
											
											10. Lecture (18. Feb.): LU factorization (cont'd), matrix inversion, determinant
										
											
											
											
											
											
											
											
											11. Lecture (20. Feb.): Sparse MVM, Sparsity/Bebop (slides, notes)
										
											
											
											
											
											
											
											
											12. Lecture (25. Feb.): cancelled, replaced by one-on-one meetings
											
											
											
											
											
											
											
											13. Lecture (27. Feb.): SIMD vector instructions, part I
											
											
											
											
											
											
											14. Lecture (03. Mar.): SIMD vector instructions, part II (slides, notes)
											15. Lecture (05. Mar.): Midterm exam
										
											
											
											
											
											
											
											
											16. Lecture (17. Mar.): Small guide to benchmarking, small guide to making nice plots, linear transforms (slides)
										
											
											
											
											
											
											
											
											17. Lecture (19. Mar.): Transforms, structured matrices, FFT (notes)
										
											
											
											
											
											
											
											
											18. Lecture (24. Mar.): From structured matrices to code, complex arithmetic, recursive and iterative FFT (slides, notes)
										
											
											
											
											
											
											
											
											19. Lecture (26. Mar.): Fast DFT, FFTW (slides, notes, website)
											
											
											
											
											
											
											20. Lecture (31. Mar.): Parallelism is the future (slides, notes)
											
											
											
											
											
											
											21. Lecture (02. Apr.): Shared memory parallelism, OpenMP (slides, notes)
											
											
											
											
											
											
											22. Lecture (07. Apr.): cancelled, replaced by one-on-one meetings 
											
											
											
											
											
											
											
											23. Lecture (09. Apr.): Spiral, library generator for transforms (slides, website)
											
											
											
											
											
											
											24. Lecture (14. Apr.): Matlab, how it works, profiling and short performance guide, including C code (slides)
											
											
											
											
											
											
											
											25. Lecture (16. Apr.): Filtering and convolution (slides, notes)
										
											
											
											
											
											
											
											
											26. Lecture (21. Apr.): Sorting (slides, notes)
											
											
											
											
											
											
											
											27. Lecture (23. Apr.): Optimized and adaptive sorting (slides, notes, paper1, paper2)
											
											
											
											
											
											
											
											28. Lecture (28. Apr.): cancelled
											
											
											
											
											
											
											
											29. Lecture (30. Apr.): Poster presentation, 5:30pm-8:30pm, Scaife Hall (instructions above)
										 |