Real-Time Digital Signal Processing


JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Real-Time Digital Signal Processing Implementations and Applications Second Edition Sen M Kuo Northern Illinois University, USA Bob H Lee Ingenient Technologies Inc., USA Wenshun Tian UTStarcom Inc., USA iiiJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Real-Time Digital Signal Processing Second Edition iJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 iiJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Real-Time Digital Signal Processing Implementations and Applications Second Edition Sen M Kuo Northern Illinois University, USA Bob H Lee Ingenient Technologies Inc., USA Wenshun Tian UTStarcom Inc., USA iiiJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Copyright C 2006 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone (+44) 1243 779777 Email (for orders and customer service enquiries): cs-books@wiley.co.uk Visit our Home Page on www.wileyeurope.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to (+44) 1243 770620. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Library of Congress Cataloging-in-Publication Data Kuo, Sen M. (Sen-Maw) Real-time digital signal processing : implementations, applications and experiments with the TMS320C55X / Sen M Kuo, Bob H Lee, Wenshun Tian. Ð 2nd ed. p. cm. Includes bibliographical references and index. ISBN 0-470-01495-4 (cloth) 1. Signal processingÐDigital techniques. 2. Texas Instruments TMS320 series microprocessors. I. Lee, Bob H. II. Tian, Wenshun. III. Title. TK5102 .9 .K86 2006 621.3822-dc22 2005036660 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN-13 978-0-470-01495-0 ISBN-10 0-470-01495-4 Typeset in 9/11pt Times by TechBooks, New Delhi, India Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production. ivJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Contents Preface xv 1 Introduction to Real-Time Digital Signal Processing 1 1.1 Basic Elements of Real-Time DSP Systems 2 1.2 Analog Interface 3 1.2.1 Sampling 3 1.2.2 Quantization and Encoding 7 1.2.3 Smoothing Filters 8 1.2.4 Data Converters 9 1.3 DSP Hardware 10 1.3.1 DSP Hardware Options 10 1.3.2 DSP Processors 13 1.3.3 Fixed- and Floating-Point Processors 15 1.3.4 Real-Time Constraints 16 1.4 DSP System Design 17 1.4.1 Algorithm Development 18 1.4.2 Selection of DSP Processors 19 1.4.3 Software Development 20 1.4.4 High-Level Software Development Tools 21 1.5 Introduction to DSP Development Tools 22 1.5.1 C Compiler 22 1.5.2 Assembler 23 1.5.3 Linker 24 1.5.4 Other Development Tools 25 1.6 Experiments and Program Examples 25 1.6.1 Experiments of Using CCS and DSK 26 1.6.2 Debugging Program Using CCS and DSK 29 1.6.3 File I/O Using Probe Point 32 1.6.4 File I/O Using C File System Functions 35 1.6.5 Code Efficiency Analysis Using Profiler 37 1.6.6 Real-Time Experiments Using DSK 39 1.6.7 Sampling Theory 42 1.6.8 Quantization in ADCs 44 References 45 Exercises 45 vJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 vi CONTENTS 2 Introduction to TMS320C55x Digital Signal Processor 49 2.1 Introduction 49 2.2 TMS320C55x Architecture 50 2.2.1 Architecture Overview 50 2.2.2 Buses 53 2.2.3 On-Chip Memories 53 2.2.4 Memory-Mapped Registers 55 2.2.5 Interrupts and Interrupt Vector 55 2.3 TMS320C55x Peripherals 58 2.3.1 External Memory Interface 60 2.3.2 Direct Memory Access 60 2.3.3 Enhanced Host-Port Interface 61 2.3.4 Multi-Channel Buffered Serial Ports 62 2.3.5 Clock Generator and Timers 65 2.3.6 General Purpose Input/Output Port 65 2.4 TMS320C55x Addressing Modes 65 2.4.1 Direct Addressing Modes 66 2.4.2 Indirect Addressing Modes 68 2.4.3 Absolute Addressing Modes 70 2.4.4 Memory-Mapped Register Addressing Mode 70 2.4.5 Register Bits Addressing Mode 71 2.4.6 Circular Addressing Mode 72 2.5 Pipeline and Parallelism 73 2.5.1 TMS320C55x Pipeline 73 2.5.2 Parallel Execution 74 2.6 TMS320C55x Instruction Set 76 2.6.1 Arithmetic Instructions 76 2.6.2 Logic and Bit Manipulation Instructions 77 2.6.3 Move Instruction 78 2.6.4 Program Flow Control Instructions 78 2.7 TMS320C55x Assembly Language Programming 82 2.7.1 Assembly Directives 82 2.7.2 Assembly Statement Syntax 84 2.8 C Language Programming for TMS320C55x 86 2.8.1 Data Types 86 2.8.2 Assembly Code Generation by C Compiler 87 2.8.3 Compiler Keywords and Pragma Directives 89 2.9 Mixed C-and-Assembly Language Programming 90 2.10 Experiments and Program Examples 93 2.10.1 Interfacing C with Assembly Code 93 2.10.2 Addressing Modes Using Assembly Programming 94 2.10.3 Phase-Locked Loop and Timers 97 2.10.4 EMIF Configuration for Using SDRAM 103 2.10.5 Programming Flash Memory Devices 105 2.10.6 Using McBSP 106 2.10.7 AIC23 Configurations 109 2.10.8 Direct Memory Access 111 References 115 Exercises 115JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 CONTENTS vii 3 DSP Fundamentals and Implementation Considerations 121 3.1 Digital Signals and Systems 121 3.1.1 Elementary Digital Signals 121 3.1.2 Block Diagram Representation of Digital Systems 123 3.2 System Concepts 126 3.2.1 Linear Time-Invariant Systems 126 3.2.2 The z-Transform 130 3.2.3 Transfer Functions 132 3.2.4 Poles and Zeros 135 3.2.5 Frequency Responses 138 3.2.6 Discrete Fourier Transform 141 3.3 Introduction to Random Variables 142 3.3.1 Review of Random Variables 142 3.3.2 Operations of Random Variables 144 3.4 Fixed-Point Representations and Quantization Effects 147 3.4.1 Fixed-Point Formats 147 3.4.2 Quantization Errors 151 3.4.3 Signal Quantization 151 3.4.4 Coefficient Quantization 153 3.4.5 Roundoff Noise 153 3.4.6 Fixed-Point Toolbox 154 3.5 Overflow and Solutions 157 3.5.1 Saturation Arithmetic 157 3.5.2 Overflow Handling 158 3.5.3 Scaling of Signals 158 3.5.4 Guard Bits 159 3.6 Experiments and Program Examples 159 3.6.1 Quantization of Sinusoidal Signals 160 3.6.2 Quantization of Audio Signals 161 3.6.3 Quantization of Coefficients 162 3.6.4 Overflow and Saturation Arithmetic 164 3.6.5 Function Approximations 167 3.6.6 Real-Time Digital Signal Generation Using DSK 175 References 180 Exercises 180 4 Design and Implementation of FIR Filters 185 4.1 Introduction to FIR Filters 185 4.1.1 Filter Characteristics 185 4.1.2 Filter Types 187 4.1.3 Filter Specifications 189 4.1.4 Linear-Phase FIR Filters 191 4.1.5 Realization of FIR Filters 194 4.2 Design of FIR Filters 196 4.2.1 Fourier Series Method 197 4.2.2 Gibbs Phenomenon 198 4.2.3 Window Functions 201JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 viii CONTENTS 4.2.4 Design of FIR Filters Using MATLAB 206 4.2.5 Design of FIR Filters Using FDATool 207 4.3 Implementation Considerations 213 4.3.1 Quantization Effects in FIR Filters 213 4.3.2 MATLAB Implementations 216 4.3.3 Floating-Point C Implementations 218 4.3.4 Fixed-Point C Implementations 219 4.4 Applications: Interpolation and Decimation Filters 220 4.4.1 Interpolation 220 4.4.2 Decimation 221 4.4.3 Sampling-Rate Conversion 221 4.4.4 MATLAB Implementations 224 4.5 Experiments and Program Examples 225 4.5.1 Implementation of FIR Filters Using Fixed-Point C 226 4.5.2 Implementation of FIR Filter Using C55x Assembly Language 226 4.5.3 Optimization for Symmetric FIR Filters 228 4.5.4 Optimization Using Dual MAC Architecture 230 4.5.5 Implementation of Decimation 232 4.5.6 Implementation of Interpolation 233 4.5.7 Sample Rate Conversion 234 4.5.8 Real-Time Sample Rate Conversion Using DSP/BIOS and DSK 235 References 245 Exercises 245 5 Design and Implementation of IIR Filters 249 5.1 Introduction 249 5.1.1 Analog Systems 249 5.1.2 Mapping Properties 251 5.1.3 Characteristics of Analog Filters 252 5.1.4 Frequency Transforms 254 5.2 Design of IIR Filters 255 5.2.1 Bilinear Transform 256 5.2.2 Filter Design Using Bilinear Transform 257 5.3 Realization of IIR Filters 258 5.3.1 Direct Forms 258 5.3.2 Cascade Forms 260 5.3.3 Parallel Forms 262 5.3.4 Realization of IIR Filters Using MATLAB 263 5.4 Design of IIR Filters Using MATLAB 264 5.4.1 Filter Design Using MATLAB 264 5.4.2 Frequency Transforms Using MATLAB 267 5.4.3 Design and Realization Using FDATool 268 5.5 Implementation Considerations 271 5.5.1 Stability 271 5.5.2 Finite-Precision Effects and Solutions 273 5.5.3 MATLAB Implementations 275JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 CONTENTS ix 5.6 Practical Applications 279 5.6.1 Recursive Resonators 279 5.6.2 Recursive Quadrature Oscillators 282 5.6.3 Parametric Equalizers 284 5.7 Experiments and Program Examples 285 5.7.1 Floating-Point Direct-Form I IIR Filter 285 5.7.2 Fixed-Point Direct-Form I IIR Filter 286 5.7.3 Fixed-Point Direct-Form II Cascade IIR Filter 287 5.7.4 Implementation Using DSP Intrinsics 289 5.7.5 Implementation Using Assembly Language 290 5.7.6 Real-Time Experiments Using DSP/BIOS 293 5.7.7 Implementation of Parametric Equalizer 296 5.7.8 Real-Time Two-Band Equalizer Using DSP/BIOS 297 References 299 Exercises 299 6 Frequency Analysis and Fast Fourier Transform 303 6.1 Fourier Series and Transform 303 6.1.1 Fourier Series 303 6.1.2 Fourier Transform 304 6.2 Discrete Fourier Transform 305 6.2.1 Discrete-Time Fourier Transform 305 6.2.2 Discrete Fourier Transform 307 6.2.3 Important Properties 310 6.3 Fast Fourier Transforms 313 6.3.1 Decimation-in-Time 314 6.3.2 Decimation-in-Frequency 316 6.3.3 Inverse Fast Fourier Transform 317 6.4 Implementation Considerations 317 6.4.1 Computational Issues 317 6.4.2 Finite-Precision Effects 318 6.4.3 MATLAB Implementations 318 6.4.4 Fixed-Point Implementation Using MATLAB 320 6.5 Practical Applications 322 6.5.1 Spectral Analysis 322 6.5.2 Spectral Leakage and Resolution 323 6.5.3 Power Spectrum Density 325 6.5.4 Fast Convolution 328 6.6 Experiments and Program Examples 332 6.6.1 Floating-Point C Implementation of DFT 332 6.6.2 C55x Assembly Implementation of DFT 332 6.6.3 Floating-Point C Implementation of FFT 336 6.6.4 C55x Intrinsics Implementation of FFT 338 6.6.5 Assembly Implementation of FFT and Inverse FFT 339 6.6.6 Implementation of Fast Convolution 343 6.6.7 Real-Time FFT Using DSP/BIOS 345 6.6.8 Real-Time Fast Convolution 347 References 347 Exercises 348JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 x CONTENTS 7 Adaptive Filtering 351 7.1 Introduction to Random Processes 351 7.2 Adaptive Filters 354 7.2.1 Introduction to Adaptive Filtering 354 7.2.2 Performance Function 355 7.2.3 Method of Steepest Descent 358 7.2.4 The LMS Algorithm 360 7.2.5 Modified LMS Algorithms 361 7.3 Performance Analysis 362 7.3.1 Stability Constraint 362 7.3.2 Convergence Speed 363 7.3.3 Excess Mean-Square Error 363 7.3.4 Normalized LMS Algorithm 364 7.4 Implementation Considerations 364 7.4.1 Computational Issues 365 7.4.2 Finite-Precision Effects 365 7.4.3 MATLAB Implementations 366 7.5 Practical Applications 368 7.5.1 Adaptive System Identification 368 7.5.2 Adaptive Linear Prediction 369 7.5.3 Adaptive Noise Cancelation 372 7.5.4 Adaptive Notch Filters 374 7.5.5 Adaptive Channel Equalization 375 7.6 Experiments and Program Examples 377 7.6.1 Floating-Point C Implementation 377 7.6.2 Fixed-Point C Implementation of Leaky LMS Algorithm 379 7.6.3 ETSI Implementation of NLMS Algorithm 380 7.6.4 Assembly Language Implementation of Delayed LMS Algorithm 383 7.6.5 Adaptive System Identification 387 7.6.6 Adaptive Prediction and Noise Cancelation 388 7.6.7 Adaptive Channel Equalizer 392 7.6.8 Real-Time Adaptive Line Enhancer Using DSK 394 References 396 Exercises 397 8 Digital Signal Generators 401 8.1 Sinewave Generators 401 8.1.1 Lookup-Table Method 401 8.1.2 Linear Chirp Signal 404 8.2 Noise Generators 405 8.2.1 Linear Congruential Sequence Generator 405 8.2.2 Pseudo-Random Binary Sequence Generator 407 8.3 Practical Applications 409 8.3.1 Siren Generators 409 8.3.2 White Gaussian Noise 409 8.3.3 Dual-Tone Multifrequency Tone Generator 410 8.3.4 Comfort Noise in Voice Communication Systems 411 8.4 Experiments and Program Examples 412 8.4.1 Sinewave Generator Using C5510 DSK 412 8.4.2 White Noise Generator Using C5510 DSK 413JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 CONTENTS xi 8.4.3 Wail Siren Generator Using C5510 DSK 414 8.4.4 DTMF Generator Using C5510 DSK 415 8.4.5 DTMF Generator Using MATLAB Graphical User Interface 416 References 418 Exercises 418 9 Dual-Tone Multifrequency Detection 421 9.1 Introduction 421 9.2 DTMF Tone Detection 422 9.2.1 DTMF Decode Specifications 422 9.2.2 Goertzel Algorithm 423 9.2.3 Other DTMF Detection Methods 426 9.2.4 Implementation Considerations 428 9.3 Internet Application Issues and Solutions 431 9.4 Experiments and Program Examples 432 9.4.1 Implementation of Goertzel Algorithm Using Fixed-Point C 432 9.4.2 Implementation of Goertzel Algorithm Using C55x Assembly Language 434 9.4.3 DTMF Detection Using C5510 DSK 435 9.4.4 DTMF Detection Using All-Pole Modeling 439 References 441 Exercises 442 10 Adaptive Echo Cancelation 443 10.1 Introduction to Line Echoes 443 10.2 Adaptive Echo Canceler 444 10.2.1 Principles of Adaptive Echo Cancelation 445 10.2.2 Performance Evaluation 446 10.3 Practical Considerations 447 10.3.1 Prewhitening of Signals 447 10.3.2 Delay Detection 448 10.4 Double-Talk Effects and Solutions 450 10.5 Nonlinear Processor 453 10.5.1 Center Clipper 453 10.5.2 Comfort Noise 453 10.6 Acoustic Echo Cancelation 454 10.6.1 Acoustic Echoes 454 10.6.2 Acoustic Echo Canceler 456 10.6.3 Subband Implementations 457 10.6.4 Delay-Free Structures 459 10.6.5 Implementation Considerations 459 10.6.6 Testing Standards 460 10.7 Experiments and Program Examples 461 10.7.1 MATLAB Implementation of AEC 461 10.7.2 Acoustic Echo Cancelation Using Floating-Point C 464 10.7.3 Acoustic Echo Canceler Using C55x Intrinsics 468 10.7.4 Experiment of Delay Estimation 469 References 472 Exercises 472JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 xii CONTENTS 11 Speech-Coding Techniques 475 11.1 Introduction to Speech-Coding 475 11.2 Overview of CELP Vocoders 476 11.2.1 Synthesis Filter 477 11.2.2 Long-Term Prediction Filter 481 11.2.3 Perceptual Based Minimization Procedure 481 11.2.4 Excitation Signal 482 11.2.5 Algebraic CELP 483 11.3 Overview of Some Popular CODECs 484 11.3.1 Overview of G.723.1 484 11.3.2 Overview of G.729 488 11.3.3 Overview of GSM AMR 490 11.4 Voice over Internet Protocol Applications 492 11.4.1 Overview of VoIP 492 11.4.2 Real-Time Transport Protocol and Payload Type 493 11.4.3 Example of Packing G.729 496 11.4.4 RTP Data Analysis Using Ethereal Trace 496 11.4.5 Factors Affecting the Overall Voice Quality 497 11.5 Experiments and Program Examples 497 11.5.1 Calculating LPC Coefficients Using Floating-Point C 497 11.5.2 Calculating LPC Coefficients Using C55x Intrinsics 499 11.5.3 MATLAB Implementation of Formant Perceptual Weighting Filter 504 11.5.4 Implementation of Perceptual Weighting Filter Using C55x Intrinsics 506 References 507 Exercises 508 12 Speech Enhancement Techniques 509 12.1 Introduction to Noise Reduction Techniques 509 12.2 Spectral Subtraction Techniques 510 12.2.1 Short-Time Spectrum Estimation 511 12.2.2 Magnitude Subtraction 511 12.3 Voice Activity Detection 513 12.4 Implementation Considerations 515 12.4.1 Spectral Averaging 515 12.4.2 Half-Wave Rectification 515 12.4.3 Residual Noise Reduction 516 12.5 Combination of Acoustic Echo Cancelation with NR 516 12.6 Voice Enhancement and Automatic Level Control 518 12.6.1 Voice Enhancement Devices 518 12.6.2 Automatic Level Control 519 12.7 Experiments and Program Examples 519 12.7.1 Voice Activity Detection 519 12.7.2 MATLAB Implementation of NR Algorithm 522 12.7.3 Floating-Point C Implementation of NR 522 12.7.4 Mixed C55x Assembly and Intrinsics Implementations of VAD 522 12.7.5 Combining AEC with NR 526 References 529 Exercises 529JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 CONTENTS xiii 13 Audio Signal Processing 531 13.1 Introduction 531 13.2 Basic Principles of Audio Coding 531 13.2.1 Auditory-Masking Effects for Perceptual Coding 533 13.2.2 Frequency-Domain Coding 536 13.2.3 Lossless Audio Coding 538 13.3 Multichannel Audio Coding 539 13.3.1 MP3 540 13.3.2 Dolby AC-3 541 13.3.3 MPEG-2 AAC 542 13.4 Connectivity Processing 544 13.5 Experiments and Program Examples 544 13.5.1 Floating-Point Implementation of MDCT 544 13.5.2 Implementation of MDCT Using C55x Intrinsics 547 13.5.3 Experiments of Preecho Effects 549 13.5.4 Floating-Point C Implementation of MP3 Decoding 549 References 553 Exercises 553 14 Channel Coding Techniques 555 14.1 Introduction 555 14.2 Block Codes 556 14.2.1 ReedÐSolomon Codes 558 14.2.2 Applications of ReedÐSolomon Codes 562 14.2.3 Cyclic Redundant Codes 563 14.3 Convolutional Codes 564 14.3.1 Convolutional Encoding 564 14.3.2 Viterbi Decoding 564 14.3.3 Applications of Viterbi Decoding 566 14.4 Experiments and Program Examples 569 14.4.1 ReedÐSolomon Coding Using MATALB 569 14.4.2 ReedÐSolomon Coding Using Simulink 570 14.4.3 Verification of RS(255, 239) Generation Polynomial 571 14.4.4 Convolutional Codes 572 14.4.5 Implementation of Convolutional Codes Using C 573 14.4.6 Implementation of CRC-32 575 References 576 Exercises 577 15 Introduction to Digital Image Processing 579 15.1 Digital Images and Systems 579 15.1.1 Digital Images 579 15.1.2 Digital Image Systems 580 15.2 RGB Color Spaces and Color Filter Array Interpolation 581 15.3 Color Spaces 584 15.3.1 YCbCr and YUV Color Spaces 584 15.3.2 CYMK Color Space 585JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 xiv CONTENTS 15.3.3 YIQ Color Space 585 15.3.4 HSV Color Space 585 15.4 YCbCr Subsampled Color Spaces 586 15.5 Color Balance and Correction 586 15.5.1 Color Balance 587 15.5.2 Color Adjustment 588 15.5.3 Gamma Correction 589 15.6 Image Histogram 590 15.7 Image Filtering 591 15.8 Image Filtering Using Fast Convolution 596 15.9 Practical Applications 597 15.9.1 JPEG Standard 597 15.9.2 2-D Discrete Cosine Transform 599 15.10 Experiments and Program Examples 601 15.10.1 YCbCr to RGB Conversion 601 15.10.2 Using CCS Link with DSK and Simulator 604 15.10.3 White Balance 607 15.10.4 Gamma Correction and Contrast Adjustment 610 15.10.5 Histogram and Histogram Equalization 611 15.10.6 2-D Image Filtering 613 15.10.7 Implementation of DCT and IDCT 617 15.10.8 TMS320C55x Image Accelerator for DCT and IDCT 621 15.10.9 TMS320C55x Hardware Accelerator Image/Video Processing Library 623 References 625 Exercises 625 Appendix A Some Useful Formulas and Definitions 627 A.1 Trigonometric Identities 627 A.2 Geometric Series 628 A.3 Complex Variables 628 A.4 Units of Power 630 References 631 Appendix B Software Organization and List of Experiments 633 Index 639JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 Preface In recent years, digital signal processing (DSP) has expanded beyond filtering, frequency analysis, and signal generation. More and more markets are opening up to DSP applications, where in the past, real-time signal processing was not feasible or was too expensive. Real-time signal processing using general-purpose DSP processors provides an effective way to design and implement DSP algorithms for real-world applications. However, this is very challenging work in today’s engineering fields. With DSP penetrating into many practical applications, the demand for high-performance digital signal processors has expanded rapidly in recent years. Many industrial companies are currently engaged in real-time DSP research and development. Therefore, it becomes increasingly important for today’s students, practicing engineers, and development researchers to master not only the theory of DSP,but also the skill of real-time DSP system design and implementation techniques. This book provides fundamental real-time DSP principles and uses a hands-on approach to introduce DSP algorithms, system design, real-time implementation considerations, and many practical applica- tions. This book contains many useful examples like hands-on experiment software and DSP programs using MATLAB, Simulink, C, and DSP assembly languages. Also included are various exercises for further exploring the extensions of the examples and experiments. The book uses the Texas Instruments’ Code Composer Studio (CCS) with the Spectrum Digital TMS320VC5510 DSP starter kit (DSK) devel- opment tool for real-time experiments and applications. This book emphasizes real-time DSP applications and is intended as a text for senior/graduate-level college students. The prerequisites of this book are signals and systems concepts, microprocessor ar- chitecture and programming, and basic C programming knowledge. These topics are covered at the sophomore and junior levels of electrical and computer engineering, computer science, and other related engineering curricula. This book can also serve as a desktop reference for DSP engineers, algorithm developers, and embedded system programmers to learn DSP concepts and to develop real-time DSP applications on the job. We use a practical approach that avoids numerous theoretical derivations. A list of DSP textbooks with mathematical proofs is given at the end of each chapter. Also helpful are the manuals and application notes for the TMS320C55x DSP processors from Texas Instruments at www.ti.com, and for the MATLAB and Simulink from Math Works at www.mathworks.com. This is the second edition of the book titled ‘Real-Time Digital Signal Processing: Implementations, Applications and Experiments with the TMS320C55x’ by Kuo and Lee, John Wiley & Sons, Ltd. in 2001. The major changes included in the revision are: 1. To utilize the effective software development process that begins from algorithm design and verifica- tion using MATLABand floating-point C, to finite-wordlength analysis, fixed-point C implementation and code optimization using intrinsics, assembly routines, and mixed C-and-assembly programming xvJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 xvi PREFACE on fixed-point DSP processors. This step-by-step software development and optimization process is applied to the finite-impulse response (FIR) filtering, infinite-impulse response (IIR) filtering, adaptive filtering, fast Fourier transform, and many real-life applications in Chapters 8Ð15. 2. To add several widely used DSP applications such as speech coding, channel coding, audio coding, image processing, signal generation and detection, echo cancelation, and noise reduction by expand- ing Chapter 9 of the first edition to eight new chapters with the necessary background to perform the experiments using the optimized software development process. 3. To design and analyze DSP algorithms using the most effective MATLAB graphic user interface (GUI) tools such as Signal Processing Tool (SPTool), Filter Design and Analysis Tool (FDATool), etc. These tools are powerful for filter designing, analysis, quantization, testing, and implementation. 4. To add step-by-step experiments to create CCS DSP/BIOS applications, configure the TMS320VC5510 DSK for real-time audio applications, and utilize MATLAB’s Link for CCS feature to improve DSP development, debug, analyze, and test efficiencies. 5. To update experiments to include new sets of hands-on exercises and applications. Also, to update all programs using the most recent version of software and the TMS320C5510 DSK board for real-time experiments. There are many existing DSP algorithms and applications available in MATLAB and floating-point C programs. This book provides a systematic software development process for converting these pro- grams to fixed-point C and optimizing them for implementation on commercially available fixed-point DSP processors. To effectively illustrate real-time DSP concepts and applications, MATLAB is used for analysis and filter design, C program is used for implementing DSP algorithms, and CCS is in- tegrated into TMS320C55x experiments and applications. To efficiently utilize the advanced DSP ar- chitecture for fast software development and maintenance, the mixing of C and assembly programs is emphasized. This book is organized into two parts: DSP implementation and DSP application. Part I, DSP implemen- tation (Chapters 1Ð7) discusses real-time DSP principles, architectures, algorithms, and implementation considerations. Chapter 1 reviews the fundamentals of real-time DSP functional blocks, DSP hardware options, fixed- and floating-point DSP devices, real-time constraints, algorithm development, selection of DSP chips, and software development. Chapter 2 introduces the architecture and assembly programming of the TMS320C55x DSP processor. Chapter 3 presents fundamental DSP concepts and practical con- siderations for the implementation of digital filters and algorithms on DSP hardware. Chapter 4 focuses on the design, implementation, and application of FIR filters. Digital IIR filters are covered in Chapter 5, and adaptive filters are presented in Chapter 7. The development, implementation, and application of FFT algorithms are introduced in Chapter 6. Part II, DSP application (Chapters 8Ð15) introduces several popular real-world applications in signal processing that have played important roles in the realization of the systems. These selected DSP applica- tions include signal (sinewave, noise, and multitone) generation in Chapter 8, dual-tone multifrequency detection in Chapter 9, adaptive echo cancelation in Chapter 10, speech-coding algorithms in Chapter 11, speech enhancement techniques in Chapter 12, audio coding methods in Chapter 13, error correction coding techniques in Chapter 14, and image processing fundamentals in Chapter 15. As with any book attempting to capture the state of the art at a given time, there will certainly be updates that are necessitated by the rapidly evolving developments in this dynamic field. We are certain that this book will serve as a guide for what has already come and as an inspiration for what will follow.JWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 SOFTWARE AVAILABILITY xvii Software Availability This text utilizes various MATLAB, floating-point and fixed-point C, DSP assembly and mixed C and assembly programs for the examples, experiments, and applications. These programs along with many other programs and real-world data files are available in the companion CD. The directory structure and the subdirectory names are explained in Appendix B. The software will assist in gaining insight into the understanding and implementation of DSP algorithms, and it is required for doing experiments in the last section of each chapter. Some of these experiments involve minor modifications of the example code. By examining, studying, and modifying the example code, the software can also be used as a prototype for other practical applications. Every attempt has been made to ensure the correctness of the code. We would appreciate readers bringing to our attention (kuo@ceet.niu.edu) any coding errors so that we can correct, update, and post them on the website http://www.ceet.niu.edu/faculty/kuo. Acknowledgments We are grateful to Cathy Wicks and Gene Frantz of Texas Instruments, and to Naomi Fernandes and Courtney Esposito of The MathWorks for providing us with the support needed to write this book. We would like to thank several individuals at Wiley for their support on this project: Simone Taylor, Executive Commissioning Editor; Emily Bone, Assistant Editor; and Lucy Bryan, Executive Project Editor. We also thank the staff at Wiley for the final preparation of this book. Finally, we thank our families for the endless love, encouragement, patience, and understanding they have shown throughout this period. Sen M. Kuo, Bob H. Lee and Wenshun TianJWBK080-FM JWBK080-Kuo March 9, 2006 19:24 Char Count= 0 xviiiJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 1 Introduction to Real-Time Digital Signal Processing Signals can be divided into three categories: continuous-time (analog) signals, discrete-time signals, and digital signals. The signals that we encounter daily are mostly analog signals. These signals are defined continuously in time, have an infinite range of amplitude values, and can be processed using analog electronics containing both active and passive circuit elements. Discrete-time signals are defined only at a particular set of time instances. Therefore, they can be represented as a sequence of numbers that have a continuous range of values. Digital signals have discrete values in both time and amplitude; thus, they can be processed by computers or microprocessors. In this book, we will present the design, implementation, and applications of digital systems for processing digital signals using digital hardware. However, the analysis usually uses discrete-time signals and systems for mathematical convenience. Therefore, we use the terms ‘discrete-time’ and ‘digital’ interchangeably. Digital signal processing (DSP) is concerned with the digital representation of signals and the use of digital systems to analyze, modify, store, or extract information from these signals. Much research has been conducted to develop DSP algorithms and systems for real-world applications. In recent years, the rapid advancement in digital technologies has supported the implementation of sophisti- cated DSP algorithms for real-time applications. DSP is now used not only in areas where analog methods were used previously, but also in areas where applying analog techniques is very difficult or impossible. There are many advantages in using digital techniques for signal processing rather than traditional analog devices, such as amplifiers, modulators, and filters. Some of the advantages of a DSP system over analog circuitry are summarized as follows: 1. Flexibility: Functions of a DSP system can be easily modified and upgraded with software that implements the specific applications. One can design a DSP system that can be programmed to perform a wide variety of tasks by executing different software modules. A digital electronic device can be easily upgraded in the field through the onboard memory devices (e.g., flash memory) to meet new requirements or improve its features. 2. Reproducibility: The performance of a DSP system can be repeated precisely from one unit to another. In addition, by using DSP techniques, digital signals such as audio and video streams can be stored, transferred, or reproduced many times without degrading the quality. By contract, analog circuits Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 1JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 2 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING will not have the same characteristics even if they are built following identical specifications due to analog component tolerances. 3. Reliability: The memory and logic of DSP hardware does not deteriorate with age. Therefore, the field performance of DSP systems will not drift with changing environmental conditions or aged electronic components as their analog counterparts do. 4. Complexity: DSP allows sophisticated applications such as speech recognition and image compres- sion to be implemented with lightweight and low-power portable devices. Furthermore, there are some important signal processing algorithms such as error correcting codes, data transmission and storage, and data compression, which can only be performed using DSP systems. With the rapid evolution in semiconductor technologies, DSP systems have a lower overall cost com- pared to analog systems for most applications. DSP algorithms can be developed, analyzed, and simulated using high-level language and software tools such as C/C++ and MATLAB (matrix laboratory). The performance of the algorithms can be verified using a low-cost, general-purpose computer. Therefore, a DSP system is relatively easy to design, develop, analyze, simulate, test, and maintain. There are some limitations associated with DSP. For instance, the bandwidth of a DSP system is limited by the sampling rate and hardware peripherals. Also, DSP algorithms are implemented using a fixed number of bits with a limited precision and dynamic range (the ratio between the largest and smallest numbers that can be represented), which results in quantization and arithmetic errors. Thus, the system performance might be different from the theoretical expectation. 1.1 Basic Elements of Real-Time DSP Systems There are two types of DSP applications: non-real-time and real-time. Non-real-time signal processing involves manipulating signals that have already been collected in digital forms. This may or may not represent a current action, and the requirement for the processing result is not a function of real time. Real-time signal processing places stringent demands on DSP hardware and software designs to complete predefined tasks within a certain time frame. This chapter reviews the fundamental functional blocks of real-time DSP systems. The basic functional blocks of DSP systems are illustrated in Figure 1.1, where a real-world analog signal is converted to a digital signal, processed by DSP hardware, and converted back into an analog Other digital systems Antialiasing filter ADC x(n) DSP hardware Other digital systemsDACReconstruction filter y(n) x(t)x′(t) Amplifier Amplifier y(t) y′(t) Input channels Output channels Figure 1.1 Basic functional block diagram of a real-time DSP systemJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 ANALOG INTERFACE 3 signal. Each of the functional blocks in Figure 1.1 will be introduced in the subsequent sections. For some applications, the input signal may already be in digital form and/or the output data may not need to be converted to an analog signal. For example, the processed digital information may be stored in computer memory for later use, or it may be displayed graphically. In other applications, the DSP system may be required to generate signals digitally, such as speech synthesis used for computerized services or pseudo-random number generators for CDMA (code division multiple access) wireless communication systems. 1.2 Analog Interface In this book, a time-domain signal is denoted with a lowercase letter. For example, x(t) in Figure 1.1 is used to name an analog signal of x which is a function of time t. The time variable t and the amplitude of x(t) take on a continuum of values between −∞ and ∞. For this reason we say x(t) is a continuous-time signal. The signals x(n) and y(n) in Figure 1.1 depict digital signals which are only meaningful at time instant n. In this section, we first discuss how to convert analog signals into digital signals so that they can be processed using DSP hardware. The process of converting an analog signal to a digital signal is called the analog-to-digital conversion, usually performed by an analog-to-digital converter (ADC). The purpose of signal conversion is to prepare real-world analog signals for processing by digital hardware. As shown in Figure 1.1, the analog signal x(t) is picked up by an appropriate electronic sensor that converts pressure, temperature, or sound into electrical signals. For example, a microphone can be used to collect sound signals. The sensor signal x(t) is amplified by an amplifier with gain value g. The amplified signal is x(t) = gx(t). (1.1) The gain value g is determined such that x(t) has a dynamic range that matches the ADC used by the system. If the peak-to-peak voltage range of the ADC is ±5 V, then g may be set so that the amplitude of signal x(t) to the ADC is within ±5 V. In practice, it is very difficult to set an appropriate fixed gain because the level of x(t) may be unknown and changing with time, especially for signals with a larger dynamic range such as human speech. Once the input digital signal has been processed by the DSP hardware, the result y(n) is still in digital form. In many DSP applications, we need to reconstruct the analog signal after the completion of digital processing. We must convert the digital signal y(n) back to the analog signal y(t) before it is applied to an appropriated analog device. This process is called the digital-to-analog conversion, typically performed by a digital-to-analog converter (DAC). One example would be audio CD (compact disc) players, for which the audio music signals are stored in digital form on CDs. A CD player reads the encoded digital audio signals from the disk and reconstructs the corresponding analog waveform for playback via loudspeakers. The system shown in Figure 1.1 is a real-time system if the signal to the ADC is continuously sampled and the ADC presents a new sample to the DSP hardware at the same rate. In order to maintain real-time operation, the DSP hardware must perform all required operations within the fixed time period, and present an output sample to the DAC before the arrival of the next sample from the ADC. 1.2.1 Sampling As shown in Figure 1.1, the ADC converts the analog signal x(t) into the digital signal x(n). Analog- to-digital conversion, commonly referred as digitization, consists of the sampling (digitization in time) and quantization (digitization in amplitude) processes as illustrated in Figure 1.2. The sampling process depicts an analog signal as a sequence of values. The basic sampling function can be carried out with an ideal ‘sample-and-hold’ circuit, which maintains the sampled signal level until the next sample is taken.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 4 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING x(t) Ideal sampler x(nT) Quantizer x(n) Analog-to-digital converter Figure 1.2 Block diagram of an ADC Quantization process approximates a waveform by assigning a number for each sample. Therefore, the analog-to-digital conversion will perform the following steps: 1. The bandlimited signal x(t) is sampled at uniformly spaced instants of time nT, where n is a positive integer and T is the sampling period in seconds. This sampling process converts an analog signal into a discrete-time signal x(nT ) with continuous amplitude value. 2. The amplitude of each discrete-time sample is quantized into one of the 2B levels, where B is the number of bits that the ADC has to represent for each sample. The discrete amplitude levels are represented (or encoded) into distinct binary words x(n) with a fixed wordlength B. The reason for making this distinction is that these processes introduce different distortions. The sampling process brings in aliasing or folding distortion, while the encoding process results in quantization noise. As shown in Figure 1.2, the sampler and quantizer are integrated on the same chip. However, high-speed ADCs typically require an external sample-and-hold device. An ideal sampler can be considered as a switch that periodically opens and closes every T s (seconds). The sampling period is defined as T = 1 fs , (1.2) where fs is the sampling frequency (or sampling rate) in hertz (or cycles per second). The intermediate signal x(nT ) is a discrete-time signal with a continuous value (a number with infinite precision) at discrete time nT, n = 0,1,...,∞, as illustrated in Figure 1.3. The analog signal x(t) is continuous in both time and amplitude. The sampled discrete-time signal x(nT ) is continuous in amplitude, but is defined only at discrete sampling instants t = nT. Time, t x(nT) 0 T 2T 3T 4T x(t) Figure 1.3 Example of analog signal x(t) and discrete-time signal x(nT)JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 ANALOG INTERFACE 5 In order to represent an analog signal x(t) by a discrete-time signal x(nT ) accurately, the sampling frequency fs must be at least twice the maximum frequency component ( fM) in the analog signal x(t). That is, fs ≥ 2 fM, (1.3) where fM is also called the bandwidth of the signal x(t). This is Shannon’s sampling theorem, which states that when the sampling frequency is greater than twice of the highest frequency component contained in the analog signal, the original signal x(t) can be perfectly reconstructed from the corresponding discrete-time signal x(nT ). The minimum sampling rate fs = 2 fM is called the Nyquist rate. The frequency fN = fs/2 is called the Nyquist frequency or folding frequency. The frequency interval [− fs/2, fs/2] is called the Nyquist interval. When an analog signal is sampled at fs, frequency components higher than fs/2 fold back into the frequency range [0, fs/2]. The folded back frequency components overlap with the original frequency components in the same range. Therefore, the original analog signal cannot be recovered from the sampled data. This undesired effect is known as aliasing. Example 1.1: Consider two sinewaves of frequencies f1 = 1 Hz and f2 = 5 Hz that are sampled at fs = 4 Hz, rather than at 10 Hz according to the sampling theorem. The analog waveforms are illustrated in Figure 1.4(a), while their digital samples and reconstructed waveforms are illustrated x(t), f1 = 1Hz x(t), f2 = 5Hz t, second x(t) x(n) t x(n) x(t) (a) Original analog waveforms and digital samplses for f1 = 1 Hz and f2 = 5 Hz. x(n), f1 = 1Hz x(n), f2 = 5Hz n x(t) x(n) x(n) x(t) n (b) Digital samples for f1 = 1 Hz and f2 = 5 Hz and reconstructed waveforms. Figure 1.4 Example of the aliasing phenomenon: (a) original analog waveforms and digital samples for f1 = 1Hz and f2 = 5 Hz; (b) digital samples of f1 = 1 Hz and f2 = 5 Hz and reconstructed waveformsJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 6 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING in Figure 1.4(b). As shown in the figures, we can reconstruct the original waveform from the digital samples for the sinewave of frequency f1 = 1 Hz. However, for the original sinewave of frequency f2 = 5 Hz, the reconstructed signal is identical to the sinewave of frequency 1 Hz. Therefore, f1 and f2 are said to be aliased to one another, i.e., they cannot be distinguished by their discrete-time samples. Note that the sampling theorem assumes that the signal is bandlimited. For most practical applications, the analog signal x(t) may have significant energies outside the highest frequency of interest, or may contain noise with a wider bandwidth. In some cases, the sampling rate is predetermined by a given application. For example, most voice communication systems use an 8 kHz sampling rate. Unfortunately, the frequency components in a speech signal can be much higher than 4 kHz. To guarantee that the sampling theorem defined in Equation (1.3) can be fulfilled, we must block the frequency components that are above the Nyquist frequency. This can be done by using an antialiasing filter, which is an analog lowpass filter with the cutoff frequency fc ≤ fs 2 . (1.4) Ideally, an antialiasing filter should remove all frequency components above the Nyquist frequency. In many practical systems, a bandpass filter is preferred to remove all frequency components above the Nyquist frequency, as well as to prevent undesired DC offset, 60 Hz hum, or other low-frequency noises. A bandpass filter with passband from 300 to 3200 Hz can often be found in telecommunication systems. Since antialiasing filters used in real-world applications are not ideal filters, they cannot completely remove all frequency components outside the Nyquist interval. In addition, since the phase response of the analog filter may not be linear, the phase of the signal will not be shifted by amounts proportional to their frequencies. In general, a lowpass (or bandpass) filter with steeper roll-off will introduce more phase distortion. Higher sampling rates allow simple low-cost antialiasing filter with minimum phase distortion to be used. This technique is known as oversampling, which is widely used in audio applications. Example 1.2: The range of sampling rate required by DSP systems is large, from approximately 1 GHz in radar to 1 Hz in instrumentation. Given a sampling rate for a specific application, the sampling period can be determined by (1.2). Some real-world applications use the following sampling frequencies and periods: 1. In International Telecommunication Union (ITU) speech compression standards, the sampling rate of ITU-T G.729 and G.723.1 is fs = 8 kHz, thus the sampling period T = 1/8000 s = 125 μs. Note that 1 μs = 10−6 s. 2. Wideband telecommunication systems, such as ITU-T G.722, use a sampling rate of fs = 16 kHz, thus T = 1/16 000 s = 62.5 μs. 3. In audio CDs, the sampling rate is fs = 44.1 kHz, thus T = 1/44 100 s = 22.676 μs. 4. High-fidelity audio systems, such as MPEG-2 (moving picture experts group) AAC (advanced audio coding) standard, MP3 (MPEG layer 3) audio compression standard, and Dolby AC-3, have a sampling rate of fs = 48 kHz, and thus T = 1/48 000 s = 20.833 μs. The sampling rate for MPEG-2 AAC can be as high as 96 kHz. The speech compression algorithms will be discussed in Chapter 11 and the audio coding techniques will be introduced in Chapter 13.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 ANALOG INTERFACE 7 1.2.2 Quantization and Encoding In previous sections, we assumed that the sample values x(nT ) are represented exactly with an infinite number of bits (i.e., B →∞). We now discuss a method of representing the sampled discrete-time signal x(nT ) as a binary number with finite number of bits. This is the quantization and encoding process. If the wordlength of an ADC is B bits, there are 2B different values (levels) that can be used to represent a sample. If x(n) lies between two quantization levels, it will be either rounded or truncated. Rounding replaces x(n) by the value of the nearest quantization level, while truncation replaces x(n) by the value of the level below it. Since rounding produces less biased representation of the analog values, it is widely used by ADCs. Therefore, quantization is a process that represents an analog-valued sample x(nT ) with its nearest level that corresponds to the digital signal x(n). We can use 2 bits to define four equally spaced levels (00, 01, 10, and 11) to classify the signal into the four subranges as illustrated in Figure 1.5. In this figure, the symbol ‘o’ represents the discrete-time signal x(nT ), and the symbol ‘ r’ represents the digital signal x(n). The spacing between two consecutive quantization levels is called the quantization width, step, or resolution. If the spacing between these levels is the same, then we have a uniform quantizer. For the uniform quantization, the resolution is given by dividing a full-scale range with the number of quantization levels, 2B. In Figure 1.5, the difference between the quantized number and the original value is defined as the quantization error, which appears as noise in the converter output. It is also called the quantization noise, which is assumed to be random variables that are uniformly distributed. If a B-bit quantizer is used, the signal-to-quantization-noise ratio (SQNR) is approximated by (will be derived in Chapter 3) SQNR ≈ 6B dB. (1.5) This is a theoretical maximum. In practice, the achievable SQNR will be less than this value due to imperfections in the fabrication of converters. However, Equation (1.5) still provides a simple guideline for determining the required bits for a given application. For each additional bit, a digital signal will have about 6-dB gain in SQNR. The problems of quantization and their solutions will be further discussed in Chapter 3. Example 1.3: If the input signal varies between 0 and 5 V, we have the resolutions and SQNRs for the following commonly used data converters: 1. An 8-bit ADC with 256 (28) levels can only provide 19.5 mV resolution and 48 dB SQNR. 2. A 12-bit ADC has 4096 (212) levels of 1.22 mV resolution, and provides 72 dB SQNR. 02TT 3T00 01 10 11 Quantization level Time x(t) Quantization errors Figure 1.5 Digital samples using a 2-bit quantizerJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 8 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING 3. A 16-bit ADC has 65 536 (216) levels, and thus provides 76.294 μV resolution with 96 dB SQNR. Obviously, with more quantization levels, one can represent analog signals more accurately. The dynamic range of speech signals is very large. If the uniform quantization scheme shown in Figure 1.5 can adequately represent loud sounds, most of the softer sounds may be pushed into the same small value. This means that soft sounds may not be distinguishable. To solve this problem, a quantizer whose quantization level varies according to the signal amplitude can be used. In practice, the nonuniform quantizer uses uniform levels, but the input signal is compressed first using a logarithm function. That is, the logarithm-scaled signal, rather than the original input signal itself, will be quantized. The compressed signal can be reconstructed by expanding it. The process of compression and expansion is called companding (compressing and expanding). For example, the ITU-T G.711 μ-law (used in North America and parts of Northeast Asia) and A-law (used in Europe and most of the rest of the world) companding schemes are used in most digital telecommunications. The A-law companding scheme gives slightly better performance at high signal levels, while the μ-law is better at low levels. As shown in Figure 1.1, the input signal to DSP hardware may be a digital signal from other DSP systems. In this case, the sampling rate of digital signals from other digital systems must be known. The signal processing techniques called interpolation and decimation can be used to increase or decrease the existing digital signals’ sampling rates. Sampling rate changes may be required in many multirate DSP systems such as interconnecting DSP systems that are operated at different rates. 1.2.3 Smoothing Filters Most commercial DACs are zero-order-hold devices, meaning they convert the input binary number to the corresponding voltage level and then hold that value for T s until the next sampling instant. Therefore, the DAC produces a staircase-shape analog waveform y(t) as shown by the solid line in Figure 1.6, which is a rectangular waveform with amplitude equal to the input value and duration of T s. Obviously, this staircase output contains some high-frequency components due to an abrupt change in signal levels. The reconstruction or smoothing filter shown in Figure 1.1 smoothes the staircase-like analog signal generated by the DAC. This lowpass filtering has the effect of rounding off the corners (high-frequency components) of the staircase signal and making it smoother, which is shown as a dotted line in Figure 1.6. This analog lowpass filter may have the same specifications as the antialiasing filter with cutoff frequency fc ≤ fs/2. High-quality DSP applications, such as professional digital audio, require the use y′(t) Time, t0 T 2T 3T 4T 5T Smoothed output signal Figure 1.6 Staircase waveform generated by a DACJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 ANALOG INTERFACE 9 of reconstruction filters with very stringent specifications. To reduce the cost of using high-quality analog filter, the oversampling technique can be adopted to allow the use of low-cost filter with slower roll off. 1.2.4 Data Converters There are two schemes of connecting ADC and DAC to DSP processors: serial and parallel. A parallel converter receives or transmits all the B bits in one pass, while the serial converters receive or transmit B bits in a serial bit stream. Parallel converters must be attached to the DSP processor’s external address and data buses, which are also attached to many different types of devices. Serial converters can be connected directly to the built-in serial ports of DSP processors. This is why many practical DSP systems use serial ADCs and DACs. Many applications use a single-chip device called an analog interface chip (AIC) or a coder/decoder (CODEC), which integrates an antialiasing filter, an ADC, a DAC, and a reconstruction filter all on a single piece of silicon. In this book, we will use Texas Instruments’ TLV320AIC23 (AIC23) chip on the DSP starter kit (DSK) for real-time experiments. Typical applications using CODEC include modems, speech systems, audio systems, and industrial controllers. Many standards that specify the nature of the CODEC have evolved for the purposes of switching and transmission. Some CODECs use a logarithmic quantizer, i.e., A-law or μ-law, which must be converted into a linear format for processing. DSP processors implement the required format conversion (compression or expansion) in hardware, or in software by using a lookup table or calculation. The most popular commercially available ADCs are successive approximation, dual slope, flash, and sigmaÐdelta. The successive-approximation ADC produces a B-bit output in B clock cycles by comparing the input waveform with the output of a DAC. This device uses a successive-approximation register to split the voltage range in half to determine where the input signal lies. According to the comparator result, 1 bit will be set or reset each time. This process proceeds from the most significant bit to the least significant bit. The successive-approximation type of ADC is generally accurate and fast at a relatively low cost. However, its ability to follow changes in the input signal is limited by its internal clock rate, and so it may be slow to respond to sudden changes in the input signal. The dual-slope ADC uses an integrator connected to the input voltage and a reference voltage. The integrator starts at zero condition, and it is charged for a limited time. The integrator is then switched to a known negative reference voltage and charged in the opposite direction until it reaches zero volts again. Simultaneously, a digital counter starts to record the clock cycles. The number of counts required for the integrator output voltage to return to zero is directly proportional to the input voltage. This technique is very precise and can produce ADCs with high resolution. Since the integrator is used for input and reference voltages, any small variations in temperature and aging of components have little or no effect on these types of converters. However, they are very slow and generally cost more than successive-approximation ADCs. A voltage divider made by resistors is used to set reference voltages at the flash ADC inputs. The major advantage of a flash ADC is its speed of conversion, which is simply the propagation delay of the comparators. Unfortunately, a B-bit ADC requires (2B− 1) expensive comparators and laser-trimmed resistors. Therefore, commercially available flash ADCs usually have lower bits. SigmaÐdelta ADCs use oversampling and quantization noise shaping to trade the quantizer resolu- tion with sampling rate. The block diagram of a sigmaÐdelta ADC is illustrated in Figure 1.7, which uses a 1-bit quantizer with a very high sampling rate. Thus, the requirements for an antialiasing filter are significantly relaxed (i.e., the lower roll-off rate). A low-order antialiasing filter requires simple low-cost analog circuitry and is much easier to build and maintain. In the process of quanti- zation, the resulting noise power is spread evenly over the entire spectrum. The quantization noise be- yond the required spectrum range can be filtered out using an appropriate digital lowpass filter. As a result, the noise power within the frequency band of interest is lower. In order to match the sampling frequency with the system and increase its resolution, a decimator is used. The advantages of sigmaÐdeltaJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 10 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Analog input + − Σ Sigma Delta 1-bit B-bit 1-bit DAC 1-bit ADC∫ Digital decimator Figure 1.7 A conceptual sigmaÐdelta ADC block diagram ADCs are high resolution and good noise characteristics at a competitive price because they use digital filters. Example 1.4: In this book, we use the TMS320VC5510 DSK for real-time experiments. The C5510 DSK uses an AIC23 stereo CODEC for input and output of audio signals. The ADCs and DACs within the AIC23 use the multi-bit sigmaÐdelta technology with integrated oversampling digital interpolation filters. It supports data wordlengths of 16, 20, 24, and 32 bits, with sampling rates from 8 to 96 kHz including the CD standard 44.1 kHz. Integrated analog features consist of stereo-line inputs and a stereo headphone amplifier with analog volume control. Its power management allows selective shutdown of CODEC functions, thus extending battery life in portable applications such as portable audio and video players and digital recorders. 1.3 DSP Hardware DSP systems are required to perform intensive arithmetic operations such as multiplication and addition. These tasks may be implemented on microprocessors, microcontrollers, digital signal processors, or custom integrated circuits. The selection of appropriate hardware is determined by the applications, cost, or a combination of both. This section introduces different digital hardware implementations for DSP applications. 1.3.1 DSP Hardware Options As shown in Figure 1.1, the processing of the digital signal x(n) is performed using the DSP hardware. Although it is possible to implement DSP algorithms on any digital computer, the real applications determine the optimum hardware platform. Five hardware platforms are widely used for DSP systems: 1. special-purpose (custom) chips such as application-specific integrated circuits (ASIC); 2. field-programmable gate arrays (FPGA); 3. general-purpose microprocessors or microcontrollers (μP/μC); 4. general-purpose digital signal processors (DSP processors); and 5. DSP processors with application-specific hardware (HW) accelerators. The hardware characteristics of these options are summarized in Table 1.1.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP HARDWARE 11 Table 1.1 Summary of DSP hardware implementations DSP processors with ASIC FPGA μP/μC DSP processor HW accelerators Flexibility None Limited High High Medium Design time Long Medium Short Short Short Power consumption Low LowÐmedium MediumÐhigh LowÐmedium LowÐmedium Performance High High LowÐmedium MediumÐhigh High Development cost High Medium Low Low Low Production cost Low LowÐmedium MediumÐhigh LowÐmedium Medium ASIC devices are usually designed for specific tasks that require a lot of computations such as digital subscriber loop (DSL) modems, or high-volume products that use mature algorithms such as fast Fourier transform and ReedÐSolomon codes. These devices are able to perform their limited functions much faster than general-purpose processors because of their dedicated architecture. These application-specific products enable the use of high-speed functions optimized in hardware, but they are deficient in the programmability to modify the algorithms and functions. They are suitable for implementing well- defined and well-tested DSP algorithms for high-volume products, or applications demanding extremely high speeds that can be achieved only by ASICs. Recently, the availability of core modules for some common DSP functions has simplified the ASIC design tasks, but the cost of prototyping an ASIC device, a longer design cycle, and the lack of standard development tools support and reprogramming flexibility sometimes outweigh their benefits. FPGAs have been used in DSP applications for years as glue logics, bus bridges, and peripherals for re- ducing system costs and affording a higher level of system integration. Recently, FPGAs have been gaining considerable attention in high-performance DSP applications, and are emerging as coprocessors for stan- dard DSP processors that need specific accelerators. In these cases, FPGAs work in conjunction with DSP processors for integrating pre- and postprocessing functions. FPGAs provide tremendous computational power by using highly parallel architectures for very high performance. These devices are hardware re- configurable, thus allowing the system designer to optimize the hardware architectures for implementing algorithms that require higher performance and lower production cost. In addition, the designer can imple- ment high-performance complex DSP functions in a small fraction of the total device, and use the rest to implement system logic or interface functions, resulting in both lower costs and higher system integration. Example 1.5: There are four major FPGA families that are targeted for DSP systems: Cyclone and Stratix from Altera, and Virtex and Spartan from Xilinx. The Xilinx Spartan-3 FPGA family (introduced in 2003) uses 90-nm manufacturing technique to achieve low silicon die costs. To support DSP functions in an area-efficient manner, Spartan-3 includes the following features: r embedded 18 × 18 multipliers; r distributed RAM for local storage of DSP coefficients; r 16-bit shift register for capturing high-speed data; and r large block RAM for buffers. The current Spartan-3 family includes XC3S50, S200, S400, S1000, and S1500 devices. With the aid of Xilinx System Generation for DSP, a tool used to port MATLAB Simulink model to Xilinx hardware model, a system designer can model, simulate, and verify the DSP algorithms on the target hardware under the Simulink environment.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 12 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Program memory Processor Program address bus Program data bus Data address bus Data memory Data data bus (a) Harvard architecture Memory Processor Address bus Data bus (b) von Newmann architecture Figure 1.8 Different memory architectures: (a) Harvard architecture; (b) von Newmann architecture General-purpose μP/μC becomes faster and increasingly able to handle some DSP applications. Many electronic products are currently designed using these processors. For example, automotive controllers use microcontrollers for engine, brake, and suspension control. If a DSP application is added to an existing product that already contains a μP/μC, it is desired to add the new functions in software without requiring an additional DSP processor. For example, Intel has adopted a native signal processing initiative that uses the host processor in computers to perform audio coding and decoding, sound synthesis, and so on. Software development tools for μP/μC devices are generally more sophisticated and powerful than those available for DSP processors, thus easing development for some applications that are less demanding on the performance and power consumption of processors. General architectures of μP/μC fall into two categories: Harvard architecture and von Neumann archi- tecture. As illustrated in Figure 1.8(a), Harvard architecture has a separate memory space for the program and the data, so that both memories can be accessed simultaneously. The von Neumann architecture as- sumes that there is no intrinsic difference between the instructions and the data, as illustrated in Figure 1.8(b). Operations such as add, move, and subtract are easy to perform on μPs/μCs. However, complex instructions such as multiplication and division are slow since they need a series of shift, addition, or subtraction operations. These devices do not have the architecture or the on-chip facilities required for efficient DSP operations. Their real-time DSP performance does not compare well with even the cheaper general-purpose DSP processors, and they would not be a cost-effective or power-efficient solution for many DSP applications. Example 1.6: Microcontrollers such as Intel 8081 and Freescale 68HC11 are typically used in in- dustrial process control applications, in which I/O capability (serial/parallel interfaces, timers, and interrupts) and control are more important than speed of performing functions such as multiplica- tion and addition. Microprocessors such as Pentium, PowerPC, and ARM are basically single-chip processors that require additional circuitry to improve the computation capability. Microprocessor instruction sets can be either complex instruction set computer (CISC) such as Pentium or reduced instruction set computer (RISC) such as ARM. The CISC processor includes instructions for basic processor operations, plus some highly sophisticated instructions for specific functions. The RISC processor uses hardwired simpler instructions such as LOAD and STORE to execute in a single clock cycle.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP HARDWARE 13 It is important to note that some microprocessors such as Pentium add multimedia exten- sion (MMX) and streaming single-instruction, multiple-data (SIMD) extension to support DSP operations. They can run in high speed (>3 GHz), provide single-cycle multiplication and arith- metic operations, have good memory bandwidth, and have many supporting tools and software available for easing development. A DSP processor is basically a microprocessor optimized for processing repetitive numerically inten- sive operations at high rates. DSP processors with architectures and instruction sets specifically designed for DSP applications are manufactured by Texas Instruments, Freescale, Agere, Analog Devices, and many others. The rapid growth and the exploitation of DSP technology is not a surprise, considering the commercial advantages in terms of the fast, flexible, low power consumption, and potentially low-cost design capabilities offered by these devices. In comparison to ASIC and FPGA solutions, DSP processors have advantages in easing development and being reprogrammable in the field to allow a product feature upgrade or bug fix. They are often more cost-effective than custom hardware such as ASIC and FPGA, especially for low-volume applications. In comparison to the general-purpose μP/μC, DSP processors have better speed, better energy efficiency, and lower cost. In many practical applications, designers are facing challenges of implementing complex algorithms that require more processing power than the DSP processors in use are capable of providing. For exam- ple, multimedia on wireless and portable devices requires efficient multimedia compression algorithms. The study of most prevalent imaging coding/decoding algorithms shows some DSP functions used for multimedia compression algorithms that account for approximately 80 % of the processing load. These common functions are discrete cosine transform (DCT), inverse DCT, pixel interpolation, motion es- timation, and quantization, etc. The hardware extension or accelerator lets the DSP processor achieve high-bandwidth performance for applications such as streaming video and interactive gaming on a sin- gle device. The TMS320C5510 DSP used by this book consists of the hardware extensions that are specifically designed to support multimedia applications. In addition, Altera has also added the hardware accelerator into its FPGA as coprocessors to enhance the DSP processing abilities. Today, DSP processors have become the foundation of many new markets beyond the traditional signal processing areas for technologies and innovations in motor and motion control, automotive systems, home appliances, consumer electronics, and vast range of communication systems and devices. These general- purpose-programmable DSP processors are supported by integrated software development tools that include C compilers, assemblers, optimizers, linkers, debuggers, simulators, and emulators. In this book, we use Texas Instruments’ TMS320C55x for hands-on experiments. This high-performance and ultralow power consumption DSP processor will be introduced in Chapter 2. In the following section, we will briefly introduce some widely used DSP processors. 1.3.2 DSP Processors In 1979, Intel introduced the 2920, a 25-bit integer processor with a 400 ns instruction cycle and a 25-bit arithmetic-logic unit (ALU) for DSP applications. In 1982, Texas Instruments introduced the TMS32010, a 16-bit fixed-point processor with a 16 × 16 hardware multiplier and a 32-bit ALU and accumulator. This first commercially successful DSP processor was followed by the development of faster products and floating-point processors. The performance and price range among DSP processors vary widely. Today, dozens of DSP processor families are commercially available. Table 1.2 summarizes some of the most popular DSP processors. In the low-end and low-cost group are Texas Instruments’ TMS320C2000 (C24x and C28x) family, Analog Devices’ ADSP-218x family, and Freescale’s DSP568xx family. These conventional DSP pro- cessors include hardware multiplier and shifters, execute one instruction per clock cycle, and use the complex instructions that perform multiple operations such as multiply, accumulate, and update addressJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 14 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Table 1.2 Current commercially available DSP processors Vendor Family Arithmetic type Clock speed TMS320C24x Fixed-point 40 MHz TMS320C28x Fixed-point 150 MHz TMS320C54x Fixed-point 160 MHz Texas instruments TMS320C55x Fixed-point 300 MHz TMS320C62x Fixed-point 300 MHz TMS320C64x Fixed-point 1 GHz TMS320C67x Floating-point 300 MHz ADSP-218x Fixed-point 80 MHz ADSP-219x Fixed-point 160 MHz Analog devices ADSP-2126x Floating-point 200 MHz ADSP-2136x Floating-point 333 MHz ADSP-BF5xx Fixed-point 750 MHz ADSP-TS20x Fixed/Floating 600 MHz DSP56300 Fixed, 24-bit 275 MHz DSP568xx Fixed-point 40 MHz Freescale DSP5685x Fixed-point 120 MHz MSC71xx Fixed-point 200 MHz MSC81xx Fixed-point 400 MHz Agere DSP1641x Fixed-point 285 MHz Source: Adapted from [11] pointers. They provide good performance with modest power consumption and memory usage, thus are widely used in automotives, appliances, hard disk drives, modems, and consumer electronics. For exam- ple, the TMS320C2000 and DSP568xx families are optimized for control applications, such as motor and automobile control, by integrating many microcontroller features and peripherals on the chip. The midrange processor group includes Texas Instruments’ TMS320C5000 (C54x and C55x), Analog Devices’ ADSP219x and ADSP-BF5xx, and Freescale’s DSP563xx. These enhanced processors achieve higher performance through a combination of increased clock rates and more advanced architectures. These families often include deeper pipelines, instruction cache, complex instruction words, multiple data buses (to access several data words per clock cycle), additional hardware accelerators, and parallel execution units to allow more operations to be executed in parallel. For example, the TMS320C55x has two multiplyÐaccumulate (MAC) units. These midrange processors provide better performance with lower power consumption, thus are typically used in portable applications such as cellular phones and wireless devices, digital cameras, audio and video players, and digital hearing aids. These conventional and enhanced DSP processors have the following features for common DSP algorithms such as filtering: r Fast MAC units Ð The multiplyÐadd or multiplyÐaccumulate operation is required in most DSP functions including filtering, fast Fourier transform, and correlation. To perform the MAC operation efficiently, DSP processors integrate the multiplier and accumulator into the same data path to complete the MAC operation in single instruction cycle. r Multiple memory accesses Ð Most DSP processors adopted modified Harvard architectures that keep the program memory and data memory separate to allow simultaneous fetching of instruction and data. In order to support simultaneous access of multiple data words, the DSP processors provide multiple on-chip buses, independent memory banks, and on-chip dual-access data memory.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP HARDWARE 15 r Special addressing modes Ð DSP processors often incorporate dedicated data-address generation units for generating data addresses in parallel with the execution of instruction. These units usually support circular addressing and bit-reversed addressing for some specific algorithms. r Special program control Ð Most DSP processors provide zero-overhead looping, which allows the programmer to implement a loop without extra clock cycles for updating and testing loop counters, or branching back to the top of loop. r Optimize instruction set Ð DSP processors provide special instructions that support the computa- tional intensive DSP algorithms. For example, the TMS320C5000 processors support compare-select instructions for fast Viterbi decoding, which will be discussed in Chapter 14. r Effective peripheral interface Ð DSP processors usually incorporate high-performance serial and parallel input/output (I/O) interfaces to other devices such as ADC and DAC. They provide streamlined I/O handling mechanisms such as buffered serial ports, direct memory access (DMA) controllers, and low-overhead interrupt to transfer data with little or no intervention from the processor’s computational units. These DSP processors use specialized hardware and complex instructions for allowing more operations to be executed in every instruction cycle. However, they are difficult to program in assembly language and also difficult to design efficient C compilers in terms of speed and memory usage for supporting these complex-instruction architectures. With the goals of achieving high performance and creating architecture that supports efficient C compilers, some DSP processors, such as the TMS320C6000 (C62x, C64x, and C67x), use very simple instructions. These processors achieve a high level of parallelism by issuing and executing multiple simple instructions in parallel at higher clock rates. For example, the TMS320C6000 uses very long instruction word (VLIW) architecture that provides eight execution units to execute four to eight instructions per clock cycle. These instructions have few restrictions on register usage and addressing modes, thus improving the efficiency of C compilers. However, the disadvantage of using simple instructions is that the VLIW processors need more instructions to perform a given task, thus require relatively high program memory usage and power consumption. These high-performance DSP processors are typically used in high-end video and radar systems, communication infrastructures, wireless base stations, and high-quality real-time video encoding systems. 1.3.3 Fixed- and Floating-Point Processors A basic distinction between DSP processors is the arithmetic formats: fixed-point or floating-point. This is the most important factor for the system designers to determine the suitability of a DSP processor for a chosen application. The fixed-point representation of signals and arithmetic will be discussed in Chapter 3. Fixed-point DSP processors are either 16-bit or 24-bit devices, while floating-point processors are usually 32-bit devices. A typical 16-bit fixed-point processor, such as the TMS320C55x, stores numbers in a 16-bit integer or fraction format in a fixed range. Although coefficients and signals are only stored with 16-bit precision, intermediate values (products) may be kept at 32-bit precision within the internal 40-bit accumulators in order to reduce cumulative rounding errors. Fixed-point DSP devices are usually cheaper and faster than their floating-point counterparts because they use less silicon, have lower power consumption, and require fewer external pins. Most high-volume, low-cost embedded applications, such as appliance control, cellular phones, hard disk drives, modems, audio players, and digital cameras, use fixed-point processors. Floating-point arithmetic greatly expands the dynamic range of numbers. A typical 32-bit floating- point DSP processor, such as the TMS320C67x, represents numbers with a 24-bit mantissa and an 8-bitJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 16 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING exponent. The mantissa represents a fraction in the rang −1.0 to +1.0, while the exponent is an integer that represents the number of places that the binary point must be shifted left or right in order to obtain the true value. A 32-bit floating-point format covers a large dynamic range, thus the data dynamic range restrictions may be virtually ignored in a design using floating-point DSP processors. This is in contrast to fixed-point designs, where the designer has to apply scaling factors and other techniques to prevent arithmetic overflow, which are very difficult and time-consuming processes. As a result, floating-point DSP processors are generally easy to program and use, but are usually more expensive and have higher power consumption. Example 1.7: The precision and dynamic range of commonly used 16-bit fixed-point processors are summarized in the following table: Precision Dynamic range Unsigned integer 1 0 ≤ x ≤ 65 535 Signed integer 1 −32 768 ≤ x ≤ 32 767 Unsigned fraction 2−16 0 ≤ x ≤ (1 −2−16) Signed fraction 2−15 −1 ≤ x ≤ (1 −2−15) The precision of 32-bit floating-point DSP processors is 2−23 since there are 24 mantissa bits. The dynamic range is 1.18 ×10−38 ≤ x ≤ 3.4 × 1038. System designers have to determine the dynamic range and precision needed for the applications. Floating-point processors may be needed in applications where coefficients vary in time, signals and coefficients require a large dynamic range and high precisions, or where large memory structures are required, such as in image processing. Floating-point DSP processors also allow for the efficient use of high-level C compilers, thus reducing the cost of development and maintenance. The faster development cycle for a floating-point processor may easily outweigh the extra cost of the DSP processor itself. Therefore, floating-point processors can also be justified for applications where development costs are high and production volumes are low. 1.3.4 Real-Time Constraints A limitation of DSP systems for real-time applications is that the bandwidth of the system is limited by the sampling rate. The processing speed determines the maximum rate at which the analog signal can be sampled. For example, with the sample-by-sample processing, one output sample is generated when one input sample is presented to the system. Therefore, the delay between the input and the output for sample-by-sample processing is at most one sample interval (T s). A real-time DSP system demands that the signal processing time, tp, must be less than the sampling period, T , in order to complete the processing task before the new sample comes in. That is, tp + to < T, (1.6) where to is the overhead of I/O operations. This hard real-time constraint limits the highest frequency signal that can be processed by DSP systems using sample-by-sample processing approach. This limit on real-time bandwidth fM is given as fM ≤ fs 2 < 1 2 tp + to . (1.7)JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP SYSTEM DESIGN 17 It is clear that the longer the processing time tp, the lower the signal bandwidth that can be handled by a given processor. Although new and faster DSP processors have continuously been introduced, there is still a limit to the processing that can be done in real time. This limit becomes even more apparent when system cost is taken into consideration. Generally, the real-time bandwidth can be increased by using faster DSP processors, simplified DSP algorithms, optimized DSP programs, and parallel processing using multiple DSP processors, etc. However, there is still a trade-off between the system cost and performance. Equation (1.7) also shows that the real-time bandwidth can be increased by reducing the overhead of I/O operations. This can be achieved by using block-by-block processing approach. With block processing methods, the I/O operations are usually handled by a DMA controller, which places data samples in a memory buffer. The DMA controller interrupts the processor when the input buffer is full, and a block of signal samples will be processed at a time. For example, for a real-time N-point fast Fourier transform (will be discussed in Chapter 6), the N input samples have to be buffered by the DMA controller. The block of N samples is processed after the buffer is full. The block computation must be completed before the next block of N samples is arrived. Therefore, the delay between input and output in block processing is dependent on the block size N, and this may cause a problem for some applications. 1.4 DSP System Design A generalized DSP system design process is illustrated in Figure 1.9. For a given application, the theoret- ical aspects of DSP system specifications such as system requirements, signal analysis, resource analysis, and configuration analysis are first performed to define system requirements. H A R D W A R E S O F T W A R E System requirements specifications Algorithm development and simulation Select DSP processor Software architecture Coding and debugging Hardware schematic System integration and debug System testing and release Application Hardware prototype Figure 1.9 Simplified DSP system design flowJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 18 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING 1.4.1 Algorithm Development DSP systems are often characterized by the embedded algorithm, which specifies the arithmetic operations to be performed. The algorithm for a given application is initially described using difference equations or signal-flow block diagrams with symbolic names for the inputs and outputs. In documenting an algorithm, it is sometimes helpful to further clarify which inputs and outputs are involved by means of a data-flow diagram. The next stage of the development process is to provide more details on the sequence of operations that must be performed in order to derive the output. There are two methods of characterizing the sequence of operations in a program: flowcharts or structured descriptions. At the algorithm development stage, we most likely work with high-level language DSP tools (such as MATLAB, Simulink, or C/C++) that are capable of algorithmic-level system simulations. We then implement the algorithm using software, hardware, or both, depending on specific needs. A DSP algorithm can be simulated using a general-purpose computer so that its performance can be tested and analyzed. A block diagram of general-purpose computer implementation is illustrated in Figure 1.10. The test signals may be internally generated by signal generators or digitized from a real environment based on the given application or received from other computers via the networks. The simulation program uses the signal samples stored in data file(s) as input(s) to produce output signals that will be saved in data file(s) for further analysis. Advantages of developing DSP algorithms using a general-purpose computer are: 1. Using high-level languages such as MATLAB,Simulink, C/C++, or other DSP software packages on computers can significantly save algorithm development time. In addition, the prototype C programs used for algorithm evaluation can be ported to different DSP hardware platforms. 2. It is easy to debug and modify high-level language programs on computers using integrated software development tools. 3. Input/output operations based on disk files are simple to implement and the behaviors of the system are easy to analyze. 4. Floating-point data format and arithmetic can be used for computer simulations, thus easing devel- opment. 5. We can easily obtain bit-true simulations of the developed algorithms using MATLAB or Simulink for fixed-point DSP implementation. Analysis MATLAB or C/C++ ADC Other computers DAC Other computers Signal generators DSP algorithms DSP software Data files Data files Figure 1.10 DSP software developments using a general-purpose computerJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP SYSTEM DESIGN 19 1.4.2 Selection of DSP Processors As discussed earlier, DSP processors are used in a wide range of applications from high-performance radar systems to low-cost consumer electronics. As shown in Table 1.2, semiconductor vendors have responded to this demand by producing a variety of DSP processors. DSP system designers require a full understanding of the application requirements in order to select the right DSP processor for a given application. The objective is to choose the processor that meets the project’s requirements with the most cost-effective solution. Some decisions can be made at an early stage based on arith- metic format, performance, price, power consumption, ease of development, and integration, etc. In real-time DSP applications, the efficiency of data flow into and out of the processor is also criti- cal. However, these criteria will probably still leave a number of candidate processors for further analysis. Example 1.8: There are a number of ways to measure a processor’s execution speed. They include: r MIPS Ð millions of instructions per second; r MOPS Ð millions of operations per second; r MFLOPS Ð millions of floating-point operations per second; r MHz Ð clock rate; and r MMACS Ð millions of multiplyÐaccumulate operations. In addition, there are other metrics such as milliwatts for measuring power consumption, MIPS per mw, or MIPS per dollar. These numbers provide only the sketchiest indication about perfor- mance, power, and price for a given application. They cannot predict exactly how the processor will measure up in the target system. For high-volume applications, processor cost and product manufacture integration are important fac- tors. For portable, battery-powered products such as cellular phones, digital cameras, and personal mul- timedia players, power consumption is more critical. For low- to medium-volume applications, there will be trade-offs among development time, cost of development tools, and the cost of the DSP processor itself. The likelihood of having higher performance processors with upward-compatible software in the future is also an important factor. For high-performance, low-volume applications such as communica- tion infrastructures and wireless base stations, the performance, ease of development, and multiprocessor configurations are paramount. Example 1.9: A number of DSP applications along with the relative importance for performance, price, and power consumption are listed in Table 1.3. This table shows that the designer of a handheld device has extreme concerns about power efficiency, but the main criterion of DSP selection for the communications infrastructures is its performance. When processing speed is at a premium, the only valid comparison between processors is on an algorithm-implementation basis. Optimum code must be written for all candidates and then the execution time must be compared. Other important factors are memory usage and on-chip peripheral devices, such as on-chip converters and I/O interfaces.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 20 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Table 1.3 Some DSP applications with the relative importance rating Application Performance Price Power consumption Audio receiver 1 2 3 DSP hearing aid 2 3 1 MP3 player 3 1 2 Portable video recorder 2 1 3 Desktop computer 1 2 3 Notebook computer 3 2 1 Cell phone handset 3 1 2 Cellular base station 1 2 3 Source: Adapted from [12] Note: Rating Ð 1Ð3, with 1 being the most important In addition, a full set of development tools and supports are important for DSP processor selection, including: r Software development tools such as C compilers, assemblers, linkers, debuggers, and simulators. r Commercially available DSP boards for software development and testing before the target DSP hardware is available. r Hardware testing tools such as in-circuit emulators and logic analyzers. r Development assistance such as application notes, DSP function libraries, application libraries, data books, and low-cost prototyping, etc. 1.4.3 Software Development The four common measures of good DSP software are reliability, maintainability, extensibility, and efficiency. A reliable program is one that seldom (or never) fails. Since most programs will occasionally fail, a maintainable program is one that is easily correctable. A truly maintainable program is one that can be fixed by someone other than the original programmers. In order for a program to be truly maintainable, it must be portable on more than one type of hardware. An extensible program is one that can be easily modified when the requirements change. A program is usually tested in a finite number of ways much smaller than the number of input data conditions. This means that a program can be considered reliable only after years of bug-free use in many different environments. A good DSP program often contains many small functions with only one purpose, which can be easily reused by other programs for different purposes. Programming tricks should be avoided at all costs, as they will often not be reliable and will almost always be difficult for someone else to understand even with lots of comments. In addition, the use of variable names should be meaningful in the context of the program. As shown in Figure 1.9, the hardware and software design can be conducted at the same time for a given DSP application. Since there are a lot of interdependent factors between hardware and software, an ideal DSP designer will be a true ‘system’ engineer, capable of understanding issues with both hardware and software. The cost of hardware has gone down dramatically in recent years, thus the majority of the cost of a DSP solution now resides in software. The software life cycle involves the completion of a software project: the project definition, the detailed specification, coding and modular testing, integration, system testing, and maintenance. SoftwareJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 DSP SYSTEM DESIGN 21 maintenance is a significant part of the cost for a DSP system. Maintenance includes enhancing the software functions, fixing errors identified as the software is used, and modifying the software to work with new hardware and software. It is essential to document programs thoroughly with titles and comment statements because this greatly simplifies the task of software maintenance. As discussed earlier, good programming techniques play an essential role in successful DSP ap- plications. A structured and well-documented approach to programming should be initiated from the beginning. It is important to develop an overall specification for signal processing tasks prior to writing any program. The specification includes the basic algorithm and task description, memory requirements, constraints on the program size, execution time, and so on. A thoroughly reviewed specification can catch mistakes even before code has been written and prevent potential code changes at the system integration stage. A flow diagram would be a very helpful design tool to adopt at this stage. Writing and testing DSP code is a highly interactive process. With the use of integrated software de- velopment tools that include simulators or evaluation boards, code may be tested regularly as it is written. Writing code in modules or sections can help this process, as each module can be tested individually, thus increasing the chance of the entire system working at the system integration stage. There are two commonly used methods in developing software for DSP devices: using assembly program or C/C++ program. Assembly language is similar to the machine code actually used by the processor. Programming in assembly language gives the engineers full control of processor functions and resources, thus resulting in the most efficient program for mapping the algorithm by hand. However, this is a very time-consuming and laborious task, especially for today’s highly paralleled DSP architectures. A C program, on the other hand, is easier for software development, upgrade, and maintenance. However, the machine code generated by a C compiler is inefficient in both processing speed and memory usage. Recently, DSP manufacturers have improved C compiler efficiency dramatically, especially with the DSP processors that use simple instructions and general register files. Often the ideal solution is to work with a mixture of C and assembly code. The overall program is controlled and written by C code, but the run-time critical inner loops and modules are written in assembly language. In a mixed programming environment, an assembly routine may be called as a function or intrinsics, or in-line coded into the C program. A library of hand-optimized functions may be built up and brought into the code when required. The assembly programming for the TMS320C55x will be discussed in Chapter 2. 1.4.4 High-Level Software Development Tools Software tools are computer programs that have been written to perform specific operations. Most DSP operations can be categorized as being either analysis tasks or filtering tasks. Signal analysis deals with the measurement of signal properties. MATLAB is a powerful environment for signal analysis and visualization, which are critical components in understanding and developing a DSP system. C programming is an efficient tool for performing signal processing and is portable over different DSP platforms. MATLAB is an interactive, technical computing environment for scientific and engineering numerical analysis, computation, and visualization. Its strength lies in the fact that complex numerical problems can be solved easily in a fraction of the time required with a programming language such as C. By using its relatively simple programming capability, MATLAB can be easily extended to create new functions, and is further enhanced by numerous toolboxes such as the Signal Processing Toolbox and Filter Design Toolbox. In addition, MATLAB provides many graphical user interface (GUI) tools such as Filter Design and Analysis Tool (FDATool). The purpose of a programming language is to solve a problem involving the manipulation of informa- tion. The purpose of a DSP program is to manipulate signals to solve a specific signal processing problem. High-level languages such as C and C++ are computer languages that have English-like commands andJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 22 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING C program (Source) Machine code (Object) Linker/loader Execution Program output Libraries Data C compiler Figure 1.11 Program compilation, linking, and execution flow instructions. High-level language programs are usually portable, so they can be recompiled and run on many different computers. Although C/C++ is categorized as a high-level language, it can also be written for low-level device drivers. In addition, a C compiler is available for most modern DSP processors such as the TMS320C55x. Thus C programming is the most commonly used high-level language for DSP applications. C has become the language of choice for many DSP software development engineers not only because it has powerful commands and data structures but also because it can easily be ported on different DSP processors and platforms. The processes of compilation, linking/loading, and execution are outlined in Figure 1.11. C compilers are available for a wide range of computers and DSP processors, thus making the C program the most portable software for DSP applications. Many C programming environments include GUI debugger programs, which are useful in identifying errors in a source program. Debugger programs allow us to see values stored in variables at different points in a program, and to step through the program line by line. 1.5 Introduction to DSP Development Tools The manufacturers of DSP processors typically provide a set of software tools for the user to develop efficient DSP software. The basic software development tools include C compiler, assembler, linker, and simulator. In order to execute the designed DSP tasks on the target system, the C or assembly programs must be translated into machine code and then linked together to form an executable code. This code conversion process is carried out using software development tools illustrated in Figure 1.12. The TMS320C55x software development tools include a C compiler, an assembler, a linker, an archiver, a hex conversion utility, a cross-reference utility, and an absolute lister. The C55x C compiler generates assembly source code from the C source files. The assembler translates assembly source files, either hand-coded by DSP programmers or generated by the C compiler, into machine language object files. The assembly tools use the common object file format (COFF) to facilitate modular programming. Using COFF allows the programmer to define the system’s memory map at link time. This maximizes performance by enabling the programmer to link the code and data objects into specific memory locations. The archiver allows users to collect a group of files into a single archived file. The linker combines object files and libraries into a single executable COFF object module. The hex conversion utility converts a COFF object file into a format that can be downloaded to an EPROM programmer or a flash memory program utility. In this section, we will briefly describe the C compiler, assembler, and linker. A full description of these tools can be found in the user’s guides [13, 14]. 1.5.1 C Compiler C language is the most popular high-level tool for evaluating algorithms and developing real-time soft- ware for DSP applications. The C compiler can generate either a mnemonic assembly code or an algebraic assembly code. In this book, we use the mnemonic assembly (ASM) language. The C compiler pack- age includes a shell program, code optimizer, and C-to-ASM interlister. The shell program supportsJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 INTRODUCTION TO DSP DEVELOPMENT TOOLS 23 Macro source files C source files C compiler Archiver Archiver Library of object files Hex- converter EPROM programmer Linker COFF executable file COFF object files TMS320C55x Target Absolute lister ×-reference lister Debugger tools Run-time support libraries Library-build utility Macro library Assembly source files Assembler Figure 1.12 TMS320C55x software development flow and tools automatically compiled, assembled, and linked modules. The optimizer improves run-time and code density efficiency of the C source file. The C-to-ASM interlister inserts the original comments in C source code into the compiler’s output assembly code so users can view the corresponding assembly instructions for each C statement generated by the compiler. The C55x C compiler supports American National Standards Institute (ANSI) C and its run-time support library. The run-time support library rts55.lib (or rts55x.lib for large memory model) includes functions to support string operation, memory allocation, data conversion, trigonometry, and exponential manipulations. C language lacks specific features of DSP,especially those fixed-point data operations that are necessary for many DSP algorithms. To improve compiler efficiency for DSP applications, the C55x C compiler supports in-line assembly language for C programs. This allows adding highly efficient assembly code directly into the C program. Intrinsics are another improvement for substituting DSP arithmetic operation with DSP assembly intrinsic operators. We will introduce more compiler features in Chapter 2 and subsequent chapters. 1.5.2 Assembler The assembler translates processor-specific assembly language source files (in ASCII format) into binary COFF object files. Source files can contain assembler directives, macro directives, and instructions.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 24 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Assembler directives are used to control various aspects of the assembly process, such as the source file listing format, data alignment, section content, etc. Binary object files contain separate blocks (called sections) of code or data that can be loaded into memory space. Once the DSP algorithm has been written in assembly, it is necessary to add important assembly directives to the source code. Assembler directives are used to control the assembly process and enter data into the program. Assembly directives can be used to initialize memory, define global variables, set conditional assembly blocks, and reserve memory space for code and data. 1.5.3 Linker The linker combines multiple binary object files and libraries into a single executable program for the target DSP hardware. It resolves external references and performs code relocation to create the executable mod- ule. The C55x linker handles various requirements of different object files and libraries, as well as targets system memory configurations. For a specific hardware configuration, the system designers need to pro- vide the memory mapping specification to the linker. This task can be accomplished by using a linker com- mand file. The visual linker is also a very useful tool that provides a visualized memory usage map directly. The linker commands support expression assignment and evaluation, and provides MEMORY and SECTION directives. Using these directives, we can define the memory model for the given target system. We can also combine object file sections, allocate sections into specific memory areas, and define or redefine global symbols at link time. An example linker command file is listed in Table 1.4. The first portion uses the MEMORY directive to identify the range of memory blocks that physically exist in the target hardware. These memory blocks Table 1.4 Example of linker command file used by TMS320C55x /* Specify the system memory map */ MEMORY { RAM (RWIX):o=0x000100, l = 0x00feff /* Data memory */ RAM0 (RWIX):o=0x010000, l = 0x008000 /* Data memory */ RAM1 (RWIX):o=0x018000, l = 0x008000 /* Data memory */ RAM2 (RWIX):o=0x040100, l = 0x040000 /* Program memory */ ROM (RIX) : o = 0x020100, l = 0x020000 /* Program memory */ VECS (RIX) : o = 0xffff00, l = 0x000100 /* Reset vector */ } /* Specify the sections allocation into memory */ SECTIONS { vectors > VECS /* Interrupt vector table */ .text > ROM /* Code */ .switch > RAM /* Switch table info */ .const > RAM /* Constant data */ .cinit > RAM2 /* Initialization tables */ .data > RAM /* Initialized data */ .bss > RAM /* Global & static vars */ .stack > RAM /* Primary system stack */ .sysstack > RAM /* Secondary system stack */ expdata0 > RAM0 /* Global & static vars */ expdata1 > RAM1 /* Global & static vars */ }JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 25 are available for the software to use. Each memory block has its name, starting address, and the length of the block. The address and length are given in bytes for C55x processors and in words for C54x processors. For example, the data memory block called RAM starts at the byte address 0x100, and it has a size of 0xFEFF bytes. Note that the prefix 0x indicates the following number is represented in hexadecimal (hex) form. The SECTIONS directive provides different code section names for the linker to allocate the program and data within each memory block. For example, the program can be loaded into the .text section, and the uninitialized global variables are in the .bss section. The attributes inside the parentheses are optional to set memory access restrictions. These attributes are: R Ð Memory space can be read. W Ð Memory space can be written. X Ð Memory space contains executable code. I Ð Memory space can be initialized. Several additional options used to initialize the memory can be found in [13]. 1.5.4 Other Development Tools Archiver is used to group files into a single archived file, that is, to build a library. The archiver can also be used to modify a library by deleting, replacing, extracting, or adding members. Hex-converter converts a COFF object file into an ASCII hex format file. The converted hex format files are often used to program EPROM and flash memory devices. Absolute lister takes linked object files to create the .abs files. These .abs files can be assembled together to produce a listing file that contains absolute addresses of the entire system program. Cross-reference lister takes all the object files to produce a cross-reference listing file. The cross-reference listing file includes symbols, definitions, and references in the linked source files. The DSP development tools also include simulator, EVM, XDS, and DSK. A simulator is the soft- ware simulation tool that does not require any hardware support. The simulator can be used for code development and testing. The EVM is a hardware evaluation module including I/O capabilities to allow developers to evaluate the DSP algorithms for the specific DSP processor in real time. EVM is usually a computer board to be connected with a host computer for evaluating the DSP tasks. The XDS usually includes in-circuit emulation and boundary scan for system development and debug. The XDS is an external stand-alone hardware device connected to a host computer and a DSP board. The DSK is a low-cost development board for the user to develop and evaluate DSP algorithms under a Windows operation system environment. In this book, we will use the Spectrum Digital’s TMS320VC5510 DSK for real-time experiments. The DSK works under the Code Composer Studio (CCS) development environment. The DSK package includes a special version of the CCS [15]. The DSK communicates with CCS via its onboard universal serial bus (USB) JTAG emulator. The C5510 DSK uses a 200 MHz TMS320VC5510 DSP processor, an AIC23 stereo CODEC, 8 Mbytes synchronous DRAM, and 512 Kbytes flash memory. 1.6 Experiments and Program Examples Texas Instruments’ CCS Integrated Development Environment (IDE) is a DSP development tool that allows users to create, edit, build, debug, and analyze DSP programs. For building applications, the CCS provides a project manager to handle the programming project. For debugging purposes, it providesJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 26 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING breakpoints, variable watch windows, memory/register/stack viewing windows, probe points to stream data to and from the target, graphical analysis, execution profile, and the capability to display mixed disassembled and C instructions. Another important feature of the CCS is its ability to create and manage large projects from a GUI environment. In this section, we will use a simple sinewave example to introduce the basic editing features, key IDE components, and the use of the C55x DSP development tools. We also demonstrate simple approaches to software development and the debug process using the TMS320C55x simulator. Finally, we will use the C5510 DSK to demonstrate an audio loop-back example in real time. 1.6.1 Experiments of Using CCS and DSK After installing the DSK or CCS simulator, we can start the CCS IDE. Figure 1.13 shows the CCS running on the DSK. The IDE consists of the standard toolbar, project toolbar, edit toolbar, and debug toolbar. Some basic functions are summarized and listed in Figure 1.13. Table 1.5 briefly describes the files used in this experiment. Procedures of the experiment are listed as follows: 1. Create a project for the CCS: Choose Project→New to create a new project file and save it as useCCS.pjt to the directory ..\experiments\exp1.6.1_CCSandDSK. The CCS uses the project to operate its built-in utilities to create a full-build application. Figure 1.13 CCS IDEJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 27 Table 1.5 File listing for experiment exp1.6.1CCSandDSK Files Description useCCS.c C file for testing experiment useCCS.h C header file useCCS.pjt DSP project file useCCS.cmd DSP linker command file 2. Create C program files using the CCS editor: Choose File→New to create a new file, type in the example C code listed in Tables 1.6 and 1.7. Save C code listed in Table 1.6 as useCCS.c to ..\experiments\exp1.6.1_CCSandDSK\src, and save C code listed in Table 1.7 as useCCS.h to the directory ..\experiments\exp1.6.1_CCSandDSK\inc. This example reads precalculated sine values from a data table, negates them, and stores the results in a reversed order to an output buffer. The programs useCCS.c and useCCS.h are included in the companion CD. However, it is recommended that we create them using the editor to become familiar with the CCS editing functions. 3. Create a linker command file for the simulator: Choose File→New to create another new file, and type in the linker command file as listed in Table 1.4. Save this file as useCCS.cmd to the directory ..\experiments\exp1.6.1_CCSandDSK. The command file is used by the linker to map different program segments into a prepartitioned system memory space. 4. Setting up the project: Add useCCS.c and useCCS.cmd to the project by choosing Project→Add Files to Project, then select files useCCS.c and useCCS.cmd. Before build- ing a project, the search paths of the included files and libraries should be setup for C com- piler, assembler, and linker. To setup options for C compiler, assembler, and linker choose Project→Build Options. We need to add search paths to include files and libraries that are not included in the C55x DSP tools directories, such as the libraries and included files we created Table 1.6 Program example, useCCS.c #include "useCCS.h" short outBuffer[BUF_SIZE]; void main() { short i, j; j=0; while (1) { for (i=BUF_SIZE-1; i>= 0;i--) { outBuffer [j++]=0-sineTable[i]; // <- Set breakpoint if (j >= BUF_SIZE) j=0; } j++; } }JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 28 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Table 1.7 Program example header file, useCCS.h #define BUF_SIZE 40 const short sineTable[BUF_SIZE]= {0x0000, 0x000f, 0x001e, 0x002d, 0x003a, 0x0046, 0x0050, 0x0059, 0x005f, 0x0062, 0x0063, 0x0062, 0x005f, 0x0059, 0x0050, 0x0046, 0x003a, 0x002d, 0x001e, 0x000f, 0x0000, 0xfff1, 0xffe2, 0xffd3, 0xffc6, 0xffba, 0xffb0, 0xffa7, 0xffa1, 0xff9e, 0xff9d, 0xff9e, 0xffa1, 0xffa7, 0xffb0, 0xffba, 0xffc6, 0xffd3, 0xffe2, 0xfff1}; in the working directory. Programs written in C language require the use of the run-time support library, either rts55.lib or rts55x.lib, for system initialization. This can be done by selecting the compiler and linker dialog box and entering the C55x run-time support library, rts55.lib, and adding the header file path related to the source file directory. We can also specify different directories to store the output executable file and map file. Figure 1.14 shows an example of how to set the search paths for compiler, assembler, and linker. 5. Build and run the program: Use Project→Rebuild All command to build the project. If there are no errors, the CCS will generate the executable output file, useCCS.out. Be- fore we can run the program, we need to load the executable output file to the C55x DSK or the simulator. To do so, use File→Load Program menu and select the useCCS.out in ..\expriments\exp1.6.1_CCSandDSK\Debug directory and load it. Execute this program by choosing Debug→Run. The processor status at the bottom-left-hand corner of the CCS will change from CPU HALTED to CPU RUNNING. The running process can be stopped by the Debug→Halt command. We can continue the program by reissuing the Run command or exiting the DSK or the simulator by choosing File→Exit menu. (a) Setting the include file searching path. (b) Setting the run-time support library. Figure 1.14 Setup search paths for C compiler, assembler, and linker: (a) setting the include file searching path; (b) setting the run-time support libraryJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 29 1.6.2 Debugging Program Using CCS and DSK The CCS IDE has extended traditional DSP code generation tools by integrating a set of editing, emulating, debugging, and analyzing capabilities in one entity. In this section, we will introduce some program building steps and software debugging capabilities of the CCS. The standard toolbar in Figure 1.13 allows users to create and open files, cut, copy, and paste text within and between files. It also has undo and redo capabilities to aid file editing. Finding text can be done within the same file or in different files. The CCS built-in context-sensitive help menu is also located in the standard toolbar menu. More advanced editing features are in the edit toolbar menu, including mark to, mark next, find match, and find next open parenthesis for C programs. The features of out-indent and in-indent can be used to move a selected block of text horizontally. There are four bookmarks that allow users to create, remove, edit, and search bookmarks. The project environment contains C compiler, assembler, and linker. The project toolbar menu (see Figure 1.13) gives users different choices while working on projects. The compile only, incremental build, and build all features allow users to build the DSP projects efficiently. Breakpoints permit users to set software breakpoints in the program and halt the processor whenever the program executes at those breakpoint locations. Probe points are used to transfer data files in and out of the programs. The profiler can be used to measure the execution time of given functions or code segments, which can be used to analyze and identify critical run-time blocks of the programs. The debug toolbar menu illustrated in Figure 1.13 contains several stepping operations: step-into-a- function, step-over-a-function, and step-out-off-a-function. It can also perform the run-to-cursor-position operation, which is a very convenient feature, allowing users to step through the code. The next three hot buttons in the debug toolbar are run, halt, and animate. They allow users to execute, stop, and animate the DSP programs. The watch windows are used to monitor variable contents. CPU registers and data memory viewing windows provide additional information for ease of debugging programs. More custom options are available from the pull-down menus, such as graphing data directly from the processor memory. We often need to check the changing values of variables during program execution for developing and testing programs. This can be accomplished with debugging settings such as breakpoints, step commands, and watch windows, which are illustrated in the following experiment. Procedures of the experiment are listed as follows: 1. Add and remove breakpoints: Start with Project→Open, select useCCS.pjt from the directory ..\experiments\exp1.6.2_CCSandDSK. Build and load the example project useCCS.out. Dou- ble click the C file, useCCS.c, in the project viewing window to open it in the editing window. To add a breakpoint, move the cursor to the line where we want to set a breakpoint. The command to enable a breakpoint can be given either from the Toggle Breakpoint hot button on the project toolbar or by clicking the mouse button on the line of interest. The function key is a shortcut that can be used to toggle a breakpoint. Once a breakpoint is enabled, a red dot will appear on the left to indicate where the breakpoint is set. The program will run up to that line without passing it. To remove breakpoints, we can either toggle breakpoints one by one or select the Remove All Breakpoints hot button from the debug toolbar to clear all the breakpoints at once. Now load the useCCS.out and open the source code window with source code useCCS.c, and put the cursor on the line: outBuffer[j++]=0-sineTable[i]; // <- set breakpoint Click the Toggle Breakpoint button (or press ) to set the breakpoint. The breakpoint will be set as shown in Figure 1.15.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 30 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Figure 1.15 CCS screen snapshot of the example using CCS 2. Set up viewing windows: CCS IDE provides many useful windows to ease code development and the debugging process. The following are some of the most often used windows: CPU register viewing window: On the standard tool menu bar, click View→Registers→ CPU Registers to open the CPU registers window. We can edit the contents of any CPU register by double clicking it. If we right click the CPU Register Window and select Allow Docking,we can move the window around and resize it. As an example, try to change the temporary register T0 and accumulator AC0 to new values of T0 = 0x1234 and AC0 = 0x56789ABC. Command window: From the CCS menu bar, click Tools→Command Window to add the command window. We can resize and dock it as well. The command window will appear each time when we rebuild the project. Disassembly window: Click View→Disassembly on the menu bar to see the disassembly window. Every time we reload an executable out file, the disassembly window will appear automatically. 3. Workspace feature: We can customize the CCS display and settings using the workspace feature. To save a workspace, click File→Workspace→Save Workspace and give the workspace a name and path where the workspace will be stored. When we restart CCS, we can reload the workspace by clicking File→Workspace→Load Workspace and use a workspace from previous work. Now save the workspace for your current CCS settings then exit the CCS. Restart CCS and reload the workspace. After the workspace is reloaded, you should have the identical settings restored. 4. Using the single-step features: When using C programs, the C55x system uses a function called boot from the run-time support library rts55.lib to initialize the system. After we load the useCCS.out,JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 31 the program counter (PC) will be at the start of the boot function (in assembly code boot.asm). This code should be displayed in the disassembly window. For a project starting with C programs, there is a function called main( ) from which the C program begins to execute. We can issue the command Go Main from the Debug menu to start the C program after loading the useCCS.out. After the Go Main command is executed, the processor will be initialized for boot.asm and then halted at the location where the function main( ) is. Hit the key or click the single-step button on the debug toolbar repeatedly to single step through the program useCCS.c, and watch the values of the CPU registers change. Move the cursor to different locations in the code and try the Run to Cursor command (hold down the and keys simultaneously). 5. Resource monitoring: CCS provides several resource viewing windows to aid software development and the debugging process. Watch windows: From View→Watch Window, open the watch window. The watch window can be used to show the values of listed variables. Type the output buffer name, outBuffer, into the expression box and click OK. Expand the outBuffer to view each individual element of the buffer. Memory viewing: From View→Memory, open a memory window and enter the starting address of the outBuffer in the data page to view the output buffer data. Since global variables are defined globally, we can use the variable name as its address for memory viewing. Is memory viewing showing the same data values as the watch window in previous step? Graphics viewing: From View→Graph→Time/Frequency, open the graphic property dialog. Set the display parameters as shown in Figure 1.16. The CCS allows the user to plot data directly from memory by specifying the memory location and its length. Set a breakpoint on the line of the following C statement: outBuffer[j++]=0-sineTable[i]; // <- set breakpoint Figure 1.16 Graphics display settingsJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 32 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Start animation execution ( hot key), and view DSP CPU registers and outBuffer data in watch window, memory window, and graphical plot window. Figure 1.15 shows one instant snapshot of the animation. The yellow arrow represents the current program counter’s location, and the red dot shows where the breakpoint is set. The data and register values in red color are the ones that have just been updated. 1.6.3 File I/O Using Probe Point Probe point is a useful tool for algorithm development, such as simulating the real-time input and output operations with predigitized data in files. When a probe point is reached, the CCS can either read the selected amount of data samples from a file of the host computer to the target processor memory or write the processed data samples from the target processor to the host computer as an output file for analysis. In the following example, we will learn how to setup probe points to transfer data between the example program probePoint.c and a host computer. In the example, the input data is read into inBuffer[ ] via probe point before the for loop, and the output data in outBuffer[ ] is written to the host computer at the end of the program. The program probePoint.c is listed in Figure 1.17. Write the C program as shown in Figure 1.17. This program reads in 128 words of data from a file, and adds each data value with the loop counter, i. The result is saved in outBuffer[ ], and written out to the host computer at the end of the program. Save the program in ..\experiments \exp1.6.3_probePoint\src and create the linker command file probePoint.cmd based on the Figure 1.17 CCS screen snapshot of example of using probe point: (a) set up probe point address and length for output data buffer; (b) set up probe point address and length for input data buffer; and (c) connect probe points with filesJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 33 Table 1.8 File listing for experiment exp1.6.3_probePoint Files Description probePoint.c C file for testing probe point experiment probePoint.h C header file probePoint.pjt DSP project file probePoint.cmd DSP linker command file previous example useCCS. The sections expdata0 and expdata1 are defined for the input and out- put data buffers. The pragma keyword in the C code will be discussed in Chapter 2. Table 1.8 gives a brief description of the files used for CCS probe point experiment. Procedures of the experiment are listed as follows: 1. Set probe point position: To set a probe point, put the cursor on the line where the probe point will be set and click the Toggle Probe Point hot button. A blue dot to the left indicates that the probe point is set (see Figure 1.17). The first probe point on the line of the for loop reads data into the inBuffer[ ], while the second probe points at the end of the main program and writes data from the outBuffer[ ] to the host computer. 2. Connect probe points: From File→File I/O, open the file I/O dialog box and select File Out- put tab. From the Add File tab, enter probePointOut.dat as the filename from the directory ..\experiments\exp1.6.3_probePoint\data and select *.dat (Hex) as the file type and then click Open tab. Use the output buffer name outBuffer as the address and 128 as the length of the data block for transferring 128 data to the host computer from the output buffer when the probe point is reached as shown in Figure 1.18(a). Also connect the input data probe point to the DSP processor. Select the File Input tab from the File I/O dialog box and click Add File tab. Navigate to the (a) Set up probe point address and length for output data buffer. Figure 1.18 Connect probe pointsJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 34 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING (b) Set up probe point address and length for input data buffer. (c) Connect probe points with files. Figure 1.18 (Continued )JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 35 Table 1.9 Data input file used by CCS probe point 1651 1 c000 1 80 0x0000 0x0001 0x0002 0x0003 0x0004 ... folder ..\experiments\exp1.6.3_probePoint\data and choose probePointIn.dat file. In the Address box, enter inBuffer for the input data buffer and set length to 128 (see Figure 1.18(b)). Now select Add Probe Point tab to connect the probe point with the output file probePointOut.dat and input data file probePointIn.dat. A new dialog box, Break/Probe Points, as shown in Figure 1.18(c), will pop up. From this window, highlight the probe point and click the Connect pull-down tab to select the output data file probePointOut.dat for the output data file, and select the input file probePointIn.dat for the input data file. Fi- nally, click the Replace button to connect the input and output probe points to these two files. After closing the Break/Probe Points dialog box, the File I/O dialog box will be updated to show that the probe point has been connected. Restart the program and run the program. After execution, view the data file probePointOut.dat using the built-in editor by issuing File → Open command. If there is a need to view or edit the data file using other editors/viewers, exit the CCS or disconnect the file from the File I/O. 3. Probe point results: Input data file for experiment is shown in Table 1.9, and the output data file is listed in Table 1.10. The first line contains the header information in hexadecimal format, which uses the syntax illustrated in Figure 1.19. For this example, the data shown in Tables 1.9 and 1.10 are in hexadecimal format, with the address of inBuffer at 0xC000 and outBuffer at 0x8000; both are at the data page, and each block contains 128 (0x80) data samples. 1.6.4 File I/O Using C File System Functions As shown in Figure 1.17, the probe point can be used to connect data files to the C55x system via CCS. The CCS uses only the ACSII file format. Binary file format, on the other hand, is more efficient for storing in the computers. In real-world applications, many data files are digitized and stored in binary format instead of ASCII format. In this section, we will introduce the C file-I/O functions. The CCS supports standard C library I/O functions and include fopen( ), fclose( ), fread ( ), fwrite( ), and so on. These functions not only provide the ability of operating on different Table 1.10 Data output file saved by CCS probe point 1651 1 8000 1 80 0x0000 0x0002 0x0004 0x0006 0x0008 ...JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 36 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING The number of data in each block Page number of that block, page 1 – data, 2 – program The starting address of memory block where data has been saved Hex (1), integer (2), long integer (3), and floating-point (4) Fixed at 1651 Magic Number Format Starting Address Page Number Length Figure 1.19 CCS file header format file formats, but also allow users to directly use the functions on computers. Comparing with probe point introduced in the previous section, these file I/O functions are functions that are portable to other development environments. Table 1.11 shows an example of C program that uses fopen( ), fclose( ), fread( ), and fwrite( ) functions. The input is a stereo data file in linear PCM WAV (Microsoft file format for using pulse code modulation audio data) file format. In this WAVfile, a dialog is carried out between the left and right stereo channels. The stereo audio data file is arranged such that the even samples are the left-channel data and the odd samples are the right-channel data. The example program reads in audio data samples in binary form from the input WAV file and writes out the left and right channels separately into two binary data files. The output files are written using the linear PCM WAV file format and will have the same sampling rate as the input file. These WAV files can be played back via the Windows Media Player. In this example, the binary files are read and written in byte units as sizeof(char). The CCS file I/O for C55x only supports this data format. For data units larger than byte, such as 16-bit short data type, the read and write must be done in multiple byte accesses. In the example, the 16-bit linear PCM data is read in and written out with 2 bytes at a time. When running this program on a computer, the data access can be changed to its native data type sizeof(short). The files used are in linear PCM WAV file format. WAV file format can have several different file types and it supports different sampling frequencies. Different WAV file formats are given as an exercise in this chapter for readers to explore further. The detailed WAV file format can be found in references [18Ð20]. The files used for this experiment are given in Table 1.12 with brief descriptions. Procedures of the experiment are listed as follows: 1. Create the project fielIO.pjt and save it in the directory ..\experiments\exp1.6.4_fileIO. Copy the linker command file from the previous experiment and rename it as fileIO.cmd. Write the experiment program fielIO.c as shown in Table 1.9 and save it to the directory ..\experiments\exp1.6.4_fileIO\src. Write the WAV file header as shown in Table 1.13 and save it as fielIO.h in the directory ..\experiments\exp1.6.4_fileIO\inc. The input data file inStereo.wav is included in the CD and located in the directory ..\experiments\exp1.6.4 fileIO\data. 2. Build the fielIO project and test the program. 3. Listen to the output WAV files generated by the experiment using computer audio player such as Windows Media Player and compare experimental output WAV file with the input WAV file.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 37 Table 1.11 Program of using C file system, fielIO.c #include #include "fielIO.h" void main() { FILE *inFile; // File pointer of input signal FILE *outLeftFile; // File pointer of left channel output signal FILE *outRightFile; // File pointer of right channel output signal short x[4]; char wavHd[44]; inFile = fopen("..\\data\\inStereo.wav", "rb"); if (inFile == NULL) { printf("Can't open inStereo.wav"); exit(0); } outLeftFile = fopen("..\\data\\outLeftCh.wav", "wb"); outRightFile = fopen("..\\data\\outRightCh.wav", "wb"); // Skip input wav file header fread(wavHd, sizeof(char), 44, inFile); // Add wav header to left and right channel output files fwrite(wavHeader, sizeof(char), 44, outLeftFile); fwrite(wavHeader, sizeof(char), 44, outRightFile); // Read stereo input and write to left/right channels while( (fread(x, sizeof(char), 4, inFile) == 4) ) { fwrite(&x[0], sizeof(char), 2, outLeftFile); fwrite(&x[2], sizeof(char), 2, outRightFile); } fclose(inFile); fclose(outLeftFile); fclose(outRightFile); } Table 1.12 File listing for experiment exp1.6.4_fileIO Files Description fileIO.c C file for testing file IO experiment fileIO.h C header file fileIO.pjt DSP project file fileIO.cmd DSP linker command file 1.6.5 Code Efficiency Analysis Using Profiler The profiler of the CCS measures the execution status of specific segments of a project. This is a very useful tool for analyzing and optimizing large and complex DSP projects. In this experiment, we will use the CCS profiler to obtain statistics of the execution time of DSP functions. The files used for this experiment are listed in Table 1.14.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 38 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Table 1.13 Program example of header file, fielIO.h // This wav file header is pre-calculated // It can only be used for this experiment short wavHeader[44]={ 0x52, 0x49, 0x46, 0x46, // RIFF 0x2E, 0x8D, 0x01, 0x00, // 101678 (36 bytes + 101642 bytes data) 0x57, 0x41, 0x56, 0x45, // WAVE 0x66, 0x6D, 0x74, 0x20, // Formatted 0x10, 0x00, 0x00, 0x00, // PCM audio 0x01, 0x00, 0x01, 0x00, // Linear PCM, 1-channel 0x40, 0x1F, 0x00, 0x00, // 8 kHz sampling 0x80, 0x3E, 0x00, 0x00, // Byte rate = 16000 0x02, 0x00, 0x10, 0x00, // Block align = 2, 16-bit/sample 0x64, 0x61, 0x74, 0x61, // Data 0x0A, 0x8D, 0x01, 0x00}; // 101642 data bytes Table 1.14 File listing for experiment exp1.6.5_profile Files Description profile.c C file for testing DSP profile experiment profile.h C header file profile.pjt DSP project file profile.cmd DSP linker command file Procedures of experiment are listed as follows: 1. Creating the DSP project: Create a new project profile.pjt and write a C program profile.c as shown in Table 1.15. Build the project and load the program. For demonstration purposes, we will profile a function and a segment of code in the program. This program calls the sine function from the C math library to generate 1000 Hz tone at 8000 Hz sampling rate. The generated 16-bit integer data is saved on the computer in WAV file format. As an example, we will use CCS profile feature to profile the function sinewave( ) and the code segment in main( ) which calls the sinewave( ) function. 2. Set up profile points: Build and load the program profile.out. Open the source code profile.c. From the Profile Point menu, select Start New Session. This opens the profile window. We can give a name to the profile session. Select Functions tab from the window. Click the function name (not the calling function inside the main function), sinewave( ), and drag this function into the profile session window. This enables the CCS profiler to profile the function sinewave( ).We can profile a segment of the source code by choosing the Ranges tab instead of the Functions tab. Highlight two lines of source code in main( ) where the sine function is called, and drag them into the profiler session window. This enables profiling two lines of code segment. Finally, run the program and record the cycle counts shown on the profile status window. The profile is run with the assistance of breakpoints, so it is not suitable for real-time analysis. But, it does provide useful analysis using the digitized data files. We use profile to identify the critical run-time code segments and functions. These functions, or code segments, can then be optimized to improve their real-time performances. Figure 1.20 shows the example profile results. The sine function in C library uses an average of over 4000 cycles to generate one data sample. This is very inefficient due to the use of floating-point arithmetic for calculation of the sine function. Since the TMS320C55x is a fixed- point DSP processor, the processing speed has been dramatically slowed by emulating floating-pointJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 39 Table 1.15 Program example using CCS profile features, profile.c #include #include #include "profile.h" void main() { FILE *outFile; // File pointer of output file short x[2],i; outFile = fopen("..\\data\\output.wav", "wb"); // Add wav header to left and right channel output files fwrite(wavHeader, sizeof(char), 44, outFile); // Generate 1 second 1 kHz sine wave at 8kHz sampling rate for (i=0; i<8000; i++) { x[0] = sinewave(i); // <- Profile range start x[1] = (x[0]>>8)&0xFF; // <- Profile range stop x[0] = x[0]&0xFF; fwrite(x, sizeof(char), 2, outFile); } fclose(outFile); } // Integer sine-wave generator short sinewave(short n) // <- Profile function { return( (short)(sin(TWOPI_f_F*(float)n)*16384.0)); } arithmetic. We will discuss the implementation of fixed-point arithmetic for the C55x processors in Chapter 3. We will also present a more efficient way to generate variety of digital signals including sinewave in the following chapters. 1.6.6 Real-Time Experiments Using DSK The programs we have presented in previous sections can be used either on a C5510 DSK or on a C55x simulator. In this section, we will focus on the DSK for real-time experiments. The DSK is a low-cost DSP development and evaluation hardware platform. It uses USB interface for connecting to the host computer. A DSK can be used either for DSP program development and debugging or for real-time demonstration and evaluation. We will introduce detailed DSK functions and features in Chapter 2 along with the C5510 peripherals. The DSK has several example programs included in its package. We modified an audio loop-back demo that takes an audio input through the line-in jack, and plays back via the headphone output in real time. The photo of the C5510 DSK is shown in Figure 1.21. For the audio demo example, we connect DSK with an audio player as the input audio source and a headphone (or loudspeaker) as the audio output. The demo program is listed in Table 1.16.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 40 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Figure 1.20 Profile window of DSP profile status Figure 1.21 TMSVC 5510 DSKJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 Table 1.16 Program example of DSK audio loop-back, loopback.c #include "loopbackcfg.h" #include "dsk5510.h" #include "dsk5510_aic23.h" /* Codec configuration settings */ DSK5510_AIC23_Config config = { \ 0x0017, /* 0 DSK5510_AIC23_LEFTINVOL Left line input channel volume */ \ 0x0017, /* 1 DSK5510_AIC23_RIGHTINVOL Right line input channel volume */\ 0x01f9, /* 2 DSK5510_AIC23_LEFTHPVOL Left channel headphone volume */ \ 0x01f9, /* 3 DSK5510_AIC23_RIGHTHPVOL Right channel headphone volume */ \ 0x0011, /* 4 DSK5510_AIC23_ANAPATH Analog audio path control */ \ 0x0000, /* 5 DSK5510_AIC23_DIGPATH Digital audio path control */ \ 0x0000, /* 6 DSK5510_AIC23_POWERDOWN Power down control */ \ 0x0043, /* 7 DSK5510_AIC23_DIGIF Digital audio interface format */ \ 0x0081, /* 8 DSK5510_AIC23_SAMPLERATE Sample rate control */ \ 0x0001, /* 9 DSK5510_AIC23_DIGACT Digital interface activation */ \ }; void main() { DSK5510_AIC23_CodecHandle hCodec; Int16 i,j,left,right; /* Initialize the board support library, must be called first */ DSK5510_init(); /* Start the codec */ hCodec = DSK5510_AIC23_openCodec(0, &config); /* Loop back line-in audio for 30 seconds at 48 kHz sampling rate */ for (i = 0; i < 30; i++) { for (j = 0; j < 48000; j++) { /* Read a sample from the left input channel */ while (!DSK5510_AIC23_read16(hCodec, &left)); /* Write a sample to the left output channel */ while (!DSK5510_AIC23_write16(hCodec, left)); /* Read a sample from the right input channel */ while (!DSK5510_AIC23_read16(hCodec, &right)); /* Write a sample to the right output channel */ while (!DSK5510_AIC23_write16(hCodec, right)); } } /* Close the codec */ DSK5510_AIC23_closeCodec(hCodec); } 41JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 42 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING Table 1.17 File listing for experiment exp1.6.6_loopback Files Description loopback.c C file for testing DSK real-time loopback experiment Loopback.cdb DSP BIOS configuration file loopbackcfg.h C header file loopback.pjt DSP project file loopbackcfg.cmd DSP linker command file desertSun.wav Test data file fools8k.wav Test data file This experimental program first initializes the DSK board and the AIC23 CODEC. It starts the audio loopback at 48 kHz sampling rate for 1 min. Finally, it stops and closes down the AIC23. The settings of the AIC23 will be presented in detail in Chapter 2. This example is included in the companion CD and can be loaded into DSK directly. In the subsequent chapters, we will continue to modify this program for use in other audio signal processing experiments. As we have discussed in this chapter, the signal processing can be either in a sample-by-sample or in a block-by-block method. This audio loopback experiment is implemented in the sample-by-sample method. It is not very efficient in terms of processing I/O overhead. We will introduce block-by-block method to reduce the I/O overhead in the following chapters. The files used for this experiment are listed in Table 1.17. In addition, there are many built-in header files automatically included by CCS. Procedures of experiment are listed as follows: 1. Play the WAVfile desertSun.wav in the directory ..\experiments\exp1.6.6_loopback\data using media player in loop mode on a host computer, or use an audio player as audio source. 2. Connect one end of a stereo cable to the computer’s audio output jack, and the other end to the DSK’s line-in jack. Connect a headphone to the DSK headphone output jack. 3. Start CCS and open loopback.pjt from the directory ..\experiments\exp1.6.6_loopback, and then build and load the loopback.out. 4. Play the audio by the host computer in loop mode and run the program on the DSK. The DSK will acquire the signal from the line-in input, and send it out to the DSK headphone output. 1.6.7 Sampling Theory Aliasing is caused by using sampling frequency incorrectly. A chirp signal (will be discussed in Chapter 8) is a sinusoid function with changing frequency, which is good for observing aliasing. This experiment uses audible and visual results from the MATLAB to illustrate the aliasing phenomenon. Table 1.18 lists the MATLAB code for experiment. In the program given in Table 1.18, fl and fh are the low and high frequencies of the chirp signals, respectively. The sampling frequency fs is set to 800 Hz. This experiment program generates 1 s of chirp signal. The experiment uses MATLAB function sound( ) as the audio tool for listening to the chirp signal and uses the plot( ) function as a visual aid to illustrate the aliasing result.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 43 Table 1.18 MATLAB code to demonstrate aliasing fl = 0; % Low frequency fh = 200; % High frequency fs = 800; % Sampling frequency n = 0:1/fs:1; % 1 seconds of data phi = 2*pi*(fl*n + (fh-fl)*n.*n/2); y = 0.5*sin(phi); sound(y, fs); plot(y) Procedures of the experiment are listed as follows: 1. Start MATLAB and set MATLAB path to the experiment directory. 2. Type samplingTheory in the MATLAB command window to start the experiment. 3. When the sampling frequency is set to 800 Hz, the code will generate a chirp signal sweeping from 0 to 200 Hz. The MATLAB uses the function sound to play the continuous chirp signal and plot the entire signal as shown in Figure 1.22(a). 4. Now change the sampling frequency to fs = 200 Hz. Because the sampling frequency fs does not meet the sampling theorem, the chirp signal generated will have aliasing. The result is audible and can be viewed from a MATLAB plot. The sweeping frequency folded at 100 Hz is shown in Figure 1.22(b). 0.5 0 0 0 20 40 60 80 100 120 140 160 180 200 100 200 300 400 (a) 800 Hz sampling rate (b) 200 Hz sampling rate 500 600 700 800 −0.5 0.5 0 −0.5 Figure 1.22 Sampling theory experiment using chirp signal: (a) 800 Hz sampling rate; (b) 200 Hz sampling rateJWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 44 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING 5. Now use the chirp experiment as reference, and write a signal generator using the sine function available in MATLAB. Set the sampling frequency to 800 Hz. Start with sine function frequency 200 Hz, generate 2 s of audio, and plot the signal. Repeat this experiment five times by incrementing the sine function frequency by 100 Hz each time. 1.6.8 Quantization in ADCs Quantization is an important factor when designing a DSP system. We will discuss quantization in Chapter 3. For this experiment, we use MATLAB to show that different ADC wordlengths have different quantization errors. Table 1.19 shows a portion of the MATLAB code for this experiment. Procedures of the experiment are listed as follows: 1. Start MATLAB and set the MATLAB path to the experiment directory. 2. Type ADCQuantization in the MATLAB command window to start the experiment. 3. When MATLAB prompts for input, enter the desired ADC peak voltage and wordlength. 4. Enter an input voltage to the ADC to compute the digital output and error. This experiment will calcu- late the ADC resolution and compute the error in million volts; it will also display the corresponding hexadecimal numbers that will be generated by ADC for the given voltage. An example of output is listed in Table 1.20. Table 1.19 MATLAB code for experiment of ADC quantization peak = input('Enter the ADC peak voltage (0 - 5) = '); bits = input('Enter the ADC wordlength (4 - 12) = '); volt = input('Enter the analog voltage = '); % Calculate resolution resolution = peak / power(2, bits); % Find digital output digital = round(volt/resolution); % Calculate error error = volt - digital*resolution; Table 1.20 Output of ADC quantization >> ADCQuantization Enter the ADC peak voltage (0 - 5) = 5 Enter the ADC wordlength (4 - 12) = 12 Enter the analog voltage (less than or equal to peak voltage) = 3.445 ADC resolution (mv) = 1.2207 ADC corresponding output (HEX) = B06 ADC quantization error (mv)= 0.1758JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 Exercises 45 References [1] ITU Recommendation G.729, Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic-Code-Excited Linear-Prediction (CS-ACELP), Mar. 1996. [2] ITU Recommendation G.723.1, Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 kbit/s, Mar. 1996. [3] ITU Recommendation G.722, 7 kHz Audio-Coding within 64 kbit/s, Nov. 1988. [4] 3GPP TS 26.190, AMR Wideband Speech Codec: Transcoding Functions, 3GPP Technical Specification, Mar. 2002. [5] ISO/IEC 13818-7, MPEG-2 Generic Coding of Moving Pictures and Associated Audio Information, Oct. 2000. [6] ISO/IEC 11172-3, Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 Mbit/s - Part 3: Audio, Nov. 1992. [7] ITU Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies, Nov. 1988. [8] Texas Instruments, TLV320AIC23B Data Manual, Literature no. SLWS106H, 2004. [9] Spectrum Digital, Inc., TMS320VC5510 DSK Technical Reference, 2002. [10] S. Zack and S. Dhanani, ‘DSP co-processing in FPGA: Embedding high-performance, low-cost DSP functions,’ Xilinx White Paper, WP212, 2004. [11] Berkeley Design Technology, Inc., ‘Choosing a DSP processor,’ White-Paper, 2000. [12] G. Frantz and L. Adams, ‘The three Ps of value in selecting DSPs,’ Embedded System Programming, Oct. 2004. [13] Texas Instruments, Inc., TMS320C55x Optimizing C Compiler User’s Guide, Literature no. SPRU281E, Revised 2003. [14] Texas Instruments, Inc., TMS320C55x Assembly Language Tools User’s Guide, Literature no. SPRU280G, Revised 2003. [15] Spectrum Digital, Inc., TMS320C5000 DSP Platform Code Composer Studio DSK v2 IDE, DSK Tools for C5510 Version 1.0, Nov. 2002. [16] Texas Instruments, Inc., Code Composer Studio User’s Guide (Rev B), Literature no. SPRU328B, Mar. 2000. [17] Texas Instruments, Inc., Code Composer Studio v3.0 Getting Start Guide, Literature no. SPRU509E, Sept. 2004. [18] IBM and Microsoft, Multimedia Programming Interface and Data Specification 1.0, Aug. 1991. [19] Microsoft, New Multimedia Data Types and Data Techniques, Rev 1.3, Aug. 1994. [20] Microsoft, Multiple Channel Audio Data and WAVE Files, Nov. 2002. [21] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1989. [22] S. J. Orfanidis, Introduction to Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1996. [23] J. G. Proakis and D. G. Manolakis, Digital Signal Processing Ð Principles, Algorithms, and Applications, 3rd Ed., Englewood Cliffs, NJ: Prentice Hall, 1996. [24] A. Bateman and W. Yates, Digital Signal Processing Design, New York: Computer Science Press, 1989. [25] S. M. Kuo and D. R. Morgan, Active Noise Control Systems Ð Algorithms and DSP Implementations, New York: John Wiley & Sons, Inc., 1996. [26] J. H. McClellan, R. W. Schafer, and M. A. Yoder, DSP First: A Multimedia Approach, 2nd Ed., Englewood Cliffs, NJ: Prentice Hall, 1998. [27] S. M. Kuo and W. S. Gan, Digital Signal Processors Ð Architectures, Implementations, and Applications, Upper Saddle River, NJ: Prentice Hall, 2005. [28] P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals: Architectures and Features, Piscataway, NJ: IEEE Press, 1997. [29] Berkeley Design Technology, Inc., ‘The evolution of DSP processor,’ A BDTi White Paper, 2000. Exercises 1. Given an analog audio signal with frequencies up to 10 kHz. (a) What is the minimum required sampling frequency that allows a perfect reconstruction of the signal from its samples? (b) What will happen if a sampling frequency of 8 kHz is used?JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 46 INTRODUCTION TO REAL-TIME DIGITAL SIGNAL PROCESSING (c) What will happen if the sampling frequency is 50 kHz? (d) When sampled at 50 kHz, if only taking every other samples (this is a decimation by 2), what is the frequency of the new signal? Is this causing aliasing? 2. Refer to Example 1.1, assuming that we have to store 50 ms (1 ms = 10−3 s) of digitized signals. How many samples are needed for (a) narrowband telecommunication systems with fs = 8 kHz, (b) wideband telecommu- nication systems with fs = 16 kHz, (c) audio CDs with fs = 44.1 kHz, and (d) professional audio systems with fs = 48 kHz. 3. Assume that the dynamic range of the human ear is about 100 dB, and the highest frequency the human ear can hear is 20 kHz. If you are a high-end digital audio system designer, what size of converters and sampling rate are needed? If your design uses single-channel 16-bit converter and 44.1 kHz sampling rate, how many bits are needed to be stored for 1 min of music? 4. Given a discrete time sinusoidal signal of x(n) = 5sin(nπ/100) V. (a) Find its peak-to-peak range? (b) What are the quantization resolutions of (i) 8-bit, (ii) 12-bit, (iii) 16-bit, and (iv) 24-bit ADCs for this signal? (c) In order to obtain the quantization resolution of below 1 mV, how many bits are required in the ADC? 5. A speech file (timit_1.asc) was sampled using 16-bit ADC with one of the following sampling rates: 8 kHz, 12 kHz, 16 kHz, 24 kHz, or 32 kHz. We can use MATLAB to play it and find the correct sampling rate. Try to run exercise1_5.m under the exercise directory. This script plays the file at 8 kHz, 12 kHz, 16 kHz, 24 kHz, and 32 kHz. Press the Enter key to continue if the program is paused. What is the correct sampling rate? 6. From the Option menu, set the CCS for automatically loading the program after the project has been built. 7. To reduce the number of mouse clicks, many pull-down menu items have been mapped to the hot buttons for the standard and advanced edit, project management, and debug toolbars. There are still some functions, however, that do not associate with any hot buttons. Use the Option menu to create shortcut keys for the following menu items: (a) map Go Main in the debug menu to Alt + M( and keys); (b) map Reset in the debug menu to Alt + R; (c) map Restart in the debug menu to Alt + S; and (d) map Reload Program in the file menu to Ctrl + R. 8. After loading the program into the simulator and enabling Source/ASM mixed display mode from View → Mixed Source/ASM, what is shown in the CCS source display window besides the C source code? 9. How do you change the format of displayed data in the watch window to hex, long, and floating-point format from the integer format? 10. What does File → Workspace do? Try the save and reload workspace commands. 11. Besides using file I/O with the probe point, data values in a block of memory space can also be stored to a file. Try the File → Data → Save and File → Data → Load commands. 12. Using Edit→Memory command we can manipulate (edit, copy, and fill) system memory of the useCCS.pjt in section 1.6.1 with the following tasks: (a) open memory window to view outBuffer;JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 EXERCISES 47 (b) fill outBuffer with data 0x5555; and (c) copy the constant sineTable[] to outBuffer. 13. Use the CCS context-sensitive online help menu to find the TMS320C55x CUP diagram, and name all the buses and processing units. 14. We have introduced probe point for connecting files in and out of the DSP program. Create a project that will read in 16-bit, 32-bit, and floating-point data files into the DSP program. Perform multiplication of two data and write the results out via probe point. 15. Create a project to use fgetc( ) and fputc( ) to get data from the host computer to the DSP processor and write out the data back to the computer. 16. Use probe point to read the unknown 16-bit speech data file (timit_1.asc) and write it out in binary format (timit_1.bin). 17. Study the WAV file format and write a program that can create a WAV file using the PCM binary file (timit_1.bin) from above experiment. Play the created WAV file (timit_1.wav) on a personal com- puter’s Windows Media Player. 18. Getting familiar with the DSK examples is very helpful. The DSK software package includes many DSP examples for the DSK. Use the DSK to run some of these examples and observe what these examples do.JWBK080-01 JWBK080-Kuo March 8, 2006 19:8 Char Count= 0 48JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 2 Introduction to TMS320C55x Digital Signal Processor To efficiently design and implement digital signal processing (DSP) systems, we must have a sound knowledge of DSP algorithms as well as DSP processors. In this chapter, we will introduce the architecture and programming of the Texas Instruments’ TMS320C55x fixed-point processors. 2.1 Introduction As introduced in Chapter 1, the TMS320 fixed-point processor family consists of C1x, C2x, C5x, C2xx, C54x, C55x, C62x, and C64x. In recent years, Texas Instruments have also introduced application-specific DSP-based processors including DSC2x, DM2xx, DM3xx, DM6xxxx, and OMAP (open multimedia application platform). DSC2x targets low-end digital cameras market. The digital medial processors aim at the rapid developing digital media markets such as portable media players, media centers, digital satellite broadcasting, simultaneously streaming video and audio, high-definition TVs, surveillance systems, as well as high-end digital cameras. The OMAP family is primarily used in wireless and portable devices such as new generation of cell phones and portable multimedia devices. Each generation of the TMS320 family has its own unique central processing unit (CPU) with variety of memory and peripheral configurations. The widely used TMS320C55x family includes C5501, C5502, C5503, C5509, C5510, and so on. In this book, we use the TMS320C5510 as an example for real-time DSP implementations, experiments, and applications. The C55x processor is designed for low power consumption, optimum performance, and high code density. Its dual multiply-and-accumulate (MAC) architecture provides twice the cycle efficiency for computing vector products, and its scaleable instruction length significantly improves the code density. Some important features of the C55x processors are: r 64-byte instruction buffer queue that works as an on-chip program cache to support implementation of block-repeat operations efficiently. r Two 17-bit by17-bit MAC units can execute dual MAC operations in a single cycle. r A 40-bit arithmetic-and-logic unit (ALU) performs high precision arithmetic and logic operations with an additional 16-bit ALU to perform simple arithmetic operations in parallel to the main ALU. Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 49JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 50 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR r Four 40-bit accumulators for storing intermediate computational results in order to reduce memory access. r Eight extended auxiliary registers (XARs) for data addressing plus four temporary data registers to ease data processing requirements. r Circular addressing mode supports up to five circular buffers. r Single-instruction repeat and block-repeat operations of program for supporting zero-overhead looping. r Multiple data variable and coefficient accesses in single instruction. Detailed information of the TMS320C55x can be found in the references listed at the end of this chapter. 2.2 TMS320C55x Architecture The C55x CPU consists of four processing units: instruction buffer unit (IU), program flow unit (PU), address-data flow unit (AU), and data computation unit (DU). These units are connected to 12 different address and data buses as shown in Figure 2.1. 2.2.1 Architecture Overview IU fetches instructions from the memory into the CPU. The C55x instructions have different lengths for optimum code density. Simple instructions use only 8 bits (1 byte), while complicated instruc- tions may contain as many as 48 bits (6 bytes). For each clock cycle, the IU fetches 4 bytes of in- struction code via its 32-bit program-read data bus (PB) and places them into the 64-byte instruc- tion buffer. At the same time, the instruction decoder decodes an instruction as shown in Figure 2.2. The decoded instruction is passed to the PU, AU, or DU. The IU improves the program execution by maintaining instruction flow between the four units within the CPU. If the IU is able to hold a complete segment of loop code, the program execution can be repeated many times without fetching code from memory. Such capability not only improves the efficiency of loop execution, but also saves the power consumption by reducing memory accesses. The instruction buffer that can hold multiple instructions in conjunction with conditional program flow control is another advantage. This can minimize the overhead caused by program flow discontinuities such as conditional calls and branches. PU controls program execution. As illustrated in Figure 2.3, the PU consists of a program counter (PC), four status registers, a program address generator, and a pipeline protection unit. The PC tracks the program execution every clock cycle. The program address generator produces a 24-bit address that covers 16 Mbytes of memory space. Since most instructions will be executed sequentially, the C55x utilizes pipeline structure to improve its execution efficiency. However, instructions such as branch, call, return, conditional execution, and interrupt will cause a nonsequential program execution. The dedicated pipeline protection unit prevents program flow from any pipeline vulnerabilities caused by a nonsequential execution. AU serves as the data access manager. The block diagram illustrated in Figure 2.4 shows that the AU generates the data space addresses for data read and data write. The AU consists of eight 23-bit XARS (XAR0ÐXAR7), four 16-bit temporary registers (T0ÐT3), a 23-bit coefficient pointer (XCDP), and a 23-bit extended stack pointer (XSP). It consists of an additional 16-bit ALU that can be used for simple arithmetic operations. The temporary registers can be utilized to expand compiler efficiency by minimizing the need for memory access. The AU allows two address registers and a coefficient pointerJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ARCHITECTURE 51 24-bit program-read address bus (PAB) 32-bit program-read data bus (PB) Three 24-bit data-read address buses (BAB, CAB, DAB) Three 16-bit data-read data buses (BB, CB, DB) Two 24-bit data-write address buses (EAB, FAB) Two 16-bit data-write data buses (EB, FB) 32bits CB DB BB CB DB Instruction buffer unit IU Data computation unit DU Program flow unit PU Address-data flow unit AU C55x CPU Figure 2.1 Block diagram of TMS320C55x CPU Program-read data bus (PB) 32 (4-byte opcode fetch) Instruction decoder Instruction buffer queue (64bytes) PU AU DU(1−6bytes opcode) 48 IU Figure 2.2 Simplified block diagram of the C55x IUJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 Program-read address bus (PAB) 24-bit PU Program counter (PC) Status registers (ST0, ST1, ST2, ST3) Address generator Pipeline protection unit Figure 2.3 Simplified block diagram of the C55x PU FB EB FAB EAB BAB CAB DAB C DB D A T A M E M O R Y S P A C E XAR0 XAR1 XAR2 XAR3 XAR4 XAR5 XAR6 XAR7 XCDP XSP 16-bit ALU T0 T1 T2 T3 16-bit 23-bit AU Data- address generator unit (24-bit) Figure 2.4 Simplified block diagram of the C55x AU AC0 AC1 AC2 AC3 MAC MAC ALU (40-bit) Barrel shifter Overflow and saturation EB FB 16-bit 16-bit BB C DB 16-bit 16-bit 16-bit DU Figure 2.5 Simplified block diagram of the C55x DU 52JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ARCHITECTURE 53 to be used together for some instructions to access two data samples and one coefficient in a single clock cycle. The AU also supports up to five circular buffers, which will be discussed later. DU handles intensive computation for C55x applications. As illustrated in Figure 2.5, the DU consists of a pair of MAC units, a 40-bit ALU, four 40-bit accumulators (AC0, AC1, AC2, and AC3), a barrel shifter, and rounding and saturation control logic. There are three data-read data buses that allow two data paths and a coefficient path to be connected to the dual MAC units simultaneously. In a single cycle, each MAC unit can perform a 17-bit by 17-bit multiplication and a 40-bit addition (or subtraction) with saturation option. The ALU can perform 40-bit arithmetic, logic, rounding, and saturation operations using the accumulators. It can also be used to achieve two 16-bit arithmetic operations in both the upper and lower portions of an accumulator at the same time. The ALU can accept immediate values from the IU as data and communicate with other AU and PU registers. The barrel shifter may be used to perform data shift in the range of 2−32 (shift right 32 bits) to 231 (shift left 31 bits). 2.2.2 Buses As illustrated in Figure 2.1, the TMS320C55x has one program data bus, five data buses, and six address buses. The C55x architecture is built around these 12 buses. The program buses carry the instruction code and immediate operands from program memory, while the data buses connect various units. This architecture maximizes the processing power by maintaining separate memory bus structures for full- speed execution. The program buses include a 32-bit PB and a 24-bit program-read address bus (PAB). The PAB carries the program memory address in order to read the code from the program space. The PB transfers 4 bytes of code to the IU at each clock cycle. The unit of program address used by the C55x processors is byte. Thus, the addressable program space is in the range of 0x000000Ð0xFFFFFF. (The prefix 0x indicates that the following numbers are in hexadecimal format.) The data buses consist of three 16-bit data-read data buses (BB, CB, and DB) and three 24-bit data-read address buses (BAB, CAB, and DAB). This architecture supports three simultaneous data reads from data memory or I/O space. The CB and DB can send data to the PU, AU, and DU, while the BB can only work with the DU. The primary function of the BB is to connect memory to the dual MAC, so some specific operations can fetch two data and one coefficient simultaneously. The data-write operations use two 16-bit data-write data buses (EB and FB) and two 24-bit data-write address buses (EAB and FAB). For a single 16-bit data write, only the EB is used. A 32-bit data write will use both the the EB and the FB in one cycle. The data-write address buses (EAB and FAB) have the same 24-bit addressing range. The data memory space is 23-bit word addressable from address 0x000000 to 0x7FFFFF. 2.2.3 On-Chip Memories The C55x uses unified program and data memory configurations with separated I/O space. All 16 Mbytes of memory space are available for program and data. The program memory space is used for program code, which is stored in byte units. The data memory space is used for data storage. The memory mapped registers (MMRs) also reside in data memory space. When the processor fetches instructions from the program memory space, the C55x address generator uses the 24-bit PAB. When the processor accesses data memory space, the C55x address generator masks off the least significant bit (LSB) of the data address line to ensure that the data is stored in memory in 16-bit word entity. The 16 Mbytes memory map is shown in Figure 2.6. Data space is divided into 128 data pages (0Ð127), and each page has 64 K words. The C55x on-chip memory from addresses 0x0000 to 0xFFFF uses the dual access RAM (DARAM). The DARAM is divided into eight blocks of 8 Kbytes each, see Table 2.1. Within each block, C55x can perform two accesses (two reads, two writes, or one read and one write) per cycle. The on-chip DARAM can be accessed via the internal program bus, data bus, or direct memory access (DMA) buses. The DARAM is often used for frequently accessed data.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 54 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR MMRs 00 0000−00 005F 00 0000−00 00BF Reserved 00 0060 00 FFFF 00 00C0 01 FFFF 01 0000 01 FFFF 02 0000 03 FFFF 02 0000 02 FFFF 04 0000 05 FFFF 7F 0000 7F FFFF FE 0000 FF FFFF Page 0 Page 1 Page 2 Page 127 Program space addresses (byte in hexadecimal) C55x memory program/data space Data space addresses (word in hexadecimal) ⎨ ⎨ ⎨ ⎨ Figure 2.6 TMS320C55x program space and data space memory map The C55x on-chip memory also includes single-access RAM (SARAM). The SARAM location starts from the byte address 0x10000 to 0x4FFFF. It consists of 32 blocks of 8 Kbytes each (see Table 2.2). Each access (one read or one write) will take one cycle. The C55x on-chip SARAM can be accessed by the internal program, data, or DMA buses. The C55x contains an on-chip read-only memory (ROM) in a single 32 K-byte block. It starts from the byte address 0xFF8000 to 0xFFFFFF. Table 2.3 shows the addresses and contents of ROM in C5510. The bootloader provides multiple methods to load the program at power up or hardware reset. The bootloader uses vector table for placing interrupts. The 256-value sine lookup table can be used to generate sine function. Table 2.1 C5510 DARAM blocks and addresses DARAM byte address range DARAM memory blocks 0x0000−0x1FFF DARAM 0 0x2000−0x3FFF DARAM 1 0x4000−0x5FFF DARAM 2 0x6000−0x7FFF DARAM 3 0x8000−0x9FFF DARAM 4 0xA000−0xBFFF DARAM 5 0xC000−0xDFFF DARAM 6 0xE000−0xFFFF DARAM 7JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ARCHITECTURE 55 Table 2.2 C5510 DARAM blocks and addresses SARAM byte address range SARAM memory blocks 0x10000−0x11FFF SARAM 0 0x12000−0x13FFF SARAM 1 0x14000−0x15FFF SARAM 2 ···· ···· ···· 0x4C000−0x4DFFF SARAM 30 0x4E000−0x4FFFF SARAM 31 2.2.4 Memory Mapped Registers The C55x processor has MMRs for internal managing, controlling, and monitoring. These MMRs are located at the reserved RAM block from 0x00000 to 0x0005F. Table 2.4 lists all the CPU registers of C5510. The accumulators AC0, AC1, AC2, and AC3 are 40-bit registers. They are formed by two 16-bit and one 8-bit registers as shown in Figure 2.7. The guard bits, AG, are used to hold data result of more than 32 bits to prevent overflow during accumulation. The temporary data registers, T0, T1, T2, and T3, are 16-bit registers. They are used to hold data results less or equal to 16 bits. There are eight auxiliary registers, AR0-AR7, which can be used for several purposes. Auxiliary registers can be used as data pointers for indirect addressing mode and circular addressing mode. The coefficient data pointer (CDP) is a unique addressing register used for accessing coefficients via coefficient data bus during multiple data access operations. Stack pointer tracks the data memory address position at the top of the stack. The stack must be set with sufficient locations at reset to ensure that the system works correctly. Auxiliary registers, CDP register, and stack pointer register are all 23-bit registers. These 23-bit registers are formed by combining two independent registers (see Figure 2.8). The data in lower 16-bit portion will not carry into higher 7-bit portion of the register. The C55x processor has four system status registers: ST0 C55, ST1 C55, ST2 C55, and ST3 C55. These registers contain system control bits and flag bits. The control bits directly affect the C55x operation conditions. The flag bits report the processor current status or results. These bits are shown in Figure 2.9, and see C55x reference guides for details. 2.2.5 Interrupts and Interrupt Vector The C5510 has an interrupt vector that serves all the internal and external interrupts. Interrupt vector given in Table 2.5 lists the priorities for all internal and external interrupts. The addresses of the interrupts are the offsets from the interrupt vector pointer. Table 2.3 C5510 ROM block addresses and contents SARAM byte address range SARAM memory blocks 0xFF8000−0xFF8FFF Bootloader 0xFF9000−0xFFF9FF Reserved 0xFFFA00−0xFFFBFF Sine lookup table 0xFFFC00−0xFFFEFF Factory test code 0xFFFF00−0xFFFEFB Vector table 0xFFFFFC−0xFFFEFF ID codeJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 56 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.4 C5510 MMRs Reg. Addr. Function description Reg. Addr. Function description IER0 0x00 Interrupt mask register 0 DPH 0x2B Extended data-page pointer IFR0 0x01 Interrupt flag register 0 0x2C Reserved ST0 55 0x02 Status register 0 for C55x 0x2D Reserved ST1 55 0x03 Status register 1 for C55x DP 0x2E Memory data-page start address ST3 55 0x04 Status register 3 for C55x PDP 0x2F Peripheral data-page start address 0x05 Reserved BK47 0x30 Circular buffer size register for AR[4Ð7] ST0 0x06 ST0 (for 54x compatibility) BKC 0x31 Circular buffer size register for CDP ST1 0x07 ST1 (for 54x compatibility) BSA01 0x32 Circular buffer start addr. reg. for AR[0Ð1] AC0L 0x08 Accumulator 0 [15 0] BSA23 0x33 Circular buffer start addr. reg. for AR[2Ð3] AC0H 0x09 Accumulator 0 [31 16] BSA45 0x34 Circular buffer start addr. reg. for AR[4Ð5] AC0G 0x0A Accumulator 0 [39 32] BSA67 0x35 Circular buffer start addr. reg. for AR[6Ð7] AC1L 0x0B Accumulator 1 [15 0] BSAC 0x36 Circular buffer coefficient start addr. reg. AC1H 0x0C Accumulator 1 [31 16] BIOS 0x37 Data page ptr storage for 128-word data table AC1G 0x0D Accumulator 1 [39 32] TRN1 0x38 Transition register 1 T3 0x0E Temporary register 3 BRC1 0x39 Block-repeat counter 1 TRN0 0x0F Transition register BRS1 0x3A Block-repeat save 1 AR0 0x10 Auxiliary register 0 CSR 0x3B Computed single repeat AR1 0x11 Auxiliary register 1 RSA0H 0x3C Repeat start address 0 high AR2 0x12 Auxiliary register 2 RSA0L 0x3D Repeat start address 0 low AR3 0x13 Auxiliary register 3 REA0H 0x3E Repeat end address 0 high AR4 0x14 Auxiliary register 4 REA0L 0x3F Repeat end address 0 low AR5 0x15 Auxiliary register 5 RSA1H 0x40 Repeat start address 1 high AR6 0x16 Auxiliary register 6 RSA1L 0x41 Repeat start address 1 low AR7 0x17 Auxiliary register 7 REA1H 0x42 Repeat end address 1 high SP 0x18 Stack pointer register REA1L 0x43 Repeat end address 1 low BK03 0x19 Circular buffer size register RPTC 0x44 Repeat counter BRC0 0x1A Block-repeat counter IER1 0x45 Interrupt mask register 1 RSA0L 0x1B Block-repeat start address IFR1 0x46 Interrupt flag register 1 REA0L 0x1C Block-repeat end address DBIER0 0x47 Debug IER0 PMST 0x1D Processor mode status register DBIER1 0x48 Debug IER1 XPC 0x1E Program counter extension register IVPD 0x49 Interrupt vector pointer, DSP 0x1F Reserved IVPH 0x4A Interrupt vector pointer, HOST T0 0x20 Temporary data register 0 ST2 55 0x4B Status register 2 for C55x T1 0x21 Temporary data register 1 SSP 0x4C System stack pointer T2 0x22 Temporary data register 2 SP 0x4D User stack pointer T3 0x23 Temporary data register 3 SPH 0x4E Extended data-page pointer for SP and SSP AC2L 0x24 Accumulator 2 [15 0] CDPH 0x4F Main data-page pointer for the CDP AC2H 0x25 Accumulator 2 [31 16] AC2G 0x26 ccumulator 2 [39 32] CDP 0x27 Coefficient data pointer AC3L 0x28 Accumulator 3 [15 0] AC3H 0x29 Accumulator 3 [31 16] AC3G 0x2A Accumulator 3 [39 32]JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ARCHITECTURE 57 AG AH AL 39 32 31 16 15 0 Figure 2.7 TMS320C55x accumulator structure ARH CDPH SPH SP CDP ARn XARn XCDP XSP 22 16 15 0 Figure 2.8 TMS320C55x 23-bit MMRs ACOV2 ACOV3 TC1 TC2 CARRY ACOV0 ACOV1 DP [8:0] 15 14 13 12 11 10 9 ST0_55: BRAF CPL XF HM INTM M40 SATD 15 14 13 12 11 10 9 ST1_55: SXMD 8 C16 FRCT C54CM ASM [4:0] 765 ARMS RESERVED [14:13] DBGM EALLOW RDM RESERVE 15 12 11 10 9 CDPLC 8 AR7LC AR6LC AR5LC AR4LC AR3LC AR2LC AR1LC 7654 321 AR0LC 0 ST2_55: CAFRZ HINT RESERVED [11:8] 15 14 13 12 CBERR MPNMC SATA RESERVED [4:3] CLKOFF SMUL 765 21 SST 0 ST3_55: CAEN CACLR Figure 2.9 TME320C55x status registersJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 58 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.5 C5510 interrupt vector Name Offset Priority Function description RESET 0x00 0 Reset (hardware and software) MNI 0x08 1 Nonmaskable interrupt INT0 0x10 3 External interrupt #0 INT2 0x18 5 External interrupt #2 TINT0 0x20 6 Timer #0 interrupt RINT0 0x28 7 McBSP #0 receive interrupt RINT1 0x30 9 McBSP #1 receive interrupt XINT1 0x38 10 McBSP #1 transmit interrupt SINT8 0x40 11 Software interrupt #8 DMAC1 0x48 13 DMA channel #1 interrupt DSPINT 0x50 14 Interrupt from host INT3 0x58 15 External interrupt #3 RINT2 0x60 17 McBSP #2 receive interrupt XINT2 0x68 18 McBSP #2 transmit interrupt DMAC4 0x70 21 DMA channel #4 interrupt DMAC5 0x78 22 DMA channel #5 interrupt INT1 0x80 4 External interrupt #1 XINT0 0x88 8 McBSP #0 transmit interrupt DMAC0 0x90 12 DMA channel #0 interrupt INT4 0x98 16 External interrupt #4 DMAC2 0xA0 19 DMA channel #2 interrupt DMAC3 0xA8 20 DMA channel #3 interrupt TINT1 0xB0 23 Timer #1 interrupt INT5 0xB8 24 External interrupt #5 BERR 0xC0 2 Bus error interrupt DLOG 0xC8 25 Data log interrupt RTOS 0xD0 26 Real-time operating system interrupt SINT27 0xD8 27 Software interrupt #27 SINT28 0xE0 28 Software interrupt #28 SINT29 0xE8 29 Software interrupt #29 SINT30 0xF0 30 Software interrupt #30 SINT31 0xF8 31 Software interrupt #31 These interrupts can be enabled or disabled (masked) by the interrupt enable registers, IER0 and IER1. The interrupt flag registers, IFR0 and IFR1, indicate if an interrupt has occurred. The interrupt enable bits and flag bits assignments are given in Figure 2.10. When a flag bit of the IFR is set to 1, it indicates an interrupt has happened and that interrupt is pending to be served. 2.3 TMS320C55x Peripherals In this section, we use the C5510 as an example to introduce some commonly used peripherals of the C55x processors. The C5510 consists of the following peripherals and the functional block diagram is shown in Figure 2.11. r an external memory interface (EMIF); r a six-channel DMA controller; r a 16-bit parallel enhanced host-port interface (EHPI);JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X PERIPHERALS 59 DMAC5 RINT2 INT3 DSPINT DMAC1 15 12 11 10 9 RESERV 8 XINT1 RINT1 RINT0 TINT0 INT2 INT0 RESERVED [1:0] 765432 IFR0/IER DMAC4 XINT2 14 13 RESERVED [15:11] RTOS DLOG 10 9 BER 8 INT5 TINT1 DMAC3 DMAC2 INT4 DMAC0 XINT0 7654321 INT1 0 IFR1/IER:1 Figure 2.10 TMS320C55x interrupt enable and flag registers EHPI C55X CPU DMA controller EMIF Peripheral controller Power management ClockGPIO Timer McBSP Figure 2.11 TMS320C55x functional blocksJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 60 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR r a digital phase-locked loop (DPLL) clock generator; r two timers; r three multichannel buffered serial ports (McBSP); and r eight configurable general-purpose I/O (GPIO) pins; 2.3.1 External Memory Interface The C55x EMIF connects the processor with external memory devices. The memory devices can be ROM, Flash, SRAM, synchronous burst SRAM (SBSRAM), and synchronous DRAM (SDRAM). The EMIF supports the program and data memory accesses at 32, 16, or 8 bit. The C55x external memory is divided into four spaces according to chip enable (CE) settings (see Figure 2.12). The highest address block, 0xFF8000Ð0xFFFFFF, can be configured either as a continuous CE3 space or shared by internal processor ROM. The configuration depends upon the C55x status register MPNMC bit selection. A memory device must be physically connected to the proper CE pin of the EMIF. For example, an SDRAM memory’s chip select pin must be connected to EMIF CE1 pin in order to be used in the CE1 memory space. The EMIF is managed by EMIF registers. The C5510 EMIF registers are given in Table 2.6. Each CE space can support either asynchronous or synchronous memory. For asynchronous memory, it can be a 32-, 16-, or 8-bit device. For synchronous memory, it supports SDRAM and SBSDRAM. More detailed description can be found in reference [7]. 2.3.2 Direct Memory Access The DMA is used to transfer data between the internal memory, external memory, and peripherals. Since the DMA data transfer is independent of the CPU, the C55x processor can simultaneously perform processing tasks at foreground while DMA transfers data at background. There are six DMA channels on External memory CE1 space CE0 space 0x05 0000−0x3F FFFF CE3 space CE2 space 0x40 0000−0x7F FFFF 0x80 0000−0xBF FFFF 0xC0 0000−0xFF 7FFF 0xFF 8000−0xFF FFFF Byte addresses Figure 2.12 TMS320C55x EMIFJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X PERIPHERALS 61 Table 2.6 C5510 EMIF registers Register Address Function description EGCR 0x0800 Global control register EMI RST 0x0801 Global reset register EMI BE 0x0802 Bus error status register CE0 1 0x0803 CE0 space control register 1 CE0 2 0x0804 CE0 space control register 2 CE0 3 0x0805 CE0 space control register 3 CE1 1 0x0806 CE1 space control register 1 CE1 2 0x0807 CE1 space control register 2 CE1 3 0x0808 CE1 space control register 3 CE2 1 0x0809 CE2 space control register 1 CE2 2 0x080A CE2 space control register 2 CE2 3 0x080B CE2 space control register 3 CE3 1 0x080C CE3 space control register 1 CE3 2 0x080D CE3 space control register 2 CE3 3 0x080E CE3 space control register 3 SDC1 0x080F SDRAM control register SDPER 0x0810 SDRAM period register SDCNT 0x0811 SDRAM counter register INIT 0x0812 SDRAM init register SDC2 0x0813 SDRAM control register 2 the C55x processors, thus allowing up to six different operations. Each DMA channel has its own interrupt associated for event control. The DMA uses four standard ports for DARAM, SARAM, peripherals, and external memory. Each DMA channel’s priority can be programmed independently. The data transfer source and destination addresses are programmable to provide more flexibility. Table 2.7 lists the DMA synchronization events. Event sync is determined by the SYNC field in DMA channel control register, DMA CCR. Channel 1Ð5 configuration registers are in the same order as channel 0. The DMA global registers and channel 0 configuration registers are given in Table 2.8. 2.3.3 Enhanced Host-Port Interface The C5510 has an EHPI that allows a host processor to access C55x’s internal DARAM and SARAM as well as portions of the external memory within its 20-bit address range. The range of 16-bit data access Table 2.7 C5510 DMA synchronization events SYNC field SYNC event SYNC field SYNC event 0x00 No sync event 0x0B Reserved 0x01 McBSP0 receive event REVT0 0x0C Reserved 0x02 McBSP0 transmit event XEVT0 0x0D Timer0 event 0x03 Reserved 0x0E Timer1 event 0x04 Reserved 0x0F External interrupt 0 0x05 McBSP1 receive event REVT0 0x10 External interrupt 1 0x06 McBSP1 transmit event XEVT0 0x11 External interrupt 2 0x07 Reserved 0x12 External interrupt 3 0x08 Reserved 0x13 External interrupt 4 0x09 McBSP2 receive event REVT0 0x14 External interrupt 5 0x0A McBSP2 transmit event XEVT0 ReservedJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 62 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.8 C5510 DMA configuration registers (channel 0 only) Register Address Function description Global Register DMA GCR 0x0E00 DMA global control register DMA GSCR 0x0E02 EMIF bus error status register Channel #0 Registers DMA CSDP0 0x0C00 DMA channel 0 source/destination parameters register DMA CCR0 0x0C01 DMA channel 0 control register DMA CICR0 0x0C02 DMA channel 0 interrupt control register DMA CSR0 0x0C03 DMA channel 0 status register DMA CSSA L0 0x0C04 DMA channel 0 source start address register (low bits) DMA CSSA U0 0x0C05 DMA channel 0 source start address register (up bits) DMA CDSA L0 0x0C06 DMA channel 0 source destination address register (low bits) DMA CDSA U0 0x0C07 DMA channel 0 source destination address register (up bits) DMA CEN0 0x0C08 DMA channel 0 element number register DMA CFN0 0x0C09 DMA channel 0 frame number register DMA CSFI0 0x0C0A DMA channel 0 source frame index register DMA CSEI0 0x0C0B DMA channel 0 source element index register DMA CSAC0 0x0C0C DMA channel 0 source address counter DMA CDAC0 0x0C0D DMA channel 0 destination address counter DMA CDEI0 0x0C0E DMA channel 0 destination element index register DMA CDFI0 0x0C0F DMA channel 0 destination frame index register starts from word address 0x00030 to 0xFFFFF, except the spaces for MMRs and peripheral registers. The address auto-increment capability improves the data transfer efficiency. The EHPI provides a 16-bit parallel data access between the host processor and the DSP processor. The data transfer is handled by the DMA controller. There are two configurations: nonmultiplexed mode and multiplexed mode. For the nonmultiplexed mode, the EHPI uses separated address and data buses while the multiplexed mode shares the same bus for both address and data. In order to pass data between C55x’s peripherals (or MMRs) and host processor, the data must be first transferred to a shared memory that can be accessed by both the host processor and the DSP processor. 2.3.4 Multichannel Buffered Serial Ports TMS320C55x processors use McBSP for direct serial interface with other serial devices connected to the system. The McBSP has the following key features: r full-duplex communication; r double-buffered transmission and triple-buffered reception; r independent clocking and framing for receiving and transmitting; r support external clock generation and sync frame signal; r programmable sampling rate for internal clock generation and sync frame signal;JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X PERIPHERALS 63 r support data size of 8, 12, 16, 20, 24, and 32 bits; and r ability of performing μ-law and A-law companding. The McBSP functional block diagram is shown in Figure 2.13. In the receive path, the incoming data is triple buffered. This allows one buffer to receive the data while other two buffers to be used by the processor. In the transmit path, double-buffer scheme is used. This allows one buffer of data to be transmitted out while the other buffer to be filled with new data for transmission. If the data width is 16-bit or less, only one 16-bit register will be used at each stage. These registers are DDR1, RBR1, and RSR1 in the receiving path, and DXR1 and XSR1 in the transmit path. When the data size is greater than 16 bits, two registers will be used at each stage of the data transfer. We will use the most commonly used 16-bit data transfer as an example to explain the functions of McBSP. When a receive data bit arrives at C55x’s DR pin, it will be shifted into the receive shift register (RSR). After all 16 receive bits are shifted into RSR, the whole word will be copied to the receive buffer register (RBR) if the previous data word in the RBR has already been copied. After the previous data in the data receive register (DRR) has been read, the RBR will copy its data to DRR for the processor or DMA to read. The data transmit process starts by the processor (or DMA controller) writing a data word to the data transmit register (DXR). After the last bit in the transmit shift register (XSR) is shifted out through the DX pin, the data in DXR will be copied into XSR. Using the McBSP’s hardware companding feature, the linear data word can be compressed into 8-bit byte according to either μ-law or A-law standard while the received μ-law or A-law 8-bit data can be expanded to 16-bit linear data. The companding algorithm follows ITU G.711 recommendation. XSR[1, 2] RSR[1, 2] RBR[1, 2] Compand DXR[1, 2] DRR[1, 2] 2 MCRs 8 RCERs 8 XCERs 2 SPCRs 2 RCRs 2 XCRs 2 SRGRs PCR Registers for multichannel control and monitoring Registers for data, clock, and frame sync control and monitoring XINT RINT XEVT REVT XEVTA REVTA DX DR CLKX CLKR FSX FSR CLKS Interrupt Events To DMA To CPU Figure 2.13 TMS320C55x McBSP functional blockJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 64 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR As shown in Figure 2.13, the McBSP will send interrupt notification to the processor via XINT or RINT interrupt, and send important events to the DMA controller via REVT, XEVT, REVTA, and XEVTA. These pins are summarized as follows: RINTÐreceive interrupt: The McBSP can send receive interrupt request to C55x according to a preselected condition in the receiver of the McBSP. XINT Ð transmit interrupt: The McBSP can send transmit interrupt request to C55x according to a preselected condition in the transmitter of the McBSP. REVT Ð receive synchronization event: This signal is sent to the DMA controller when a data is received in the DRR. XEVT Ð transmit synchronization event: This signal is sent to the DMA controller when the DXR is ready to accept the next serial word data. The C55x has three McBSPs: McBSP0, McBSP1, and McBSP2. Each McBSP has 31 registers. Table 2.9 lists the registers for McBSP0 as an example. Table 2.9 Registers and addresses for the McBSP0 Register Address Function description DRR2 0 0x2800 McBSP0 data receive register 2 DRR1 0 0x2801 McBSP0 data receive register 1 DXR2 0 0x2802 McBSP0 data transmit register 2 DXR1 0 0x2803 McBSP0 data transmit register 1 SPCR2 0 0x2804 McBSP0 serial port control register 2 SPCR1 0 0x2805 McBSP0 serial port control register 1 RCR2 0 0x2806 McBSP0 receive control register 2 RCR1 0 0x2807 McBSP0 receive control register 1 XCR2 0 0x2808 McBSP0 transmit control register 2 XCR1 0 0x2809 McBSP0 transmit control register 1 SRGR2 0 0x280A McBSP0 sample rate generator register 2 SRGR1 0 0x280B McBSP0 sample rate generator register 1 MCR2 0 0x280C McBSP0 multichannel register 2 MCR1 0 0x280D McBSP0 multichannel register 1 RCERA 0 0x280E McBSP0 receive channel enable register partition A RCERB 0 0x280F McBSP0 receive channel enable register partition B XCERA 0 0x2810 McBSP0 transmit channel enable register partition A XCERB 0 0x2811 McBSP0 transmit channel enable register partition B PCR0 0x2812 McBSP0 pin control register RCERC 0 0x2813 McBSP0 receive channel enable register partition C RCERD 0 0x2814 McBSP0 receive channel enable register partition D XCERC 0 0x2815 McBSP0 transmit channel enable register partition C XCERD 0 0x2816 McBSP0 transmit channel enable register partition D RCERE 0 0x2817 McBSP0 receive channel enable register partition E RCERF 0 0x2818 McBSP0 receive channel enable register partition F XCERE 0 0x2819 McBSP0 transmit channel enable register partition E XCERF 0 0x281A McBSP0 transmit channel enable register partition F RCERG 0 0x281B McBSP0 receive channel enable register partition G RCERH 0 0x281C McBSP0 receive channel enable register partition H XCERG 0 0x281D McBSP0 transmit channel enable register partition G XCERH 0 0x281E McBSP0 transmit channel enable register partition HJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ADDRESSING MODES 65 Table 2.10 Registers and addresses for clock generator and timers Register Address Function description Clock Generator Register CLKMD 0x1C00 DMA global control register Timer Registers TIM0 0x1000 Timer0 count register PRD0 0x1001 Timer0 period register TCR0 0x1002 Timer0 timer control register PRSC0 0x1003 Timer0 timer prescale register TIM1 0x2400 Timer1 count register PRD1 0x2401 Timer1 period register TCR1 0x2402 Timer1 timer control register PRS10 0x2403 Timer1 timer prescale register 2.3.5 Clock Generator and Timers The C5510 has a clock generator and two general-purpose timers. The clock generator takes an input clock signal from the CLKIN pin, and modifies this signal to generate the output clock signal for processor, peripherals, and other modules inside the C55x. The output clock signal is called the DSP (CPU) clock, which can be sent out via the CLKOUT pin. The clock generator consists of a DPLL circuit for high precision clock signal. An important feature of the clock generator is its idle mode for power conservation applications. The TMS320C55x has two general-purpose timers. Each timer has a dynamic range of 20 bits. The registers for clock generator and timers are listed in Table 2.10. 2.3.6 General Purpose Input/Output Port TMS320C55x has a GPIO port, which consists of two I/O port registers, an I/O direction register, and an I/O data register. The I/O direction register controls the direction of a particular I/O pin. The C55x has eight I/O pins. Each can be independently configured as input or output. At power up, all the I/O pins are set as inputs. The I/O registers are listed in Table 2.11. 2.4 TMS320C55x Addressing Modes The TMS320C55x can address 16 Mbytes of memory space using the following addressing modes: r direct addressing mode; r indirect addressing mode; r absolute addressing mode; Table 2.11 C5510 GPIO registers Register Address Function description IODIR 0x3400 GPIO direction register IODATA 0x3401 GPIO data registerJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 66 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.12 C55x mov instruction with different operand forms Instruction Description 1. mov #k,dst Load the 16-bit signed constant k to the destination register dst 2. mov src,dst Load the content of source register src to the destination register dst 3. mov Smem,dst Load the content of memory location Smem to the destination register dst 4. mov Xmem,Ymem,ACx The content of Xmem is loaded into the lower part of ACx, while the content of Ymem is sign extended and loaded into upper part of ACx 5. mov dbl(Lmem),pair(TAx) Load upper 16-bit data and lower 16-bit data from Lmem to the TAx and TA(x+1), respectively 6. amov #k23,xdst Load the effective address of k23 (23-bit constant) into extended destination register (xdst) r MMR addressing mode; r register bits addressing mode; and r circular addressing mode. To explain these different addressing modes, Table 2.12 lists the move (mov) instruction with different syntaxes. As illustrated in Table 2.12, each addressing mode uses one or more operands. Some of the operand types are explained as follows: r Smem means a short data word (16-bit) from data memory, I/O memory, or MMRs. r Lmem means a long data word (32-bit) from either data memory space or MMRs. r Xmem and Ymem are used by an instruction to perform two 16-bit data memory accesses simul- taneously. r src and dst are source and destination registers, respectively. r #k is a signed immediate constant; for example, #k16 is a 16-bit constant ranging from Ð32768 to 32767. r dbl is a memory qualifier for memory access for a long data word. r xdst is an extended register (23-bit). 2.4.1 Direct Addressing Modes There are four types of direct addressing modes: data-page pointer (DP) direct, stack pointer (SP) direct, register-bit direct, and peripheral data-page pointer (PDP) direct.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ADDRESSING MODES 67 DP (16bits) @x (7bits) DPH (7bits) DP direct address (23bits) + XDP Figure 2.14 Using the DP-direct addressing mode to access variable x The DP-direct addressing mode uses the main data page specified by the 23-bit extended data-page pointer (XDP). Figure 2.14 shows a generation of DP-direct address. The upper 7-bit DPH determines the main data page (0-127), and the lower 16-bit DP defines the starting address in the data page selected by the DPH. The instruction contains a 7-bit offset in the data page (@x) that directly points to the variable x(Smem). The data-page registers DPH, DP, and XDP can be loaded by the mov instruction as mov #k7,DPH ; Load DPH with a 7-bit constant k7 mov #k16,DP ; Load DP with a 16-bit constant k16 The first instruction loads the high portion of the extended data-page pointer, DPH, with a 7-bit constant k7 to set up the main data page. The second instruction initializes the starting address of the DP. Example 2.1 shows how to initialize the DPH and DP pointers. Example 2.1: Instruction mov #0x3,DPH mov #0x0100,DP DPH 0 DPH 03 DP 0000 DP 0100 Before instruction After instruction The XDP also can be initialized in one instruction using a 23-bit constant as amov #k23,XDP ; Load XDP with a 23-bit constant The syntax used in the assembly code is amov #k23,xdst, where #k23 is a 23-bit address, the destination xdst is an extended register. Example 2.2 initializes the XDP to data page 1 with starting address 0x4000. Example 2.2: Instruction amov #0x14000, XDP DPH 0 DPH 01 DP 0000 DP 4000 Before instruction After instructionJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 68 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR PDP Upper (9bits) Lower (7bits) PDP-direct address (16bits) @x (7bits)+ Figure 2.15 Using PDP-direct addressing mode to access variable x The following code shows how to use DP-direct addressing mode: X .set 0x1FFEF mov# 0x1,DPH ; Load DPH with 1 mov# 0x0FFEF,DP ; Load DP with starting address .dp X mov# 0x5555,@X ; Store 0x5555 to memory location X mov# 0xFFFF,@(X+5) ; Store 0xFFFF to memory location X+5 In this example, the symbol @ tells the assembler that this instruction uses the direct addressing mode. The directive .dp indicates the base address of the variable X without using memory space. The SP-direct addressing mode is similar to the DP-direct addressing mode. The 23-bit address can be formed with the XSP in the same way as XDP. The upper 7 bits (SPH) select the main data page and the lower 16 bits (SP) determine the starting address. The 7-bit stack offset is contained in the instruction. When SPH = 0 (main page 0), the stack must not use the reserved memory space for MMRs from address 0 to 0x5F. The I/O space addressing mode only has 16-bit addressing range. The 512 peripheral data pages are selected by the upper 9 bits of the PDP register. The 7-bit offset in the lower portion of the PDP register determines the location inside the selected peripheral data page as illustrated in Figure 2.15. 2.4.2 Indirect Addressing Modes There are four types of indirect addressing modes. The AR-indirect mode uses one of the eight aux- iliary registers as a pointer to data memory, I/O space, and MMRs. The dual AR indirect mode uses two auxiliary registers for dual data memory accesses. The CDP indirect mode uses the CDP for point- ing to coefficients in data memory space. The coefficient-dual-AR indirect mode uses the CDP and the dual AR indirect modes for generating three addresses. The indirect addressing is the most fre- quently used addressing mode. It provides powerful pointer update and modification schemes as listed in Table 2.13. The AR-indirect addressing mode uses auxiliary registers (AR0ÐAR7) to point to data memory space. The upper 7 bits of the XAR point to the main data page while the lower 16 bits point to a data location in that page. Since the I/O-space address is limited to a 16-bit range, the upper portion of the XAR must be set to zero when accessing I/O space. The maximum block size (32 K words) of the indirect addressing mode is limited by using 16-bit auxiliary registers. Example 2.3 uses the indirect addressing mode to copy the data stored in data memory, pointed by AR0, to the destination register AC0.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ADDRESSING MODES 69 Table 2.13 AR and CDP indirect addressing pointer modification schemes Operand ARn/CDP pointer modifications *ARn or *CDP ARn (or CDP) is not modified *ARn± or *CDP± ARn (or CDP) is modified after the operation by: ±1 for 16-bit operation (ARn=ARn ±1) ±2 for 32-bit operation (ARn=ARn±2) *ARn(#k16) or *CDP(#k16) ARn (or CDP) is not modified The signed 16-bit constant k16 is used as the offset from the base pointer ARn (or CDP) *+ARn(#k16) ARn (or CDP) is modified before the operation or *+CDP(#k16) The signed 16-bit constant k16 is added as the offset to the base pointer ARn (or CDP) before generating new address *(ARn±T0/T1) ARn is modified after the operation by ±16-bit content in T0 or T1,(ARn=ARn±T0/T1) *ARn(T0/T1) ARn is not modified T0 or T1 is used as the offset for the base pointer ARn Example 2.3: Instruction mov *AR0, AC0 AC0 00 0FAB 8678 AC0 00 0000 12AB AR0 0100 AR0 0100 Data memory Data memory 0x100 12AB 0x100 12AB Before instruction After instruction The dual AR indirect addressing mode allows two data memory accesses through the auxiliary registers. It can access two 16-bit data in memory using the syntax mov Xmem,Ymem,ACx given in Table 2.12. Example 2.4 performs two 16-bit data loads with AR2 and AR3 as the data pointers to Xmem and Ymem, respectively. The data pointed at by AR3 is sign extended to 24 bits, loaded into the upper portion of the destination accumulator AC0(39:16), and the data pointed at by AR2 is loaded into the lower portion of AC0(15:0). The data pointers AR2 and AR3 are also updated. Example 2.4: Instruction mov *AR2+, *AR3-, AC0 AC0 FF FFAB 8678 AC0 00 3333 5555 AR2 0100 AR2 0101 AR3 0300 AR3 02FF Data memory Data memory 0x100 5555 0x100 5555 0x300 3333 0x300 3333 Before instruction After instruction The extended coefficient data pointer (XCDP) is the concatenation of the CDPH (the upper 7 bits) and the CDP (the lower 16 bits). The CDP-indirect addressing mode uses the upper 7 bits to define theJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 70 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR main data page and the lower 16 bits to point to the memory location within the specified data page. Example 2.5 uses the CDP-indirect addressing mode where CDP contains the address of the coefficient in data memory. This instruction first increases the CDP pointer by 2, then loads a coefficient pointed by the updated coefficient pointer to the destination register AC3. Example 2.5: Instruction mov *+CDP (#2), AC3 AC3 00 0FAB EF45 AC3 00 0000 5631 CDP 0400 CDP 0402 Data memory Data memory 0x402 5631 0x402 5631 Before instruction After instruction 2.4.3 Absolute Addressing Modes The memory can also be addressed using either k16 or k23 absolute addressing mode. The k23 absolute mode specifies an address as a 23-bit unsigned constant. Example 2.6 shows an example of loading the data content at address 0x1234 on main data page 1 into the temporary register T2, where the symbol *( ) represents the absolute addressing mode. Example 2.6: Instruction mov *(#x011234), T2 T2 0000 T2 FFFF Data memory Data memory 0x01 1234 FFFF 0x01 1234 FFFF Before instruction After instruction The k16 absolute addressing mode uses the operand *abs(#k16), where k16 is a 16-bit unsigned constant. In this mode, the DPH (7-bit) is forced to zero and concatenated with the unsigned constant k16 to form a 23-bit data space memory address. The I/O absolute addressing mode uses the operand port(#k16). The absolute address can also be the variable name such as the variable x in the following example: mov *(x),AC0 This instruction loads the accumulator AC0 with a content of variable x. When using absolute addressing mode, we do not need to worry about the DP. The drawback is that it needs more code space to represent the 23-bit addresses. 2.4.4 Memory Mapped Register Addressing Mode The absolute, direct, and indirect addressing modes can be used to address MMRs, which are located in the data memory from address 0x0 to 0x5F on the main data page 0 as shown in Figure 2.6. To access MMRs using the k16 absolute operand, the DPH must be set to zero. Example 2.7 uses the absolute addressing mode to load the 16-bit content of AR2 into the temporary register T2.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X ADDRESSING MODES 71 Example 2.7: Instruction mov *abs16(#AR2), T2 AR2 1357 AR2 1357 T2 0000 T2 1357 Before instruction After instruction For the MMR-direct addressing mode, the DP-direct addressing mode must be selected. Example 2.8 uses direct addressing mode to load the content of the lower portion of the accumulator AC0 (15:0) into the temporary register T0. When the mmap() qualifier is used for the MMR-direct addressing mode, it forces the data-address generator to access the main data page 0. That is, XDP = 0. Example 2.8: Instruction mov mmap16(@AC0L), T0 AC0 00 12DF 0202 AC0 00 12DF 0202 T0 0000 T0 0202 Before instruction After instruction Accessing the MMRs using indirect addressing mode is the same as addressing the data memory space. Since the MMRs are located in data page 0, the XAR and XCDP must be initialized to page 0 by setting the upper 7 bits to zero. The following instructions load the content of AC0 into T1 and T2 registers: amov #AC0H,XAR6 mov *AR6-,T2 mov *AR6+,T1 In this example, the first instruction loads the effective address of the upper portion of the accumulator AC0 (AC0H, located at address 0x9 on page 0) to XAR6. That is, XAR6 = 0x000009. The second instruction uses AR6 as a pointer to copy the content of AC0H into the T2 register, and then the pointer was decremented by 1 to point to the lower portion of AC0 (AC0L, located at address 0x8). The third instruction copies the content of AC0L into the register T1 and modifies AR6 to point to AC0H again. 2.4.5 Register Bits Addressing Mode Both direct and indirect addressing modes can be used to address a bit or a pair of bits in a specific register. The direct addressing mode uses a bit offset to access a particular register’s bit. The offset is the number of bits counting from the LSB. The instruction of register-bit direct addressing mode is shown in Example 2.9. The bit test instruction btstp will update the test condition bits (TC1 and TC2) of the status register ST0. Example 2.9: Instruction btstp @30, AC1 AC1 00 7ADF 3D05 AC1 00 7ADF 3D05 TC1 0 TC1 1 TC2 0 TC2 0 Before instruction After instructionJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 72 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR We can also use the indirect addressing modes to specify register bit(s) as follows: mov# 2,AR4 ; AR4 contains the bit offset 2 bset* AR4,AC3 ; Set the AC3 bit pointed by AR4 to 1 btstp* AR4,AC1 ; Test AC1 bit-pair pointed by AR4 The register bits addressing mode supports only the bit test, bit set, bit clear, and bit complement instruc- tions in conjunction with the accumulators (AC0ÐAC3), auxiliary registers (AR0ÐAR7), and temporary registers (T0ÐT3). 2.4.6 Circular Addressing Mode Circular addressing mode updates data pointers in modulo fashion for accessing data buffers continuously without resetting the pointers. When the pointer reaches the end of the buffer, it will wrap back to the beginning of the buffer for the next iteration. Auxiliary registers (AR0ÐAR7) and the CDP can be used as circular pointers in indirect addressing mode. The following steps are used to set up circular buffers: 1. Initialize the most significant 7 bits of XAR (ARnH or CDPH) to select the main data page for a circular buffer. For example, mov #k7,AR2H. 2. Initialize the 16-bit circular pointer (ARn or CDP). The pointer can point to any memory location within the buffer. For example, mov #k16,AR2. The initialization of the address pointers in the examples of steps 1 and 2 can be combined using the single instruction: amov #k23,XAR2. 3. Initialize the 16-bit circular buffer starting address register (BSA01, BSA23, BSA45, BSA67, or BSAC) associated with the auxiliary registers. For example, if AR2 (or AR3) is used as the circu- lar pointer, we have to use BSA23 and initialize it using mov #k16,BSA23. The main data page concatenated with the content of this register defines the 23-bit starting address of the circular buffer. 4. Initialize the data buffer size register (BK03, BK47, or BKC). When using AR0ÐAR3 (or AR4ÐAR7) as the circular pointer, BK03 (or BK47) should be initialized. The instruction mov #16,BK03 sets up a circular buffer of 16 elements for the auxiliary registers AR0ÐAR3. 5. Enable the circular buffer by setting the appropriate bit in the status register ST2. For example, the instruction bset AR2LC enables AR2 for circular addressing. The following example demonstrates how to initialize a circular buffer COEFF[4] with four integers, and how to use the circular addressing mode to access data in the buffer: amov#COEFF,XAR2 ; Main data page for COEFF[4] mov#COEFF,BSA23 ; Buffer base address is COEFF[0] mov#0x4,BK03 ; Set buffer size of 4 words mov#2,AR2 ; AR2 points to COEFF[2] bset AR2LC ; AR2 is configured as circular pointer mov*AR2+,T0 ; T0 is loaded with COEFF[2] mov*AR2+,T1 ; T1 is loaded with COEFF[3] mov*AR2+,T2 ; T2 is loaded with COEFF[0] mov*AR2+,T3 ; T3 is loaded with COEFF[1]JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 PIPELINE AND PARALLELISM 73 Since the circular addressing uses the indirect addressing modes, the circular pointers can be updated using the modifications listed in Table 2.13. The applications of using circular buffers for FIR filtering will be introduced in Chapter 4. 2.5 Pipeline and Parallelism The pipeline technique has been widely used to improve DSP processors’ performance. The pipeline execution breaks a sequence of operations into smaller segments and efficiently executes these smaller pieces in parallel to reduce the overall execution time. 2.5.1 TMS320C55x Pipeline The C55x has two independent pipelines as illustrated in Figure 2.16: the program fetch pipeline and the program execution pipeline. The numbers on the top of the diagram represent the CPU clock cycle. The program fetch pipeline consists of the following three stages: PA (program address): Instruction unit places the program address on the PAB. PM (program memory address stable): The C55x requires one clock cycle for its program memory address bus to be stabilized before that memory can be read. PB (program fetch from program data bus): Four bytes of the program code are fetched from the program memory via the 32-bit PB. The code is placed into the instruction buffer queue (IBQ). At the same time, the seven-stage execution pipeline independently performs the sequence of fetch, decode, address, access, read, and execution. The C55x execution pipeline stages are summarized as follows: F (fetch): An instruction is fetched from the IBQ. The size of the instruction varies from 1 byte up to 6 bytes. D (decode): Decode logic decodes these bytes as an instruction or a parallel instruction pair. The decode logic will dispatch the instruction to the PU, AU, or DU. IBQ F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X F D AD AC1AC2 R X PA PM PB PA PM PB PA PM PB PA PM PB 12345 12345 6 7 8 9 10 11 12 13 14 15 Execution pipeline: F −Fetch from IBQ D−Decode AD−Address AC1−Access 1 AC2−Access 2 R−Read X−Execute Fetch pipeline: PA−P-address PM−P-memory PB−Fetch to IBQ 64x8 1−6bytes4bytes Figure 2.16 The C55x fetch and execution pipelinesJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 74 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR AD (address): AU calculates data addresses using its data-address generation unit, modifies pointers if required, and computes the program space address for PC-relative branching instructions. AC (access cycle 1 and 2): The first cycle is used to send the addresses to the data-read address buses (BAB, CAB, and DAB) for read operations, or transfer an operand to the processor via the CB. The second cycle is inserted to allow the address lines to be stabilized before the memory is read. R (read): Data and operands are transferred to the processor via the CB for the Ymem operand, the BB for the Cmem operand, and the DB for the Smem or Xmem operands. For reading the Lmem operand, both the CB and the DB are used. The AU will generate the address for the operand write and send the address to the data-write address buses (EAB and FAB). X (execute): Most data processing operations are done in this stage. The ALU inside the AU as well as the ALU and dual MAC inside the DU perform data processing, store an operand via the FB, or store a long operand via the EB and FB. Figure 2.16 shows that the execution pipeline will be full after seven cycles, and every cycle that follows will complete the execution of one instruction. If the pipeline is always full, this technique increases the processing speed seven times. However, when a disturbing execution such as a branch instruction occurs, it breaks the sequential pipeline. Under such circumstances, the pipeline will be flushed and will need to be refilled. This is called pipeline breakdown. The use of IBQ can minimize the impact of pipeline breakdown. Proper use of conditional execution instructions to replace branch instructions can also reduce the pipeline breakdown. 2.5.2 Parallel Execution The TMS320C55x uses multiple-bus architecture, dual MAC units, and separated PU, AU and DU for parallel execution. The C55x supports two types of parallel processing: implied (built-in) and explicit (user-built). The implied parallel instructions use the parallel columns symbol ‘::’ to separate the pair of instructions that will be processed in parallel. The explicit parallel instructions use the parallel bar symbol ‘||’ to indicate the pair of parallel instructions. These two types of parallel instructions can be used together to form a combined parallel instruction. The following examples show the user-built, built-in, and combined parallel instructions that can be carried out in just one clock cycle. User-built: mpym *AR1+,*AR2+,AC0 ; User-built parallel instruction || and AR4,T1 ; using DU and AU Built-in: mac *AR2-,*CDP-,AC0 ; Built-in parallel instruction :: mac *AR3+,*CDP-,AC1 ; using dual-MAC units Built-in and user-built combination: mpy *AR2+,*CDP+,AC0 ; Combined parallel instruction :: mpy *AR3+,*CDP+,AC1 ; using dual-MAC units and PU || rpt #15JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 PIPELINE AND PARALLELISM 75 Table 2.14 Partial list of the C55x registers and buses PU registers/buses AU registers/buses DU registers/buses RPTC T0, T1, T2, T3 AC0, AC1, AC2, AC3 BRC0, BRC1 AR0, AR1, AR2, AR3, TRN0, TRN1 RSA0, RSA1 AR4, AR5, AR6, AR7 REA0, REA1 CDP BSA01, BSA23, BSA45, BSA67 BK01, BK23, BK45, BK67 Read buses: CB, DB Read buses: CB, DB Read buses: BB, CB, DB Write buses: EB, FB Write buses: EB, FB Write buses: EB, FB Some of the restrictions for using parallel instructions are summarized as follows: r For either the user-built or the built-in parallelism, only two instructions can be executed in parallel, and these two instructions must not exceed 6 bytes. r Not all instructions can be used for parallel operations. r When addressing memory space, only the indirect addressing mode is allowed. r Parallelism is allowed between and within execution units, but there cannot be any hardware resources conflicts between units, buses, or within the unit itself. There are several restrictions that define the parallelism within each unit when applying parallel operations in assembly code. The detailed descriptions are given in the TMS320C55x DSP Mnemonic Instruction Set Reference Guide. The PU, AU, and DU can be involved in parallel operations. Understanding the register files and buses in each of these units will help to be aware of the potential conflicts when using the parallel instructions. Table 2.14 lists some of the registers and buses in PU, AU, and DU. The parallel instructions used in the following example are incorrect because the second instruction uses the direct addressing mode: mov *AR2,AC0 || mov T1,@x We can correct this problem by replacing the direct addressing mode, @x, with an indirect addressing mode, *AR1, so both memory accesses are using indirect addressing mode as follows: mov *AR2,AC0 || mov T1,*AR1 Consider the following example where the first instruction loads the content of AC0 that resides inside the DU to the auxiliary register AR2 inside the AU. The second instruction attempts to use the content of AC3 as the program address for a function call. Because there is only one link between AU and DU, when both instructions try to access the accumulators in the DU via the single link, it creates a conflict. mov AC0,AR2 || call AC3JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 76 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR To solve this problem, we can change the subroutine call from call by accumulator to call by address as follows: mov AC0,AR2 || call my_func This is because the instruction call my_func uses only the PU. The coefficient-dual-AR indirect addressing mode is used to perform operations with dual-AR indirect addressing mode. The coefficient indirect addressing mode supports three simultaneous memory accesses (Xmem, Ymem, and Cmem). The FIR filter (will be introduced in Chapter 4) is one of the applications that can effectively use coefficient indirect addressing mode. The following code is an example of using the coefficient indirect addressing mode: mpy *AR2+,*CDP+,AC2 ; AR1 pointer to data x1 :: mpy *AR3+,*CDP+,AC3 ; AR2 pointer to data x2 ||rpt #6 ; Repeat the following 7 times mac *AR2+,*CDP+,AC2 ; AC2 has accumulated result :: mac *AR3+,*CDP+,AC3 ; AC3 has another result In this example, the memory buffers Xmem and Ymem are pointed at by AR2 and AR3, respectively, while the coefficient array is pointed at by CDP. The multiplication results are added with the contents in the accumulators AC2 and AC3. 2.6 TMS320C55x Instruction Set In this section, we will introduce more C55x instructions for DSP applications. In general, we can divide the instruction set into four categories: arithmetic, logic and bit manipulation, move (load and store), and program flow control instructions. 2.6.1 Arithmetic Instructions Arithmetic instructions include addition (add), subtraction (sub), and multiplication (mpy). The combina- tion of these basic operations produces powerful subset of instructions such as the multiplyÐaccumulation (mac) and multiplyÐsubtraction (mas) instructions. Most arithmetic operations can be executed condition- ally. The C55x also supports extended precision arithmetic such as add-with-carry, subtract-with-borrow, signed/signed, signed/unsigned, and unsigned/unsigned instructions. In Example 2.10, the instruction mpym multiplies the data pointed by AR1 and CDP, stores the product in the accumulator AC0, and updates AR1 and CDP after the multiplication. Example 2.10: Instruction mpym *AR1+, *CDP−, AC0 AC0 FF FFFF FF00 AC0 00 0000 0020 FRCT 0 FRCT 0 AR1 02E0 AR1 02E1 CDP 0400 CDP 03FF Data memory Data memory 0x2E0 0002 0x2E0 0002 0x400 0010 0x400 0010 Before instruction After instructionJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 TMS320C55X INSTRUCTION SET 77 In Example 2.11, the macmr40 instruction performs MAC operation using AR1 and AR2 as data pointers. At the same time, the instruction also carries out the following operations: r The keyword ‘r’ produces a rounded result in the higher portion of the accumulator AC3. After rounding, the lower portion of AC3(15:0) is cleared. r 40-bit overflow detection is enabled by the keyword ‘40’. If overflow occurs, the result in accumulator AC3 will be saturated to a 40-bit maximum value. r The option ‘T3=*AR1+’ loads the data pointed at by AR1 into T3. r Finally, AR1 and AR2 are incremented by 1 to point to the next data memory location. Example 2.11: Instruction macmr40 T3=*AR1+, *AR2+, AC3 AC3 00 0000 0020 AC3 00 235B 0000 FRCT 1 FRCT 1 T3 FFF0 T3 3456 AR1 0200 AR1 0201 AR2 0380 AR2 0381 Data memory Data memory 0x200 3456 0x200 3456 0x380 5678 0x380 5678 Before instruction After instruction 2.6.2 Logic and Bit Manipulation Instructions Logic operation instructions such as and, or, not, and xor (exclusive-OR) on data values are widely used in decision-making and execution-flow control. They are also found in applications such as error correc- tion coding in data communications, which will be introduced in Chapter 14. For example, the instruction and #0xf,AC0 clears all upper bits in the accumulator AC0 but not the four LSBs. Example 2.12: Instruction and #0xf,AC0 AC0 00 1234 5678 AC0 00 0000 0008 Before instruction After instruction The bit manipulation instructions act on an individual bit or a pair of bits of a register or data memory. These instructions include bit clear, bit set, and bit test to a specified bit or a bit pair. Similar to logic opera- tions, the bit manipulation instructions are often used with logic operations in supporting decision-making processes. In Example 2.13, the bit-clear instruction clears the carry bit (bit 11) of the status register ST0. Example 2.13: Instruction blcr #11, ST0 ST0 0800 ST0 0000 Before instruction After instructionJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 78 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR 2.6.3 Move Instruction The move instruction copies data values between registers, memory locations, register to memory, or memory to register. Example 2.14 initializes the upper 16 bits of accumulator AC1 with a constant and clears the lower portion of the AC1. We can use the instruction mov #k<<16,AC1 where the constant k is shifted left by 16 bits first and then loaded into the upper portion of the accumulator AC1 (31:16), and the lower portion of the accumulator AC1 (15:0) is zero filled. The 16-bit constant that follows the # can be any 16-bit signed number. Example 2.14: Instruction mov #5<<16,AC1 AC1 00 0011 0800 AC1 00 0005 0000 Before instruction After instruction A more complicated instruction given in Example 2.15 completes the following operations in one clock cycle: r The unsigned data content in AC0 is shifted left according to the content in T2. r The upper portion of the AC0 (31:16) is rounded. r The data value in AC0 may be saturated if the left-shift or the rounding process causes the result in AC0 to overflow. r The final result, after left shifting, rounding, and possible saturation, is stored into the data memory pointed at by the pointer AR1 as an unsigned value. r Pointer AR1 is automatically incremented by 1. Example 2.15: Instruction mov uns (rnd(HI(satuate(AC0< VECS PAGE 0 /* Interrupt vectors */ .text > SARAM PAGE 0 /* Code */ .data > SARAM PAGE 0 /* Initialized variables */ .bss > DARAM PAGE 0 /* Global & static variables */ .const > DARAM PAGE 0 /* Constant data */ .sysmem > SARAM PAGE 0 /* Dynamic memory (malloc) */ .stack > SARAM PAGE 0 /* Primary system stack */ .sysstack > SARAM PAGE 0 /* Secondary system stack */ .cio > SARAM PAGE 0 /* C I/O buffers */ .switch > SARAM PAGE 0 /* Switch statement tables */ .cinit > SARAM PAGE 0 /* Auto-initialization tables */ .pinit > SARAM PAGE 0 /* Initialization fn tables */ .ioport > IOPORT PAGE 2 /* Global&static IO variables */ } runs much faster. Table 2.27 lists the run-time benchmark comparison of this demo program built with and without C compiler optimization option. Table 2.28 lists the files used for this experiment. Although we have included all the files for the experiment, the readers are strongly encouraged to create this experiment step by step. Table 2.26 Program example of timer, timerTest.c #include #include "timer.h" #pragma CODE_SECTION(main, ".text:example:timer0"); #pragma CODE_SECTION(application, ".text:example:timer0"); static unsigned long application(unsigned long cnt, short lp); void main() { short k,loop; unsigned long sampleCnt; asm(" MOV #0x01,mmap(IVPD)");// Set up C55x interrupt vector pointer asm(" MOV #0x01,mmap(IVPH)");// Set up HOST interrupt vector pointer initCLKMD(); // Initialize CLKMD register initTimer(); // Initialize timer loop = 100; for (k=0; k<9; k++) { time.us = 0; // Reset time variables continues overleafJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 102 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.26 (continued ) time.ms = 0; time.s = 0; startTimer(); // Start timer sampleCnt = application(0, loop); stopTimer(); // Stop timer loop -= 10; printf("samples processed = %10ld\ttime used = %d(s)%d(ms)%d(us)\ n", sampleCnt,time.s,time.ms,time.us); } } // Example of DSP application static unsigned long application(unsigned long cnt, short lp) { short i,j,k,n; for (i=0; i SDRAM PAGE 0 /* SDRAM memory: external memory */ } The files used for this experiment are listed in Table 2.29. Procedures of the experiment are listed as follows: 1. Create the project emif.pjt, and add files listed in Table 2.29 and the run-time support library rts55x.lib to the project. Set the Build Option to use large memory mode. 2. Build the project and load it to DSK. Run the experiment and view the SDRAM memory using CCS graphic display tool to verify RGB data write to SDRAM. Do you see the data pattern as shown in Figure 2.20? 2.10.5 Programming Flash Memory Devices Many DSP products use flash memories to store application program and data. The programmable flash memory offers the cost-effective and reliable read/write nonvolatile random accesses. The C5510 DSK consists of 256K words (4 Mbits) external flash memory. The flash memory is mapped to the C55x EMIF CE1 memory space. In this experiment, we will demonstrate several basic functions of flash programming. The C5510 DSK uses 4 Mbits of flash memory. The starting address of the flash memory is at word address 0x200000. The flash memory device is operated by writing specific data sequences into the flash memory command registers. Writing to incorrect address or using invalid data will cause the flash memory to reset. After resets, the flash device is in read-only state. The flash memory reset is performed by writing reset command word 0xF0 to any valid flash memory location. The flash memory must be erased before writing new data. The erase operation includes both chip erase and sector erase. The chip erase will erase all the data contained in the entire flash memory device while the sector erase will erase the data contained only in the specific sector. The erase command sequence is writing the specific data pattern 0xAA-0x55-0x80-0xAA-0x55-0x10 to flash memory address offset 0x555 and 0x2AA. After the command sequence has been completed, we check the flash memory statusJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 106 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.30 File listing for experiment exp2.10.5_flashProgram Files Description flashTest.c C function for testing flash programming experiment emifInit.c C function initializes C55x EMIF module flashErase.c C function erases flash memory data flashID.c C function obtains flash device IDs flashReset.c C function resets flash memory flashWrite.c C function writes data to flash memory emif.h C header file used for EMIF settings flash.h C header file for flash programming experiment flash.pjt DSP project file flash.cmd DSP linker command file dtmf18005551234.dat Data file in ASCII format bit DQ7 to detect if the erase has completed. The bit DQ5 indicates if the operation has been timed out. If this bit is set, it indicates a timeout error. When programming flash memory, the write sequence must be issued for each data at each flash memory address. The data can be a 16-bit word or an 8-bit byte. This experiment uses 16-bit flash memory write program. The write command sequence is 0xAA-0x55-0xA0-addr-data. After each data has been written, the processor checks the data ready bit, DQ7, before writing another data. The program first initializes the EMIF to map the flash device to CE1. It then resets the flash device and uses the function flashID( ) to obtain the manufacture ID and device ID. This important step allows system to determine exactly which flash memory device has being used, and thus enables flexible programming for using different programming algorithms for different flash memory devices. The pro- gram then erases the entire flash memory and reprograms it. Finally, the program reads data back from the flash memory and compares with the original data. In this experiment, we introduce several basic programming functions and demonstrate flash-erase and flash-write procedures. We also identify specific flash memory devices. Table 2.30 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Create the project flash.pjt, and add the files listed in Table 2.30 to the project. 2. Build the project and load it to DSK. Run the experiment and view the flash memory data pattern with CCS memory window after flash programming is completed. 3. Reload the flash program to DSK. Use Go Main command to start the experiment. Single step through the program to view the flash memory manufacture ID and chip ID. Use CCS debug window to view flash memory (what is the correct starting address for flash memory?) before calling the flashErase() function. Single step into the erase function and view the flash memory using CCS. Finally, run the code to write the data into the flash memory. 2.10.6 Using McBSP The C5510 DSK uses McBSP1 and McBSP2 for connecting the analog interface chip TLV320AIC23, where McBSP1 is used as control channel and McBSP2 is used as data channel. In this experiment, we will introduce the basic programming of the McBSP and use CCS to build the McBSP DSP library.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 107 Every McBSP consists of two serial port control registers (SPCR) in I/O memory spaces. These control registers are used for controlling digital loopback, sign extension, clock stop, and interrupt modes. There are also several signal pins that allow user to check the current status of transmit and receive operations. Each McBSP port has two transmit control registers (XCR) and two receive control registers (RCR). These I/O MMRs allow users to specify transmit and receive frame phases, wordlength, and the number of the words to transfer. There are two sample rate generator registers (SRGR) for each McBSP port. Sample rate generator can generate frame sync and clock (CLKG) signals. The SRGRs allow user to choose input clock source (CLKSM), divide output clock via a divide counter (CLKGDV), and set the frame sync pulse width and period. Every McBSP port has eight transmit channel enable registers and eight receive channel enable registers. These registers are used only when the transmittmer and/or receiver are configured to allow individual enabling and disabling of the channels; that is, TMCM = 0 and/or RMCM = 1. In this experiment, we initialize the registers of the McBSP ports. Since the DSK uses McBSP1 in master mode for AIC23 control channel, McBSP1 must be initialized before it can send configuration commands and parameters to the control registers of the AIC23. The McBSP2 is used as data port for AIC23, which can be configured by the function mcbsp2Init( ). After the McBSP2 is reset, the initialization sets the transmitter and receiver control registers. The pin control register is configured for the proper clock polarities. Finally, the sample rate generator is enabled, and then McBSP transmitter and receiver are enabled. This experiment uses two functions: mcbsp1CtlTx( ) and mcbsp2DatTx( ). The function mcbsp1CtlTx( ) checks McBSP transmit data ready bit. When this bit is set, the control parame- ter regValue will be written to the AIC23 control registers via the McBSP1. The lower 9 bits of the regValue contain the control parameters for setting the AIC register, while the upper 6 bits are used to identify the AIC23 control registers. Similarly, the function mcbsp2DatTx( ) checks if it is time to write the data via the McBSP2 transmitter. If the XRDY bit is set, the data is copied to the McBSP transmit buffer for transmission. The process of using CCS to create a DSP library for the TMS32C55x processor is similar to create a COFF executable program. First, we create a new project using library (.lib) as the target instead of choosing the COFF executable file type. The library must use the same memory model as the application program that will use the library. As an example, Figure 2.21 shows that the project type is the library Figure 2.21 The project creation for McBSP libraryJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 108 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Figure 2.22 The CCS project configuration for McBSP library, mcbsp.lib when we create the new project. The building option in Figure 2.22 shows the project will create a library mcbsp.lib. When creating libraries, only the library functions are included in the project. The application program, test programs, and other functions that are not related to the library shall not be included. We choose optimization option level-2 when building the library, so the library functions will be compiled with optimization option turned on. We also disabled features of generating the debug information when building the library. These settings give us an efficient library. As shown in Figure 2.22, we used two copy statements in the Final build steps Window to copy the C header file and library to the working directories. In practice, the library functions are individually tested, debugged, and verified before being used to create the library. The example mcbsp.lib is made with the large memory model and will be used by the next experiment. In Figure 2.22, we also show how to add special commands in the CCS build option to copy the mcbsp.lib from the current build directory to the destination. Table 2.31 lists the files used to build the McBSP library for this experiment. Table 2.31 File listing for experiment exp2.10.6_mcbsp Files Description mcbsp1CtlTx.c C function sends command and data via C55x McBSP1 mcbsp1Init.c C function initializes C55x McBSP1 registers mcbsp2DatTx.c C function writes data to McBSP2 mcbsp2Init.c C function initializes C55x McBSP2 registers mcbspReset.c C function resets C55x McBSP mcbsp.h C header file for McBSP experiment mcbsp.pjt DSP project fileJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 109 Procedures of the experiment are listed as follows: 1. Create the project mcbsp.pjt, and add the following files to the project: mcbsp1CtlTx.c, mcbsp1Init.c, mcbsp2DatTx.c, mcbsp2Init.c, and mcbspReset.c. 2. Set up the search path for Include File. Build the project to create the library. 3. The C55x DSP code generation tools are located in the directory ..\c5500\cgtools\bin. Open a command window from the host computer by going to the Windows Start Menu and select Run. In the Run dialogue window, type cmd and click OK. 4. When the command window appears on the computer, we will show how to use the archiver tool ar55.exe located in ..\c5500\cgtools\bin directory. Assuming that the DSP tools are in- stalled in the C:\ti directory, type C:\ti\c5500\cgtools\bin\ar55 -h from the command win- dow to see the archiver’s help information. The following is an example of the archiver help menu: Syntax: ar55 [arxdt][quvse] archive files ... Commands: (only one may be selected) a - Add file r - Replace file x - Extract file d - Delete file t - Print table of contents Options: q - Quiet mode - Normal status messages suppressed u - Update with newer files (use with 'r' command) s - Print symbol table contents v - Verbose 5. The archiver ar55.exe allows us to view (-t) the file list of a library, remove (-d) files from the library, add (-a) and replace (-r) files to existing library, and extract the library files. Use these commands to view and extract the McBSP library that we built for this experiment. 2.10.7 AIC23 Configurations The C5510 DSK analog inputs include a microphone and a stereo line-in; the analog outputs include a stereo line-out and a stereo headphone. The AIC23 uses McBSP1 for control channel with 16-bit control signal. The lower 9 bits contain the command value that will be written to the specified register, while the upper 7 bits specify the AIC23 control register. The McBSP2 is set as the bidirectional data channel for passing audio samples in and out of the DSK. The AIC23 supports several data formats and can be configured for different sampling frequencies as described in its data manual [12]. In this experiment, we will configure the C55x McBSP to interface with the AIC23 for real-time audio playback. The experiment program AIC23Demo( ) configures the AIC23 for stereo output. The digital samples stored in the DSK flash memory will be played at 8 kHz rate. The AIC23 has 11 control registers that must be initialized to satisfy different application requirements. The initialization values are listed in the C header file aic23.h. The AIC23 uses the sigmaÐdelta technologies with built-in headphone amplifier to provide up to 30 mW output level for 32 impedance and 40 mW for 16 impedance. It supports sampling rate from 8 to 96 kHz, and data wordlength of 16, 20, 24, and 32 bits. The AIC23 also includes flexible gain and volume controls. Figure 2.23 shows the functional block diagram of AIC23. The serial clock is connectedJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 110 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR McBSP1 CS SDIN SCL MODE McBSP2 LRCIN DIN LRCOUT DOUT BCLK LHPOUT RHPOUT LLINEOUT RLINEOUT MICIN LLINE IN RLINE IN A/D D/A CONTROL 0 LEFT INPUT VOL 1 RIGHT INPUT VOL 2 LEFT HEADPHONE VOL 3 RIGHT HEADPHONE VOL 4 ANALOG AUDIO PATH CTL 5 DIGITAL AUDIO PATH CTL 6 POWERDOWN CTL 7 DIGITAL AUDIO IF FMT 8 SAMPLERATE CTL 9 DIGITAL IF ACTIVATIO 15 RESET Figure 2.23 Block diagram of connecting AIC23 using McBSPs to SCLK. The data word is latched by CS signal. The 16-bit control word is latched on the rising edge of CS with the MSB first. The AIC23 supports stereo audio channels. The left and right channels can be simultaneously locked together or individually controlled. The program listed in Table 2.32 initializes the DSK and AIC23, reads in the data sample from flash memory, and plays this audio signal at 8-kHz sampling rate via the DSK headphone output. The files used for this AIC23 experiment are listed in Table 2.33. Procedures of the experiment are listed as follows: 1. Use the flash program experiment in Section 2.10.5 to initialize the flash memory with the data file, dtmf18005551234.dat. Youcan also use the same data file from SRAM as shown in aic23Test.c. In this case, the data file is included as a header file. 2. Create the mcbsp.lib using the experiment from previous experiment given in Section 2.10.6. 3. Create the project aic23.pjt, add all the source files listed in Table 2.33 and the mcbsp.lib to the project, and build the project (pay attention to memory mode). 4. Connect a headphone (or loudspeaker) to the DSK headphone output and run the DSK experiment to listen to the playback.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 111 Table 2.32 Program of audio playback, aic23Test.c #include #include "mcbsp.h" #include "aic23.h" #define DATASIZE 12320 #if 1 // Flash memory data #define flashData 0x200000 // Data has been programmed by flash example #else // Local memory data short flashData[DATASIZE]={ // Data is in SRAM #include "dtmf18005551234.dat" }; #endif #pragma CODE_SECTION(main, ".text:example:main"); void main() { short i,data; unsigned short *flashPtr; // Initialize McBSP1 as AIC23 control channel mcbsp1Init(); // Initialize the AIC23 aic23Init(); // Initialize McBSP2 as AIC23 data channel mcbsp2Init(); flashPtr = (unsigned short *)flashData; // Playback data via AIC23 for (i=0; i #include "dma.h" #include "emif.h" #define N 128 // Transfer data elements #define M 16 // Transfer data frames #define DMA_CHANNEL 3 // DMA channel // SRC is in DARAM and DST is in SDRAM // Force SRC and DST to align at 32-bit boundary #pragma DATA_SECTION(src, ".daram:example:dmaDemo") #pragma DATA_SECTION(dst, ".sdram:example:dmaDemo") #pragma DATA_ALIGN(src, 2); #pragma DATA_ALIGN(dst, 2); unsigned short src[N*M]; unsigned short dst[N*M];JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 113 Table 2.34 (continued ) #pragma CODE_SECTION(main, ".text:example:main"); void main(void) { unsigned short i,frame,err; short dmaInitParm[DMA_REGS]; unsigned long srcAddr,dstAddr; // Initialize EMIF emifInit(); // Initialize source and destination memory for testing for (i = 0; i < (N*M); i++) { dst[i] = 0; src[i]=i+1; } // Convert word address to byte address, DMA uses byte address srcAddr = (unsigned long)src; dstAddr = (unsigned long)dst; srcAddr <<= 1; dstAddr <<= 1; // Setup DMA initialization values dmaInitParm[0] = DMACSDP_INIT_VAL; dmaInitParm[1] = DMACCR_INIT_VAL; dmaInitParm[2] = DMACICR_INIT_VAL; dmaInitParm[3] = DMACSR_INIT_VAL; dmaInitParm[4] = (short)(srcAddr & 0xFFFF); dmaInitParm[5] = (short)(srcAddr >> 16); dmaInitParm[6] = (short)(dstAddr & 0xFFFF); dmaInitParm[7] = (short)(dstAddr >> 16); dmaInitParm[8] = N; dmaInitParm[9] = M; dmaInitParm[10] = DMACSFI_INIT_VAL; dmaInitParm[11] = DMACSEI_INIT_VAL; dmaInitParm[12] = 0; dmaInitParm[13] = 0; dmaInitParm[14] = DMACDEI_INIT_VAL; dmaInitParm[15] = DMACDFI_INIT_VAL; // Initialize DMA channel dmaInit(DMA_CHANNEL, dmaInitParm); // Enable DMA channel and begin data transfer dmaEnable(DMA_CHANNEL); // DMA transfer data at background frame = M; while (frame>0) { if (dmaFrameStat(DMA_CHANNEL) != 0) continues overleafJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 114 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.34 (continued ) { frame--; } } // Close DMA channel dmaReset(DMA_CHANNEL); // Check data transfer is correct or not err=0; for (i = 0; i <(N*M); i++) { if (dst[i] != src[i]) { err++; } } printf("DMA Demo: error found = %d\ n", err); } The DMA initialization function dmaInit( ) uses the argument dmaNum to select a DMA channel and initialize all 16 registers with the context passed in via dmaInitParm[]. The DMA channel is disabled by the initialization function. To begin a DMA transfer, the DMA channel must be enabled. The DMA status register bits indicate the DMA status for the given channel. The DMA demo program sets a frame-synchronization-based DMA transfer. After each frame of data has been transferred, the frame sync status bit of the DMA CSR register will be set. The program checks the frame sync status bit to monitor the data transfer. It is a good practice to disable the DMA channel when data transfer has been completed and the DMA channel will no longer be needed. This can prevent unpredictable behavior. The function dmaReset( ) will disable the given DMA channel and reset the DMA registers. Table 2.35 lists the files used for this DMA experiment. Procedures of the experiment are listed as follows: 1. Create the project dma.pjt, and add all the files listed in Table 2.35 and the run-time support library rts55x.lib to the project. 2. Build the project and run the program. What memory mode should be used for this experiment? Table 2.35 File listing for experiment exp2.10.8_dma Files Description dmaTest.c C function for testing DMA experiment dmaEnable.c C function enables DMA dmaFrameStat.c C function checks DMA status bit dmaInit.c C function initializes DMA dmaReset.c C function resets DMA emifInit.c C function initializes EMIF for DMA experiment emif.h C header file for using EMIF initialization function dma.h C header file for DMA experiment dma.pjt DSP project file dma.cmd DSP linker command fileJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXERCISES 115 References [1] Texas Instruments, Inc., TMS320C55x DSP CPU Reference Guide, Literature no. SPRU371F, 2004. [2] Texas Instruments, Inc., TMS320C55x Assembly Language Tools User’s Guide, Literature no. SPRU280G, 2003. [3] Texas Instruments, Inc., TMS320C55x Optimizing C Compiler User’s Guide, Literature no. SPRU281E, 2003. [4] Texas Instruments, Inc., TMS320C55x DSP Mnemonic Instruction Set Reference Guide, Literature no. SPRU374G, 2002. [5] Texas Instruments, Inc.,TMS320C55x DSP Algebraic Instruction Set Reference Guide, Literature no. SPRU375G, 2002. [6] Texas Instruments, Inc., TMS320C55x Programmer’s Reference Guide, Literature no. SPRU376A, 2001. [7] Texas Instruments, Inc., TMS320C55x DSP Peripherals Reference Guide, Literature no. SPRU317G, 2004. [8] ITU Recommendation G.711, ‘Pulse code modulation (PCM) of voice frequencies,’ CCITT Series G Recom- mendations, 1988. [9] Texas Instruments, Inc., TMS320VC5510 DSP External Memory Interface (EMIF) Reference Guide, Literature no. SPRU590, Aug. 2004. [10] Micron, Technology, Inc., Synchronous DRAM MT48LC2M32B2 Ð 512K x 32 x 4 Banks, Specification, Literature no. Advanced Micro Devices, Inc. [11] Advance Micro Devices, Inc., Am29LV400B 4 Megabit (512k x 8-Bit/256K x 16-Bit) COMS 3.0 Volt-only Boot Sector FLASH Memory Data Sheet, July 2003. [12] Texas Instruments, Inc., TLV320AIC23B Stereo Audio Codec, 8-96 kHz, With Integrated Headphone Amplifier Data Manual, Literature no. SLWS106F, 2004. Exercises 1. Check the following examples to determine if these are correct parallel instructions. If not, correct the problems: (a) mov *AR1+,AC1 :: add @x,AR2 (b) mov AC0,dbl(*AR2+) :: mov dbl(*AR1+T0),AC2 (c) mpy *AR1+,*AR2+,AC0 :: mpy *AR3+,*AR2+,AC1 || rpt #127 2. Given a memory block, XAR0, XDP, and T0 as shown in Figure 2.24. Determine the contents of AC0, AR0, and T0 after the execution of the following instructions: (a) mov *(#x+2),AC0 (b) mov @(x-x+1),AC0 (c) mov @(x-x+0x80),AC0 (d) mov *AR0+,AC0 (e) mov *(AR0+T0),AC0 (f) mov *AR0(T0),AC0 (g) mov *AR0+T0,AC0 (h) mov *AR0(#-1),AC0 (i) mov *AR0(#2),AC0 (j) mov *AR0(#0x80),AC0JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 116 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR 0xFFFF 0x0000 0x1111 0x2222 0x3333 0x4444 0x00FFFF x=0x010000 0x010001 0x010002 0x010003 0x010004 Data memoryAddress: 0x80800x010080 : : : : 0x0004 0x010000XDP T0 0x010000XAR0 Figure 2.24 Contents of data memory and registers 3. Use Table 2.36 to show how the C compiler passes parameters for the following C functions: (a) short func_a(long, short, short, short, short, short, short *, short *, short *, short *); var = func_a(0xD32E0E1D, 0, 1, 2, 3, 4,pa, pb, pc, pd); (b) short func_b(long, long, long, short, short, short, short *, short *, short *, short *); var = func_b(0x12344321, 0, 1, 2, 3, 4,pa, pb, pc, pd); (c) long func_c(short, short *, short *, short, short, long, short *, short *, long, long); var = func_c(0x2468ABCD, p0, p1, 1, 2, 0x1001, p2, p3, 0x98765432, 0x0); Table 2.36 List of parameters passed by the C55x C compiler T0 T1 T2 T3 AC0 AC1 AC2 AC3 XAR0 XAR1 XAR2 XAR3 XAR4 XAR6 XAR6 XAR7 SP(−3) SP(−2) SP(−1) SP(0) SP(1) SP(2) SP(3) var 4. The complex vector multiplication can be represented as z = x · y = (a + jb) · (c + jd). The following C-callable assembly routine is written to compute the complex vector multiplication. Identify potential pro- gramming errors (bugs) in this assembly routine. Correct the errors and test it using CCS.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXERCISES 117 .data x .word 1,2 ; Complex vector x = a+jb = 1+2j y .word 7,8 ; Complex vector y = c+jd = 7+8j .bss z,2,1,1 ; For storing complex vector multiplication .global complexVectMult .text complexVectMult mov #x, AR0 mov #Y, CDP mov #Z, AR1 mov #1, T0 mpy *AR0,*CDP+,AC0 ; AC0 = a*c :: mpy *AR0(T0),*CDP+,AC1 ; AC1 = b*c mas *AR0(T0),*CDP+,AC0 ; AC0 = a*c-b*d :: mac *AR0,*CDP+,AC1 ; AC1 = b*c+a*d mov pair(LO(AC0)),*AR1+ ; Store the result (a+jb)(c+jd)=-9+22j .end 5. Some applications require the use of extended precision arithmetic. For example, the 32-bit by 32-bit integer multiplication will have a 64-bit result. The implementation of the double-precision multiplication can be described by the following figure: YH YL XH XL x XL YL XH YL XL YH XH YH+ Z4 Z3 Z2 Z1= The following assembly routine is written for computing the 32-bit by 32-bit integer multiplication. Identify potential programming errors (bugs) within this assembly routine. Correct the errors and test it using CCS. .data x .long 0x13579bdf ; 32-bit data y .long 0x2468ace0 ; 32-bit data .bss z,4,1,1 ; 64-bit result .global _mult32x32JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 118 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR .text _mult32x32 amov #x, XAR0 amov #y, XAR1 amov #z, XAR2 amar *AR0+ || amar *AR1+ add #3,AR2 || mpym *AR0-, *AR1,AC0 ; AC0 = XL*YL mov AC0,*AR2- ; Save Z1 macm *AR0+,*AR1-,AC0>>16,AC0 ; AC0 = (XH*YL) + (XL*YL)>>16 macm *AR0-,*AR1,AC0 ; AC0 += (XL*YH) mov AC0,*AR2- ; Save Z2 macm *AR0,*AR1,AC0>>16,AC0 ; AC0 = (XH*YH) + (AC0)>>16 mov AC0,*AR2- ; Save Z3 mov HI(AC0),*AR2 ; Save Z4 ret .end 6. Based on the previous experiment on interfacing C with assembly code, write a C-callable assembly function to compute d = a · b · c, where a = 0x400, b = 0x600, and c = 0x4000. The assembly function should pass three variables a, b, and c into the assembly routine, and return the result to the C function. Check the result and explain why. 7. Refer to the previous experiment on addressing modes using assembly programming to write an assembly routine that uses indirect addressing mode to compute an 8 × 8 matrix of B = A · X. 8. Write an assembly routine computing the following matrix operation to obtain Y, Cr, and Cb from given R, G, and B data. The values of the R, G, and B are 8-bit integers. ⎡ ⎣ 65 129 25 −38 −74 112 112 −94 −18 ⎤ ⎦ ⎡ ⎣ R G B ⎤ ⎦ + ⎡ ⎣ 16 128 128 ⎤ ⎦ = ⎡ ⎣ Y Cr Cb ⎤ ⎦ · 9. Write a C-callable assembly function that performs the following functions: (a) Find the maximum and minimum values of the data file ‘dtmf18005551234.dat’, which is used by previous experiment on programming flash memory devices. (b) Calculate the average (mean) value of the data file ‘dtmf18005551234.dat’. 10. Write a C-callable assembly function that sorts the data file ‘dtmf18005551234.dat’ and write the sorting result into a memory location starting from the maximum in a decent order. Using the CCS graphic feature, plot both the ‘dtmf18005551234.dat’ and the sorted result. 11. Based on the experiment program examples given in Sections 2.10.5 and 2.10.7, write a program that writes the data file ‘dtmf18005551234.dat’ into flash memory and playback the data from flash memory using C5510 DSK. Add a timer that generates interrupt every 10 s. Using this timer automatically plays back the data file stored in flash memory every 10 s. In this experiment, we will learn how to set up C55x timers and create a timer interrupt for a given rate. 12. We introduced flash memory programming in Section 2.10.5. The flash erase and programming are done for whole chip. Refer to the flash memory datasheet to develop a flash program that can erase sections instead ofJWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 EXERCISES 119 the whole chip and program memory in selected sections. This experiment is intended to create a flash memory update utility. 13. Based on the AIC23 experiment given in Section 2.10.7, develop an audio loopback program. The sampling rate of the AIC23 is 8 kHz in 16-bit data format. The DSK audio input is the stereo line-in and the DSK output is the stereo headphone output. Connect an audio source such as a CD player to the DSK audio input and playback the audio output through the DSK headphone output. Adjust the AIC23 control registers to set proper gain of the input and output signal levels. In this experiment, we will learn the detailed control register settings for the AIC23, and will be able to adjust the AIC23 to meet the application requirements. 14. Modify above real-time DSP program to make the audio loopback using McBSP interrupt event and ISR instead of polling the McBSP status bit. From this experiment, we will learn how the interrupt and ISR work in conjunction with McBSP. 15. Modify the audio loopback program to use DMA channels, so the audio samples will be buffered into 80 samples for the transmitter and receiver. The interrupt to the transmitter and receiver should occur every 80 samples. In this experiment, the audio is processed in blocks of 80 samples. This is a challenging experiment that requires the knowledge of DMA, McBSP, flash memory, AIC23, and the DSK system. The DMA event and McBSP interrupt along with audio sample management are all involved. 16. Create an experiment that configures the DSK for multichannel DMA data transfer. For the first DMA data transfer, the data source is in SRAM and the destination is SDRAM. The transfer data size is 8192 bytes. For the second DMA data transfer, the data source is in SDRAM and the destination is DARAM. The transfer data size is 4096 bytes. Write the program such that the data transfers will be performed at the same time for both DMA paths.JWBK080-02 JWBK080-Kuo March 9, 2006 20:47 Char Count= 0 120JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 3 DSP Fundamentals and Implementation Considerations This chapter presents fundamental DSP concepts and practical implementation considerations for the digital filters and algorithms. DSP implementations, especially using fixed-point processors, require special attention due to the quantization and arithmetic errors. 3.1 Digital Signals and Systems In this section, we will introduce some widely used digital signals and simple DSP systems. 3.1.1 Elementary Digital Signals Signals can be classified as deterministic or random. Deterministic signals are used for test purposes and can be described mathematically. Random signals are information-bearing signals such as speech. Some deterministic signals will be introduced in this section, while random signals will be discussed in Section 3.3. A digital signal is a sequence of numbers x(n), −∞ < n < ∞, where n is the time index. The unit- impulse sequence, with only one nonzero value at n = 0, is defined as δ(n) = 1, n = 0 0, n = 0 , (3.1) where δ(n) is also called the Kronecker delta function. This unit-impulse sequence is very useful for testing and analyzing the characteristics of DSP systems. The unit-step sequence is defined as u(n) = 1, n ≥ 0 0, n < 0 . (3.2) Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 121JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 122 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS This function is very convenient for describing causal signals, which are the most commonly encountered signals in real-time DSP systems. Sinusoidal signals (sinusoids, tones, or sinewaves) can be expressed in a simple mathematical formula. An analog sinewave can be expressed as x(t) = A sin (t + φ) = A sin (2π ft+ φ) , (3.3) where A is the amplitude of the sinewave.  = 2π f (3.4) is the frequency in radians per second (rad/s), f is the frequency in cycles per second (Hz), and φ is the phase in radians. The digital signal corresponding to the analog sinewave defined in Equation (3.3) can be expressed as x(n) = A sin (nT + φ) = A sin (2πfnT + φ) , (3.5) where T is the sampling period in seconds. This digital sequence can also be expressed as x(n) = A sin (ωn + φ) = A sin (πFn + φ) , (3.6) where ω = T = 2π f fs (3.7) is the digital frequency in radians per sample, and F = ω π = f ( fs/2) (3.8) is the normalized digital frequency in cycles per sample. The units, relationships, and ranges of these analog and digital frequency variables are summarized in Table 3.1. Sampling of analog signals implies a mapping of an infinite range of analog frequency variable f (or ) into a finite range of digital frequency variable F (or ω). The highest frequency in a digital signal is F = 1 (or ω = π) based on the sampling theorem defined in Equation (1.3). Therefore, the spectrum of digital signals is restricted to a limited range as shown in Table 3.1. Example 3.1: Generate 32 samples of a sinewave with A = 2, f = 1000 Hz, and fs = 8 kHz using MATLAB program. Since F = f ( fs/2) = 0.25, we have ω = π F = 0.25π. From Equation (3.6), we can express the generated sinewave as x(n) = 2 sin (ωn), n = 0,1,...,31.Thegenerated sinewave samples are Table 3.1 Units, relationships, and ranges of four frequency variables Variables Unit Relationship Range  Radians per second  = 2π f −∞ <<∞ f Cycles per second (Hz) f = Ffs 2 = ω 2πT −∞ < f < ∞ ω Radians per sample ω = T = π F −π ≤ ω ≤ π F Cycles per sample F = f ( fs/2) −1 ≤ F ≤ 1JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 DIGITAL SIGNALS AND SYSTEMS 123 2 1.5 1 0.5 0 −0.5 −1 −1.5 −20 5 10 15 20 25 30 Amplitude Time index, n Figure 3.1 An example of sinewave with A = 2 and ω = 0.25 π plotted (shown in Figure 3.1) and saved in a data file (sine.dat) using ASCII format using the following MATLAB script (example3_1.m): n = [0:31]; % Time index n omega = 0.25*pi; % Digital frequency xn = 2*sin(omega*n); % Sinewave generation plot(n, xn, '-o'); % Samples are marked by 'o' xlabel('Time index, n'); ylabel('Amplitude'); axis([0 31 -2 2]); save sine.dat xn -ascii ; Note that F = 0.25 means there are four samples from 0 to π, resulting in eight samples per period of sinewave, which is clearly indicated in Figure 3.1. 3.1.2 Block Diagram Representation of Digital Systems A DSP system performs prescribed operations on signals. The processing of digital signals can be described as combinations of certain basic operations including addition (or subtraction), multiplication, and time shift (or delay). Thus, a DSP system consists of the interconnection of three basic elements: adders, multipliers, and delay units. Two signals, x1(n) and x2(n), can be added as illustrated in Figure 3.2, where the adder output is expressed as y(n) = x1(n) + x2(n). (3.9) The adder could be drawn as a multi-input adder with more than two inputs, but the additions are typically performed with two inputs at a time. The addition operation of Equation (3.9) can be implemented as theJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 124 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS x2 (n) x1 (n) y(n) Σ ++ x2 (n) x1 (n) or y(n) Figure 3.2 Block diagram of an adder following C55x code using direct addressing mode: mov @x1n,AC0 ; AC0 = x1(n) add @x2n,AC0 ; AC0 = x1(n)+x2(n) mov AC0,@yn ; y = x1(n)+x2(n) A given signal can be multiplied by a scalar, α, as illustrated in Figure 3.3, where x(n) is the multiplier input and the multiplier’s output is y(n) = αx(n). (3.10) Multiplication of a sequence by a scalar, α, results in a sequence that is scaled by α. The output sig- nal is amplified if |α| > 1, or attenuated if |α| < 1. The multiply operation of Equation (3.10) can be implemented as the following C55x code using indirect addressing mode: amov #alpha,XAR1 ; AR1 points to alpha (α) amov #xn,XAR2 ; AR2 points to x(n) amov #yn,XAR3 ; AR3 points to y(n) mpy *AR1,*AR2,AC0 ; AC0 = α *x(n) mov AC0,*AR3 ; y = α *x(n) The sequence x(n) can be delayed in time by one sampling period, T , as illustrated in Figure 3.4, where the box labeled z−1 represents the unit delay, x(n) is the input signal, and the output signal y(n) = x(n − 1). (3.11) In fact, the signal x(n − 1) is actually the previously stored signal in memory before the current time n. Therefore, the delay unit is very easy to realize in a digital system with memory, but is difficult to implement in an analog system. A delay by more than one unit can be implemented by cascading several delay units in a row. Therefore, an L-unit delay requires L memory locations configured as a first-in first-out buffer (tapped-delay line or simply delay line) in memory. There are several methods to implement delay operations on the TMS320C55x. The following code uses a delay instruction to move the contents of the addressed data memory location into the next higher address location: amov #xn,XAR1 ; AR1 points to x(n) delay *AR1 ; Contents of x(n) is copied to x(n-1) x(n) x(n)y(n) y(n)α α or Figure 3.3 Block diagram of a multiplierJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 DIGITAL SIGNALS AND SYSTEMS 125 x(n) y(n) = x(n − 1) z−1 Figure 3.4 Block diagram of a unit delay Example 3.2: Consider a simple DSP system described by the difference equation y(n) = αx(n) + αx(n − 1). (3.12) The block diagram of the system using the three basic building blocks is sketched in Figure 3.5(a), which shows that the output signal y(n) is computed using two multiplications and one addition. A simple algebraic simplification may be used to reduce computational requirements. For example, Equation (3.12) can be rewritten as y(n) = α [x(n) + x(n − 1)] . (3.13) The implementation of this difference equation is illustrated in Figure 3.5(b), where only one multiplication is required. This example shows that with careful design (or optimization), the complexity of the system (or algorithm) can be further reduced. Example 3.3: In practice, the complexity of algorithm also depends on the architecture and instruction set of the DSP processor. For example, the C55x implementation of Equation (3.13) can be written as amov #alpha,XAR1 ; AR1 points to α amov #temp,XAR2 ; AR2 points to temp amov #yn,XAR4 ; AR4 points to yn mov *(x1n),AC0 ; AC0 = x1(n) add *(x2n),AC0 ; AC0 = x1(n)+x2(n) x(n) x(n) x(n − 1) x(n − 1) z−1 y(n) y(n) z−1 αα α ++ Σ ++ Σ (b) (a) Figure 3.5 Block diagrams of DSP systems: (a) direct realization described in (3.12); (b) simplified implementation given in (3.13)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 126 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS mov AC0,*AR2 ; temp = x1(n)+x2(n), pointed by AR2 mpy *AR1,*AR2,AC1 ; AC1 = α *[x1(n)+x2(n)] mov AC1,*AR4 ; yn = α *[x1(n)+x2(n)] Equation (3.12) can be implemented as amov #x1n,XAR1 ; AR1 points to x1(n) amov #x2n,XAR2 ; AR2 points to x2(n) amov #alpha,XAR3 ; AR3 points to α amov #yn,XAR4 ; AR4 points to yn mpy *AR1,*AR3,AC1 ; AC1 = α *x1(n) mac *AR2,*AR3,AC1 ; AC1 = α *x1(n)+ α *x2(n) mov AC1,*AR4 ; yn = α *x1(n)+ α *x2(n) This example shows Equation (3.12) is more efficient for implementation on the TMS320C55x because its architecture is optimized for the sum of products operation. Therefore, the complexity of DSP algorithm cannot be simply measured by the number of required multiplications. When the multiplier coefficient α is a number with a base of 2 such as 0.25 (1/4), we can use shift operation instead of multiplication. The following example uses the absolute addressing mode: mov *(x1n)<<#-2,AC0 ; AC0 = 0.25*x1(n) add *(x2n)<<#-2,AC0 ; AC0 = 0.25*x1(n)+0.25*x2(n) where the right-shift option, <<#-2, shifts the contents of x1n and x2n to the right by 2 bits. This is equivalent to dividing the number by 4. 3.2 System Concepts In this section, we introduce several techniques for describing and analyzing the linear time-invariant (LTI) digital systems. 3.2.1 Linear Time-Invariant Systems If the input signal to an LTI system is the unit-impulse sequence δ(n) defined in Equation (3.1), then the output signal is called the impulse response of the system, h(n). Example 3.4: Consider a digital system with the I/O equation y(n) = b0x(n) + b1x(n − 1) + b2x(n − 2). (3.14) Applying the unit-impulse sequence δ(n) to the input of the system, the outputs are the impulse response coefficients and can be computed as follows: h(0) = y(0) = b0 · 1 + b1 · 0 + b2 · 0 = b0 h(1) = y(1) = b0 · 0 + b1 · 1 + b2 · 0 = b1 h(2) = y(2) = b0 · 0 + b1 · 0 + b2 · 1 = b2 h(3) = y(3) = b0 · 0 + b1 · 0 + b2 · 0 = 0 ... Therefore, the impulse response of the system defined in Equation (3.14) is {b0, b1, b2,0,0,...}.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 127 x(n) x(n − 1) x(n − L + 1) bL−1b1b0 y(n) z−1 z−1 + ++ Σ Figure 3.6 Detailed signal-flow diagram of an FIR filter The I/O equation given in (3.14) can be generalized with L coefficients, expressed as y(n) = b0x(n) + b1x(n − 1) +···+bL−1x(n − L + 1) = L−1 l=0 bl x(n − l). (3.15) Substituting x(n) = δ(n) into Equation (3.15), the output is the impulse response expressed as h(n) = L−1 l=0 bl δ(n − l) = bn n = 0, 1,...,L − 1 0 otherwise . (3.16) Therefore, the length of the impulse response is L for the system defined in Equation (3.15). Such a system is called a finite impulse response (FIR) filter. The coefficients, bl , l = 0,1,...,L − 1, are called filter coefficients (also called as weights or taps). For FIR filters, the filter coefficients are identical to the impulse response coefficients. The signal-flow diagram of the system described by the I/O equation (3.15) is illustrated in Figure 3.6. The string of z−1 units is called a tapped-delay line. The parameter, L, is the length of the FIR filter. Note that the order of filter is L − 1 for the FIR filter with length L since they are L − 1 zeros. The design and implementation of FIR filters will be further discussed in Chapter 4. The moving (running) average filter is a simple example of FIR filter. Consider an L-point moving-average filter defined as y(n) = 1 L [x(n) + x(n − 1) +···+x(n − L + 1)] = 1 L L−1 l=0 x(n − l), (3.17) where each output signal is the average of L consecutive input samples. Implementation of Equation (3.17) requires L − 1 additions and L memory locations for storing signal samples x(n), x(n − 1),..., x(n − L + 1) in a memory buffer. Note that the division by a constant L can be implemented by multi- plication of constant α, where α = 1/L. As illustrated in Figure 3.6, the signal samples used to compute the output signal are L samples included in the window at time n. These samples are almost the same as those samples used for the previous window at time n− 1 to compute y(n− 1), except that the oldest sample x(n − L) of the window at time n − 1is replaced by the newest sample x(n) of the window at time n. The concept of moving window is illustratedJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 128 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Window at time n Window at time n − 1 n − 1 n − L + 1 n − L Timen Figure 3.7 Time windows at current time n and previous time n − 1 in Figure 3.7. Therefore, the averaged signal, y(n), can be computed recursively as y(n) = y(n − 1) + 1 L [x(n) − x(n − L)] . (3.18) This recursive equation can be realized by using only two additions. However, we still need L + 1 memory locations for keeping L + 1 signal samples [x(n)x(n − 1) ...x(n − L)]. Example 3.5: The following C55x assembly code illustrates the implementation of the moving- average filter of L = 8 based on Equation (3.18): L .set 8 ; Length of filter xin .usect "indata",1 xbuffer .usect "indata",L ; Length of buffer y .usect "outdata",2,1,1 ; Long-word format amov #xbuffer+L-1,XAR3 ; AR3 points to end of x buffer amov #xbuffer+L-2,XAR2 ; AR2 points to next sample mov dbl(*(y)),AC1 ; AC1 = y(n-1) in long format mov *(xin),AC0 ; AC0 = x(n) sub *AR3,AC0 ; AC0 = x(n)-x(n-L) add AC0,#-3,AC1 ; AC0 = y(n-1)+1/L[x(n)-x(n-L)] mov AC1,dbl(*(y)) ; AC1 = y(n) rpt #(L-2) ; Update the tapped-delay-line mov *AR2-,*AR3- ; x(n-1) = x(n) mov *(xin),AC0 ; Update the newest sample x(n) mov AC0,*AR3 ; x(n) = input xin Consider an LTI system illustrated in Figure 3.8, the output of the system can be expressed as y(n) = x(n) ∗ h(n) = h(n) ∗ x(n) = ∞ l=−∞ x(l)h(n − l) = ∞ l=−∞ h(l)x(n − l), (3.19) where * denotes the linear convolution. The exact internal structure of the system is either unknown or ignored. The only way to interact with the system is by using its input and output terminals as shown in Figure 3.8. This ‘black box’ representation is a very effective way to depict complicated DSP systems. h(n) y(n) = x(n)*h(n)x(n) Figure 3.8 An LTI system expressed in time domainJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 129 A digital system is called the causal system if and only if h(n) = 0, n < 0. (3.20) A causal system does not provide a zero-state response prior to input application; that is, the output depends only on the present and previous samples of the input. This is an obvious property for real-time DSP systems since we simply do not have future data. However, if the data is recorded and processed later, the algorithm operating on this data set does not need to be causal. For a causal system, the limits on the summation of Equation (3.19) can be modified to reflect this restriction as y(n) = ∞ l=0 h(l)x(n − l). (3.21) Example 3.6: Consider the I/O equation of the digital system expressed as y(n) = bx(n) − ay(n − 1), (3.22) where each output signal y(n) is dependent on the current input signal x(n) and the previous output signal y(n − 1). Assuming that the system is causal, i.e., y(n) = 0 for n < 0 and let x(n) = δ(n). The output signals are computed as y(0) = bx(0) − ay(−1) = b y(1) = bx(1) − ay(0) =−ay(0) =−ab y(2) = bx(2) − ay(1) =−ay(1) = a2b ... In general, we have y(n) = (−1)nanb, n = 0, 1, 2,...,∞. This system has infinite impulse response h(n) if the coefficients a and b are nonzero. A digital filter can be classified as either an FIR filter or an infinite impulse response (IIR) filter, depending on whether or not the impulse response of the filter is of finite or infinite duration. The system defined in Equation (3.22) is an IIR system (or filter) since it has infinite impulse response as shown in Example 3.6. The I/O equation of the IIR system can be generalized as y(n) = b0x(n) + b1x(n − 1) +···+bL−1x(n − L + 1) − a1 y(n − 1) −···−aM y(n − M) = L−1 l=0 bl x(n − l) − M m=1 am y(n − m). (3.23) This IIR system is represented by a set of feedforward coefficients {bl ,l = 0, 1, ...,L − 1} and a set of feedback coefficients {am, m = 1, 2, ...,M}. Since the outputs are fed back and combined with the weighted inputs, IIR systems are feedback systems. Note that when all am are zero, Equation (3.23) is identical to Equation (3.15). Therefore, an FIR filter is a special case of an IIR filter without feedback coefficients.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 130 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Example 3.7: The IIR filters given in Equation (3.23) can be implemented using the MATLAB function filter as follows: yn = filter(b, a, xn); The vector b contains feedforward coefficients {bl , l = 0, 1,...,L − 1} and the vector a contains feedback coefficients {am, m = 0, 1, 2,...,M, where a0 = 1}. The signal vectors xn and yn are the input and output buffers of the system, respectively. The FIR filter defined in Equation (3.15) can be implemented as follows: yn = filter(b, 1, xn); This is because all am are zero except a0 = 1 for an FIR filter. Example 3.8: Assume that L is large enough so that the oldest sample x(n − L) can be ap- proximated by its average y(n − 1). The moving-average filter defined in Equation (3.18) can be approximated as y(n) ∼= 1 − 1 L y(n − 1) + 1 L x(n) = (1 − α) y(n − 1) + αx(n), (3.24) where α = 1/L. This is a simple first-order IIR filter. Compared with Equation (3.18), we need two multiplications instead of one, but only need two memory locations instead of L + 1. Thus, Equation (3.24) is the most efficient way of approximating a moving-average filtering. 3.2.2 The z-Transform Continuous-time systems are commonly analyzed using the Laplace transform. For discrete-time systems, the transform corresponding to the Laplace transform is the z-transform. The z-transform (ZT[.]) of a digital signal, x(n), −∞ < n < ∞, is defined as the power series: X(z) = ∞ n=−∞ x(n)z−n, (3.25) where X(z) represents the z-transform of x(n). The variable z is a complex variable, and can be expressed in polar form as z = rejθ , (3.26) where r is the magnitude (radius) of z and θ is the angle of z. When r = 1, |z|=1 is called the unit circle on the z-plane. Since the z-transform involves an infinite power series, it exists only for those values of z where the power series defined in Equation (3.25) converges. The region on the complex z-plane in which the power series converges is called the region of convergence. For causal signals, the two-sided z-transform defined in Equation (3.25) becomes a one-sided z-transform expressed as X(z) = ∞ n=0 x(n)z−n. (3.27)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 131 Example 3.9: Consider the exponential function x(n) = anu(n). The z-transform can be computed as X(z) = ∞ n=−∞ an z−nu(n) = ∞ n=0 az−1 n. Using the infinite geometric series given in Appendix A, we have X(z) = 1 1 − az−1 = z z − a if az−1 < 1. The equivalent condition for convergence is |z| > |a| , which is the region outside the circle with radius a. The properties of the z-transform are extremely useful for the analysis of discrete-time LTI systems. These properties are summarized as follows: 1. Linearity (superposition): The z-transform of the sum of two sequences is the sum of the z-transforms of the individual sequences. That is, ZT [a1x1(n) + a2x2(n)] = a1 ZT [x1(n)] + a2 ZT [x2(n)] = a1 X1(z) + a2 X2(z), (3.28) where a1 and a2 are constants. 2. Time shifting: The z-transform of the shifted (delayed) signal y(n) = x(n − k)is Y(z) = ZT [x(n − k)] = z−k X(z). (3.29) Thus, the effect of delaying a signal by k samples is equivalent to multiplying its z-transform by a factor of z−k. For example, ZT [x(n − 1)] = z−1 X(z). The unit delay z−1 corresponds to a time shift of one sample in the time domain. 3. Convolution: Consider the signal x(n) = x1(n) ∗ x2(n), (3.30) we have X(z) = X1(z)X2(z). (3.31) The z-transform converts the convolution in time domain to the multiplication in z domain.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 132 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS The inverse z-transform is defined as x(n) = ZT−1[X(z)] = 1 2π j C X(z)zn−1dz, (3.32) where C denotes the closed contour of X(z) taken in a counterclockwise direction. Several methods are available for finding the inverse z-transform: long division, partial-fraction expansion, and residue method. A limitation of the long-division method is that it does not lead to a closed form solution. However, it is simple and lends itself to software implementation. Both the partial-fraction-expansion and the residue methods lead to closed form solutions. The main disadvantage is the need to factorize the denominator polynomial, which is difficult if the order of X(z) is high. 3.2.3 Transfer Functions Consider the LTI system illustrated in Figure 3.8. Using the convolution property, we have Y(z) = X(z)H(z), (3.33) where X(z) = ZT[x(n)], Y(z) = ZT[y(n)], and H(z) = ZT[h(n)]. The combination of time- and frequency-domain representations of LTI system is illustrated in Figure 3.9. This diagram shows that we can replace the time-domain convolution by the z-domain multiplication. The transfer function of an LTI system is defined in terms of the system’s input and output. From Equation (3.33), we have H(z) = Y(z) X(z) . (3.34) The z-transform can be used in creating alternative filters that have exactly the same inputÐoutput behavior. An important example is the cascade or parallel connection of two or more systems, as illustrated in Figure 3.10. In the cascade (series) interconnection shown in Figure 3.10(a), we have Y1(z) = X(z)H1(z) and Y(z) = Y1(z)H2(z). Thus, Y(z) = X(z)H1(z)H2(z). h(n) y(n) = x(n)*h(n)x(n) H(z) Y(z) = X(z)H(z)X(z) ZT ZT ZT−1 Figure 3.9 A block diagram of LTI system in both time domain and z domainJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 133 H(z) H1(z) H2(z) x(n) y(n) y2(n) y1(n) H(z) H1(z) H2(z) x(n) X(z) y(n) Y(z) y1(n) Y1(z) (a) (b) Figure 3.10 Interconnect of digital systems: (a) cascade form; (b) parallel form Therefore, the overall transfer function of the cascade of the two systems is H(z) = H1(z)H2(z) = H2(z)H1(z). (3.35) Since multiplication is commutative, the two systems can be cascaded in either order to obtain the same overall system. The overall impulse response of the system is h(n) = h1(n) ∗ h2(n) = h2(n) ∗ h1(n). (3.36) Similarly, the overall impulse response and transfer function of the parallel connection of two LTI systems shown in Figure 3.10(b) are given by h(n) = h1(n) + h2(n) (3.37) and H(z) = H1(z) + H2(z). (3.38) If we can multiply several transfer functions to get a higher-order system, we can also factor polynomials to break down a large system into smaller sections. The concept of parallel and cascade implementation will be further discussed in the realization of IIR filters in Chapter 5. Example 3.10: The LTI system with transfer function H(z) = 1 − 2z−1 + z−3 can be factored as H(z) = 1 − z−1 1 − z−1 − z−2 = H1(z)H2(z). Thus, the overall system H(z) can be realized as the cascade of the first-order system H1(z) = 1 − z−1 and the second-order system H2(z) = 1 − z−1 − z−2.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 134 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS The I/O equation of an FIR filter is given in Equation (3.15). Taking the z-transform of both sides, we have Y(z) = b0 X(z) + b1z−1 X(z) +···+bL−1z−(L−1) X(z) = b0 + b1z−1 +···+bL−1z−(L−1) X(z). (3.39) Therefore, the transfer function of the FIR filter is expressed as H(z) = b0 + b1z−1 +···+bL−1z−(L−1) = L−1 l=0 bl z−l . (3.40) Similarly, taking the z-transform of both sides of the IIR filter defined in Equation (3.23) yields Y(z) = b0 X(z) + b1z−1 X(z) +···+bL−1z−L+1 X(z) − a1z−1Y(z) −···−aM z−M Y(z) = L−1 l=0 bl z−l X(z) − M m=1 am z−m Y(z). (3.41) By rearranging the terms, we can derive the transfer function of the IIR filter as H(z) = L−1 l=0 bl z−l 1 + M m=1 am z−m = L−1 l=0 bl z−l M m=0 am z−m , (3.42) where a0 = 1. A detailed block diagram of an IIR filter is illustrated in Figure 3.11 for M = L − 1. Example 3.11: Consider the moving-average filter given in Equation (3.17). Taking the z-transform of both sides, we have Y(z) = 1 L L−1 l=0 z−l X(z). y(n) z−1z−1 z−1 z−1 y(n − 1) y(n − 2) y(n − M) −a1 −a2 −aMbL − 1 b2 b1 b0x(n) x(n−L+1) Figure 3.11 Detailed signal-flow diagram of an IIR filterJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 135 Using the geometric series defined in Appendix A, the transfer function of the filter can be expressed as H(z) = 1 L L−1 l=0 z−l = 1 L 1 − z−L 1 − z−1 = Y(z) X(z) . (3.43) This equation can be rearranged as Y(z) = z−1Y(z) + 1 L X(z) − z−L X(z) . Taking the inverse z-transform of both sides, we obtain y(n) = y(n − 1) + 1 L [x(n) − x(n − L)] . This is an effective way of deriving Equation (3.18) from (3.17). 3.2.4 Poles and Zeros Factoring the numerator and denominator polynomials of H(z), Equation (3.42) can be expressed as the following rational function: H(z) = b0 L−1 l=1 (z − zl ) M m=1 (z − pm) = b0(z − z1)(z − z2) ···(z − zL−1) (z − p1)(z − p2) ···(z − pM ) . (3.44) The roots of the numerator polynomial are the zeros of the transfer function H(z) since they are the values of z for which H(z) = 0. Thus, H(z) given in Equation (3.44) has (L − 1) zeros at z = z1, z2,...,zL−1. The roots of the denominator polynomial are the poles since they are the values of z such that H(z) =∞, and there are M poles at z = p1, p2,...,pM . The LTI system described in Equation (3.44) is a pole-zero system, while the system described in Equation (3.40) is an all-zero system. Example 3.12: The roots of the numerator polynomial defined in Equation (3.43) determine the zeros of H(z), i.e., zL − 1 = 0. Using the complex arithmetic given in Appendix A, we have zl = e j(2π/L)l , l = 0, 1,...,L − 1. (3.45) Therefore, there are L equally spaced zeros on the unit circle |z|=1. Similarly, the poles of H(z) are determined by the roots of the denominator zL−1(z − 1). Thus, there are L − 1 poles at the origin z = 0 and one pole at z = 1. A pole-zero diagram of H(z) for L = 8 on the complex z-plane is illustrated in Figure 3.12. The pole-zero diagram provides an insight into the properties of an LTI system. To find poles and zeros of a rational function H(z), we can use the MATLAB function roots on both the numerator and denominator polynomials. Another useful MATLAB function for analyzing transfer function is zplane(b,a), which displays the pole-zero diagram of H(z).JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 136 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Zero Re[z] Pole |z| =1 Im[z] 7 Figure 3.12 Pole-zero diagram of the moving-average filter, L = 8 Example 3.13: Consider the IIR filter with the transfer function H(z) = 1 1 − z−1 + 0.9z−2 . We can plot the pole-zero diagrams using the following MATLAB script (example3_13a.m): b=[1]; a=[1, -1, 0.9]; zplane(b,a); Similarly, we can plot (Figure 3.13) the pole-zero diagram of moving-average filter using the following MATLAB script (example3_13b.m) for L = 8: b=[1, 0, 0, 0, 0, 0, 0, 0, -1]; a=[1, -1]; zplane(b,a); As shown in Figure 3.13, the moving-average filter has a single pole at z = 1, which is canceled by the zero at z = 1. In this case, the pole-zero cancellation occurs in the system transfer function itself. The portion of the output y(n) that is due to the poles of X(z) is called the forced response of the system. The portion of the output that is due to the poles of H(z) is called the natural response. If a system has all its poles within the unit circle, its natural response decays to zero as n →∞, and this is called the transient response. If the input to such a system is a sinusoidal signal, the corresponding forced response is called the sinusoidal steady-state response. Example 3.14: Consider the recursive moving-window filter given in Equation (3.24). Taking the z-transform of both sides and rearranging terms, we obtain the transfer function H(z) = α 1 − (1 − α) z−1 . (3.46) This is a simple first-order IIR filter with a zero at the origin and a pole at z = 1 − α. A pole-zero plot of H(z) given in Equation (3.46) is illustrated in Figure 3.14. Note that α = 1 L results inJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 137 −1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 0.1 0 −0.5 0 7 10.5 Real part Imaginary part Figure 3.13 A pole-zero diagram generated by MATLAB 1 − α = (L − 1) L, which is slightly less than 1. For a longer window, L is large; the value of 1 − α closes to 1, and the pole is closer to the unit circle. An LTI system H(z) is stable if and only if all the poles are inside the unit circle. That is, |pm| < 1, for all m. (3.47) In this case, limn→∞ {h(n)} = 0. A system is unstable if H(z) has pole(s) outside the unit circle or multiple- order pole(s) on the unit circle. For example, if H(z) = z/(z − 1)2, then h(n) = n, which is unstable. A system is marginally stable, or oscillatory bounded, if H(z) has first-order pole(s) that lie on the unit circle. For example, if H(z) = z/(z + 1), then h(n) = (−1)n, n ≥ 0. Zero Re[z] Pole z = 1 Im[z] Figure 3.14 Pole-zero diagram of the recursive first-order IIR filterJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 138 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Example 3.15: Given an LTI system with transfer function H(z) = z z − a . There is a zero at the origin z = 0 and a pole at z = a. From Example 3.9, we have h(n) = an, n ≥ 0. When |a| > 1, i.e., the pole at z = a is outside the unit circle, we have limn→∞ h(n) →∞, that is an unstable system. However, when |a| < 1, i.e., the pole is inside the unit circle, we have limn→∞ h(n) → 0, which is a stable system. 3.2.5 Frequency Responses The frequency response of a digital system can be readily obtained from its transfer function H(z)by setting z = e jω and obtain H (ω) = H (z) |z=e jω = ∞ n=−∞ h(n)z−n |z=e jω = ∞ n=−∞ h(n)e− jωn. (3.48) Thus, the frequency response H(ω) of the system is obtained by evaluating the transfer function on the unit circle |z| = e jω = 1. As summarized in Table 3.1, the digital frequency is in the range of −π ≤ ω ≤ π. The characteristics of the system can be described using the frequency response. In general, H(ω)is a complex-valued function expressed in polar form as H(ω) = |H(ω)| e jφ(ω), (3.49) where |H(ω)| is the magnitude (or amplitude) response and φ(ω) is the phase response. The magnitude response |H(ω)| is an even function of ω, and the phase response φ(ω) is an odd function. Thus, we only need to evaluate these functions in the frequency region 0 ≤ ω ≤ π. |H(ω)|2 is the squared-magnitude response, and |H(ω0)| is the system gain at frequency ω0. Example 3.16: The moving-average filter expressed as y(n) = 1 2 [x(n) + x(n − 1)] , n ≥ 0JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 139 is a simple first-order FIR filter. Taking the z-transform of both sides and rearranging the terms, we obtain H(z) = 1 2 1 + z−1 . From Equation (3.48), we have H(ω) = 1 2 1 + e− jω = 1 2 (1 + cos ω − j sin ω) , |H(ω)|2 = {Re[H(ω)]}2 + {Im [H(ω)]}2 = 1 2 (1 + cos ω) , φ(ω) = tan−1 Im[H(ω)] Re[H(ω)] = tan−1 − sin ω 1 + cos ω . From Appendix A, sin ω = 2 sin ω 2 cos ω 2 and cos ω = 2 cos2 ω 2 − 1. Therefore, the phase response is φ(ω) = tan−1 − tan ω 2 =−ω 2 . For a given transfer function H(z) expressed in Equation (3.42), the frequency response can be analyzed using the MATLAB function [H,w]=freqz(b,a,N); which returns the N-point frequency vector w and the complex frequency response vector H. Example 3.17: Consider the IIR filter defined as y(n) = x(n) + y(n − 1) − 0.9y(n − 2). The transfer function is H(z) = 1 1 − z−1 + 0.9z−2 . The MATLAB script (example3_17a.m) for analyzing the magnitude and phase responses of this IIR filter is listed as follows: b=[1]; a=[1, -1, 0.9]; freqz(b,a); Similarly, we can plot the magnitude and phase responses (shown in Figure 3.15) of the moving- average filter for L = 8 using the following script (example3_17b.m): b=[1, 0, 0, 0, 0, 0, 0, 0, -1]; a=[1, -1]; freqz(b,a);JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 140 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS 0 −40 −20 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 −200 −100 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized frequency (xπ rad/sample) Normalized frequency (xπ rad/sample) Magnitude (dB) Phase (degrees) Figure 3.15 Magnitude (top) and phase responses of a moving-average filter, L = 8 A useful method of obtaining the brief frequency response of an LTI system is based on the geometric evaluation of its poles and zeros. For example, consider a second-order IIR filter expressed as H(z) = b0 + b1z−1 + b2z−2 1 + a1z−1 + a2z−2 . (3.50) The roots of the characteristic equation z2 + a1z + a2 = 0 (3.51) are the poles of the filter, which may be either real or complex. Complex poles can be expressed as p1 = rejθ and p2 = re− jθ , (3.52) where r is radius of the pole and θ is the angle of the pole. Therefore, (3.51) becomes z − rejθ z − re− jθ = z2 − 2r cos θ + r 2 = 0. (3.53) Comparing this equation with (3.51), we have r = √ a2 and θ = cos−1 −a1 2r . (3.54) The system with a pair of complex-conjugated poles as given in Equation (3.52) is illustrated in Figure 3.16. The filter behaves as a digital resonator for r close to unity. The digital resonator is a bandpass filter with its passband centered at the resonant frequency θ.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 SYSTEM CONCEPTS 141 Re[z] z = 1 Im[z] r r θ θ Figure 3.16 A second-order IIR filter with complex-conjugated poles Similarly, we can obtain two zeros, z1 and z2, by evaluating b0z2 + b1z + b2 = 0. Thus, the transfer function defined in Equation (3.50) can be expressed as H(z) = b0 (z − z1)(z − z2) (z − p1)(z − p2) . (3.55) In this case, the frequency response is given by H(ω) = b0 e jω − z1 e jω − z2 e jω − p1 e jω − p2 . (3.56) The magnitude response can be obtained by evaluating |H (ω)|as the point z moves in counterclockwise direction from z = 0toz =−1(π) on the unit circle. As the point z moves closer to the pole p1, the magnitude response increases. The closer r is to the unity, the sharper the peak. On the other hand, as the point z moves closer to the zero z1, the magnitude response decreases. The magnitude response exhibits a peak at the pole angle (or frequency), whereas the magnitude response falls to the valley at the angle of zero. 3.2.6 Discrete Fourier Transform Toperform frequency analysis of x(n), we can convert the time-domain signal into frequency domain using the z-transform defined in Equation (3.27), and the frequency analysis can be performed by substituting z = e jω as shown in Equation (3.48). However, X(ω) is a continuous function of continuous frequency ω, and it also requires an infinite number of x(n) samples for calculation. Therefore, it is difficult to compute X(ω) using digital hardware. The discrete Fourier transform (DFT) of N-point signals {x(0), x(1), x(2), ..., x(N−1)} can be obtained by sampling X(ω) on the unit circle at N equally-spaced samples at frequencies ωk = 2πk/N, k = 0,1,..., N − 1. From Equation (3.48), we have X(k) = X(ω)|ω=2πk/N = N−1 n=0 x(n)e− j 2πk N n, k = 0, 1,...,N − 1, (3.57)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 142 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS where n is the time index, k is the frequency index, and X(k)isthekth DFT coefficient. The DFT can be manipulated to obtain a very efficient computing algorithm called the fast Fourier transform (FFT). The derivation, implementation, and application of DFT and FFT will be further discussed in Chapter 6. MATLAB provides the function fft(x) to compute the DFT of the signal vector x. The function fft(x,N) performs N-point FFT. If the length of x is less than N, then x is padded with zeros at the end. If the length of x is greater than N, function fft(x,N) truncates the sequence x and performs DFT of the first N samples only. DFT generates N coefficients X(k) for k = 0, 1,. . . N − 1. The frequency resolution of the N-point DFT is = fs N . (3.58) The frequency fk (in Hz) corresponding to the index k can be computed by fk = k = kfs N , k = 0, 1,...,N − 1. (3.59) The Nyquist frequency ( fs/2) corresponds to the frequency index k = N/2. Since the magnitude |X(k)| is an even function of k, we only need to display the spectrum for 0 ≤ k ≤ N/2 (or 0 ≤ ωk ≤ π). Example 3.18: Similar to Example 3.1, we can generate 100 samples of sinewave with A = 1, f = 1 kHz, and sampling rate of 10 kHz. The magnitude response of signal can be computed and plotted (Figure 3.17) using the following MATLAB script (example3_18.m): N=100; f = 1000; fs = 10000; n=[0:N-1]; k=[0:N-1]; omega=2*pi*f/fs; xn=sin(omega*n); Xk=fft(xn,N); % Perform DFT magXk=20*log10(abs(Xk)); % Compute magnitude spectrum plot(k, magXk); axis([0, N/2, -inf, inf]); % Plot from 0 to pi xlabel('Frequency index, k'); ylabel('Magnitude in dB'); From Equation (3.58), frequency resolution is 100 Hz. The peak spectrum shown in Figure 3.17 is located at the frequency index k = 10, which corresponds to 1000 Hz as indicated by Equa- tion (3.59). 3.3 Introduction to Random Variables The signals encountered in practice are often random signals such as speech and music. In this section, we will briefly introduce the basic concepts of random variables. 3.3.1 Review of Random Variables An experiment that has at least two possible outcomes is fundamental to the concept of probability. The set of all possible outcomes in any given experiment is called the sample space S. A random variable, x,JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 INTRODUCTION TO RANDOM VARIABLES 143 0 −300 −250 −200 −150 −100 −50 0 5 101520253035404550 Frequency index, k Magnitude, dB Figure 3.17 Magnitude spectrum of sinewave is defined as a function that maps all elements from the sample space S into points on the real line. Thus, a random variable is a number whose value depends on the outcome of an experiment. For example, considering the outcomes of rolling of a fair die N times, we obtain a discrete random variable that can be any one of the discrete values from 1 through 6. The cumulative probability distribution function of a random variable x is defined as F(X) = P(x ≤ X), (3.60) where X is a real number ranging from −∞ to ∞, and P(x ≤ X) is the probability of {x ≤ X}. The probability density function of a random variable x is defined as f (X) = dF (X) dX (3.61) if the derivative exists. Two important properties of f (X) are summarized as follows: ∞ −∞ f (X) dX = 1 (3.62) P (X1 < x ≤ X2) = F (X2) − F (X1) = X2 X1 f (X) dX. (3.63) If x is a discrete random variable that can take on any one of the discrete values Xi , i = 1, 2,...as the result of an experiment, we define pi = P (x = Xi ) . (3.64)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 144 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Example 3.19: Consider a random variable x that has a probability density function f (X) = 0, a, x < X1 or x > X2 X1 ≤ x ≤ X2 , which is uniformly distributed between X1 and X2. The constant value a can be computed using Equation (3.62). That is, ∞ −∞ f (X)dX = X2 X1 a · dX = a (X2 − X1) = 1. Thus, a = 1 X2 − X1 . If a random variable x is equally likely to take on any value between the two limits X1 and X2, and cannot assume any value outside that range, it is uniformly distributed in the range [X1, X2]. As shown in Figure 3.18, a uniform density function is defined as f (X) = 1 X2−X1 , X1 ≤ x ≤ X2 0, otherwise . (3.65) 3.3.2 Operations of Random Variables The statistics associated with random variables is often more meaningful from a physical viewpoint than the probability density function. The mean (expected value) of a random variable x is defined as mx = E [x] = ∞ −∞ Xf(X)dX, continuous-time case = i Xi pi , discrete-time case, (3.66) where E[.] denotes the expectation operation (or ensemble averaging). The mean mx defines the level about which the random process x fluctuates. The expectation is a linear operation. Two useful properties of the expectation operation are E [α] = α and E [αx] = αE [x], where α is a constant. If E[x] = 0, x is the zero-mean random variable. The XX2X1 X2 − X1 0 1 f(X) Figure 3.18 A uniform density functionJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 INTRODUCTION TO RANDOM VARIABLES 145 MATLAB function mean calculates the mean value. For example, the statement mx = mean(x) com- putes the mean mx of the elements in the vector x. Example 3.20: Considering the rolling of a fair die N times (N →∞), the probability of out- comes is listed as follows: Xi 1 2 3 4 5 6 pi 1/6 1/6 1/6 1/6 1/6 1/6 The mean of outcomes can be computed as mx = 6 i=1 pi Xi = 1 6 (1 + 2 + 3 + 4 + 5 + 6) = 3.5. The variance is a measure of the spread about the mean, and is defined as σ 2 x = E (x − mx )2 = ∞ −∞ (X − mx )2 f (X)dX, continuous-time case = i pi (Xi − mx )2, discrete-time case, (3.67) where (x − mx ) is the deviation of x from the mean value mx . The positive square root of variance is called the standard deviation σx . The MATLAB function std calculates standard deviation of the elements in the vector. The variance defined in Equation (3.67) can be expressed as σ 2 x = E (x − mx )2 = E x2 − 2xmx + m2 x = E x2 − 2mx E (x) + m2 x = E x2 − m2 x . (3.68) We call E x2 the mean-square value of x. Thus, the variance is the difference between the mean-square value and the square of the mean value. If the mean value is equal to zero, then the variance is equal to the mean-square value. For a zero-mean random variable x, i.e., mx = 0, we have σ 2 x = E x2 = Px , (3.69) which is the power of x. Consider the uniform density function defined in Equation (3.65). The mean of the function can be computed by mx = E [x] = ∞ −∞ Xf(X)dX = 1 X2 − X1 X2 X1 XdX = X2 − X1 2 . (3.70)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 146 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS The variance of the function is σ 2 x = E x2 − m2 x = ∞ −∞ X 2 f (X)dX − m2 x = 1 X2 − X1 X2 X1 X 2 dX − m2 x = 1 X2 − X1 · X 3 2 − X 3 1 3 − m2 x = (X2 − X1)2 12 . (3.71) In general, if x is a uniformly distributed random variable in the interval (− , ), we have mx = 0 and σ 2 x = 2 3. (3.72) Example 3.21: The MATLAB function rand generates pseudo-random numbers uniformly dis- tributed in the interval [0, 1]. From Equation (3.70), the mean of the generated pseudo-random numbers is 0.5. From Equation (3.71), the variance is 1/12. To generate zero-mean random numbers, we subtract 0.5 from every generated random number. The numbers are now distributed in the interval [−0.5, 0.5]. To make these pseudo-random numbers with unit variance, i.e., σ 2 x = 2 3 = 1, the generated numbers must be equally distributed in the interval [− √ 3, √ 3]. Therefore, we have to multiply 2 √ 3 to every generated number that was subtracted by 0.5. The following MATLAB statement can be used to generate the uniformly distributed random numbers with mean 0 and variance 1: xn = 2*sqrt(3)*(rand-0.5); The waveform of zero-mean, unit-variance (σ 2 x = 1) white noise generated by MATLAB code (example3_21.m) is shown in Figure 3.19. A sinewave corrupted by white noise v(n) can be expressed as x(n) = A sin(ωn) + v(n). (3.73) When a signal s(n) with power Ps is corrupted by a noise v(n) with power Pv, the signal-to-noise ratio (SNR) in dB is defined as SNR = 10 log10 Ps Pv . (3.74) From Equation (3.69), the power of sinewave defined in Equation (3.6) can be computed as Ps = E A2 sin2(ωn) = A2/2. (3.75) Example 3.22: If we want to generate signal x(n) expressed in Equation (3.73), where v(n)is a zero-mean, unit-variance white noise. As shown in Equation (3.74), SNR is determined by the power of sinewave. As shown in Equation (3.75), when the sinewave amplitude A = √ 2, the power is equal to 1. From Equation (3.74), the SNR is 0 dB.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 FIXED-POINT REPRESENTATIONS AND QUANTIZATION EFFECTS 147 2 1.5 1 0.5 0 Amplitude −0.5 −1 −1.5 −20 50 100 150 Time index, n 200 250 Figure 3.19 A zero-mean, unit-variance white noise We can generate a sinewave corrupted by the zero-mean, unit-variance white noise with SNR = 0 dB using MATLAB script example3_22.m. Example 3.23: We can compute the DFT of signal x(n) to obtain X(k). The magnitude spectrum in dB scale can be calculate as 20 log10 |X(k)| for k = 0, 1,...,N/2. Using the signal x(n) generated in Example 3.22, magnitude spectrum can be computed and displayed using the MATLAB code example3_23.m. The noisy spectrum is shown in Figure 3.20. Comparing this figure with Figure 3.17, we show that the power of white noise is uniformly distributed from 0 to π, while the power of sinewave is concentrated at its frequency 0.2π. 3.4 Fixed-Point Representations and Quantization Effects The basic element in digital hardware is the binary device that contains one bit of information. A register (or memory unit) containing B bits of information is called a B-bit word. There are several different methods for representing numbers and carrying out arithmetic operations. In this book, we focus on widely used fixed-point implementations. 3.4.1 Fixed-Point Formats The most commonly used fixed-point representation of a fractional number x is illustrated in Figure 3.21. The wordlength is B(= M + 1) bits, i.e., M magnitude bits and one sign bit. The most significant bit (MSB) is the sign bit, which represents the sign of the number as follows: b0 = 0, 1, x ≥ 0 x < 0 (positive number) (negative number) . (3.76)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 148 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS 40 Spectrum of noisy sinewave 35 30 25 Magnitude, dB 20 15 10 5 0204060 Frequency index, k 80 100 120 Figure 3.20 Spectrum of sinewave corrupted by white noise, SNR = 0dB The remaining M bits give the magnitude of the number. The rightmost bit bM is called the least significant bit (LSB), which represents precision of the number. As shown in Figure 3.21, the decimal value of a positive (b0 = 0) binary fractional number x can be expressed as (x)10 = b1 · 2−1 + b2 · 2−2 +···+bM · 2−M = M m=1 bm2−m. (3.77) Example 3.24: The largest (positive) 16-bit fractional number in binary format is x = 0111 1111 1111 1111b (the letter ‘b’ denotes that the number is in binary representation). The decimal value of this number can be obtained as (x)10 = 15 m=1 2−m = 2−1 + 2−2 +···+2−15 = 1 − 2−15 ≈ 0.999969. Binary point Sign bit x = b0 . b1 b2 ... bM−1 bM Figure 3.21 Fixed-point representation of binary fractional numbersJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 FIXED-POINT REPRESENTATIONS AND QUANTIZATION EFFECTS 149 The smallest nonzero positive number is x = 0000 0000 0000 0001b. The decimal value of this number is (x)10 = 2−15 = 0.000030518. The negative numbers (b0 = 1) can be represented using three different formats: the sign-magnitude, the 1’s complement, and the 2’s complement. Fixed-point DSP processors usually use the 2’s com- plement format to represent negative numbers because it allows the processor to perform addition and subtraction using the same hardware. With the 2’s complement format, a negative number is obtained by complementing all the bits of the positive binary number and then adding 1 to the LSB. In general, the decimal value of a B-bit binary fractional number can be calculated as (x)10 =−b0 + 15 m=1 bm2−m. (3.78) For example, the smallest (negative) 16-bit fractional number in binary format is x = 1000 0000 0000 0000b. From Equation (3.78), its decimal value is −1. Therefore, the range of fractional binary numbers is −1 ≤ x ≤ 1 − 2−M . (3.79) For a 16-bit fractional number x, the decimal value range is −1 ≤ x ≤ 1 − 2−15. Example 3.25: 4-bit binary numbers represent both integers and fractional numbers using the 2’s complement format and their corresponding decimal values are listed in Table 3.2. Example 3.26: If we want to initialize a 16-bit data x with the constant decimal value 0.625, we can use the binary form x = 0101 0000 0000 0000b, the hexidecimal form x = 0x5000, or the decimal integer x = 214 + 212 = 20480. Table 3.2 4-bit binary numbers in 2’s complement format and their corresponding decimal values Binary numbers Integers (sxxx.) Fractions (s.xxx) 0000 0 0.000 0001 1 0.125 0010 2 0.250 0011 3 0.375 0100 4 0.500 0101 5 0.675 0110 6 0.750 0111 7 0.875 1000 −8 −1.000 1001 −7 −0.875 1010 −6 −0.750 1011 −5 −0.675 1100 −4 −0.500 1101 −3 −0.375 1110 −2 −0.250 1111 −1 −0.125JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 150 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS As shown in Figure 3.21, the easiest way to convert a normalized 16-bit fractional number into the integer that can be used by the C55x assembler is to move the binary point to the right by 15 bits (at the right of bM ). Since shifting the binary point 1 bit right is equivalent to multiply the fractional number by 2, this can be done by multiplying the decimal value by 215 = 32768. For example, 0.625 × 32 768 = 20 480. It is important to note that we use an implied binary point to represent the binary fractional number. It will affect the accuracy (dynamic range and precision) of the number. The binary point is purely a programmer’s convention and has no relationship with the hardware. The programmer needs to keep track of the binary point when manipulating fractional numbers in assembly language programming. Different notations can be used to represent different fractional formats. Similar to Figure 3.21, a more general fractional format Qnm is illustrated in Figure 3.22 where n + m = M = B − 1. There are n bits at the left of binary point that represent integer portion, while m bits at the right represent fractional values. The most popular used fractional number representation shown in Figure 3.21 is called the Q0.15 format (n = 0 and m = 15), which is simply also called Q15 format since there are 15 fractional bits. Note that the Qnm format is represented in MATLABas [Bm]. For example, Q15 format is represented as [16 15]. Example 3.27: The decimal value of a 16-bit binary number x = 0100 1000 0001 1000b depends on which Q format is used by the programmer. Some examples are given as follows: Q0.15, x = 2−1 + 2−4 + 2−11 + 2−12 = 0.56323 Q2.13, x = 21 + 2−2 + 2−9 + 2−10 = 2.25293 Q5.10, x = 24 + 21 + 2−6 + 2−7 = 18.02344 Example 3.28: As introduced in Chapter 2, the TMS320 assembly directives .set and .equ assign a value to a symbolic name. The directives .word and .short (or .int) initialize memory locations with particular data values represented in binary, hexidecimal, or integer format. Each data is treated as a 16-bit value and is separated by a comma. Some examples of the Q15 format data used for C55x are given as follows: ONE .set 32767 ; 1-2−15 ≈ 0.999969 in integer ONE_HALF .set 0x4000 ; 0.5 in hexadecimal ONE_EIGHTH .equ 1000h ; 1/8 in hexadecimal MINUS_ONE .equ 0xffff ; -1.0 in hexadecimal COEFF .short 0ff00h ; -2−7 = -0.0078125 in hexadecimal ARRAY .word 2048,-2048 ; ARRAY[0.0625, -0.625] As discussed in Chapter 1, fixed-point arithmetic is often used with DSP hardware for real-time processing because it offers fast operation and relatively economical implementation. Its drawbacks Integer Binary point Fraction Sing bit x = b0b1b2 ... bn b1b2 ...bm Figure 3.22 A general binary fractional numbersJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 FIXED-POINT REPRESENTATIONS AND QUANTIZATION EFFECTS 151 include a small dynamic range and low resolution. These problems will be discussed in details in the following sections. 3.4.2 Quantization Errors As discussed in Section 3.4.1, numbers are represented by a finite number of bits. The errors between the desired and actual values are called the finite-wordlength (finite-precision, or numerical) effects. In general, finite-precision effects can be broadly categorized into the following classes. 1. Quantization errors: (a) signal quantization (b) coefficient quantization 2. Arithmetic errors: (a) roundoff (or truncation) (b) overflow The limit cycle oscillation is another phenomenon that may occur when implementing a feedback system such as an IIR filter with finite-precision arithmetic. The output of the system may continue to oscillate indefinitely while the input remains zero. 3.4.3 Signal Quantization The analog-to-digital converter (ADC) converts an analog signal x(t) into a digital signal x(n). The input signal is first sampled to obtain the discrete-time signal x(nT) with infinite precision. Each x(nT) value is then encoded using B-bit wordlength to obtain the digital signal x(n). We assume that the signal x(n) is interpreted as the Q15 fractional number shown in Figure 3.21 such that −1 ≤ x(n) < 1. Thus, the dynamic range of fractional numbers is 2. Since the quantizer employs B bits, the number of quantization levels available is 2B. The spacing between two successive quantization levels is = 2 2B = 2−B+1 = 2−M , (3.80) which is called the quantization step (interval, width, or resolution). For example, the output of a 4-bit converter with quantization interval = 2−3 = 0.125 is summarized in Table 3.2. As discussed in Chapter 1, we use rounding (instead of truncating) for quantization in this book. The input value x(nT) is rounded to the nearest level as illustrated in Figure 3.23 for a 3-bit ADC. We assume there is a line exactly between two quantization levels. The signal value above this line will be assigned to the higher quantization level, while the signal value below this line is assigned to the lower level. For example, the discrete-time signal x(T ) in Figure 3.23 is rounded to 010b since the real value is below the middle line between 010b and 011b, while x(2T ) is rounded to 011b since the value is above the middle line. The quantization error (or noise) e(n) is the difference between the discrete-time signal x(nT) and the quantized digital signal x(n), and is be expressed as e(n) = x(n) − x(nT). (3.81) Figure 3.23 clearly shows that |e(n)| ≤ 2 . (3.82)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 152 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Quantization level Time, t 011 010 001 0000 T 2T e(n) Δ/2 Δ x(t) Figure 3.23 Quantization process related to a 3-bit ADC Thus, the quantization noise generated by an ADC depends on the quantization interval. The presence of more bits results in a smaller quantization step, a lower quantization noise. From Equation (3.81), we can express the ADC output as the sum of the quantizer input x(nT) and the error e(n). That is, x(n) = Q [x(nT)] = x(nT) + e(n), (3.83) where Q[.] denotes the quantization operation. Therefore, the nonlinear operation of the quantizer is modeled as a linear process that introduces an additive noise e(n) to the digital signal x(n). For an arbitrary signal with fine quantization (B is large), the quantization error e(n) is assumed to be uncorrelated with x(n), and is a random noise that is uniformly distributed in the interval [− /2, /2]. From Equation (3.70), we have E[e(n)] = − /2 + /2 2 = 0. (3.84) Thus, the quantization noise e(n) has zero mean. From Equation (3.72), the variance σ 2 e = 2 12 = 2−2B 3 . (3.85) Therefore, the larger wordlength results in smaller input quantization error. The SQNR can be expressed as SQNR = σ 2 x σ 2 e = 3 · 22Bσ 2 x , (3.86) where σ 2 x denotes the variance of the signal, x(n). Usually, the SQNR is expressed in dB as SQNR = 10 log10 σ 2 x σ 2 e = 10 log10 3 · 22Bσ 2 x = 10 log10 3 + 20B log10 2 + 10 log10 σ 2 x = 4.77 + 6.02B + 10 log10 σ 2 x . (3.87) This equation indicates that for each additional bit used in the ADC, the converter provides about 6-dB gain. When using a 16-bit ADC (B = 16), the maximum SQNR is about 98.1 dB if the input signal is aJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 FIXED-POINT REPRESENTATIONS AND QUANTIZATION EFFECTS 153 sinewave. This is because the maximum sinewave having amplitude 1.0 in decimal makes 10 log10(σ 2 x ) = 10 log10(1/2) =−3, and Equation (3.87) becomes 4.77 + 6.02 × 16 − 3.0 = 98.09. Another important fact about Equation (3.87) is that the SQNR is proportional to σ 2 x . Therefore, we want to keep the power of signal as large as possible. This is an important consideration when we discuss scaling issues in Section 3.5. Example 3.29: Effects of signal quantization may be subjectively evaluated by observing and listening to the quantized speech. The speech file timit1.asc was digitized with fs = 8 kHz and B = 16. This speech file can be viewed and played using the MATLAB script (example3_29.m): load timit1.asc; plot(timit1); soundsc(timit1, 8000, 16); where the MATLAB function soundsc autoscales and plays the vector as sound. We can simulate the quantization of data with 8-bit wordlength by qx = round(timit1/256); where the function (round) rounds the real number to the nearest integer. We then evaluate the quantization effects by plot(qx); soundsc(qx, 8000, 16); By comparing the graph and sound of timit1 and qx, the signal quantization effects may be understood. 3.4.4 Coefficient Quantization The filter coefficients, bl and am, of the digital filter determined by a filter design package such as MATLAB are usually represented using the floating-point format. When implementing a digital filter, the filter coefficients have to be quantized for a given fixed-point processor. Therefore, the performance of the fixed-point digital filter will be different from its design specification. The coefficient quantization effects become more significant when tighter specifications are used, especially for IIR filters. Coefficient quantization can cause serious problems if the poles of designed IIR filters are too close to the unit circle. This is because those poles may move outside the unit circle due to coefficient quantization, resulting in an unstable implementation. Such undesirable effects are far more pronounced in high-order systems. The coefficient quantization is also affected by the structures used for the implementation of dig- ital filters. For example, the direct-form implementation of IIR filters is more sensitive to coefficient quantization than the cascade structure consisting of sections of first- or second-order IIR filters. 3.4.5 Roundoff Noise As shown in Equation (3.10), we may need to compute the product y(n) = αx(n) in a DSP system. Assuming the wordlength associated with α and x(n)isB bits, the multiplication yields 2B-bit prod- uct y(n). In most applications, this product may have to be stored in memory or output as a B-bitJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 154 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS word. The 2B-bit product can be either truncated or rounded to B bits. Since truncation causes an undesired bias effect, we should restrict our attention to the rounding. Example 3.30: In C programming, rounding a real number to an integer number can be imple- mented by adding 0.5 to the real number and then truncating the fractional part. The following C statement y = (short)(x + 0.5); rounds the real number x to the nearest integer y. As shown in Example 3.29, MATLAB provides the function round for rounding a real number. In TMS320C55x implementation, the CPU rounds the operands enclosed by the rnd() expres- sion qualifier as mov rnd(HI(AC0)),*AR1 This instruction will round the content of the high portion of AC0(31:16) and the rounded 16-bit value is stored in the memory location pointed at by AR1. Another key word R (or r) also performs rounding operation on the operands. The following instruction mpyr AC0,AC1 multiplies and stores the rounded product in the upper portion of the accumulator AC1(31:16) and clears the lower portion of the accumulator AC1(15:0). The process of rounding a 2B-bit product to B bits is similar to that of quantizing discrete-time signal using a B-bit quantizer. Similar to Equation (3.83), the nonlinear roundoff operation can be modeled as the linear process expressed as y(n) = Q [αx(n)] = αx(n) + e(n), (3.88) where αx(n) is the 2B-bit product and e(n) is the roundoff noise due to rounding 2B-bit product to B-bit product. The roundoff noise is a uniformly distributed random process defined in Equation (3.82). Thus, it has a zero mean and its power is defined in Equation (3.85). It is important to note that most commercially available fixed-point DSP processors, such as the TMS320C55x, have double-precision accumulator(s). As long as the program is carefully written, it is possible to ensure that rounding occurs only at the final stage of calculation. For example, consider the computation of FIR filter output given in Equation (3.15). We can keep the sum of all temporary products, bl x(n − l), in the double-precision accumulator. Rounding is performed only when the final sum is saved to memory with B-bit wordlength. 3.4.6 Fixed-Point Toolbox The MATLABFixed-Point Toolboxprovides fixed-point data types and arithmetic for enabling fixed-point algorithm development. This toolbox has the following features: r defining fixed-point data types, scaling, rounding, and overflow methods in the MATLAB workspace; r bit-true real and complex simulation;JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 FIXED-POINT REPRESENTATIONS AND QUANTIZATION EFFECTS 155 r fixed-point arithmetic; r relational, logical, and bitwise operators; and r conversions between binary, hex, double, and built-in integers. This toolbox provides the function quantizer to construct a quantizer object. For example, q = quantizer('PropertyName1',PropertyValue1,... ) creates a quantizer object q that uses property name/property value pairs that are summarized in Table 3.3. We also can use the following syntax q = quantizer to create a quantizer object q with properties set to the following default values: mode = 'fixed'; roundmode = 'floor'; overflowmode = 'saturate'; format = [16 15]; Note that [16 15] is equivalent to Q15 format. After we have constructed a quantizer object, we can apply it to data using the quantize function with the following syntax: y = quantize(q, x) The command y = quantize(q, x) uses the quantizer object q to quantize x. When x is a numeric array, each element of x is quantized. Table 3.3 List of quantizer property name/property value pairs Property name Property value Description mode 'double' Double-precision mode 'float' Custom-precision floating-point mode 'fixed' Signed fixed-point mode 'single' Single-precision mode 'ufixed' Unsigned fixed-point mode roundmode 'ceil' Round toward negative infinity 'convergent' Convergent rounding 'fix' Round toward zero 'floor' Round toward positive infinity 'round' Round toward nearest overflowmode 'saturate' Saturate on overflow 'wrap' Wrap on overflow format [B m] Format for fixed or ufixed mode, B is wordlength, m is number of fractional bitsJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 156 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Quantized samples, Q15 marked o, Q3 marked x 0.8 0.6 0.4 0.2 0 −0.2 −0.4 Amplitude 0 5 10 15 Time index, n e(n) Figure 3.24 Quantization using Q15 and Q3 formats and the difference e(n) Example 3.31: Similar to Example 3.21, we generate a zero-mean white noise using MATLAB function rand, which uses double-precision, floating-point format. Wethen construct two quantizer objects and quantize the white noise to Q15 and Q3 (4-bit) representations. We plot the quantized noise in Q15 and Q3 formats and the difference between these two is shown in Figure 3.24 using the following MATLAB script (example3_31.m): N=16; n=[0:N-1]; xn = sqrt(3)*(rand(1,N)-0.5); % Generate zero-mean white noise q15 = quantizer('fixed', 'convergent', 'wrap', [16 15]); % Q15 q3 = quantizer('fixed', 'convergent', 'wrap', [4 3]); % Q3 y15 = quantize(q15,xn); % Quantization using Q15 format y3 = quantize(q3,xn); % Quantization using Q3 format en = y15-y3, % Difference between Q15 and Q3 plot(n,y15,'-o',n,y3,'-x',n,en); MATLAB Fixed-Point Toolbox also provides several radix conversion functions which are summarized in Table 3.4. For example, y = num2int(q,x) uses q.format to convert a number x to an integer y. Example 3.32: For testing some programs using fixed-point C programs with CCS and DSK, we may need to generate input data files for simulations. As shown in Example 3.31, we use MATLABJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 OVERFLOW AND SOLUTIONS 157 Table 3.4 List of radix conversion functions using a quantizer object Function Description bin2num Convert a 2’s complement binary string to a number hex2num Convert hexadecimal string to a number num2bin Convert a number to a binary string num2hex Convert a number to its hexadecimal equivalent num2int Convert a number to a signed integer to generate signal and construct a quantizer object. In order to save the Q15 data in integer format, we use the function num2int in the following MATLAB script (example3_32.m): N=16; n=[0:N-1]; xn = sqrt(3)*(rand(1,N)-0.5); % Generate zero-mean white noise q15 = quantizer('fixed', 'convergent', 'wrap', [16 15]); % Q15 Q15int = num2int(q15,xn); 3.5 Overflow and Solutions Assuming that the signals and filter coefficients have been properly normalized in the range of −1to1 for fixed-point arithmetic, the sum of two B-bit numbers may fall outside the range of −1 to 1. The term overflow is a condition in which the result of an arithmetic operation exceeds the capacity of the register used to hold that result. When using a fixed-point processor, the range of numbers must be carefully examined and adjusted in order to avoid overflow. This may be achieved by using different Qn.m formats with desired dynamic ranges. Example 3.33: Assume that a 4-bit fixed-point hardware uses the fractional 2’s complement format (see Table 3.2). If x1 = 0.875 (0111b) and x2 = 0.125 (0001b), the binary sum of x1 + x2 is 1000b. The decimal value of this signed binary number is −1, not the correct answer +1. That is, when the result exceeds the dynamic range of the register, overflow occurs and unacceptable error is produced. Similarly, if x3 =−0.5 (1100b) and x4 = 0.625(0101b). x3 − x4 = 0110b, which is +0.875, and not the correct answer −1.125. Therefore, subtraction may also result in underflow. For the FIR filter defined in Equation (3.15), this overflow will result in the severe distortion of the output y(n). For the IIR filter defined in Equation (3.23), the overflow effect is much more serious because the errors are fed back. The problem of overflow may be eliminated using saturation arithmetic and proper scaling (or constraining) signals at each node within the filter to maintain the magnitude of the signal. 3.5.1 Saturation Arithmetic Most commercially available DSP processors have mechanisms that protect against overflow and auto- matically indicate the overflow if it occurs. Saturation arithmetic prevents overflow by keeping the result at a maximum value. Saturation logic is illustrated in Figure 3.25 and can be expressed as y = ⎧ ⎨ ⎩ 1 − 2−M , x ≥ 1 − 2−M x, −1 ≤ x < 1 −1, x < −1 , (3.89)JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 158 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS x y 1 − 2−M 1 − 2−M −1 −1 Figure 3.25 Characteristics of saturation arithmetic where x is the original addition result and y is the saturated adder output. If the adder is under saturation mode, the undesired overflow can be avoided since the 32-bit accumulator fills to its maximum (or minimum) value, but does not roll over. Similar to Example 3.31, when 4-bit hardware with saturation arithmetic is used, the addition result of x1 + x2 is 0111b, or 0.875 in decimal value. Compared with the correct answer 1, there is an error of 0.125. This result is much better than the hardware without saturation arithmetic. Saturation arithmetic has a similar effect of ‘clipping’ the desired waveform. This is a nonlinear operation that will add undesired nonlinear components into the signal. Therefore, saturation arithmetic can only be used to guarantee that overflow will not occur. It should not be the only solution for solving overflow problems. 3.5.2 Overflow Handling As mentioned earlier, the C55x supports the saturation logic to prevent overflow. The logic is enabled when the overflow mode bit (SATD) in status register ST1 is set (SATD = 1). When this mode is set, the accumulators are loaded with either the largest positive 32-bit value (0x00 7FFF FFFF) or the smallest negative 32-bit value (0xFF 8000 0000) if the result overflows. The C55x overflow mode bit can be set with the instruction bset SATD and reset (disabled) with the instruction bclr SATD The TMS320C55x provides overflow flags that indicate whether or not an arithmetic operation has overflowed. The overflow flag ACOVn,(n = 0, 1, 2, or 3) is set to 1 when an overflow occurs in the corresponding accumulator ACn. This flag will remain set until a reset is performed or when a status bit clear instruction is implemented. If a conditional instruction (such as a branch, return, call, or conditional execution) that tests overflow status is executed, the overflow flag will be cleared. 3.5.3 Scaling of Signals The most effective technique in preventing overflow is by scaling down the signal. For example, consider the simple FIR filter illustrated in Figure 3.26 without the scaling factor β (or β = 1). Let x(n) = 0.8JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 159 0.8 z−1 0.9 β y(n) x(n − 1)x(n) ∑+ + Figure 3.26 Block diagram of simple FIR filters with scaling factor β and x(n − 1) = 0.6, the filter output y(n) = 1.2. When this filter is implemented on a fixed-point DSP processor using Q15 format without saturation arithmetic, undesired overflow occurs. As illustrated in Figure 3.26, the scaling factor β<1 can be used to scale down the input signal. For example, when β = 0.5, we have x(n) = 0.4 and x(n − 1) = 0.3, and the result y(n) = 0.6 without overflow. If the signal x(n) is scaled by β, the corresponding signal variance changes to β2σ 2 x . Thus, the SQNR in dB given in Equation (3.87) changes to SQNR = 10 log10 β2σ 2 x σ 2 e = 4.77 + 6.02B + 10 log10 σ 2 x + 20 log10 β. (3.90) Since we perform fractional arithmetic, β<1 is used to scale down the input signal. The last term 20 log10 β has negative value. Thus, scaling down the signal reduces the SQNR. For example, when β = 0.5, 20 log10 β =−6.02 dB, thus reducing the SQNR of the input signal by about 6 dB. This is equivalent to losing 1 bit in representing the signal. 3.5.4 Guard Bits The TMS320C55x provides four 40-bit accumulators as introduced in Chapter 2. Each accumulator is split into three parts as illustrated in Figure 3.27. These guard bits are used as a head-margin for preventing overflow in iterative computations such as the FIR filtering defined in Equation (3.15). Because of the potential overflow in a fixed-point implementation, engineers need to be concerned with the dynamic range of numbers. This usually demands greater coding and testing efforts. In general, the optimum solution is combining of scaling factors, guard bits, and saturation arithmetic. The scaling factors (smaller than 1) are set as large as possible so that there maybe only some occasional overflows which can be avoided by using guard bits and saturation arithmetic. 3.6 Experiments and Program Examples In this section, the first half of the experiments is used to demonstrate quantization effects, overflow and saturation arithmetic, and to determine the proper fixed-point representations. The rest of experiments emphasize on the hands-on DSP programming and implementation using the C5510 DSK. b39–b32 b31–b16 b15–b0 G Guard bits High-order bits Low-order bits HL Figure 3.27 Configuration of the TMS320C55x accumulatorsJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 160 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.5 C program for quantizing a sinusoid, quantSine.c #define BUF_SIZE 40 const short sineTable[BUF_SIZE]= { 0x0000,0x01E0,0x03C0,0x05A0,0x0740,0x08C0,0x0A00,0x0B20, 0x0BE0,0x0C40,0x0C60,0x0C40,0x0BE0,0x0B20,0x0A00,0x08C0, 0x0740,0x05A0,0x03C0,0x01E0,0x0000,0xFE20,0xFC40,0xFA60, 0xF8C0,0xF740,0xF600,0xF4E0,0xF420,0xF3C0,0xF3A0,0xF3C0, 0xF420,0xF4E0,0xF600,0xF740,0xF8C0,0xFA60,0xFC40,0x0000}; short out16[BUF_SIZE]; /* 16 bits output sample buffer */ short out12[BUF_SIZE]; /* 12 bits output sample buffer */ short out8[BUF_SIZE]; /* 8 bits output sample buffer */ short out6[BUF_SIZE]; /* 6 bits output sample buffer */ void main() { short i; for (i = 0; i < BUF_SIZE; i++) { out16[i] = sineTable[i]; /* 16-bit data */ out12[i] = sineTable[i]&0xfff0; /* Mask off 4-bit */ out8[i] = sineTable[i]&0xff00; /* Mask off 8-bit */ out6[i] = sineTable[i]&0xfc00; /* Mask off 10-bit */ } } 3.6.1 Quantization of Sinusoidal Signals The C program listed in Table 3.5 simulates an ADC with different wordlengths. Instead of shifting off the bits, we mask out the least significant 4, 8, or 10 bits of each sample, resulting in the 12, 8, or 6 bits of data samples that have the comparable amplitude to the original 16-bit data. Table 3.6 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Load the project quantSine.pjt, rebuild, and load the program to the DSK or C55x simulator. 2. Use the CCS graphic display to plot four output buffers: out16, out12, out8, and out6, as shown in Figure 3.28. Compare and describe the graphical results of each output waveforms represented by different wordlengths. 3. Find the mean and variance of quantization noise for the 12-, 8-, and 6-bit ADCs. Table 3.6 File listing for experiment exp3.6.1_quantSine Files Description quantSine.c C function for implementing quantization quantSine.pjt DSP project file quantSine.cmd DSP linker command fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 161 Figure 3.28 Quantizing 16-bit data (top-left) into 12-bit (bottom-left), 8-bit (top-right), and 6-bit (bottom-right) 3.6.2 Quantization of Audio Signals To evaluate the quantization effects of audio signals, we use the DSK for real-time experiment. The program that emulates the quantizer is listed in Table 3.7. During the real-time audio playback, the masked variable quant will be changed to emulate the quantization effects. Table 3.8 lists the files used for this experiment. This experiment uses the program given in Section 1.6.6, which is modified from the C5510 DSK audio example. The program reads audio samples, applies quantizer to the samples, and plays the quantized samples using DSK headphone output. Procedures of the experiment are listed as follows: 1. Load the project quantAudio.pjt, rebuild, and load the program to DSK. 2. Use an audio source (CD player or radio) as the audio input to the DSK. The included wave files can be used with Windows media player as audio sources. 3. Listen to the quantization effects of representing audio samples with different wordlengths using a headphone (or loudspeaker) connected to the headphone output of the DSK. 4. Compare and describe the quantization effects of speech and music samples. Table 3.7 Listing of audio signal quantization program, quantAudio.c short quantAudio(short indata, short quant) { return(indata&quant); /* Quantization by masking the data sample */ }JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 162 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.8 File listing for experiment exp3.6.2_quantAudio Files Description quantAudioTest.c C function for testing experiment quantAudio.c C function for audio quantization quantAudio.pjt DSP project file quantAudiocfg.cmd DSP linker command file quantAudio.cdb DSP BIOS configuration file desertSun.wav Wave file fools8k.wav Wave file 3.6.3 Quantization of Coefficients Since filter design and implementation will be discussed in Chapters 4 and 5, we will only briefly describe the fourth-order IIR filter used in this experiment. Table 3.9 lists an assembly program that implements a fourth-order IIR filter. This lowpass filter is designed as fc/fs = 0.225, where fc is the cutoff frequency. The signal components with frequencies below the cutoff frequency will pass through the lowpass filter, Table 3.9 List of IIR filtering program, IIR4.asm .def _IIR4 .def _initIIR4 ; ; Original coefficients of 4th-order IIR lowpass filter ; with fc/fs = 0.225 ; ; short b[5]={ 0.0072, 0.00287, 0.0431, 0.0287, 0.0072}; ; short a[5]={ 1.0000, -2.16860,2.0097,-0.8766, 0.1505}; ; .data ; Q13 formatted coefficients iirCoeff .word 0x003B, 0x00EB ; b0, b1, .word 0x0161, 0x00EB, 0x003B ; b2, b3, b4 .word 0x4564, -0x404F ; -a1, -a2, .word 0x1C0D, -0x04D1 ; -a3, -a4 .bss x,5 ; x buffer .bss y,4 ; y buffer .bss coeff,9 ; Filter coefficients .text ; ; 4th-order IIR filter initialization routine ; Entry T0 = mask for filter coefficients ; _initIIR4 amov #x,XAR0 ; Zero x buffer rpt #4 mov #0,*AR0+ amov #y,XAR0 ; Zero y buffer rpt #3 mov #0,*AR0+ mov #8,BRC0 ; Mask off bits of coefficientsJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 163 Table 3.9 (continued ) amov #iirCoeff,XAR0 amov #coeff,XAR1 rptb maskCoefLoop-1 mov *AR0+,AC0 and T0,AC0 mov AC0,*AR1+ maskCoefLoop ret ; ; 4-th-order IIR filtering ; Entry T0 = sample ; Exit T0 = filtered sample ; _IIR4 bset SATD bset SXM amov #x,XAR0 amov #y,XAR1 amov #coeff,XCDP bset FRCT || mov T0,*AR0 ; x[0] = indata ; ; Perform IIR filtering ; mpym *AR0+,*CDP+,AC0 ; AC0=x[0]*bn[0] || rpt #3 ; i=1,2,3,4 macm *AR0+,*CDP+,AC0 ; AC0+=x[i]*bn[i] rpt #3 ; i=0,1,2,3 macm *AR1+,*CDP+,AC0 ; AC0+=y[i]*an[i] amov #y+2,XAR0 amov #y+3,XAR1 sfts AC0,#2 ; Scale to Q15 format || rpt #2 mov *AR0-,*AR1- ; Update y[] mov hi(AC0),*AR1 || mov hi(AC0),T0 ; Return y[0] in T0 amov #x+3,XAR0 amov #x+4,XAR1 bclr FRCT || rpt #3 mov *AR0-,*AR1- ; Update x[] bclr SXM bclr SATD ret .end while the higher frequency components will be attenuated. The assembly routine, _initIIR4, initializes the memory locations of x and y buffers to zero. In our experiment, the coefficients are masked during the initialization to 16, 12, 8, and 4 bits. The IIR filter assembly routine, _IIR4, performs the filtering operation to the input data samples. The initialization is performed only once, while the IIR routine will be called to perform the filter operation for every incoming sample. The coefficient data pointer (CDP) is used to address the filter coefficients. The auxiliary registers, AR0 and AR1, are pointing to the x and yJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 164 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.10 List of files used for experiment exp3.6.3_quantFiltCoef Files Description quantFiltCoefTest.c C function for testing filter quantization quantFiltCoef.asm Assembly IIR function for quantized filter quantFiltCoef.pjt DSP project file quantFiltCoefcfg.cmd DSP linker command file quantFiltCoef.cdb DSP BIOS configuration file desertSun.wav Wave file fools8k.wav Wave file data buffers, respectively. After each sample is processed, both the x and y buffers are updated by shifting the data in the buffers, which will be further discused in Chapter 4. The files used for this experiment are listed in Table 3.10. The experiment program reads audio samples, applies an IIR filter to the samples, and plays the filter results via DSK headphone jack. Procedures of the experiment are listed as follows: 1. Load the project quantFiltCoef.pjt, rebuild, and load the program to the DSK. 2. Connect an audio source to the line-in of the DSK and connect a headphone to the headphone output of the DSK and play the audio signals. The included wave files can be used as audio sources for Windows media player. 3. Listen to the audio output and compare the left channel with the right channel. In this experiment, the left-channel audio input samples are sent directly to the output while the right-channel input samples are filtered by the IIR filter. Describe the quantization effects due to the use of limited wordlength for representing filter coefficients. 3.6.4 Overflow and Saturation Arithmetic As discussed in Section 3.5, overflow may occur when DSP processors perform fixed-point accumulation such as FIR filtering. Overflow may occur when data is transferred to memory because the C55x accu- mulators (AC0ÐAC3) have 40 bits, while the memory space is usually defined as a 16-bit word. In this experiment, we use an assembly routine ovf_sat.asm to evaluate the results with and without overflow protection. Table 3.11 lists a portion of the assembly code used for this experiment. In the assembly program, the following code repeatedly adds the constant 0x140 to AC0: rptblocal add_loop_end-1 add #0x140<<#16,AC0 mov hi(AC0),*AR2+ add_loop_end The updated value is stored at the buffer pointed at by AR2. The content of AC0 will grow larger and larger and eventually this accumulator will overflow. When the overflow occurs, a positive number in AC0 suddenly turns into negative. However, when the C55x saturation mode is set, the overflowed positive number will be limited to 0x7FFFFFFF. The second half of the code stores the left-shifted sinewave values to data memory locations. Without saturation protection, this shift will cause some of the shifted values to overflow.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 165 Table 3.11 Program for experiment of overflow and saturation .def _ovftest .def _buff,_buff1 .bss _buff,(0x100) .bss _buff1,(0x100) ; ; Code start ; _ovftest bclr SATD ; Clear saturation bit if set xcc start,T0!=#0 ; If T0!=0, set saturation bit bset SATD start mov #0,AC0 amov #_buff,XAR2 ; Set buffer pointer rpt #0x100-1 ; Clear buffer mov AC0,*AR2+ amov #_buff1,XAR2 ; Set buffer pointer rpt #0x100-1 ; Clear buffer1 mov AC0,*AR2+ mov #0x80-1,BRC0 ; Initialize loop counts for addition amov #_buff+0x80,XAR2 ; Initialize buffer pointer rptblocal add_loop_end-1 add #0x140<<#16,AC0 ; Use upper AC0 as a ramp up counter mov hi(AC0),*AR2+ ; Save the counter to buffer add_loop_end mov #0x80-1,BRC0 ; Initialize loop counts for subtraction mov #0,AC0 amov #_buff+0x7f,XAR2 ; Initialize buffer pointer rptblocal sub_loop_end-1 sub #0x140<<#16,AC0 ; Use upper AC0 as a ramp down counter mov hi(AC0),*AR2- ; Save the counter to buffer sub_loop_end mov #0x100-1,BRC0 ; Initialize loop counts for sinewave amov #_buff1,XAR2 ; Initialize buffer pointer mov mmap(@AR0),BSA01 ; Initialize base register mov #40,BK03 ; Set buffer as size 40 mov #20,AR0 ; Start with an offset of 20 samples bset AR0LC ; Active circular buffer rptblocal sine_loop_end-1 mov *ar0+<<#16,AC0 ; Get sine value into high AC0 sfts AC0,#9 ; Scale the sine value mov hi(AC0),*AR2+ ; Save scaled value sine_loop_end mov #0,T0 ; Return 0 if no overflow xcc set_ovf_flag,overflow(AC0) mov #1,T0 ; Return 1 if overflow detected set_ovf_flag bclr AR0LC ; Reset circular buffer bit bclr SATD ; Reset saturation bit ret .endJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 166 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.12 List of files for experiment exp3.6.4_overflow Files Description overflowTest.c C function for testing overflow experiment ovf_sat.asm Assembly function showing overflow overflow.pjt DSP project file overflow.cmd DSP linker command file The following segment of code sets up and uses the circular addressing mode: mov #sineTable,BSA01 ; Initialize base register mov #40,BK03 ; Set buffer size to 40 mov #20,AR0 ; Start with an offset of 20 bset AR0LC ; Activate circular buffer The first instruction sets up the circular buffer base register (BSA01). The second instruction initializes the size of the circular buffer. The third instruction initializes the offset from the base as the starting point. In this case, the offset is set to 20 words from the base of sineTable[]. The last instruction enables AR0 as the circular pointer. Table 3.12 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Load the project overflow.pjt, rebuild, and load the program to DSK or CCS. 2. Use the graphic function to display the results buff1 (top) and the buff (bottom) as shown in Figure 3.29. 3. Turn off overflow protection and repeat the experiment. Display and compare the results as shown in Figure 3.29(a) without saturation protection, and Figure 3.29(b) with saturation protection. Figure 3.29 Fixed-point implementation showing overflow and saturation: (a) without saturation protection; (b) with saturation protectionJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 167 3.6.5 Function Approximations This experiment uses polynomial approximation of sinusoidal functions to show a typical DSP algorithm design and implementation process. The DSP algorithm development usually starts with MATLAB or a floating-point C simulation, changes to fixed-point C, optimizes the code to improve its efficiency, and uses assembly language if necessary. The cosine and sine functions can be expressed as the infinite power (Taylor) series expansion as follows: cos(θ) = 1 − 1 2! θ 2 + 1 4! θ 4 − 1 6! θ 6 +···, (3.91a) sin(θ) = θ − 1 3! θ 3 + 1 5! θ 5 − 1 7! θ 7 +···, (3.91b) where θ is in radians and ‘!’ represents the factorial operation. The accuracy of the approximation depends on the number of terms used in the series. Usually more terms are needed for larger values of θ. However, only a limited number of terms can be used in real-time DSP applications. Floating-point C implementation In this experiment, we implement the cosine function approximation in Equation (3.91a) using the C program listed in Table 3.13. In the function fCos1( ), 12 multiplications are required. The C55x compiler has a built-in run-time support library for floating-point arithmetic operations. These floating- point functions are very inefficient for real-time applications. For example, the program fCos1() needs over 2300 clock cycles to compute one sine value. We can improve the computation efficiency by reducing the multiplication from 14 to 4. The modified program is listed in Table 3.14. This improved program reduces the clock cycles from 2300 to less than 1100. To further improve the efficiency, we will use the fixed-point C and assembly language programs. The files used for this experiment are listed in Table 3.15. This experiment can be run on a DSK or a simulator. Table 3.13 Floating-point C Program for cosine approximation // Coefficients for cosine function approximation double fcosCoef[4]={ 1.0, -(1.0/2.0), (1.0/(2.0*3.0*4.0)), -(1.0/(2.0*3.0*4.0*5.0*6.0)) }; // Direct implementation of function approximation double fCos1(double x) { double cosine; cosine = fcosCoef[0]; cosine += fcosCoef[1]*x*x; cosine += fcosCoef[2]*x*x*x*x; cosine += fcosCoef[3]*x*x*x*x*x*x; return(cosine); }JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 168 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.14 Improved floating-point C program for cosine approximation // More efficient implementation of function approximation double fCos2(double x) { double cosine,x2; x2=x*x; cosine = fcosCoef[3] * x2; cosine = (cosine+fcosCoef[2]) * x2; cosine = (cosine+fcosCoef[1]) * x2; cosine = cosine + fcosCoef[0]; return(cosine); } Procedures of the experiment are listed as follows: 1. Load the project floatingPointC.pjt, rebuild, and load the program to the C5510 DSK or C55x simulator. 2. Run the program and verify the results. 3. Profile and record the cycles needed for approximating the function cos(x). Fixed-point C implementation Since the values of a cosine function are between +1.0 and −1.0, the fixed-point C uses Q15 format as shown in Table 3.16. This fixed-point C requires only 80 cycles, a significant improvement as compared with the floating-point C that requires 1100 cycles. The fixed-point C implementation has effectively reduced the computation to 80 clock cycles per function call. This performance can be further improved by examining the program carefully. The CCS in mixed mode shows that the C multiplication uses the run-time support library function I$$LMPY and MPYM instruction as follows: cosine = (long)icosCoef[3] * x2; cosine = cosine >> 13; // Scale back to Q15 010013 D3B706 mpym *AR5(short(#3)),T2,AC0 010016 100533 sfts AC0,#-13,AC0 cosine = (cosine + (long)icosCoef[2]) * x2; cosine = cosine >> 13; // Scale back to Q15 010019 D6B500 add *AR5(short(#2)),AC0,AC0 01001C 6C010542 call I$$LMPY 010020 100533 sfts AC0,#-13,AC0 Table 3.15 List of files for experiment exp3.6.5.1_using floating-pointC Files Description fcos.c Floating-point C function approximation floatingPointC.pjt DSP project file funcAppro.cmd DSP linker command fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 169 Table 3.16 Fixed-point C program for function approximation #define UNITQ15 0x7FFF // Coefficients for cosine function approximation short icosCoef[4]={ (short)(UNITQ15), (short)(-(UNITQ15/2.0)), (short)(UNITQ15/(2.0*3.0*4.0)), (short)(-(UNITQ15/(2.0*3.0*4.0*5.0*6.0))) }; // Fixed-point implementation of function approximation short iCos1(short x) { long cosine,z; short x2; z = (long)x * x; x2 = (short)(z>>15); // x2 has x(Q14)*x(Q14) cosine = (long)icosCoef[3] * x2; cosine = cosine >> 13; // Scale back to Q15 cosine = (cosine + (long)icosCoef[2]) * x2; cosine = cosine >> 13; // Scale back to Q15 cosine = (cosine + (long)icosCoef[1]) * x2; cosine = cosine >> 13; // Scale back to Q15 cosine = cosine + icosCoef[0]; return((short)cosine); } As introduced in Chapter 2, the correct way of writing fixed-point C multiplication is short b,c; long a; a = (long)b * (long)c; The following changes ensure that the C55x compiler will generate the efficient instructions: cosine = (short)(cosine + (long)icosCoef[2]) * (long)x2; cosine = cosine >> 13; // Scale back to Q15 010015 D67590 add *AR3(short(#2)),AC0,AR1 010018 2251 mov T1,AC1 01001A 5290 mov AR1,HI(AC0) 01001C 5804 mpy T1,AC0,AC0 01001E 100533 sfts AC0,#-13,AC0 The modified program (listed in Table 3.17) needs only 33 cycles. We write the fixed-point C code to mimic the instructions of DSP processor. Thus, this stage produces a ‘practical’ DSP program that can be run on the target DSP system, and used as reference for assembly programming. Table 3.18 briefly describes the files used for this experiment.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 170 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.17 Improved fixed-point C implementation // Fixed-point C implementation that simulates assembly programming short iCos(short T0) { long AC0; short *ptr; ptr = &icosCoef[3]; AC0 = (long)T0 * T0; T0 = (short)(AC0>>15); // AC0 has T0(Q14)*T0(Q14) AC0 = (long)T0 * *ptr--; AC0 = AC0 >> 13; // Scale back to Q15 AC0 = (short)(AC0 + *ptr--) * (long)T0; AC0 = AC0 >> 13; // Scale back to Q15 AC0 = (short)(AC0 + *ptr--) * (long)T0; AC0 = AC0 >> 13; // Scale back to Q15 AC0 = AC0 + *ptr; return((short)AC0); } Procedures of the experiment are listed as follows: 1. Load the project fixedPointC.pjt, rebuild, and load the program to the DSK. 2. Run the program and compare the results of cos(x) function with that obtained by floating-point C implementation. 3. Profile the cycles needed for running the function cos(x). C55x assembly implementation In many real-world applications, the DSP algorithms are written in assembly language or mixed C- and-assembly programs. The assembly implementation can be verified by comparing the output of the assembly program against the fixed-point C code. In this experiment, we write the cosine program in assembly language as shown in Table 3.19. This assembly function needs 19 cycles to compute a cosine value. With the overhead of function call setup and return in the mixed C-and-assembly environment, this program requires 30 cycles to generate a cosine value. Table 3.18 List of files for experiment exp3.6.5.2_using fixed-pointC Files Description icos.c Fixed-point C function approximation fixedPointC.pjt DSP project file funcAppro.cmd DSP linker command fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 171 Table 3.19 C55x assembly program for cosine function approximation .data _icosCoef ; [1 (-1/2!) (1/4!) (-1/6!)] .word 32767,-16383,1365,-45 .sect ".text" .def _cosine _cosine: amov #(_icosCoef+3),XAR3 ; ptr = &icosCoef[3]; amov #AR1,AR2 ; AR1 is used as temp register || mov T0,HI(AC0) sqr AC0 ; AC0 = (long)T0 * T0; sfts AC0,#-15 ; T0 = (short)(AC0>>15); mov AC0,T0 mpym *AR3-,T0,AC0 ; AC0 = (long)T0 * *ptr--; sfts AC0,#-13 ; AC0 = AC0 >> 13; add *AR3-,AC0,AR1 ; AC0 = (short)(AC0 + *ptr--) * (long)T0; mpym *AR2,T0,AC0 sfts AC0,#-13 ; AC0 = AC0 >> 13; add *AR3-,AC0,AR1 ; AC0 = (short)(AC0 + *ptr--) * (long)T0; mpym *AR2,T0,AC0 sfts AC0,#-13 ; AC0 = AC0 >> 13; || mov *AR3,T0 add AC0,T0 ; AC0 = AC0 + *ptr; ret ; Return((short)AC0); .end The real-time evaluation and test can be done using the hardware such as a DSK. The real-time experiments can verify system control and interrupt handling issues. To summarize the software design approach used in this experiment, we list the profile results of different implementations in Table 3.20. The files used for this experiment are listed in Table 3.21. Procedures of the experiment are listed as follows: 1. Load the project c55xCos.pjt, rebuild, and load the program to the DSK. 2. Run the program and compare the results of assembly routine with those obtained by floating-point C implementation. 3. Profile the clock cycles needed for the assembly routine and compare with C implementations. Table 3.20 Profile results of cosine approximation for different implementations Arithmetic Function Implementation details Profile (cycles/call) Floating-point C fCos1( ) Direct implementation, 12 multiplications 2315 fCos2( ) Reduced multiplications, 4 multiplications 1097 Fixed-point C iCos1( ) Using fixed-point arithmetic 88 iCos( ) Using single multiplication instruction 33 Assembly language cosine( ) Hand-code assembly routine 30JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 172 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.21 List of files for experiment exp3.6.5.3_using c55x assembly language Files Description c55CosineTest.c C function for testing function approximation cos.asm Assembly routine for cosine approximation c55xCos.pjt DSP project file funcAppro.cmd DSP linker command file Practical applications Since the input arguments to cosine function are in the range of −π to π, we must map the data range of −π to π to the linear 16-bit data variables as shown in Figure 3.30. Using 16-bit wordlength, we map 0 to 0x0000, π to 0x7FFF, and −π to 0x8000 for representing the radius arguments. Therefore, the function approximation given in Equation (3.91) is no longer the best choice, and different function approximation should be considered. Using the Chebyshev approximation, cos(θ) and sin(θ) can be computed as cos(θ) = 1 − 0.001922θ − 4.9001474θ2 − 0.264892θ 3 + 5.04541θ 4 + 1.800293θ 5, (3.92a) sin(θ) = 3.140625θ + 0.02026367θ2 − 5.325196θ 3 + 0.5446788θ 4 + 1.800293θ 5, (3.92b) where the value of θ is defined in the first quadrant, 0 ≤θ < π/2. For other quadrants, the following properties can be used to transfer it to the first quadrant: sin(180◦ − θ) = sin(θ), cos(180◦ − θ) =−cos(θ) (3.93) sin(−180◦ + θ) =−sin(θ), cos(−180◦ + θ) =−cos(θ) (3.94) and sin(−θ) =−sin(θ), cos(−θ) = cos(θ). (3.95) The C55x assembly routine (listed in Table 3.22) synthesizes the sine and cosine functions, which can be used to calculate the angle θ from −180◦ to 180◦. 0x3FFF = 90° 0xBFFF = −90° 0x7FFF = 180° 0x8000 = −180° 0xFFFF = 360° 0x0000 = 0° (b) s.xxxxxxxxxxxxxxx siii.xxxxxxxxxxxx Q12 format (a) Q15 format Figure 3.30 Scaled fixed-point number representation: (a) Q formats; (b) map angle value to 16-bit signed integerJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 173 Table 3.22 The C55x program for approximation of sine and cosine functions .def _sine_cos ; ; Approximation coefficients in Q12 (4096) format ; .data coeff ; Sine approximation coefficients .word 0x3240 ; c1 = 3.140625 .word 0x0053 ; c2 = 0.02026367 .word 0xaacc ; c3 = -5.325196 .word 0x08b7 ; c4 = 0.54467780 .word 0x1cce ; c5 = 1.80029300 ; Cosine approximation coefficients .word 0x1000 ; d0 = 1.0000 .word 0xfff8 ; d1 = -0.001922133 .word 0xb199 ; d2 = -4.90014738 .word 0xfbc3 ; d3 = -0.2648921 .word 0x50ba ; d4 = 5.0454103 .word 0xe332 ; d5 = -1.800293 ; ; Function starts ; .text _sine_cos amov #14,AR2 btstp AR2,T0 ; Test bit 15 and 14 nop ; ; Start cos(x) ; amov #coeff+10,XAR2 ; Pointer to the end of coefficients xcc _neg_x,TC1 neg T0 ; Negate if bit 14 is set _neg_x and #0x7fff,T0 ; Mask out sign bit mov *AR2-<<#16,AC0 ; AC0 = d5 || bset SATD ; Set Saturate bit mov *AR2-<<#16,AC1 ; AC1 = d4 || bset FRCT ; Set up fractional bit mac AC0,T0,AC1 ; AC1 = (d5*x+d4) || mov *AR2-<<#16,AC0 ; AC0 = d3 mac AC1,T0,AC0 ; AC0 = (d5*x^2+d4*x+d3) || mov *AR2-<<#16,AC1 ; AC1 = d2 mac AC0,T0,AC1 ; AC1 = (d5*x^3+d4*x^2+d3*x+d2) || mov *AR2-<<#16,AC0 ; AC0 = d1 mac AC1,T0,AC0 ; AC0 = (d5*x^4+d4*x^3+d3*x^2+d2*x+d1) || mov *AR2-<<#16,AC1 ; AC1 = d0 macr AC0,T0,AC1 ; AC1 = (d5*x^4+d4*x^3+d3*x^2+d2*x+d1)*x+d0 || xcc _neg_result1,TC2 neg AC1 _neg_result1 mov *AR2-<<#16,AC0 ; AC0 = c5 continues overleafJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 174 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.22 (continued ) || xcc _neg_result2,TC1 neg AC1 _neg_result2 mov hi(saturate(AC1<<#3)),*AR0+ ; Return cos(x) in Q15 ; ; Start sin(x) computation ; mov *AR2-<<#16,AC1 ; AC1 = c4 mac AC0,T0,AC1 ; AC1 = (c5*x+c4) || mov *AR2-<<#16,AC0 ; AC0 = c3 mac AC1,T0,AC0 ; AC0 = (c5*x^2+c4*x+c3) || mov *AR2-<<#16,AC1 ; AC1 = c2 mac AC0,T0,AC1 ; AC1 = (c5*x^3+c4*x^2+c3*x+c2) || mov *AR2-<<#16,AC0 ; AC0 = c1 mac AC1,T0,AC0 ; AC0 = (c5*x^4+c4*x^3+c3*x^2+c2*x+c1) mpyr T0,AC0,AC1 ; AC1 = (c5*x^4+c4*x^3+c3*x^2+c2*x+c1)*x || xcc _neg_result3,TC2 neg AC1 _neg_result3 mov hi(saturate(AC1<<#3)),*AR0- ; Return sin(x) in Q15 || bclr FRCT ; Reset fractional bit bclr SATD ; Reset saturate bit ret .end Since the absolute value of the largest coefficient given in this experiment is 5.325196, we must scale the coefficients or use a different Q format as shown in Figure 3.22. We can achieve this by using the Q3.12 format, which has one sign bit, three integer bits, and 12 fraction bits to cover the range (−8, 8), as illustrated in Figure 3.30(a). In the example, we use Q3.12 format for all the coefficients, and map the angle −π ≤ θ ≤ π to a signed 16-bit number (0x8000 ≤ x ≤ 0x7FFF) as shown in Figure 3.30(b). When the assembly subroutine sine_cos is called, a 16-bit mapped angle (function argument) is passed to the assembly routine using register T0 (see C calling conversion described in Chapter 2). The quadrant information is tested and stored in TC1 and TC2. If TC1 (bit 14) is set, the angle is located in either quadrant II or quadrant IV. We use the 2’s complement to convert the angle to the first or third quadrant. We mask out the sign bit to calculate the third quadrant angle in the first quadrant, and the negation changes the fourth quadrant angle to the first quadrant. Therefore, the angle to be calculated is always located in the first quadrant. Because we use Q3.12 format for coefficients, the computed result needs to be left-shifted 3 bits to scale back to Q15 format. The files used for this experiment are listed in Table 3.23. Table 3.23 List of files for experiment exp3.6.5.4_using assembly routine Files Description sinCosTest.c C function for testing function approximation sine_cos.asm Assembly routine for sine and cosine approximation sin_cos.pjt DSP project file funcAppro.cmd DSP linker command fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 175 Procedures of the experiment are listed as follows: 1. Load the project sine_cos.pjt, rebuild, and load the program to the DSK or CCS simulator. 2. Calculate the angles in the following table, and run the experiment to read the approximation results and compare the differences. θ 30◦ 45◦ 60◦ 90◦ 120◦ 135◦ 150◦ 180◦ cos(θ) sin(θ) θ −150◦ −135◦ −120◦ −90◦ −60◦ −45◦ −30◦ 0◦ cos(θ) sin(θ) 3. Explain the tasks of following C55x instructions: (a) bset FRCT, (b) bset SATD, (c) bset SMUL 4. Remove the assembly instruction bset SATD and rerun the experiment. Observe the difference of approximation results. 3.6.6 Real-Time Digital Signal Generation Using DSK In this section, we will generate tones and random numbers using C5510 DSK. The generated signals will be played back in real time via AIC23 on the C5510 DSK. Tone generation using floating-point C In this experiment, we will generate and play a tone embedded in random noise using the C5510 DSK. Table 3.24 shows the functions that are used to generate a tone and random noise. The function cos(x) uses 4682 cycles per call and the function rand() uses only 87 cycles. Table 3.25 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Load the project floatPointSigGen.pjt, rebuild, and load the program to the DSK. 2. Connect a headphone to the headphone output of the DSK and start audio payback. 3. Listen to the audio output. Use a scope to observe the generated waveform. Tone generation using fixed-point C Refer to the experiment given above in section 3.6.5, we write a cosine function in C55x assembly language similar to Table 3.22. This assembly routine uses only 58 cycles per function call. Table 3.26 lists the files used for this experiment.JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 176 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.24 Floating-point C program for tone and noise generation #define UINTQ14 0x3FFF #define PI 3.1415926 // Variable definition static unsigned short n; static float twoPI_f_Fs; void initFTone(unsigned short f, unsigned short Fs) { n=0; twoPI_f_Fs = 2.0*PI*(float)f/(float)Fs; } short fTone(unsigned short Fs) { n++; if (n >= Fs) n=0; return( (short)(cos(twoPI_f_Fs*(float)n)*UINTQ14)); } void initRand(unsigned short seed) { srand(seed); } short randNoise(void) { return((rand()-RAND_MAX/2)>>1); } Table 3.25 List of files for experiment exp3.6.6.1_using floating-PointC Files Description floatSigGenTest.c C function for testing experiment ftone.c Floating-point C function for tone generation randNoise.c C function for generating random numbers floatPointSigGen.pjt DSP project file floatPointSigGencfg.cmd DSP linker command file floatPointSigGen.cdb DSP BIOS configuration file Table 3.26 List of files for experiment exp3.6.6.2_of tone generation Files Description toneTest.c C function for testing experiment tone.c C function controls tone generation cos.asm Assembly routine computes cosine values toneGen.pjt DSP project file toneGencfg.cmd DSP linker command file toneGen.cdb DSP BIOS configuration fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 177 Procedures of the experiment are listed as follows: 1. Load the project toneGen.pjt, rebuild, and load the program to the DSK. 2. Connect a headphone to the headphone output of the DSK and start playback of the tone. 3. Listen to the DSK output. Use a scope to observe the generated waveform. Random number generation using fixed-point C The linear congruential sequence method (will be further discussed in Chapter 8) is widely used because of its simplicity. The random number generation can be expressed as x(n) = [ax(n − 1) + b]mod M , (3.96) where the modulo operation (mod) returns the remainder after division by M. For this experiment, we select M = 220 = 0x100000, a = 2045, and x(0) = 12 357. The C program for the random number generation is given in Table 3.27, where seed=x(0)=12357. Floating-point multiplication and division are very slow on fixed-point DSP processors. We have learned in Chapter 2 that we can use a mask instead of modulo operation for a power-of-2 number. We improve the run-time efficiency by rewriting the program as listed in Table 3.28. The profile shows that the function randNumber2( ) needs only 48 cycles while the original function randNumber1( ) uses 427 cycles. The files used for this experiment are listed in Table 3.29. Procedures of the experiment are listed as follows: 1. Load the project randGenC.pjt, rebuild, and load the program to the DSK. 2. Connect a headphone to the headphone output of the DSK and start the random signal generation. 3. Listen to the DSK output. Use a scope to observe the generated waveform. Table 3.27 C program for random number generation // Variable definition static volatile long n; static short a; void initRand(long seed) { n = (long)seed; a = 2045; } short randNumber1(void) { short ran; n=a*n+1; n=n-(long)((float)(n*0x100000)/(float)0x100000); ran = (n+1)/0x100001; return (ran); }JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 178 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS Table 3.28 C program that uses mask for modulo operation short randNumber2(void) { short ran; n = a*n; n = n&0xFFFFF000; ran = (short)(n>>20); return (ran); } Random number generation using C55x assembly language To further improve the performance, we use assembly language for random number generation. The assembly routine is listed in Table 3.30, which reduces the run-time clock cycles by 50%. Table 3.31 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Load the project randGen.pjt, rebuild, and load the program to the DSK. 2. Connect a headphone to the headphone output of the DSK and start the random signal generation. 3. Listen to the DSK output. Use a scope to observe the generated waveform. Signal generation using C55x assembly language This experiment combines the tone and random number generators for generating random noise, tone, and tone with additive random noise. The files used for this experiment are listed in Table 3.32. Procedures of the experiment are listed as follows: 1. Load the project signalGen.pjt, rebuild, and load the program to the DSK. 2. Connect a headphone to the headphone output of the DSK and start signal generation. 3. Listen to the DSK output. Use a scope to observe the generated waveform. Table 3.29 List of files for experiment exp3.6.6.3_of random number generation Files Description randTest.c C function for testing experiment rand.c C function generates random numbers randGenC.pjt DSP project file randGencfg.cmd DSP linker command file randGen.cdb DSP BIOS configuration fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 179 Table 3.30 C55x assembly program of random number generator .bss _n,2,0,2 ; long n .bss _a,1,0,0 ; short a .def _initRand .def _randNumber .sect ".text" _initRand: mov AC0,dbl(*(#_n)) ; n = (long)seed; mov #2045,*(#_a) ; a = 2045; ret _randNumber: amov #_n,XAR0 mov *(#_a),T0 mpym *AR0+,T0,AC0 ; n = a*n; mpymu *AR0-,T0,AC1 ; This is an 32x16 integer multiply sfts AC0,#16 add AC1,AC0 || mov #0xFFFF<<#16,AC2 ; n = n&0xFFFFF000; or #0xF000,AC2 and AC0,AC2 mov AC2,dbl(*AR0) || sfts AC2,#-20,AC0 ; ran = (short)(n>>20); mov AC0,T0 ; Return (ran); ret .end Table 3.31 List of files for experiment exp3.6.6.4_using assembly routine Files Description randTest.c C function for testing experiment rand.asm Assembly routine generates random numbers randGen.pjt DSP project file randGencfg.cmd DSP linker command file randGen.cdb DSP BIOS configuration file Table 3.32 List of files for experiment exp3.6.6.5_of signal generation Files Description sigGenTest.c C function for testing experiment tone.c C function controls tone generation cos.asm Assembly routine computes cosine values rand.asm Assembly routine generates random numbers signalGen.pjt DSP project file signalGencfg.cmd DSP linker command file signalGen.cdb DSP BIOS configuration fileJWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 180 DSP FUNDAMENTALS AND IMPLEMENTATION CONSIDERATIONS References [1] N. Ahmed and T. Natarajan, Discrete-Time Signals and Systems, Englewood Cliffs, NJ: Prentice-Hall, 1983. [2] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1989. [3] S. J. Orfanidis, Introduction to Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1996. [4] J. G. Proakis and D. G. Manolakis, Digital Signal Processing Ð Principles, Algorithms, and Applications, 3rd Ed., Englewood Cliffs, NJ: Prentice Hall, 1996. [5] P. Peebles, Probability, Random Variables, and Random Signal Principles, New York, NY: McGraw-Hill, 1980. [6] A Bateman and W. Yates, Digital Signal Processing Design, New York: Computer Science Press, 1989. [7] S. M. Kuo and D. R. Morgan, Active Noise Control Systems Ð Algorithms and DSP Implementations, New York, NY: John Wiley & Sons, Inc., 1996. [8] C. Marven and G. Ewers, A Simple Approach to Digital Signal Processing, New York: John Wiley & Sons, Inc., 1996. [9] J. H. McClellan, R. W. Schafer, and M. A. Yoder, DSP First: A Multimedia Approach, 2nd Ed., Englewood Cliffs, NJ: Prentice Hall, 1998. [10] D. Grover and J. R. Deller, Digital Signal Processing and the Microcontroller, Upper Saddle River, NJ: Prentice Hall, 1999. [11] S. M. Kuo and W. S. Gan, Digital Signal Processors Ð Architectures, Implementations, and Applications, Upper Saddle River, NJ: Prentice Hall, 2005. [12] MathWorks, Inc.,Using MATLAB, Version 6, 2000. [13] MathWorks, Inc., Signal Processing Toolbox User’s Guide, Version 6, 2004. [14] MathWorks, Inc., Filter Design Toolbox User’s Guide, Version 3, 2004. [15] MathWorks, Inc., Fixed-Point Toolbox User’s Guide, Version 1, 2004. Exercises 1. The all-digital touch-tone phones use the sum of two sinewaves for signaling. Frequencies of these sinewaves are defined as 697, 770, 852, 941, 1209, 1336, 1477, and 1633 Hz. The sampling rate used by the telecommunications is 8 kHz, converts those 8 frequencies in terms of radians per sample and cycles per sample. 2. Compute the impulse response h(n) for n = 0, 1, 2, 3, 4 of the digital systems defined by the following I/O equations: (a) y(n) = x(n) = 0.75y(n − 1); (b) y(n) − 0.3y(n − 1) − 0.4y(n − 2) = x(n) − 2x(n − 1); and (c) y(n) = 2x(n) − 2x(n − 1) + 0.5x(n − 2). 3. Construct detailed signal-flow diagrams for the digital systems defined in Problem 2. 4. Similar to the signal-flow diagram for the IIR filter as shown in Figure 3.11, construct a general signal-flow diagram for the IIR filter defined in Equation (3.42) for M = L−1. 5. Find the transfer functions for the three digital systems defined in Problem 2. 6. Find the zero(s) and/or pole(s) of the digital systems given in Problem 2. Discuss the stability of these systems. 7. For a second-order IIR filter defined in Equation (3.42) with two complex poles defined in (3.52), the radius r = 0.9 and the angle θ = 0.25π. Find the transfer function and I/O equation of this filter. 8. A 2 kHz sinewave is sampled with 10-kHz sampling rate, what is the sampling period? What is the digital frequency in terms of ω and F? If we have 100 samples, how many cycles of sinewave are covered?JWBK080-03 JWBK080-Kuo March 8, 2006 19:12 Char Count= 0 EXERCISES 181 9. For the digital sinewave given in Problem 8, if we compute the DFT with N = 100, what is the frequency resolution? If we display the magnitude spectrum as shown in Figure 3.17, what is the value of k corresponding to the peak spectrum? What happens if the frequencies of sinewave are 1.5 and 1.05 kHz? 10. Similar to Table 3.2, construct a new table for 5-bit binary numbers. 11. Find the fixed-point 2’s complement binary representation with B = 6 for the decimal numbers 0.5703125 and −0.640625. Also, find the hexadecimal representation of these two numbers. Round the binary numbers to 6 bits and compute the corresponding roundoff errors. 12. Similar to Example 3.26, represent the two fractional numbers in Problem 11 in integer format for the C55x assembly programs. 13. Represent the 16-bit number given in Example 3.27 in Q1.14, Q3.12, and Q15.0 formats. 14. If the quantization process uses truncating instead of rounding, show that the truncation error e(n) = x(n) − x(nT) will be in the interval − ωc are referred to as the passband and stopband, respectively, and the frequency ωc that separates H (w) ω 1 0 π ω 1 0 π ω 1 0 π ω 1 0 π (a) Lowpass filter (b) Highpass filter (c) Bandpass filter (d) Bandstop filter wc wa wb wa wb wc H (w) H (w)H (w) Figure 4.2 Magnitude response of ideal filters: (a) lowpass; (b) highpass; (c) bandpass; and (d) bandstopJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 188 DESIGN AND IMPLEMENTATION OF FIR FILTERS the passband and stopband is called the cutoff frequency. An ideal lowpass filter has magnitude response |H (ω)| = 1 in the frequency range 0 ≤ ω ≤ ωc, and |H (ω)| = 0 for ω>ωc. Thus, a lowpass filter passes low-frequency components below the cutoff frequency and attenuates high-frequency components above ωc. The magnitude response of an ideal highpass filter is illustrated in Figure 4.2(b). A highpass filter passes high-frequency components above the cutoff frequency ωc and attenuates low-frequency components below ωc. In practice, highpass filters can be used to eliminate DC offset, 60 Hz hum, and other low- frequency noises. The magnitude response of an ideal bandpass filter is illustrated in Figure 4.2(c). The frequencies ωa and ωb are called the lower cutoff frequency and the upper cutoff frequency, respectively. A bandpass filter passes frequency components between the two cutoff frequencies ωa and ωb, and attenuates frequency components below the frequency ωa and above the frequency ωb. The magnitude response of an ideal bandstop (or band-reject) filter is illustrated in Figure 4.2(d). A filter with a very narrow stopband is also called a notch filter. For example, a power line generates a 60 Hz sinusoidal noise called 60 Hz hum, which can be removed by a notch filter with notch frequency at 60 Hz. In addition to these frequency-selective filters, an allpass filter provides frequency response |H (ω)| = 1 for allω. The principal use of allpass filters is to correct the phase distortion introduced by physical systems and/or other filters. A very special case of the allpass filter is the ideal Hilbert transformer, which produces a90◦ phase shift to input signals. A multiband filter has more than one passband and stopband. A special case of the multiband filter is the comb filter. A comb filter has equally spaced zeros with the shape of the magnitude response resembling a comb. The difference equation of the comb filter is given as y(n) = x(n) − x(n − L), (4.3) where the number of delay L is an integer. The transfer function of this FIR filter is H(z) = 1 − z−L = zL − 1 zL . (4.4) Thus, the comb filter has L poles at the origin and L zeros equally spaced on the unit circle at zl = e j(2π/L)l , l = 0, 1,...,L − 1. (4.5) Example 4.2: A comb filter with L = 8 has eight zeros at zl = 1, eπ/4, eπ/2, e3π/4, eπ =−1, e5π/4, e3π/2, e7π/4. The frequency response can be computed and plotted in Figure 4.3 using the following MATLAB script (example4_2.m) for L = 8: b=[10000000-1]; a=[1]; freqz(b, a); Figure 4.3 shows that the comb filter can be used as a crude bandstop filter to remove harmonics at frequencies ωl = 2πl/L, l = 0, 1,...,L/2 − 1. (4.6) The center of the passband lies halfway between the zeros of the response; that is, at frequencies (2l+1)π L , l = 0, 1,...,L/2 − 1.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 INTRODUCTION TO FIR FILTERS 189 0 10 5 −5 −10 −15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized frequency (xπ rad/sample) Magnitude (dB) 0 100 50 −50 −100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized frequency (xπ rad/sample) Phase (degrees) Figure 4.3 Magnitude and phase responses of a comb filter Comb filters are useful for passing or eliminating specific frequencies and their harmonics. Using comb filters for attenuating periodic signals with harmonic related components is more efficient than having individual filters for each harmonic. For example, the humming sound produced by large transformers located in electric utility substations is composed with even-numbered harmonics (120, 240, 360 Hz, etc.) of the 60 Hz power-line frequency. When a desired signal is corrupted by the transformer noise, the comb filter with notches at the multiples of 120 Hz can be used to eliminate those undesired harmonic components. Example 4.3: We can selectively cancel zeros in a comb filter with corresponding poles. Canceling the zero provides a passband, while the remaining zeros provide attenuation for stopband. If we add a pole at z = 1, the transfer function given in Equation (4.4) changes to H(z) = 1 − z−L 1 − z−1 . (4.7) The pole at z = 1 is canceled by the zero at z = 1, resulting in a lowpass filter with passband centered at z = 1. The system defined by Equation (4.7) is still an FIR filter because the pole is canceled. Applying the scaling factor 1/L to Equation (4.7), the transfer function becomes H(z) = 1 L 1 − z−L 1 − z−1 . (4.8) This is the moving-average filter introduced in Chapter 3. Note that canceling the zero at z = 1 produces a lowpass filter and canceling the zero at z =−1 produces a highpass filter. This is because that z = 1 is corresponding to ω = 0 and z =−1 is corresponding to ω = π. 4.1.3 Filter Specifications The characteristics of digital filters are often specified in the frequency domain, and thus the design is based on magnitude-response specifications. In practice, we cannot achieve the infinitely sharp cutoff asJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 190 DESIGN AND IMPLEMENTATION OF FIR FILTERS H (w) 1 + dP 1 − dP 1 As AP Passband Stopband 0 Transition band Ideal filter Actual filter ds wp wc ws p w Figure 4.4 Magnitude response and performance measurement of a lowpass filter the ideal filters given in Figure 4.2. We must accept a more gradual cutoff with a transition band between the passband and the stopband. The specifications are often given in the form of tolerance (or ripple) schemes, and a transition band is specified to permit the smooth magnitude roll-off. A typical magnitude response of lowpass filter is illustrated in Figure 4.4. The dotted horizontal lines in the figure indicate the tolerance limits. The magnitude response has a peak deviation δp in the passband, and a maximum deviation δs in the stopband. The frequencies ωp and ωs are the passband edge (cutoff) frequency and the stopband edge frequency, respectively. As shown in Figure 4.4, the magnitude of passband (0 ≤ ω ≤ ωp) approximates unity with an error of ±δp. That is, 1 − δp ≤ |H(ω)| ≤ 1 + δp, 0 ≤ ω ≤ ωp. (4.9) The passband ripple δp is the allowed variation in magnitude response in the passband. Note that the gain of the magnitude response is normalized to 1 (0 dB). In the stopband, the magnitude response approximates zero with an error δs. That is, |H(ω)| ≤ δs,ωs ≤ ω ≤ π. (4.10) The stopband ripple (or attenuation) δs describes the minimum attenuation for signal components above the ωs. Passband and stopband deviations are usually expressed in decibels. The peak passband ripple and the minimum stopband attenuation in decibels are defined as Ap = 20 log10 1 + δp 1 − δp dB (4.11) and As =−20 log10 δs dB. (4.12)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 INTRODUCTION TO FIR FILTERS 191 Example 4.4: Consider a filter has passband ripples within ±0.01; that is, δp = 0.01. From Equation (4.11), we have Ap = 20 log10 1.01 0.99 = 0.1737 dB. When the minimum stopband attenuation is given as δs = 0.01, we have As =−20 log10(0.01) = 40 dB. The transition band is the area between the passband edge frequency ωp and the stopband edge frequency ωs. The magnitude response decreases monotonically from the passband to the stopband in this region. The width of the transition band determines how sharp the filter is. Generally, a higher order filter is needed for smaller δp and δs, and narrower transition band. 4.1.4 Linear-Phase FIR Filters The signal-flow diagram of the FIR filter is shown in Figure 3.6, and the I/O equation is defined in Equation (3.15). If L is an odd number, we define M = (L − 1)/2. Equation (3.15) can be written as B(z) = 2M l=0 bl z−l = M l=−M bl+M z−(l+M) = z−M M l=−M hl z−l = z−M H(z), (4.13) where H(z) = M l=−M hl z−l . (4.14) Let hl have the symmetry property as hl = h−l , l = 0, 1,...,M. (4.15) From Equation (4.13), the frequency response B(ω) can be written as B (ω) = B(z)|z=e jω = e− jωM H (ω) = e− jωM M l=−M hl e− jωl = e− jωM h0 + M l=1 hl e jωl + e− jωl = e− jωM h0 + 2 M l=1 hl cos (ωl) . (4.16) If L is an even integer and M = L/2, the derivation of Equation (4.16) has to be modified slightly. If hl is real valued, H(ω) is a real function of ω.IfH(ω) ≥ 0, then the phase of B(ω) is equal to φ (ω) =−ωM, (4.17)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 192 DESIGN AND IMPLEMENTATION OF FIR FILTERS which is a linear function of ω. However, if H(ω) < 0, then the phase of B(ω) is equal to π − ωM. Thus, there are sign changes in H(ω) corresponding to 180◦ phase shifts in B(ω), and B(ω) is only piecewise linear as shown in Figure 4.3. If hl has the antisymmetry property as hl =−h−l , l = 0, 1,...,M, (4.18) this implies h(0) = 0. Following the derivation of Equation (4.16), we can also show that the phase of B(z) is a linear function of ω. In conclusion, an FIR filter has linear phase if its coefficients satisfy the positive symmetric condition bl = bL−1−l , l = 0, 1,...,L − 1, (4.19) or the antisymmetric condition (negative symmetry) bl =−bL−1−l , l = 0, 1,...,L − 1. (4.20) The group delay of a symmetric (or antisymmetric) FIR filter is Td (ω) = (L − 1)/2, which corresponds to the midpoint of the FIR filter. Depending on whether L is even or odd and whether bl has positive or negative symmetry, there are four types of linear-phase FIR filters as illustrated in Figure 4.5. The symmetry (or antisymmetry) property of a linear-phase FIR filter can be exploited to reduce the total number of multiplications required by filtering. Consider the FIR filter with even length L and positive symmetric as defined in Equation (4.19), Equation (3.40) can be combined as H(z) = b0 1 + z−L+1 + b1 z−1 + z−L+2 +···+bL/2−1 z−L/2+1 + z−L/2 . (4.21) A realization of H(z) defined in Equation (4.21) is illustrated in Figure 4.6 with the I/O equation expressed as y(n) = b0 [x(n) + x(n − L + 1)] + b1 [x(n − 1) + x(n − L + 2)] + ···+bL/2−1 [x(n − L/2 + 1) + x(n − L/2)] = L/2−1 l=0 bl [x(n − l) + x(n − L + 1 + l)]. (4.22) For an antisymmetric FIR filter, the addition of two signals is replaced by subtraction. That is, y(n) = L/2−1 l=0 bl [x(n − l) − x(n − L + 1 + l)]. (4.23) As shown in Equation (4.22) and Figure 4.6, the number of multiplications is cut in half by adding the pairs of samples, and then multiplying the sum by the corresponding coefficient. The trade-off is that we need two address pointers that point at both x(n − l) and x(n − L + 1 + l) instead of accessing data linearly through the same buffer with a single pointer. The TMS320C55x provides two special instructions firsadd and firssub for implementing the symmetric and antisymmetric FIR filters, respectively. In Section 4.5, we will demonstrate how to use the symmetric FIR instructions for experiments.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 INTRODUCTION TO FIR FILTERS 193 l Center of symmetry Center of symmetry Center of symmetry Center of symmetry (a) Type I: L even (L = 8), positive symmetry. l (b) Type II: L odd (L = 7), positive symmetry. l (c) Type III: L even (L = 8), negative symmetry. l (d) Type IV: L odd (L = 7), negative symmetry. Figure 4.5 Coefficients of the four types of linear-phase FIR filters: (a) type I: L even (L = 8), positive symmetry; (b) type II: L odd (L = 7), positive symmetry; (c) type III: L even (L = 8), negative symmetry; and (d) type IV: L odd (L = 7), negative symmetry x(n) x(n − L/2) x(n − L/2 + 1) x(n − L + 2) x(n − L + 1) x(n − 1) bL/2−1b1b0 y(n) z−1 z−1 z−1 z−1 z−1 z−1 z−1 + +++ Figure 4.6 Signal-flow diagram of symmetric FIR filter, L is evenJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 194 DESIGN AND IMPLEMENTATION OF FIR FILTERS 4.1.5 Realization of FIR Filters An FIR filter can be operated on either a block basis or a sample-by-sample basis. In the block processing, the input samples are segmented into multiple data blocks. Filtering is performed one block at a time, and the resulting output blocks are recombined to form the overall output. The filtering of each data block can be implemented using the linear convolution or fast convolution, which will be introduced in Chapter 6. In the sample-by-sample processing, the input samples are processed at every sampling period after the current input x(n) becomes available. As discussed in Section 3.2.1, the output of an LTI system is the input samples convoluted with the impulse response coefficients of the system. Assuming that the filter is casual, the output at time n is given as y(n) = ∞ l=0 h(l)x(n − l). (4.24) The process of computing the linear convolution involves the following four steps: 1. Folding: Fold x(l) about l = 0 to obtain x(−l). 2. Shifting: Shift x(−l)byn samples to the right to obtain x(n − l). 3. Multiplication: Multiply h(l)byx(n − l) to obtain the products of h(l) × (n − l) for all l. 4. Summation: Sum all the products to obtain the output y(n) at time n. Repeat Steps 2 through 4 in computing the output of the system at other time instants n. Note that the convolution of the length M input signal with the length L impulse response results in length L + M− 1 output signal. Example 4.5: Considering an FIR filter that consists of four coefficients b0, b1, b2, and b3,we have y(n) = 3 l=0 bl x(n − l), n ≥ 0. The linear convolution yields n = 0, y(0) = b0x(0) n = 1, y(1) = b0x(1) + b1x(0) n = 2, y(2) = b0x(2) + b1x(1) + b2x(0) n = 3, y(3) = b0x(3) + b1x(2) + b2x(1) + b3x(0) In general, we have y(n) = b0x(n) + b1x(n − 1) + b2x(n − 2) + b3x(n − 3), n ≥ 3. The graphical interpretation is illustrated in Figure 4.7. As shown in Figure 4.7, the input sequence is flipped around (folded) and then shifted to the right to overlap with the filter coefficients. At each time instant, the output value is the sum of products of overlapped coefficients with the corresponding input data aligned below it. This flip-and-slide form ofJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 INTRODUCTION TO FIR FILTERS 195 b0 x(0) b0 x(0) b0 x(1) b0x(2) b0x(n) b1x(n − 1) b3x(n − 3) b2x(n − 2) b1x(1) b2x(0) b1 x(0)x(1)x(0) x(2)x(1) x(0) x(n) x(n − 3) x(n − 2)x(n − 1) b1 b2 b3 n = 0: n = 1: n = 2: n ≥ 3: Figure 4.7 Graphical interpretation of linear convolution, L = 4 linear convolution can be illustrated in Figure 4.8. Note that shifting x(−l) to the right is equivalent to shifting bl to the left 1 unit at each sampling period. At time n = 0, the input sequence is extended by padding L − 1 zeros to its right. The only nonzero product comes from b0 multiplied with x(0), which is time aligned. It takes the filter L − 1 iterations before it is completely overlapped with the input sequence. Therefore, the first L − 1 outputs correspond to the transient of the FIR filtering. After n ≥ L − 1, the signal buffer of the FIR filter is full and the filter is in the steady state. In FIR filtering, the coefficients are constants, but the data in the signal buffer (or tapped delay line) changes every sampling period, T . The signal buffer is refreshed in the fashion illustrated in Figure 4.9, where the oldest sample x(n − L + 1) is discarded and the rest samples are shifted one location to the right in the buffer. A new sample (from ADC in real-time applications) is inserted to the memory location labeled as x(n). This x(n) at time n will become x(n − 1) in the next sampling period, then x(n − 2), etc., until it simply drops out off the end of the delay chain. The process of refreshing the signal buffer b1 b2 b3b0 b1 b2 b3b0 x(n − 3)x(n − 2) x(n − 1)x(n) x(0) y(0) 000 y(n) Figure 4.8 Flip-and-slide process of linear convolutionJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 196 DESIGN AND IMPLEMENTATION OF FIR FILTERS x(n) x(n − 1) x(n − 2) x(n − L + 2) x(n − L + 1)... Current time, n Next time, n + 1 DiscardedNew data x(n) x(n − 1) x(n − 2) x(n − L + 1)... Figure 4.9 Signal buffer refreshing for FIR filtering shown in Figure 4.9 requires intensive processing time if the data-move operations are not implemented by hardware. The most efficient method for refreshing a signal buffer is to arrange the signal samples in a circular fashion as illustrated in Figure 4.10(a). Instead of shifting the data samples forward while holding the start address of buffer fixed as shown in Figure 4.9, the data samples in the circular buffer do not move but the buffer start address is updated backward (counterclockwise). The beginning of the signal sample, x(n), is pointed by start-address pointer, and the previous samples are already loaded sequentially from that point in a clockwise direction. As we receive a new sample, it is placed at the position x(n) and our filtering operation is performed. After calculating the output y(n), the start pointer is moved counterclockwise one position to x(n − L + 1) and we wait for the next input signal. The next input at time n + 1 is written to the x(n − L + 1) position and is referred as x(n) for the next iteration. The circular buffer is very efficient because the update is carried out by adjusting the start-address pointer without physically shifting any data samples in memory. Figure 4.10(b) shows a circular buffer for FIR filter coefficients. Circular buffer allows the coefficient pointer to wrap around when it reaches to the end of the coefficient buffer. That is, the pointer moves from bL−1 to b0 such that the filtering will always start at the first coefficient. 4.2 Design of FIR Filters The objective of designing FIR filter is to determine a set of filter coefficients that satisfies the given specifications. A variety of techniques have been developed for designing FIR filters. The Fourier series (a) Circular buffer for signals (b) Circular buffer for coefficients Signal buffer pointer for next x(n) Signal buffer pointer at time n Coefficient buffer pointer x(n)x(n − L + 1) x(n − L + 2) x(n − 1) x(n − 2) x(n − 3) b0 b1 b3 b2 bL − 1 bL − 2 Figure 4.10 Circular buffers for FIR filter: (a) circular buffer for holding the signals. The start pointer to x(n)is updated in the counterclockwise direction; (b) circular buffer for FIR filter coefficients, the pointer always pointing to b0 at the beginning of filteringJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 197 method offers a simple way of computing FIR filter coefficients, thus can be used to explain the principles of FIR filter design. 4.2.1 Fourier Series Method Fourier series method designs an FIR filter by calculating the impulse response of a filter that approximates the desired frequency response. Thus, it can be expanded in a Fourier series as H(ω) = ∞ n=−∞ h(n)e− jωn, (4.25) where h(n) = 1 2π π −π H(ω)e jωn dω, −∞ ≤ n ≤∞. (4.26) Equation (4.26) shows that the impulse response h(n) is double sided and has infinite length. For a desired frequency response H(ω), the corresponding impulse response h(n) (same as filter coefficients) can be calculated by evaluating the integral defined in Equation (4.26). A finite-duration impulse response {h(n)} can be simply obtained by truncating the ideal infinite-length impulse response defined in Equation (4.26). That is, h(n) = h(n), −M ≤ n ≤ M 0, otherwise . (4.27) Note that in this definition, we assume L to be an odd number. A causal FIR filter can be derived by shifting the h(n) sequence to the right by M samples and reindexing the coefficients as b l = h(l − M), l = 0, 1,...,2M. (4.28) This FIR filter has L(= 2M + 1) coefficients b l ,l = 0, 1,...,L − 1. The impulse response is symmetric about b M due to the fact that h(−n) = h(n) is given in Equation (4.26). Therefore, the transfer function B (z) has a linear phase and a constant group delay. Example 4.6: The ideal lowpass filter given in Figure 4.2(a) has frequency response H(ω) = 1, |ω| ≤ ωc 0, otherwise . (4.29) The corresponding impulse response can be computed using Equation (4.26) as h(n) = 1 2π π −π H(ω)e jωn dω = 1 2π ωc −ωc e jωn dω = 1 2π e jωn jn ωc −ωc = 1 2π e jωcn − e− jωcn jn = sin (ωcn) πn = ωc π sinc ωcn π , (4.30)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 198 DESIGN AND IMPLEMENTATION OF FIR FILTERS where the sinc function is defined as sinc (x) = sin (πx) πx . By setting all impulse response coefficients outside the range −M ≤ n ≤ M to zero, we obtain an FIR filter with the symmetry property. By shifting M units to the right, we obtain a causal FIR filter of finite length L with coefficients b l = ωc π sinc ωc(l−M) π , 0 ≤ l ≤ L − 1 0, otherwise . (4.31) Example 4.7: Design a lowpass FIR filter with the frequency response H ( f ) = 1, 0, 0 ≤ f ≤ 1 kHz 17 kHz < f ≤ 4 kHz , where the sampling rate is 8 kHz. The impulse response is limited to 2.5 ms. Since 2MT = 0.0025 s and T = 0.000125 s, we need M = 10. Thus, the actual filter has 21 (L = 2M + 1) coefficients, and 1 kHz corresponds to ωc = 0.25π. From Equation (4.31), we have b l = 0.25sinc [0.25 (l − 10)] , l = 0, 1,...,20. Example 4.8: Design a lowpass filter of cutoff frequency ωc = 0.4π with filter length L = 41 and L = 61. When L = 41, M = (L − 1)/2 = 20. From Equation (4.31), the designed coefficients are b l = 0.4sinc [0.4(l − 20)] ,l = 0, 1,...,20. When L = 61, M = (L − 1)/2 = 30. The coefficients become b l = 0.4sinc [0.4(l − 30)] ,l = 0, 1,...,30. These coefficients are computed and plotted in Figure 4.11 using the MATLAB script example4_8.m. 4.2.2 Gibbs Phenomenon As shown in Figure 4.11, the FIR filter obtained by simply truncating the impulse response of the desired filter exhibits an oscillatory behavior (or ripples) in its magnitude response. As the length of the filter increases, the number of ripples in both passband and stopband increases, and the width of the ripple de- creases. The largest ripple occurs near the transition discontinuity and their amplitude is independent of L. The truncation operation described in Equation (4.27) can be considered as multiplication of the infinite-length sequence h(n) by the rectangular window w(n). That is, h(n) = h(n)w(n), −∞ ≤ n ≤∞, (4.32) where the rectangular window w(n) is defined as w(n) = 1, 0, −M ≤ n ≤ M otherwise . (4.33)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 199 1 0.8 0.6 0.4 0.2 0 0123−3 −2 −1 Frequency Magnitude Magnitude response 1 0.8 0.6 0.4 0.2 0 −3 −2 −10123 Frequency Magnitude Magnitude response Figure 4.11 Magnitude responses of lowpass filters designed by Fourier series method: (top) L = 41; (bottom) L = 61JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 200 DESIGN AND IMPLEMENTATION OF FIR FILTERS 40 Magnitude Response, M = 8 Magnitude (dB) 20 0 0123 −20 −40 −3 −2 −1 40 Magnitude Response, M = 20 Frequency Magnitude (dB) 20 0 0123 −20 −40 −3 −2 −1 Figure 4.12 Magnitude responses of the rectangular window for M = 8 and 20 In order to approximate H(ω) very closely, we need the window function with infinite length. Example 4.9: The oscillatory behavior of a truncated Fourier series representation of FIR filter, observed in Figure 4.11, can be explained by the frequency response of the rectangular window defined in Equation (4.33). It can be expressed as W(ω) = M n=−M e− jωn = sin [(2M + 1)ω/2] sin(ω/2) . (4.34) Magnitude responses of W(ω) for M = 8 and 20 are generated by MATLAB script exam- ple4_9.m. As illustrated in Figure 4.12, the magnitude response has a mainlobe centered at ω = 0. All the other ripples are called the sidelobes. The magnitude response has the first zero at ω = 2π/(2M + 1). Therefore, the width of the mainlobe is 4π/(2M + 1). From Equation (4.34), it is easy to show that the magnitude of mainlobe is |W (0)| = 2M + 1. The first sidelobe is approxi- mately located at frequency ω1 = 3π/(2M + 1) with the magnitude of |W(ω1)| ≈ 2(2M + 1)/3π for M>> 1. The ratio of the mainlobe magnitude to the first sidelobe magnitude is W (0) W (ω1) ≈ 3π 2 = 13.5dB. As ω increases toward π, the denominator grows larger. This results in a damped function shown in Figure 4.12. As M increases, the width of the mainlobe decreases. The rectangular window has an abrupt transition to zero outside the range −M ≤ n ≤ M, which causes the Gibbs phenomenon in the magnitude response. The Gibbs phenomenon can be reduced eitherJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 201 by using a window that tapers smoothly to zero at each end or by providing a smooth transition from the passband to the stopband. A tapered window will reduce the height of the sidelobes and increase the width of the mainlobe, resulting in a wider transition at the discontinuity. This phenomenon is often referred to as leakage or smearing. 4.2.3 Window Functions A large number of tapered windows have been developed and optimized for different applications. In this section, we restrict our discussion to four commonly used windows of length L = 2M + 1. That is, w(n), where n = 0, 1,...,L − 1 and is symmetric about its middle, n = M. Two parameters that predict the performance of the window in FIR filter design are its mainlobe width and the relative sidelobe level. To ensure a fast transition from the passband to the stopband, the window should have a small mainlobe width. On the other hand, to reduce the passband and stopband ripples, the area under the sidelobes should be small. Unfortunately, there is a trade-off between these two requirements. The Hann (Hanning) window function is one period of the raised cosine function defined as w(n) = 0.5 1 − cos 2πn L − 1 , n = 0, 1,...,L − 1. (4.35) The window coefficients can be generated by the MATLAB function w = hanning(L); which returns the L-point Hanning window function in array w. The MATLAB script hanWindow.m gen- erates window coefficients w(n), n = 1,...,L. For a large L, the peak-to-sidelobe ratio is approximately 31 dB, an improvement of 17.5 dB over the rectangular window. The Hamming window function is defined as w(n) = 0.54 − 0.46 cos 2πn L − 1 , n = 0, 1,...,L − 1, (4.36) which also corresponds to a raised cosine, but with different weights for the constant and cosine terms. The Hamming function tapers the end values to 0.08. MATLAB provides the Hamming window function w = hamming(L); The Hamming window function and its magnitude response generated by MATLAB script hamWin- dow.m are shown in Figure 4.13. The mainlobe width is about the same as the Hanning window, but this window has an additional 10 dB of stopband attenuation (41 dB). The Hamming window provides low ripple over the passband and good stopband attenuation, and it is usually more appropriate for a lowpass filter design. Example 4.10: Design a lowpass FIR filter of cutoff frequency ωc = 0.4π and order L = 61 using the Hamming window. Using the MATLAB script (example4_10.m) similar to the one used in Example 4.8, we plot the magnitude responses of designed filters in Figure 4.14 using both rectangular and Hamming windows. We observe that the ripples produced by the rectangular window design are virtually eliminated from the Hamming window design. The trade-off of eliminating the ripples is increasing transition width.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 202 DESIGN AND IMPLEMENTATION OF FIR FILTERS Hamming window, L = 41 1 0.8 0.6 0.4 0.2 0 40 20 0 0 0 5 10 15 20 25 30 35 40 123 −20 −40 −60 −80 −3 −2 −1 Magnitude response Frequency Magnitude (dB) Amplitude Figure 4.13 Hamming window function (top) and its magnitude response (bottom), L = 41 Magnitude response 1 0.8 0.6 0.4 0.2 0 Frequency 0−1−2−3 123 Magnitude Rectangular window Hamming window Figure 4.14 Magnitude response of lowpass filter using Hamming window, L = 61JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 203 The Blackman window function is defined as w(n) = 0.42 − 0.5 cos 2πn L − 1 + 0.08 cos 4πn L − 1 , n = 0, 1,...,L − 1. (4.37) This function is also supported by the MATLAB function w = blackman(L); This window can be generated and its magnitude response can be plotted by MATLAB script black- manWindow.m. The addition of the second cosine term in Equation (4.37) has the effect of increasing the width of the mainlobe (50%), but at the same time improving the peak-to-sidelobe ratio to about 57 dB. The Blackman window provides 74 dB of stopband attenuation, but with a transition width six times that of the rectangular window. The Kaiser window is defined as w(n) = I0 β 1 − (n − M)2 /M2 I0(β) , n = 0, 1,...,L − 1, (4.38) where β is an adjustable (shape) parameter and I0(β) = ∞ k=0 (β/2)k k! 2 (4.39) is the zero-order modified Bessel function of the first kind. MATLAB provides Kaiser window kaiser(L,beta); The window function and its magnitude response for L = 41 and β = 8 can be displayed using the MATLAB script kaiserWindow.m. The Kaiser window is nearly optimum in the sense of having the largest energy in the mainlobe for a given peak sidelobe level. Providing a large mainlobe width for the given stopband attenuation implies the sharpness transition width. This window can provide dif- ferent transition widths for the same L by choosing the parameter β to determine the trade-off between the mainlobe width and the peak sidelobe level. The procedure of designing FIR filters using Fourier series and windows is summarized as follows: 1. Determine the window type that will satisfy the stopband attenuation requirements. 2. Determine the window size L based on the given transition width. 3. Calculate the window coefficients w(l), l = 0,1,...,L− 1. 4. Generate the ideal impulse response h(n) using Equation (4.26) for the desired filter. 5. Truncate the ideal impulse response of infinite length using Equation (4.27) to obtain h(n), −M ≤ n ≤ M. 6. Make the filter causal by shifting the result M units to the right using Equation (4.28) to obtain b l , l = 0, 1,...,L − 1.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 204 DESIGN AND IMPLEMENTATION OF FIR FILTERS 7. Multiply the window coefficients obtained in Step 3 and the impulse response coefficients obtained in Step 6. That is, bl = b l w(l), l = 0, 1,...,L − 1. (4.40) Applying a window to an FIR filter’s impulse response has the effect of smoothing the resulting filter’s magnitude response. A symmetric window will preserve a symmetric FIR filter’s linear-phase response. MATLAB provides a GUI tool called Window Design & Analysis Tool (WinTool) that allows users to design and analyze windows. It can be activated by entering the following command in MATLAB command window: wintool It opens with a default 64-point Hamming window as shown in Figure 4.15. WinTool has three pan- els: Window Viewer, Window List, and Current Window Information. Window Viewer displays the time-domain (left) and frequency-domain (right) representations of the selected window(s). Three Figure 4.15 Default window for WinToolJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 205 measurements are displayed under the time-domain and frequency-domain plots: (1) Leakage factor indicates the ratio of power in the sidelobes to the total window power. (2) Relative sidelobe attenuation shows the difference in height from the mainlobe peak to the highest sidelobe peak. (3) Mainlobe width (−3 dB) shows the width of the mainlobe at 3 dB below the mainlobe peak. Window List panel lists the windows available for display in the Window Viewer. Highlight one or more windows to display them. There are four Window List buttons: (1) Add a New Window, (2) Copy Window, (3) Save to Workspace, and (4) Delete. Each window is defined by the parameters in the Current Window Information panel. We can change the current window’s characteristics by changing its parameters. From the Type pull-down menu, we can choose different windows available in the Signal Processing Toolbox. From the Length box, we can specify number of samples. With this tool, we can evaluate different windows. For example, we can click Add a New Window button and then select a new window Hann in the Type pull-down menu. We repeat this process for Blackman and Kaiser windows. We then highlight all four (including the default Hamming) windows in the Select Windows to Display box. As shown in Figure 4.16, we have four window functions and magnitude responses displayed in the same graph for comparison. Figure 4.16 Comparison of Hamming, Hann, Blackman, and Kaiser windowsJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 206 DESIGN AND IMPLEMENTATION OF FIR FILTERS 4.2.4 Design of FIR Filters Using MATLAB Filter design algorithms use iterative optimization techniques to minimize the error between the desired and actual frequency responses. The most widely used algorithm is the ParksÐMcClellan algorithm for designing the optimum linear-phase FIR filter. This algorithm spreads out the error to produce equal- magnitude ripples. In this section, we consider only the design methods and filter functions available in MATLAB Signal Processing Toolbox, which are summarized in Table 4.1, and the MATLAB Filter Design Toolbox provides more advanced FIR filter design methods. As an example, fir1 and fir2 functions design FIR filters using windowed Fourier series method. The function fir1 designs FIR filters using the Hamming window as b = fir1(L, Wn); where Wn is the normalized cutoff frequency between 0 and 1. The function fir2 designs an FIR filter with arbitrarily shaped magnitude response as b = fir2(L, f, m); where the frequency response is specified by vectors f and m that contain the frequency and magnitude, respectively. The frequencies in f must be between 0 < f <1 in increasing order. A more efficient Remez algorithm designs the optimum linear-phase FIR filters based on the ParksÐ McClellan algorithm. This algorithm uses the Remez exchange and Chebyshev approximation theory to design a filter with an optimum fit between the desired and actual frequency responses. This remez function has syntax as follows: b = remez(L, f, m); Example 4.11: Design a linear-phase FIR bandpass filter of length 18 with a passband from normalized frequency 0.4Ð0.6. This filter can be designed and displayed using the following MATLAB script (example4_11.m): f = [0 0.3 0.4 0.6 0.7 1]; m=[001100]; b = remez(17, f, m); [h, omega] = freqz(b, 1, 512); plot(f, m, omega/pi, abs(h)); The desired and obtained magnitude responses are shown in Figure 4.17. Table 4.1 List of FIR filter design methods and functions available in MATLAB Design method Filter function Description Windowing fir1, fir2, kaiserord Truncated Fourier series with windowing methods Multiband with transition bands firls, firpm, firpmord Equiripple or least squares approach Constrained least squares fircls, fircls1 Minimize squared integral error over entire frequency range Arbitrary response cfirpm Arbitrary responsesJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 207 Magnitude response 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 Normalized Frequency Magnitude Actual filter Ideal filter Figure 4.17 Magnitude responses of the desired and actual FIR filters 4.2.5 Design of FIR Filters Using FDATool The Filter Design and Analysis Tool (FDATool) is a graphical user interface (GUI) for designing, quan- tizing, and analyzing digital filters. It includes a number of advanced filter design techniques and supports all the filter design methods in the Signal Processing Toolbox. This tool has the following functions: 1. designing filters by setting filter specifications; 2. analyzing designed filters; 3. converting filters to different structures; and 4. quantizing and analyzing quantized filters. Note that the last feature is available only with the Filter Design Toolbox. In this section, we introduce the FDATool for designing and quantizing FIR filters. We can open the FDATool by typing fdatool at the MATLAB command window. The Filter Design & Analysis Tool window is shown in Figure 4.18. We can choose from several response types: Lowpass, Highpass, Bandpass, Bandstop, and Differen- tiator. For example, to design a bandpass filter, select the Radio button next to Bandpass in the Response Type region on the GUI. It has multiple options for Lowpass, Highpass, and Differentiator types. More response types are available with the Filter Design Toolbox.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 208 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.18 FDATool window It is important to compare the Filter Specifications region in Figure 4.18 with Figure 4.4. The param- eters Fpass, Fstop, Apass, and Astop are corresponding to ωp,ωs, Ap, and As, respectively. These parameters can be entered in the Frequency Specifications and Magnitude Specifications regions. The frequency units are Hz (default), kHz, MHz, or GHz, and the magnitude options are dB (default) or Linear. Example 4.12: Design a lowpass FIR filter with the following specifications: sampling frequency ( fs) = 8 kHz; passband cutoff frequency (ωp) = 2 kHz; stopband cutoff frequency (ωs) = 2.5 kHz; passband ripple (Ap) = 1 dB; and stopband attenuation (As) = 60 dB. We can easily design this filter by entering parameters in Frequency Specifications and Mag- nitude Specifications regions as shown in Figure 4.19. Pressing Design Filter button computes the filter coefficients. The Filter Specifications region will show the Magnitude Response (dB) (see Figure 4.20). We can analyze different characteristics of the designed filter by clicking the Analysis menu. For example, selecting the Impulse Response available in the menu opens a new Impulse Response window to display the designed FIR filter coefficients as shown in Figure 4.21.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 209 Figure 4.19 Frequency and magnitude specifications for a lowpass filter 50 0 0 0.5 1 1.5 2 Frequency (kHz) Magnitude response (dB) Magnitude (dB) 2.5 3 3.5 −50 −100 −150 Figure 4.20 Magnitude response of the designed lowpass filter 0.6 0.4 0 0.5 1 1.5 2 Time (ms) Impulse response Amplitude 2.5 3 3.5 0.2 0 −0.2 Figure 4.21 Impulse responses (filter coefficients) of the designed filterJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 210 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.22 Setting fixed-point quantization parameters in the FDATool We have two options for determining the filter order: we can specify the filter order by Specify Order, or use the default Minimum Order. In Example 4.12, we use the default minimum order, and the order (31) is shown in the Current Filter Information region. Note that order = 31 means the length of FIR filter is L = 32, which is shown in Figure 4.21 with 32 coefficients. Once the filter has been designed (using 64-bit double-precision floating-point arithmetic and represen- tation) and verified, we can turn on the quantization mode by clicking the Set Quantization Parameters button on the side bar shown in Figure 4.18. The bottom-half of the FDATool window will change to a new pane with the default Double-Precision Floating-Point as shown in the Filter Arithmetic menu. The Filter Arithmetic option allows users to quantize the designed filter and analyze the effects with different quantization settings. When the user has chosen an arithmetic setting (single-precision floating-point or fixed-point), FDATool quantizes the current filter according to the selection and updates the information displayed in the analysis area. For example, to enable the fixed-point quantization settings in the FDATool, select Fixed-Point from the pull-down menu. The quantization options appear in the lower pane of the FDATool window as shown in Figure 4.22. As shown in Figure 4.22, there are three tabs in the dialog window for user to select quantization tasks from the FDATool: 1. Coefficients tab defines the coefficient quantization. 2. Input/Output tab quantizes the input and output signals for the filter. 3. Filter Internals tab sets a variety of options for the arithmetic. After setting the proper options for the desired filter, click Apply to start the quantization processes. The Coefficients tab is the default active pane. The filter type and structure determine the available options. Numerator Wordlength sets the wordlength used to represent coefficients of FIR filters. Note that the Best-Precision Fraction Lengths box is also checked and the Numerator Wordlength box is set to 16 by default. We can uncheck the Best-Precision Fraction Lengths box to specify Numerator Frac. Length or Numerator Range (+/−). The Filter Internals tab as shown in Figure 4.23 specifies how the quantized filter performs arithmetic operations. Round towards options,Ceiling (round up), Nearest, Nearest (convergent), Zero,orFloor (round down), set a rounding mode that the filter will be used to quantize the numeric values. Overflow Mode options, Wrap and Saturate, set to overflow conditions in fixed-point arithmetic. Product modeJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 DESIGN OF FIR FILTERS 211 Figure 4.23 Setting filter arithmetic operations in the FDATool options, Full precision, Keep MSB, Keep LSB,orSpecify all (set the fraction length), determine how the filter handles the output of the multiplication operations. The Accum. mode option determines how the accumulator stores its output values. Example 4.13: Design a bandpass FIR filter for a 16-bit fixed-point DSP processor with the following specifications: Sampling frequency = 8000 Hz. Lower stopband cutoff frequency (Fstop1) = 1200 Hz. Lower passband cutoff frequency (Fpass1) = 1400 Hz. Upper passband cutoff frequency (Fpass2) = 1600 Hz. Upper stopband cutoff frequency (Fstop2) = 1800 Hz. Passband ripple = 1 dB. Stopband (both lower and upper) attenuation = 60 dB. After entering these parameters in the Frequency Specifications and Magnitude Specifications regions and clicking Design Filter, Figure 4.24 will be displayed. Click the Set Quantization Parameters button to switch to quantization mode and open the quantization panel. Selecting the Fixed-point option from the Filter arithmetic pull-down menu, the analysis areas will show the magnitude responses for both the designed filter and the fixed-point quantized filter. The default settings in Coefficients, Input/Output, and Filter Internals Taps are used. We can export filter coefficients to MATLAB workspace to a coefficient file or MAT-file. To save the quantized filter coefficients as a text file, select Export from the File menu on the toolbar. When the Export dialog box appears, select Coefficient File (ASCII) from the Export to menu and choose Decimal, Hexadecimal,orBinary from the Format options. After clicking the OK button, the Export Filter Coefficients to .FCF File dialog box will appear. Enter a filename and click the Save button. To create a C header file containing filter coefficients, select Generate C header from the Targets menu. For an FIR filter, variable used in C header file are for numerator name and length. We can use the default variable names B and BL as shown in Figure 4.25 in the C program, or change them to match the variable names defined in the C program that will include this header file. As shown in the figure, we can use the default Signed 16-Bit Integer with 16-Bit Fractional Length, or select Export as and choose the desired data type. Clicking Generate button opens Generate C Header dialog box. Enter the filename and click Save to save the file.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 212 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.24 FDATool window for designing a bandpass filter Figure 4.25 Generate C header dialog boxJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 213 Example 4.14:We continue the Example 4.13 by saving the quantized 16-bit FIR filter coefficients in a file named as Bandpass1500FIR.h. The parameters and filter coefficients saved in the header file are shown as follows: const int BL = 80; const int 16_T B[80] = { 79, -48, -126, -86, 71, 155, 34, -148, -149, 28, 135, 59, -24, 23, 20, -188, -296, 101, 674, 492, -614, -1321, -315, 1563, 1806, -480, -2725, 1719, 1886, 3635, 784, -3559, -3782, 931, 4906, 2884, -2965, -5350, -1080, 4634, 4634, -1080, -5350, -2965, 2884, 4906, 931, -3782, -3559, 784, 3635, 1886, -1719, -2725, -480, 1806, 1563, -315, -1321, -614, 492, 674, 101, -296, -188, 20, 23, -24, 59, 135, 28, -149, -148, 34, 155, 71, -86, -126, -48, 79 }; If the TMS320C5000 CCS is also installed on the computer, the Targets pull-down menu has additional option called Composer Studio (tm) IDE. Selecting this option, the Export to Code Composer Studio (R) IDE dialog box appears as shown in Figure 4.26. Comparing with Figure 4.25, we have additional options in Export mode: C Header File or Write Directly to Memory. In addition, we can select target DSP board such as the C5510 DSK. The MATLAB connects with DSK via MATLAB Link for CCS. This useful feature can simplify the DSP development and testing procedures by combining MATLAB functions with DSP processors. The MATLAB Link for CCS will be introduced in Chapter 9. 4.3 Implementation Considerations In this section, we will consider finite-wordlength effects of digital FIR filters, and discuss the software implementation using MATLAB and C to illustrate some important issues. 4.3.1 Quantization Effects in FIR Filters Consider the FIR filter given in Equation (3.22). The filter coefficients, bl , are determined by a filter design package such as MATLAB. These coefficients are usually represented by double-precision floating-point numbers and have to be quantized for implementation on a fixed-point processor. The filter coefficients are quantized and analyzed during the design process. If it no longer meets the given specifications, we shall optimize, redesign, restructure, and/or use more bits to satisfy the specifications. Let b l denote the quantized values corresponding to bl . As discussed in Chapter 3, the nonlinear quantization can be modeled as a linear operation expressed as b l = Q[bl ] = bl + e(l), (4.41) where e(l) is the quantization error and can be assumed as a uniformly distributed random noise of zero mean. The frequency response of the actual FIR filter with quantized coefficients b l can be expressed as B (ω) = B (ω) + E (ω) , (4.42)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 214 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.26 FDATool exports to CCS: (a) FDATool exports to CCS dialog box; (b) link for CCS target selection dialog box; (c) C5510 DSK linked with MATLABJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 215 Figure 4.26 (Continued) where E (ω) = L−1 l=0 e(l)e− jωl (4.43) represents the error in the desired frequency response B(ω). The error spectrum is bounded by |E (ω)| = L−1 l=0 e(l)e− jωl ≤ L−1 l=0 |e (l)| e− jωl ≤ L−1 l=0 |e (l)|. (4.44) As shown in Equation (3.82), |e (l)| ≤  2 = 2−B. (4.45) Thus, Equation (4.44) becomes |E (ω)| ≤ L · 2−B. (4.46) This bound is too conservative because it can only be reached if all errors, e(l), are of the same sign and have the maximum value in the range. A more realistic bound can be derived assuming e(l) is statistically independent random variable. Example 4.15: We first use a least-square method to design the FIR filter coefficients. To convert it to the fixed-point FIR filter, we use the filter construction function dfilt and change the arithmeticJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 216 DESIGN AND IMPLEMENTATION OF FIR FILTERS Magnitude response (dB) Magnitude (dB) Normalized frequency (xπ rad/sample) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 20 0 −20 −40 −60 −80 −100 −120 −140 −160 −180 12-bit FIR filter Figure 4.27 Magnitude responses of 12-bit and 16-bit FIR filters setting for the filter to fixed-point arithmetic as follows: hd = dfilt.dffir(b); % Create the direct-form FIR filter set(hd,'Arithmetic','fixed'); The first function returns a digital filter object hd of type dffir (direct-form FIR filter). The second function set(hd,'PropertyName',PropertyValue) sets the value of the specified property for the graphics object with handle hd. We can use FVTool to plot the magnitude responses for both the quantized filter and the corresponding reference filter. The fixed-point filter object hd uses 16 bits to represent the coefficients. We can make several copies of the filter for different wordlengths. For example, we can use h1 = copy(hd); % Copy hd to h1 set(h1,'CoeffWordLength',12); % Use 12 bits for coefficients The MATLAB script is given in example4 15.m, and the magnitude responses of FIR filters with 16-bit and 12-bit coefficients are shown in Figure 4.27. 4.3.2 MATLAB Implementations For simulation purposes, it is convenient to use a powerful software package such as MATLAB for software implementation of digital filter. MATLAB provides the function filter for FIR and IIRJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 217 Figure 4.28 Export window from FDATool filtering. The basic form of this function is y = filter(b, a, x) For FIR filtering, a = 1 and filter coefficients bl are contained in the vector b. The input vector is x while the output vector generated by the filter is y. Example 4.16: A 1.5 kHz sinewave with sampling rate 8 kHz is corrupted by white noise. This noisy signal can be generated, saved in file xn int.dat, and plotted by MATLAB script example4 16.m. Note that we normalized the floating-point numbers and saved them in Q15 integer format using the following MATLAB commands: xn_int = round(32767*in./max(abs(in)));% Normalize to 16-bit integer fid = fopen('xn_int.dat','w'); % Save signal to xn_int.dat fprintf(fid,'%4.0f\ n',xn_ int); % Save in integer format Using the bandpass filter designed in Example 4.13, we export FIR filter coefficients to current MATLABworkplace by selecting File→Export. From the pop-up dialog box Export shown in Fig- ure 4.28, type b in the Numerator box, and click OK. This saves the filter coefficients in vector b, which is available for use in current MATLAB directory. Now, we can perform FIR filtering using the MATLAB function filter by the command: y = filter(b, 1, xn_int); The filter output is saved in y vector of workspace, which can be plotted to compare with the input waveform. Example 4.17: This example evaluates the accuracy of the fixed-point filter when compared to a double-precision floating-point version using random data as input signal. We create a quantizerJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 218 DESIGN AND IMPLEMENTATION OF FIR FILTERS to generate uniformly distributed white-noise data using 16-bit wordlength as rand('state',0); % Initializing the random number generator q = quantizer([16,15],'RoundMode','round'); xq = randquant(q,256,1); % 256 samples xin = fi(xq,true,16,15); Now xin is an array of integers with 256 members, represented as a fixed-point object (a fi object). Now we perform the actual fixed-point filtering as follows: y = filter(hd,xin); The complete MATLAB program is given in example4 17.m. 4.3.3 Floating-Point C Implementations The FIR filtering implementation usually begins with floating-point C, migrates to the fixed-point C, and then to assembly language programs. Example 4.18: The input data is denoted as x and the filter output as y. The filter coefficients are stored in the coefficient array h[ ]. The filter delay line (signal vector) w[ ] keeps the past data samples. The sample-by-sample floating-point C program is listed as follows: void floatPointFir(float *x, float *h, short order, float *y, float *w) { short i; float sum; w[0] = *x++; // Get current data to delay line for (sum=0, i=0; i0; i--) // Update signal buffer { w[i] = w[i-1]; } } The signal buffer w[ ] is updated every sampling period as shown in Figure 4.9. For each update process, the oldest sample at the end of the signal buffer is discarded and the remaining samples are shifted one location down in the buffer. The most recent data sample x(n) is inserted to the top location at w[0]. It is more efficient to implement DSP algorithms using block-processing technique. For many practical applications such as wireless communications, speech processing, and video compression, the signal samples are usually grouped into packets or frames. An FIR filter that processes data by frames instead of sample by sample is called block-FIR filter. With the circular addressing mode available on most DSP processors, the shifting of data in the signal buffer can be replaced by circular buffer.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 219 Example 4.19: The block-FIR filtering function processes one block of data samples for each function call. The input samples are stored in the array x[ ] and the filtered output samples are stored in the array y[ ]. In the following C program, the block size is denoted as blkSize: void floatPointBlockFir(float *x, short blkSize, float *h, short order, float *y, float *w, short *index) { short i,j,k; float sum; float *c; k = *index; for (j=0; j>15); // Save filter output if (k-- <=0) // Update index for next time k = order-1; } *index = k; // Update circular buffer index } 4.4 Applications: Interpolation and Decimation Filters In many applications such as interconnecting DSP systems operating at different sampling rates, sampling frequency changes are necessary. The process of converting a digital signal to a different sampling rate is called sampling-rate conversion. The key processing for sampling-rate conversion is lowpass FIR filtering. Sampling rate increased by an integer factor U is called interpolation, while decreased by an integer factor D is called decimation. Combination of interpolation and decimation allows the digital system to change the sampling rate with any ratio. One of the main applications of decimation is to eliminate the need for high-quality analog antialiasing filters. In an audio system that uses oversampling and decimation, the analog input is first filtered by a simple analog antialiasing filter and then sampled at a higher rate. The decimation filter then reduces the bandwidth of the sampled digital signal. The digital decimation filter provides high-quality lowpass filtering and reduces the cost of using expensive analog filters. 4.4.1 Interpolation Interpolation is the process of inserting additional samples between successive samples of the original low-rate signal, and filtering the interpolated samples with an interpolation filter. For an interpolator of 1:U, the process inserts (U− 1) zeros in between the successive samples of the original signal x(n)of sampling rate fs, thus the sampling rate is increased to Ufs, or the sampling period is reduced to T /U. This intermediate signal, x(n), is then filtered by a lowpass filter to produce the final interpolated signal y(n). The simplest lowpass filter is a linear-phase FIR filter. The FDA Tool introduced in Section 4.2.5 can be used for designing this interpolation filter. The interpolating filter B(z) operates at the high rate of f  s = Ufs with the frequency response B(ω) = U, 0, 0 ≤ ω ≤ ωc ωc <ω≤ π , (4.47) where the cutoff frequency is determined as ωc = π U or fc = f  s /2U = fs/2. (4.48) Since the insertion of (U − 1) zeros spreads the energy of each signal sample over U output samples, the gain U compensates for the energy loss of the up-sampling process. The interpolation increases the sampling rate while the bandwidth ( fs/2) of the interpolated signal is still the same as the original signal.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 APPLICATIONS: INTERPOLATION AND DECIMATION FILTERS 221 Because the interpolation introduces (U− 1) zeros between successive samples of the input signal, only one out of every U input samples sent to the interpolation filter is nonzero. To efficiently implement this filter, the required filtering operations may be rearranged to operate only on the nonzero samples. Suppose at time n, these nonzero samples are multiplied by the corresponding FIR filter coefficients b0, bU , b2U ,...,bL−U . At the following time n + 1, the nonzero samples are multiplied by the coinciding filter coefficients b1, bU+1, b2U+1, ..., bL−U+1. This can be accomplished by replacing the high-rate FIR filter of length L with U, shorter polyphase filters Bm(z), m = 0,1,...,U − 1 of length I = L/U at the low-rate fs. The computational efficiency of the polyphase filter structure comes from dividing the single L-point FIR filter into a set of smaller filters of length L/U, each of which operates at the lower sampling rate fs. Furthermore, these U polyphase filters share a single signal buffer of size L/U. Example 4.21: Given the signal file wn20db.dat, which is sampled at 8 kHz. We can use the MATLAB script (example4 21.m) to interpolate it to 48 kHz. Figure 4.29 shows the spectra of the original signal, interpolated by 6 before and after lowpass filtering. This example shows that the lowpass filtering defined by Equation (4.47) removes all folded image spectra. Some useful MATLAB functions used in this example are presented in Section 4.4.4. 4.4.2 Decimation Decimation of a high-rate signal with sampling rate f  s by a factor D results in the lower rate f  s = f  s /D. The down sample process by a factor of D may be simply done by discarding the (D − 1) samples that are between the low-rate ones. However, decreasing the sampling rate by a factor D reduces the bandwidth by the same factor D. Thus, if the original high-rate signal has frequency components outside the new bandwidth, aliasing would occur. Lowpass filtering the original signal x(n) prior to the decimation process can solve the aliasing problem. The cutoff frequency of the lowpass filter is given as fc = f  s /2D = f  s /2. (4.49) This lowpass filter is called the decimation filter. The high-rate filter output y(n) is down sampled to obtain the desired low-rate decimated signal y(n) by discarding (D − 1) samples for every D sample of the filtered signal y(n). The decimation filter operates at the high-rate f  s . Because only every Dth output of the filter is needed, it is unnecessary to compute output samples that will be discarded. Therefore, the overall computation is reduced by a factor of D. Example 4.22: Given the signal file wn20dba.dat, which is sampled at 48 kHz. We can use the MATLAB script (example4 22.m) to decimate it to 8 kHz. Figure 4.30 shows the spectra of the original signal, decimated by 6 with and without lowpass filtering. This example shows the lowpass filtering before decimation reduces the aliasing. The spectrum in Figure 4.30(c) basically is part of the spectrum from 0 to 4000 Hz of Figure 4.30(a). The spectrum in Figure 4.30(b) is distorted especially in the low-magnitude segments. 4.4.3 Sampling-Rate Conversion The sampling-rate conversion by a rational factor U/D can be done entirely in the digital domain with proper interpolation and decimation factors. We can achieve this digital sampling-rate conversion by first performing interpolation of a factor U, and then decimating the signal by a factor D. For example, we can convert digital audio signals for broadcasting (32 kHz) to professional audio (48 kHz) using a factorJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 222 DESIGN AND IMPLEMENTATION OF FIR FILTERS (A) Original signal spectrum Frequency (Hz) 0 500 1000 1500 2000 2500 3000 3500 4000 40 30 20 10 0 −10 Magnitude (dB) (B) Interpolation by 6 before lowpass filtering Frequency (Hz) 0 5000 10000 15000 20000 24000 40 30 20 10 0 −10 Magnitude (dB) (C) Interpolation by 6 after lowpass filtering Frequency (Hz) 0 5000 10000 15000 20000 24000 40 30 20 10 0 −10 Magnitude (dB) Figure 4.29 Interpolation by an integer operation: (a) original signal spectrum; (b) interpolation by 6 before lowpass filtering; and (c) interpolation after lowpass filtering of U/D = 3/2. That is, we interpolate the 32 kHz signal with U = 3, then decimate the resulting 96 kHz signal with D = 2 to obtain the desired 48 kHz. It is very important to note that we have to perform inter- polation before the decimation in order to preserve the desired spectral characteristics. Otherwise, the dec- imation may remove some of the desired frequency components that cannot be recovered by interpolation. The interpolation filter must have the cutoff frequency given in Equation (4.48), and the cutoff frequency of the decimation filter is given in Equation (4.49). The frequency response of the combined filter must incorporate the filtering operations for both interpolation and decimation, and hence it should ideally have the cutoff frequency fc = 1 2 min fs, f  s . (4.50)JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 APPLICATIONS: INTERPOLATION AND DECIMATION FILTERS 223 (a) Original spectrum (b) Decimation by 6 without lowpass filter (c) Decimation by 6 with lowpass filter Frequency (Hz) Frequency (Hz)Frequency (Hz) 0 5000 1000 10002000 20003000 30004000 40000 0 10000 15000 20000 24000 40 40 40 30 30 30 20 20 20 10 10 10 Magnitude (dB) Magnitude (dB) Magnitude (dB) Distorted Figure 4.30 Decimation operation: (a) original signal spectrum; (b) decimation by 6 without lowpass filter; and (c) decimation by 6 with lowpass filter Example 4.23: Convert a sinewave from 48 to 44.1 kHz using the following MATLAB script example4_23.m (adapted from the MATLAB Help menu for upfirdn). Some useful MATLAB functions used in this example are presented in Section 4.4.4. g = gcd(48000, 44100); % Greatest common divisor,g=300 U = 44100/g; % Up sample factor, U=147 D = 48000/g; % Down sample factor,D=160 N = 24*D; b = fir1(N,1/D,kaiser(N+1,7.8562)); % Design FIR filter in b b = U*b; % Passband gain = U Fs = 48000; % Original sampling frequency: 48 kHz n = 0:10239; % 10240 samples, 0.213 seconds long x = sin(2*pi*1000/Fs*n); % Original signal, sinusoid at 1 kHz y = upfirdn(x,b,U,D); % 9408 samples, still 0.213 seconds % Overlay original (48 kHz) with resampled signal (44.1 kHz) in red stem(n(1:49)/Fs,x(1:49)); hold on stem(n(1:45)/(Fs*U/D),y(13:57),'r','filled'); xlabel('Time (seconds)'); ylabel('Signal value'); The original 48 kHz sinewave and the converted 44.1 kHz signal are shown in Figure 4.31.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 224 DESIGN AND IMPLEMENTATION OF FIR FILTERS 0 –1 –0.5 0Signal value 0.5 1 1.5 0.1 0.2 0.3 0.4 Time (s) × 10–3 0.5 0.6 0.7 0.8 0.9 1 Figure 4.31 Sampling-rate conversion from 48 to 44.1 kHz 4.4.4 MATLAB Implementations The interpolation introduced in Section 4.4.1 can be implemented by the MATLAB function interp with the following syntax: y = interp(x, U); The interpolated vector y is U times longer than the original input vector x. The decimation for decreasing the sampling rate of a given sequence can be implemented by the MATLAB function decimate with the following syntax: y = decimate(x, D); This function uses an eighth-order lowpass Chebyshev type-I filter by default. We can employ FIR filter by using the following syntax: y = decimate(x, D, 'fir'); This command uses a 30-order FIR filter generated by fir1(30, 1/D) to filter the data. We can also specify the FIR filter order L by using y = decimate(x, D, L, 'fir'). Example 4.24: Given the speech file timit 4.asc, which is sampled by a 16-bit ADC with sampling rate 16 kHz. We can use the following MATLAB script (example4 24.m) to decimateJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 225 it to 4 kHz: load timit_4.asc -ascii; % Load speech file soundsc(timit_4, 16000) % Play at 16 kHz timit4 = decimate(timit_4,4,60,'fir'); % Decimation by 4 soundsc(timit4, 4000) % Play the decimated speech We can tell the sound quality (bandwidth) difference by listening to timit 4 with 16 kHz bandwidth and timit4 with 2 kHz bandwidth. For sampling-rate conversion, we can use the MATLABfunction gcd to find the conversion factorU/D. For example, to convert an audio signal from CD (44.1 kHz) for transmission using telecommunication channels (8 kHz), we can use the following commands: g = gcd(8000, 44100); % Find the greatest common divisor U = 8000/g; % Up sample factor D = 44100/g; % Down sample factor In this example, we obtain U = 80 and D = 441 since g = 100. The sampling-rate conversion algorithm is supported by the function upfirdn in the Signal Processing Toolbox. This function implements the efficient polyphase filtering technique. For example, we can use the following command for sampling-rate conversion: y = upfindn(x, b, U, D); This function first interpolates the signal in vector x with factor U, filters the intermediate resulting signal by the FIR filter given in coefficient vector b, and finally decimates the intermediate result using the factor D to obtain the final output vector y. The quality of the sampling-rate conversion result depends on the quality of the FIR filter. Another function that performs sampling-rate conversion is resample. For example, y = resample(x, U, D); This function converts the sequence in vector x to the sequence in vector y with the sampling ratio U/D. It designs the FIR lowpass filter using firls with a Kaiser window. MATLAB provides the function intfilt for designing interpolation (and decimation) FIR filters. For example, b = intfilt(U, L, alpha); designs a linear-phase FIR filter with the interpolation ratio 1:U and saves the coefficients in vector b. The bandwidth of filter is alpha times the Nyquist frequency. Finally, we can use FDATool designing an interpolation filter by selecting Lowpass and Interpolated FIR as shown in Figure 4.32. For the Options, we can enter U in Interp. Factor box. We can also specify other parameters as introduced in Section 4.2.5. 4.5 Experiments and Program Examples In this section, we will present FIR filter implementation using fixed-point C, assembly programming, and use the C55x DSK for real-time application.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 226 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.32 Design an interpolation filter using FDATool 4.5.1 Implementation of FIR Filters Using Fixed-Point C This experiment uses the block-FIR filtering example presented in Section 4.3.4. The 16-bit test data is sampled at 8000 Hz and has three sinusoidal components at frequencies 800, 1800, and 3300 Hz. The 48-tap bandpass filter is designed using the following MATLAB script: f = [0 0.3 0.4 0.5 0.6 1]; m=[001100]; b = remez(47, f, m); This bandpass filter will attenuate the input sinusoids of frequencies 800 and 3300 Hz. Figure 4.33 shows the CCS plots of the input and output waveforms along with their spectra. We use the file I/O method (introduced in Section 1.6.4) for reading and storing data files. The files used for this experiment are listed in Table 4.2. Procedures of the experiment are listed as follows: 1. Open the project fixedPoint BlockFIR.pjt and rebuild the project. 2. Load and run the program to filter the input data file input.pcm. 3. Validate the output result using CCS plots to show that 800 and 3300 Hz components are removed. 4. Profile the FIR filter performance. 4.5.2 Implementation of FIR Filter Using C55x Assembly Language The TMS320C55x has MAC instructions, circular addressing modes, and zero-overhead nested loops to efficiently support FIR filtering. In this experiment, we use the same filter and input data as the previousJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 227 Figure 4.33 Input and output of the FIR filter. Input waveform (top left) and its spectrum (top right), and output waveform (bottom left) and its spectrum (bottom right) experiment to realize the FIR filter using the following C55x assembly language: rptblocal sample_loop-1 ; Start the outer loop mov *AR0+,*AR3 ; Put the new sample to signal buffer mpym *AR3+,*AR1+,AC0 ; Do the 1st operation || rpt CSR ; Start the inner loop macm *AR3+,*AR1+,AC0 macmr *AR3,*AR1+,AC0 ; Do the last operation with rounding mov hi(AC0),*AR2+ ; Save Q15 filtered value sample_loop The filtering loop counter is CSR and the block FIR loop counter is BRC0. AR0 points to the input buffer x[ ]. The signal buffer w[ ] is pointed by AR3. The coefficient array h[ ] is pointed by AR1. A new Table 4.2 File listing for experiment exp4.5.1 fixedPoint BlockFIR Files Description fixedPointBlockFirTest.c C function for testing block FIR filter fixedPointBlockFir.c C function for fixed-point block FIR filter fixedPointFir.h C header file for block FIR experiment firCoef.h FIR filter coefficients file fixedPoint BlockFIR.pjt DSP project file fixedPoint BlockFIR.cmd DSP linker command file input.pcm Data fileJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 228 DESIGN AND IMPLEMENTATION OF FIR FILTERS Table 4.3 File listing for experiment exp4.5.2 asm BlockFIR Files Description blockFirTest.c C function for testing block FIR filter blockFir.asm Assembly implementation of block FIR filter blockFir.h C header file for block FIR experiment blockFirCoef.h FIR filter coefficients file asm_BlockFIR.pjt DSP project file asm_BlockFIR.cmd DSP linker command file input.pcm Data file sample is placed in the signal buffer, and the inner loop repeats the MAC instructions. The intermediate results are kept in AC0. When the filtering operation is completed, the output y(n) is rounded in Q15 format and stored in the output buffer out[ ], which is pointed at by AR2. Both AR1 and AR3 are configured as circular pointers. The circular addressing mode is set as follows: mov mmap(AR1),BSA01 ; AR1=base address for coefficients mov mmap(T1),BK03 ; Set coefficient array size (order) mov mmap(AR3),BSA23 ; AR3=base address for signal buffer or #0xA,mmap(ST2_55) ; AR1 & AR3 as circular pointers mov #0,AR1 ; Coefficient start from h[0] mov *AR4,AR3 ; Signal buffer start from w[index] The circular addressing mode for signal and coefficient buffers is configured by setting the base address register BSA01 for AR1 and BSA23 for AR3. The length of the circular buffers is determined by BK03. The starting address of the circular buffer for the coefficient vector h is always h[0]. For the signal buffer, the circular buffer starting address depends upon the previous iteration, which is passed by AR4. At the end of computation, the signal buffer pointer AR3 will point at the oldest sample, w(n − L + 1). This offset is kept as shown in Figure 4.10. In this experiment, we set C55x FRCT bit to automatically compensate for the Q15 multiplication. The SMUL and SATD bits are set to handle the saturation of the fractional integer operation. The SXMD bit sets the sign-extension mode. The C55x assembly language implementation of FIR filtering takes order+3 clock cycles to process each input sample. Thus, the 48-tap FIR filter needs 51 cycles, excluding the overhead. Table 4.3 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Open the project asm BlockFIR.pjt and rebuild it. 2. Load the FIR filter project and run the program to filter the input data. 3. Validate the output data to ensure that the 800 and 3300 Hz components are attenuated. 4. Profile the FIR filter performance and compare the result with the fixed-point C implementation. 4.5.3 Optimization for Symmetric FIR Filters The TMS320C55x has two special instructions firsadd and firssub to implement the symmetric and antisymmetric FIR filters, respectively. The syntax of instructions is firsadd Xmem,Ymem,Cmem,ACx,ACyJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 229 where Xmem and Ymem are the signal buffers of {x(n), x(n−1), . . . x(n − L/2 + 1)} and {x(n − L/2), ...x(n − L+1)}, respectively, and Cmem is the coefficient buffer. The firsadd instruction is equivalent to the following parallel instructions: macm *CDP+,ACx,ACy ; bl [x(n − l)+x(n+l − L+1)] || add *ARx+,*ARy+,ACx ; x(n − l+1)+x(n+l − L+2) The macm instruction carries out the multiplyÐaccumulate portion of the symmetric filter operation, and the add instruction adds a pair of samples for the next iteration. The implementation of symmetric FIR filter using the C55x assembly program is listed as follows: rptblocal sample_loop-1 ; To prevent overflow in addition, mov #0,AC0 ; input is scaled to Q14 format || mov AC1<<#-1,*AR3 ; Put input to signal buffer in Q14 add *AR3+,*AR1-,AC1 ; AC1=[x(n)+x(n-L+1)]<<16 || rpt CSR ; Do order/2-2 iterations firsadd *AR3+,*AR1-,*CDP+,AC1,AC0 firsadd *(AR3-T0),*(AR1+T1),*CDP+,AC1,AC0 macm *CDP+,AC1,AC0 ; Finish the last macm instruction mov rnd(hi(AC0<<#1)),*AR2+ ; Store the rounded & scaled result || mov *AR0+,AC1 ; Get next sample sample-loop We need to store only the first half of the symmetric FIR filter coefficients. The inner-repeat loop is set to L/2 − 2 since each multiplyÐaccumulate operation accounts for a pair of samples. In order to use firsadd instruction inside a repeat loop, we add the first pair of filter samples using the dual memory add instruction add *AR3+,*AR1-,AC1 We also place the following instructions outside the repeat loop for the final calculation: firsadd *(AR3-T0),*(AR1+T1),*CDP+,AC1,AC0 macm *CDP+,AC1,AC0 We use two data pointers AR1 and AR3 to address the signal buffer. AR3 points at the newest sample in the buffer, and AR1 points at the oldest sample in the buffer. Temporary registers, T1 and T0, are used as the offsets for updating circular buffer pointers. The offsets are initialized to T0 = L/2 and T1 = L/2 − 2. Figure 4.34 illustrates these two circular buffer pointers for a symmetric FIR filtering. The firsadd instruction accesses three data buses simultaneously. Two implementation issues should be considered. First, the instruction firsadd adds two correspond- ing samples, which may cause an undesired overflow. Second, the firsadd instruction accesses three read operations in the same cycle, which may cause data bus contention. The first problem can be resolved by scaling the input to Q14 format, and scaling the filter output back to Q15. The second problem can be resolved by placing the coefficient buffer and the signal buffer in different memory blocks. The C55x assembly language implementation of symmetric FIR filter takes (order/2) + 4 clock cycles to process each input data. Thus, this 48-tap FIR filter needs 28 cycles for each sample, excluding the overhead. Table 4.4 lists the files used for this experiment.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 230 DESIGN AND IMPLEMENTATION OF FIR FILTERS x(n – L + 2) x(n)x(n – L + 1) x(n – 1) x(n – 2) x(n – 3) AR3 at time nAR1 at time n (a) Circular buffer for a symmetric FIR filter at time n x(n – L + 2) x(n)x(n – L + 1) x(n – 1) x(n – 2) x(n – 3)AR1 for next x(n – L + 1) AR3 for next x(n) (b) Circular buffer for a symmetric FIR filter at time n + 1 Figure 4.34 Circular buffer for a symmetric FIR filtering. The pointers to x(n) and x(n − L + 1) are updated at the counterclockwise direction: (a) circular buffer for a symmetric FIR filter at time n; (b) circular buffer for a symmetric FIR filter at time n + 1 Procedures of the experiment are listed as follows: 1. Open the symmetric BlockFIR.pjt and rebuild the project. 2. Load and run the program to filter the input data. 3. Validate the output data to ensure that the 800 and 3300 Hz components are removed. 4. Profile the FIR filter performance and compare the result with previous C55x assembly language implementation. 4.5.4 Optimization Using Dual MAC Architecture Dual MAC improves the processing speed by generating two outputs, y(n) and y(n + 1), in parallel. For example, the following parallel instructions use dual MAC architecture: rpt CSR mac *ARx+,*CDP+,ACx ; ACx=bl*x(n) :: mac *ARy+,*CDP+,ACy ; ACy=bl*x(n+1) Table 4.4 File listing for experiment exp4.5.3 symmetric BlockFIR Files Description symFirTest.c C function for testing symmetric FIR filter symFir.asm Assembly routine of symmetric FIR filter symFir.h C header file for symmetric FIR experiment symFirCoef.h FIR filter coefficients file symmetric_BlockFIR.pjt DSP project file symmetric_BlockFIR.cmd DSP linker command file input.pcm Data fileJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 231 In this example, ARx and ARy are data pointers to x(n) and x(n + 1), and CDP is the coefficient pointer. The repeat loop produces two filter outputs y(n) and y(n + 1). After execution, pointers CDP, ARx, and ARy are increased by 1. The following example shows the C55x assembly implementation using the dual MAC and circular buffer for a block-FIR filter: rptblocal sample_loop-1 mov *AR0+,*AR1 ; Put new sample to signal buffer x[n] mov *AR0+,*AR3 ; Put next new sample to location x[n+1] mpy *AR1+,*CDP+,AC0 ; The first operation :: mpy *AR3+,*CDP+,AC1 || rpt CSR mac *AR1+,*CDP+,AC0 ; The rest MAC iterations :: mac *AR3+,*CDP+,AC1 macr *AR1,*CDP+,AC0 :: macr *AR3,*CDP+,AC1 ; The last MAC operation mov pair(hi(AC0)),dbl(*AR2+); Store two output data sample-loop There are three implementation issues to be considered when using the dual MAC architecture: (1) We must increase the length of the signal buffer by 1 to accommodate an extra memory location required for computing two signals in parallel. With an additional space in the buffer, we can form two sequences in the signal buffer, one pointed by AR1 and the other by AR3. (2) Dual MAC implementation of the FIR filtering needs three memory reads (two data samples and one filter coefficient) simultaneously. To avoid memory bus contention, we shall place the signal buffer and the coefficient buffer in different memory blocks. (3) The results are kept in two accumulators, thus requires two store instructions to save two output samples. It is more efficient to use the following dual-memory-store instruction mov pair(hi(AC0)),dbl(*AR2+) to save both outputs to the data memory in 1 cycle. However, the dual-memory-store instruction requires the data to be aligned on an even word (32-bit) boundary. This alignment can be set using the key word align 4 in the linker command file as output : {} > RAM0 align 4 /* word boundary alignment */ and using the DATA SECTION pragma directive to tell the linker where to place the output sequence. Another method to set data alignment is to use DATA ALIGN pragma directive as #pragma DATA_ALIGN(y,2); /* Alignment for dual accumulator store */ The C55x implementation of FIR filter using dual MAC needs (order+3)/2 clock cycles to process each input data. Thus, it needs 26 cycles for each sample excluding the overhead. The files used for this experiment are listed in Table 4.5. Table 4.5 File listing for experiment exp4.5.4 dualMAC BlockFIR Files Description dualMacFirTest.c C function for testing dual MAC FIR filter dualMacFir.asm Assembly routine of dual MAC FIR filter dualMacFir.h C header file for dual MAC FIR experiment dualMacFirCoef.h FIR filter coefficients file dualMAC_BlockFIR.pjt DSP project file dualMAC_BlockFIR.cmd DSP linker command file input.pcm Data fileJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 232 DESIGN AND IMPLEMENTATION OF FIR FILTERS Procedures of the experiment are listed as follows: 1. Open the dual BlockFIR.pjt and rebuild the project. 2. Load the FIR filter project and run the program to filter the input data. 3. Validate the output data to ensure that the 800 and 3300 Hz components are removed. 4. Profile the FIR filter performance and compare the result with previous C55x assembly language implementations. 4.5.5 Implementation of Decimation The implementation of a decimator must consider multistage filter if the decimation factor can be formed by common multiply factors. In this experiment, we will implement the 6:1 decimator using two FIR filters of 2:1 and 3:1 decimation ratios. The two-stage decimator uses the input, output, and temporary buffers. The input buffer size is equal to the frame size multiplied by the decimation factor. For example, when the frame size is chosen as 80, the 48 to 8 kHz decimation will require the input buffer size of 480 (80 * 6). The temporary buffer size (240) is determined as the input buffer size (480) divided by the first decimation factor 2. The offset, D − 1, is preloaded to the temporary register T0. After reading two input data samples to the signal buffer, the address pointers AR1 and AR3 are incremented by D − 1. The decimation FIR filter uses the dual MAC instruction with loop unrolling. The last instruction mov pair(hi(AC0)),dbl(*AR2+) requires the output address pointer to be aligned with even-word boundary. Table 4.6 lists the files used for this experiment. || rptblocal sample_loop-1 mov *(AR0+T0),*AR3 ; Put new sample to signal buffer x[n] mov *(AR0+T0),*AR1 ; Put next new sample to location x[n+1] mpy *AR1+,*CDP+,AC0 ; The first operation :: mpy *AR3+,*CDP+,AC1 || rpt CSR mac *AR1+,*CDP+,AC0 ; The rest MAC iterations :: mac *AR3+,*CDP+,AC1 macr *AR1,*CDP+,AC0 :: macr *AR3,*CDP+,AC1 ; The last MAC operation Table 4.6 File listing for experiment exp4.5.5 decimation Files Description decimationTest.c C function for testing decimation experiment decimate.asm Assembly routine of decimation filter decimation.h C header file for decimation experiment coef48to24.h FIR filter coefficients for 2:1 decimation coef24to8.h FIR filter coefficients for 3:1 decimation decimation.pjt DSP project file decimation.cmd DSP linker command file tone1k_48000.pcm Data file 1 kHz tone at 48 kHz sampling rateJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 233 mov pair(hi(AC0)),dbl(*AR2+); Store two output data sample-loop Procedures of the experiment are listed as follows: 1. Open the decimation.pjt and rebuild the project. 2. Load and run the program to obtain the output data using the input data given in the data folder. 3. The 1000 Hz sinewave at 48 kHz sampling rate will have 48 samples per cycle. Validate the output data of the 1000 Hz sinewave. At 8 kHz sampling rate, each cycle should have eight samples, see Figure 4.35. Figure 4.35 Decimation of the 48 kHz sampling-rate signal to 8 kHz. 1 kHz tone sampled at 48 kHz (top left) and its spectrum (top right). Decimation output of 8 kHz sampling rate (bottoom left) and its spectrum (bottom right) 4. Use MATLAB to plot the spectrum of decimation output to verify that it is 1000 Hz sinewave. 4.5.6 Implementation of Interpolation In this experiment, we interpolate the 8 kHz sampling data to 48 kHz. We will use two interpolation filters with interpolation factors of 2 and 3. The interpolation filter is implemented using fixed-point C program that mimics circular addressing mode. The circular buffer index is kept by the variable index. The coefficient array is h[ ], and the signal buffer is w[ ]. Since we do not have to filter the data samples with zero values, the coefficient array pointer is offset with interpolation factor. Table 4.7 lists the files used for this experiment. The C code is listed as follows: k = *index; for (j=0; j>14); // Save filter output c++; k=m; } k--; if (k<0) // Update index for next time k += order; } Table 4.7 File listing for experiment exp4.5.6_interpolation Files Description interpolateTest.c C function for testing interpolation experiment interpolate.c C function for interpolation filter interpolation.h C header file for interpolation experiment coef8to16.h FIR filter coefficients for 1:2 interpolation coef16to48.h FIR filter coefficients for 1:3 interpolation interpolation.pjt DSP project file interpolation.cmd DSP linker command file tone1k_8000.pcm Data file Ð 1 kHz tone at 8 kHz sampling rate Procedures of the experiment are listed as follows: 1. Open the interpolation.pjt and rebuild the project. 2. Load and run the program to obtain the output data using the input data given in the data folder. 3. The 1000 Hz sinewave input data sampled at 8 kHz will have eight samples in each cycle. Validate the output 1000 Hz sinewave data at 48 kHz sampling rate that each cycle should have 48 samples. 4. Use MATLAB to plot the spectrum of interpolator output to verify that it is a 1000 Hz tone. 4.5.7 Sample Rate Conversion In this experiment, we will convert the sampling rate from 48 to 32 kHz. We first interpolate the signal sampled at 48 kHz to 96 kHz, and then decimate it to 32 kHz. The files used for this experiment are listed in Table 4.8. Figure 4.36 illustrates the procedures of sampling-rate conversion from 48 to 32 kHz.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 235 Table 4.8 File listing for experiment exp4.5.7_SRC Files Description srcTest.c C function for testing sample rate conversion interpolate.c C function for interpolation filter decimate.asm Assembly routine for decimation filter interpolation.h C header file for interpolation decimation.h C header file for decimation coef96to32.h FIR filter coefficients for 3:1 decimation SRC.pjt DSP project file SRC.cmd DSP linker command file tone1k_48000.pcm Data file Ð 1 kHz tone at 48 kHz sampling rate The first lowpass filter with cutoff frequency 48 kHz (π/2) may not be necessary in this case since a decimation lowpass filter with narrower cutoff frequency is immediately followed. Procedures of the experiment are listed as follows: 1. Open the SRC.pjt and rebuild the project. 2. Load and run the program to obtain the output data using the input data given in the folder. 3. The input signal sampled at 48 kHz will be converted to 32 kHz at the output. For each period, the output should have 32 samples. 4. Use MATLAB to plot the output spectrum to verify that it is 1000 Hz. 4.5.8 Real-Time Sample Rate Conversion Using DSP/BIOS and DSK In this experiment, we create a DSP/BIOS application that uses C5510 DSK to capture and play back audio samples for real-time sample rate conversion using C5510 DSK. The DSP/BIOS is a small kernel included in the CCS for real-time synchronization, host-target communication, and scheduling. It provides multithreading, real-time analysis, and configuration capabilities to greatly reduce the development effort when hardware and other processor resources are involved. Step 1: Create a DSP/BIOS configuration file A DSP/BIOS program needs a configuration file, which is a window interface that determines application parameters and sets up modules including interrupts and I/Os. To create configuration file for the C55x DSK, we start from CCS menu File→New→DSP Configuration to select dsk5510.cdb and click OK. When the new configuration file is opened, save it as dspbios.cdb. Similar to the previous experiments, create and save the DSP/BIOS project, DSPBIOS.pjt, and add the configuration file to the project. ↓ 3 Decimation by 3Lowpass filter48 kHz 96 kHz 96 kHz 32 kHz 32 kHz ↑ 2 Interpolate by 2 Lowpass fiter π/2 π/2 rr Figure 4.36 Sampling-rate conversionJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 236 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.37 The DSP/BIOS configuration file Double click the configuration file to open it as shown in Figure 4.37. Left click the + sign in front of an item on the left window will open the property of that item on the right window. To change the parameters listed by the configuration file, right click the item and select Properties. The configuration has six items: System, Instrumentation, Scheduling, Synchronization, Input/Output, and Chip Support Library. Under the System item, users can change and modify the processor global settings and adjust memory blocks size and allocation. To make changes, select the item and right click to bring up the Properties of that item. In this experiment, we will use the default global settings. Step 2: Create a software interrupt object Open the Scheduling and click the + sign in front of SWI to open the submenu, right click SWI and select Insert SWI to insert a new software interrupt object. Rename the newly inserted SWI0 to swiAudioProcess. Right click swiAudioProcess and select Properties again to open the dialog box, enter new function name _audioProcess to the Function box, and set the Priority to 2 and Mailbox to 3 as shown in Figure 4.38. Step 3: Set up pipe input and output We now connect the input and output of the DSK with DSP/BIOS through the configuration file. Click the + sign in front of the Input/Output to open the submenu, right click the PIP Buffer Pipe Manager and select Insert PIP to insert two new PIPs. Rename one to pipRx and the other to pipTx. This adds two ping-pong data buffers through the DMA for connecting input and output. Right click pipRx and select Properties to open the dialog box. For the experiment, we configure pipRx to receive audio samples and pipTx to transmit audio samples. First, we align the buffer in even-word boundary by setting the bufalign to 2, and we change the buffer size to 480 by modifying the frame- size. We then move to Notify Functions window and change the notifyWriter from _FXN_F_nop to _PLIO_rxPrime, and change the notifyReader from _FXN_F_nop to _SWI_andnHook. In the function _PLIO_rxPrime, we enter _plioRx to the nwarg0 field, and 0 to the nwarg1 field. In the function _SWI_andnHook, we enter _swiAudioProcess to the nrarg0 field, and 1 to the nrarg1 field. We configure pipTx to work with pipRx in a similar way. Note that the notification monitor forJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 237 Figure 4.38 Setting up SWI pipRx is reader, while pipTx uses writer. The settings of pipRx and pipTx are shown in Figures 4.39 and 4.40, respectively. Step 4: Configure the DMA This step connects the input/output of the DSK using the DMA controller. From the DSP/BIOS configuration file dialog window, select Chip Support Library and open DMA→Direct Memory Access Controller. From the DMA Configuration Manager, insert two new dmaCfg objects and rename them as dmaCfgReceive and dmaCfgTransmit. Open dmaCfgReceive and from the Frame tab configure the frame as follows: Data Type = 16-bit. Number of Element (CEN) = 256. Figure 4.39 The settings of pipRx in DSP/BIOS buffered pipe managerJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 238 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.40 The settings of pipTx in DSP/BIOS buffered pipe manager Number of Frames (CFN) = 1. Frame Index (CFI) = 0. Element Index (CEI) = 0. In the Source tab of the dmaCfgReceive, set the source configuration as follows: Burst Enable (SRC BEN) = Single Access (No Burst). Packing (SRC PACK) = No Packing Access. Source Space = Data Space. Source Address Format = Numeric. Start Address (CSSA) = 0x006002. Address Mode (SRC AMODE) = Constant. Transfer Source (SRC) = Peripheral Bus. In the Destination tab of the dmaCfgReceive, set the destination configuration as follows: Burst Enable (DST BEN) = Single Access (No Burst). Packing (DST PACK) = No Packing Access. Destination Space = Data Space. Destination Address Format = Numeric. Start Address (CDSA) = 0x000000. Address Mode (DST AMODE) = Post-incremented. Transfer Destination (DST) = DARAM. In the Control tab of the dmaCfgReceive, set the control configuration as follows: Sync Event (SYNC) = McBSP 2 Receive Event (REVT2). Repetitive Operations (REPEAT) = Only if END PROG = 1. End of Programmation (END PROG) = Delay re-initialization. Frame Synchronization (FS) = Disabled. Channel Priority (PRIO) = High. Channel Enable (EN) = Disabled. Auto-initialization (Auto INIT) = Disabled. In the Interrupts tab of the dmaCfgReceive, set the interrupt configuration as follows: Timeout (TIMEOUT IE) = Disabled. Synchronization Event drop (DROP IE) = Disabled.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 239 Half Frame (FALF IE) = Disabled. Frame Complete (FRAME IE) = Enabled. Last Frame (LAST IE) = Disabled. End Block (BLOCK IE) = Disabled. The DMA configuration management also needs to be configured for transmit. Open dmaCfgTransmit and from the Frame tab configure the frame to: Data Type = 16-bit. Number of Element (CEN) = 256. Number of Frames (CFN) = 1. Frame Index (CFI) = 0. Element Index (CEI) = 0. In the Source tab of the dmaCfgTransmit, set the source configuration as follows: Burst Enable (SRC BEN) = Single Access (No Burst). Packing (SRC PACK) = No Packing Access. Source Space = Data Space. Source Address Format = Numeric. Start Address (CSSA) = 0x000000. Address Mode (SRC AMODE) = Post-increment. Transfer Source (SRC) = DARAM. In the Destination tab of the dmaCfgTransmit, set the destination configuration as follows: Burst Enable (DST BEN) = Single Access (No Burst). Packing (DST PACK) = No Packing Access. Destination Space = Data Space. Destination Address Format = Numeric. Start Address (CDSA) = 0x006006. Address Mode (DST AMODE) = Constant. Transfer Destination (DST) = Peripheral Bus. In the Control tab of the dmaCfgTransmit, set the control configurations as follows: Sync Event (SYNC) = McBSP 2 Transmit Event (XEVT2). Repetitive Operations (REPEAT) = Only if END PROG = 1. End of Programmation (END PROG) = Delay re-initialization. Frame Synchronization (FS) = Disabled. Channel Priority (PRIO) = High. Channel Enable (EN) = Disabled. Auto-initialization (Auto INIT) = Disabled. In the Interrupts tab of the dmaCfgTransmit, set the interrupt configuration as follows: Timeout (TIMEOUT IE) = Disabled. Synchronization Event Drop (DROP IE) = Disabled. Half Frame (FALF IE) = Disabled. Frame Complete (FRAME IE) = Enabled. Last Frame (LAST IE) = Disabled. End Block (BLOCK IE) = Disabled. From the DSP/BIOS configuration file dialog window, select Chip Support Library and open DMA→Direct Memory Access Controller. There are six DMA channels in the DMA Resource Man- ager. We set DMA4 for receiving and DMA5 for transmitting for C5510 DSK. Open the DMA4 dialog box by right clicking it and selecting its Properties. Enable the Open Handle to DMA box and specify the DMA handle name as C55XX_DMA_MCBSP_hDmaRx and select dmaCfgReceive as shown in Figure 4.41. Open the DMA5 dialog box, enable Open Handle to DMA box and specify the handle name as C55XX_DMA_MCBSP_hDmaTx and select dmaCfgTransmit as shown in Figure 4.42.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 240 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.41 The settings of DMA4 in DSP/BIOS DMA manager Step 5: McBSP configuration The command and data transfer control between the processor and the AIC23 is via the serial ports as discussed in Chapter 2. The C55x chip support library also provides the McBSP functions through the DSP/BIOS configuration file. From the DSP/BIOS configuration file dialog window, select Chip Support Library and open McBSP→Multichannel Buffered Serial Port. Add two new objects to McBSP Configuration Manager and rename them as mcbspCfg1 and mcbspCfg2. Open mcbspCfg1 from the General tab to: Only check the box of Configure DX, PSX, and CLKX as Serial Pins. Uncheck the box of Configure DR, FSR, and CLKX as Serial Pins if checked. Breakpoint Emulation = Stop After Current Word. SPI Mode (CLKSTP) = Falling Edge w/o Delay. Digital Loop Back (DLB) = Disabled. Figure 4.42 The settings of DMA5 in DSP/BIOS DMA managerJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 241 In the Transmit Modes tab of the mcbspCfg1, set the configurations as follows: SPI Clock Mode (CLKXM) = Master. Frame-Sync Polarity (FSXP) = Active Low. DX Pin Delay (DXENA) = Disabled. Transmit Delay (XDATDLY) = 0-bit. Detect Sync Error (XSYNCERR) = Disabled. Interrupt Mode (XINTM) = XRDY. Early Frame Sync Response (XFIG) = Restart Transfer. Companding (XCOMPAND) = No Companding-MSB First. Transmit Frame-Sync Source = DXR(1/2)-to-XSR(1/2) Copy. In the Transmit Lengths tab of the mcbspCfg1, set the configurations as follows: Phase (XPHASE) = Single-phase. Word Length Phase1 (XWDLEN1) = 16-bit. Words/Frame Phase1 (XFRLEN1) = 1. In the Transmit Multichannel tab of the mcbspCfg1, set the configuration as: TX Channel Enable = All 128 Channels. In the Sample-Rate Gen tab of the mcbspCfg1, set the configurations as follows: SRG Clock Source (CLKSM) = CPU Clock. Transmit Frame-Sync Mode (FSXM=1)(FSGM) = Disabled. Frame Width (1-256)(FWID) = 1. Clock Divider (1-256)(CLKGDV) = 100. Frame Period (1-4096)(FRER) = 20. In the GPIO tab of the mcbspCfg1, set the configurations as follows: Select CLKR Pin as = Input. Select FSR Pin as = Input. The McBSP 1 is used for command control and McBSP 2 is used for data transfer. McBSP 2 is configured as bidirectional. Open mcbspCfg2 from the General tab to: Check the box of Configure DX, PSX, and CLKX as Serial Pins. Check the Configure DR, FSR, and CLKX as Serial Pins if checked. Breakpoint Emulation = Stop After Current Word. SPI Mode (CLKSTP) = Disabled. Digital Loop Back (DLB) = Disabled. In the Transmit Modes tab of the mcbspCfg2, set the configurations as follows: Clock Mode (CLKXM) = External. Clock Polarity (CLKXP) = Falling Edge. Frame-Sync Polarity (FSXP) = Active High. DX Pin Delay (DXENA) = Disabled. Transmit Delay (XDATDLY) = 0-bit. Detect Sync Error (XSYNCERR) = Disabled. Interrupt Mode (XINTM) = XRDY. Early Frame Sync Response (XFIG) = Restart Transfer. Companding (XCOMPAND) = No Companding-MSB First. Transmit Frame-Sync Source = External. In the Transmit Lengths tab of the mcbspCfg2, set the configurations as follows: Phase (XPHASE) = Single-phase. Word Length Phase1 (XWDLEN1) = 16-bit. Words/Frame Phase1 (XFRLEN1) = 2. In the Transmit Multichannel tab of the mcbspCfg1, set the configuration as follows: TX Channel Enable = All 128 Channels. In the Receive Modes tab of the mcbspCfg2, set the configurations as follows:JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 242 DESIGN AND IMPLEMENTATION OF FIR FILTERS Clock Mode (CLKXM) = External. Clock Polarity (CLKXP) = Rising Edge. Frame-Sync Polarity (FSXP) = Active High. Receive Delay (RDATDLY) = 0-bit. Detect Sync Error (RSYNCERR) = Disabled. Interrupt Mode (RINTM) = RRDY. Frame-Sync Mode (FSRM) = External. Early Frame Sync Response (RFIG) = Restart Transfer. Sign-Ext and Justification (RJUST) = Right-justify/zero-fill. Companding (XCOMPAND) = No Companding-MSB First. In the Receive Lengths tab of the mcbspCfg2, set the configurations as follows: Phase (RPHASE) = Single-phase. Word Length Phase1 (RWDLEN1) = 16-bit. Words/Frame Phase1 (RFRLEN1) = 2. In the Receive Multichannel tab of the mcbspCfg2, set the configuration as follows: RX Channel Enable = All 128 Channels. In the Sample-Rate Gen tab of the mcbspCfg2, set the configurations as follows: SRG Clock Source (CLKSM) = CLKS Pin. Clock Synchronization with CLKS Pin (GSYNC) = Disabled. CLKS Polarity Clock Edge (From CLKS Pin) (CLKSP) = Rising Edge of CLKS. Frame Width (1Ð256)(FWID) = 1. Clock Divider (1Ð256)(CLKGDV) = 1. Frame Period (1Ð4096) (FRER) = 1. From the DSP/BIOS configuration file dialog window, select Chip Support Library and open McBSP→Multichannel Buffered Serial Port. From McBSP Resource Manager modify hMCBSP1 and hMCBSP2 as shown in Figures 4.43 and 4.44, respectively. Step 6: Configuration of hardware interrupts of the DSP/BIOS Open the HWI under the Scheduling from the DSP/BIOS configuration file to connect the interrupts to DSK. Hardware interrupts 14 and 15 are used by the DSK as receive and trans- mit interrupts. Modify HWI INT14 by adding the receive function _C55XX_DMA_MCBSP_rxIsr Figure 4.43 The settings of McBSP 1 in DSP/BIOS McBSP resource managerJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 243 Figure 4.44 The settings of McBSP 2 in DSP/BIOS McBSP resource manager into the Function box and check Use Dispatcher box under the Dispatch tab. Also, mod- ify HWI INT15 by adding the transmit function _C55XX_DMA_MCBSP_txIsr into the Func- tion box and check Use Dispatcher box under the Dispatch tab as shown in Figures 4.45 and 4.46. Step 7: Build and run the real-time DSP/BIOS experiment Open the CCS project, set the project to use large memory (-ml option) and add CHIP 5510PG2 2 in Compiler-Preprocessor-defined symbol field. We also add the DSK board support library dsk5510bslx.lib to the Linker Include Libraries search path. When we create the configura- tion file, the CCS will generate a command file, dspbioscfg.cmd. We must use this linker command file. The files used for this experiment are listed in Table 4.9 with brief descriptions. Add the command file dspbioscfg.cmd and C and assembly source files listed in Table 4.9, and build this DSP/BIOS project. Figure 4.45 The settings of receive interrupt in HWI INT14 in DSP/BIOSJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 244 DESIGN AND IMPLEMENTATION OF FIR FILTERS Figure 4.46 The settings of transmit interrupt in HWI INT15 in DSP/BIOS Procedures of the experiment are listed as follows: 1. Create a DSP/BIOS configuration file and configure the DSK for real-time audio processing appli- cation. 2. Create the DSP project and rebuild the project. 3. Connect input and output audio cables to audio source and headphone (or loudspeaker). 4. Load the project and run the program to validate the DSP/BIOS project. Table 4.9 File listing for experiment exp4.5.8_realtime_SRC Files Description realtime_SRCTest.c C function for testing sample rate conversion plio.c Interface for PIP functions with low level I/O interpolate.c C function for interpolation filter decimate.asm Assembly routine for decimation filter interpolation.h C header file for interpolation decimation.h C header file for decimation coef8to16.h FIR filter coefficients for 1:2 interpolation coef16to48.h FIR filter coefficients for 1:3 interpolation coef48to24.h FIR filter coefficients for 2:1 decimation coef24to8.h FIR filter coefficients for 3:1 decimation lio.h Header file for low level I/O plio.h Header file for PIP to connect with low level I/O DSPBIOS.pjt DSP project file dspbios.cdb DSP/BIOS configuration file dspbioscfg.cmd DSP/BIOS linker command fileJWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXERCISES 245 References [1] N. Ahmed and T. Natarajan, Discrete-Time Signals and Systems, Englewood Cliffs, NJ: Prentice Hall, 1983. [2] V. K. Ingle and J. G. Proakis, Digital Signal Processing Using MATLAB V.4, Boston: PWS Publishing, 1997. [3] S. M. Kuo and W. S. Gan, Digital Signal Processors, Upper Saddle River, NJ: Prentice Hall, 2005. [4] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1989. [5] S. J. Orfanidis, Introduction to Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1996. [6] J. G. Proakis and D. G. Manolakis, Digital Signal Processing Ð Principles, Algorithms, and Applications, 3rd Ed., Englewood Cliffs, NJ: Prentice Hall, 1996. [7] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach, 2nd Ed., New York, NY: McGraw Hill, 1998. [8] D. Grover and J. R. Deller, Digital Signal Processing and the Microcontroller, Englewood Cliffs, NJ: Prentice Hall, 1999. [9] F. Taylor and J. Mellott, Hands-On Digital Signal Processing, New York, NY: McGraw Hill, 1998. [10] S. D. Stearns and D. R. Hush, Digital Signal Analysis, 2nd Ed., Englewood Cliffs, NJ: Prentice Hall, 1990. [11] The Math Works, Inc., Signal Processing Toolbox User’s Guide, Version 6, June 2004. [12] The Math Works, Inc., Filter Design Toolbox User’s Guide, Version 3, Oct. 2004. [13] Texas Instruments, Inc., TMS320C55x Optimizing C Compiler User’s Guide, Literature no. SPRU281E, Mar. 2003. [14] Texas Instruments, Inc., TMS320C55x Chip Support Library API Reference Guide, Literature no. SPRU433J, Sep. 2004. [15] Texas Instruments, Inc., TMS320 DSP/BIOS User’s Guide, Literature no. SPRU423, Nov. 2002. [16] Texas Instruments, Inc., TMS320C5000 DSP/BIOS Application Programming Interface (API) Reference Guide, Literature no. SPRU404E, Oct. 2002. Exercises 1. Consider the moving-average filter given in Example 4.1. What is the 3-dB bandwidth of this filter if the sampling rate is 8 kHz? 2. Consider the FIR filter with the impulse response h(n) ={1, 1, 1}. Calculate the magnitude and phase responses, and verify that the filter has linear phase. 3. Consider the comb filter designed in Example 4.2 with sampling rate 8 kHz. If a periodic signal with fundamental frequency 500 Hz, and all its harmonics at 1, 1.5, . . . , 4 kHz, is filtered by this comb filter, then find out which harmonics will be attenuated and why? 4. Using the graphical interpretation of linear convolution given in Figure 4.7, compute the linear convolution of h(n) ={1, 2, 1} and x(n), n = 0, 1, 2 defined as follows: (a) x(n) = {1, −1, 2} (b) x(n) = {1, 2, −1} (c) x(n) = {1, 3, 1} 5. The comb filter can also be described as y(n) = x(n) + x(n − L). Find the transfer function, zeros, and the magnitude response of this filter using MATLAB and compare the results with Figure 4.3 (assume L = 8).JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 246 DESIGN AND IMPLEMENTATION OF FIR FILTERS 6. Assuming h(n) has the symmetry property h(n) = h(−n) for n = 0,1,...,M, verify that H(ω) can be expressed as H(ω) = h(0) + M n=1 2h(n) cos(ωn). 7. The simplest digital approximation to a continuous-time differentiator is the first-order operation defined as y(n) = 1 T [x(n) − x(n − 1)] . Find the transfer function H(z), the frequency response H(ω), and the phase response of the differentiator. 8. Redraw the signal-flow diagram shown in Figure 4.6 and modify Equations (4.22) and (4.23) in the case that L is an odd number. 9. Design a lowpass FIR filter of length L = 5 with a linear phase to approximate the ideal lowpass filter of cutoff frequency 1.5 kHz with the sampling rate 8 kHz. 10. Consider the FIR filters with the following impulse responses: (a) h(n) ={−4, 1, −1, −2, 5, 0, −5, 2, 1, −1, 4} (b) h(n) ={−4, 1, −1, −2, 5, 6, 5, −2, −1, 1, −4} Use MATLAB to plot magnitude responses, phase responses, and locations of zeros for both filters. 11. Show the frequency response of the lowpass filter given in Equation (4.8) for L = 8 and compare the result with Figure 4.3. 12. Use Examples 4.6 and 4.7 to design and plot the magnitude response of a linear-phase FIR highpass filter of cutoff frequency ωc = 0.6π by truncating the impulse response of the ideal highpass filter to length L = 2M + 1 for M = 32 and 64. 13. Repeat Problem 12 using Hamming and Blackman window functions. Show that oscillatory behavior is reduced using the windowed Fourier series method. 14. Design a bandpass filter H ( f ) = 1, 0, 1.6 kHz ≤ f ≤ 2 kHz otherwise with the sampling rate 8 kHz and the duration of impulse response 50 ms using Fourier series method; that is, using MATLAB functions fir1. Plot the magnitude and phase responses. 15. Repeat Problem 14 using the FDATool using different design methods, and compare results with Problem 14. 16. Redo Example 4.15, quantize the designed coefficients using Q15 format, and save in C header file. Write a floating-point C program to implement this FIR filter and test the result by comparing both input and output signals in terms of time-domain waveforms and frequency-domain spectra. 17. Redo Problem 16 using a fixed-point C, and also use circular buffer. 18. Redo Example 4.12 with different cutoff frequencies and ripples, and summarize their relationship with the required filter order. 19. List the window functions supported by the MATLAB WinTool. Also, use this tool to study the Kaiser window with different L and β.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 EXERCISES 247 20. Write a C (or MATLAB)program that implements a comb filter of L = 8. The program must have the input/output capability. Test the filter using the sinusoidal signals of frequencies ω1 = π/4 and ω2 = 3π/8. Explain the results based on the distribution of the zeros of the filter. 21. Rewrite above program using a circular buffer. 22. Design a 24th-order bandpass FIR filter using MATLAB. The filter has passband frequencies of 1300Ð2100 Hz. Implement this filter using the C55x assembly routines blockFir.asm, symFir.asm, and dualMac- Fir.asm. The test data, input.pcm, is sampled at 8 kHz. Plot the filter results in both the time domain and the frequency domain using the CCS graphics. 23. When designing highpass or bandstop FIR filter using MATLAB, the number of filter coefficients is an odd number. This ensures the unit gain at the half-sampling frequency. Design a highpass FIR filter, such that its cutoff frequency is 3000 Hz. Implement this filter using the dual MAC block-FIR filter. Plot the results in both the time domain and the frequency domain. (Hint: Modify the assembly routine dualMacFir.asm to handle the odd numbered coefficients.) 24. Design an antisymmetric bandpass FIR filter to allow only the middle frequency of the tri-frequency input signal (input.pcm) to pass. Use firssub instruction to implement the FIR filter and plot the filter results in both the time domain and the frequency domain using the CCS graphics. 25. Use symmetric instruction to implement the decimation function of the experiment exp4.5.5_decimation. Compare the run-time efficiency of the function using symmetric instruction implementation and using dual MAC implementation. 26. The assembly routine, asmIntpFir.asm, is written for implementing signal interpolation function. However, there are some bugs in the code so it does not work, yet. Debug this assembly program and fix the problems. Test the routine using exp4.5.6_interpolation. 27. Implementing a dual MAC assembly routine interpolation function for the experiment exp4.5.7_SRC, mea- sure the performance improvement over C function in number of clock cycles. 28. Design a converter to change the 32 kHz sampling rate to 48 kHz. 29. For an experiment given in Section 4.5.7, the approach is to interpolate the 48 kHz signal to 96 kHz and then decimate the 96 kHz signal to 32 kHz. Another approach is to decimate the 48 kHz signal to 16 kHz first and then interpolate the 16 kHz signal to 32 kHz. Will these approaches provide the same result or performance, why? Design an experiment to support your claim. 30. Design an interpolator that converts the 44.1 kHz sampling rate to 48 kHz. 31. Use the TMS320C5510 DSK for the following real-time tasks:r Set the TMS320C5510 DSK to 8 kHz sampling rate.r Connect the signal source to the audio input of the DSK.r Write an interrupt service routine to handle input samples or use DSP/BIOS.r Process signal in blocks with 128 samples per block, and apply lowpass filter, highpass filter, and bandpass filter to input signals. 32. Use the TMS320C5510 DSK for the following real-time SRC:r Set the TMS320C55x DSK to 16 kHz sampling rate.r Connect the signal source to the audio input of the DSK.r Write an interrupt service routine to handle input samples or use DSP/BIOS.r Process signal in blocks with 160 samples per block.r Verify the result using an oscilloscope or spectrum analyzer.JWBK080-04 JWBK080-Kuo March 8, 2006 11:40 Char Count= 0 248JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 5 Design and Implementation of IIR Filters In this chapter, we focus on the design, realization, implementation, and applications of digital IIR filters. We will use experiments to demonstrate the implementation of IIR filters in different forms using fixed-point processors. 5.1 Introduction Designing a digital IIR filter usually begins with the designing of an analog filter, and applies a mapping technique to transform it from the s-plane into the z-plane. Therefore, we will briefly review the Laplace transform, analog filters, mapping properties, and frequency transformation. 5.1.1 Analog Systems Given a positive time function x(t) = 0 for t < 0, the one-sided Laplace transform is defined as X(s) = ∞ 0 x(t)e−st dt, (5.1) where s is a complex variable defined as s = σ + j, (5.2) and σ is a real number. The inverse Laplace transform is expressed as x(t) = 1 2π j σ+ j∞ σ− j∞ X(s)est ds. (5.3) Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 249JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 250 DESIGN AND IMPLEMENTATION OF IIR FILTERS The integral is evaluated along the straight line σ + j in the complex plane from  =−∞to  =∞, which is parallel to the imaginary axis j at a distance σ from it. Example 5.1: Find the Laplace transform of function x(t) = e−atu(t), where a is a real number. From Equation (5.1), we have X(s) = ∞ 0 e−ate−st dt = ∞ 0 e−(s+a)t dt =− 1 s + a e−(s+a)t ∞ 0 = 1 s + a , Re[s] > −a. Equation (5.2) clearly shows a complex s-plane with a real axis σ and an imaginary axis j. For values of s along the j-axis, i.e., σ = 0, we have X(s)|s= j = ∞ 0 x(t)e− jt dt, (5.4) which is the Fourier transform of the causal signal x(t). Therefore, given a function X(s), we can find its frequency characteristics by substituting s = j. If Y(s), X(s), and H(s) are the one-sided Laplace transforms of y(t), x(t), and h(t), respectively, and y(t) = x(t) ∗ h(t) = ∞ 0 x(τ)h(t − τ)dτ = ∞ 0 h(τ)x(t − τ)dτ, (5.5) we have Y(s) = H(s)X(s). (5.6) Thus, linear convolution in the time domain is equivalent to multiplication in the Laplace (or frequency) domain. In Equation (5.6), the transfer function of a casual system is defined as H(s) = Y(s) X(s) = ∞ 0 h(t)e−st dt, (5.7) where h(t) is the impulse response of the system. The general form of a system transfer function can be expressed as H(s) = b0 + b1s +···+bL−1sL−1 a0 + a1s +···+aM s M = N(s) D(s) . (5.8) The roots of N(s) are the zeros of H(s), while the roots of D(s) are the poles of the system. MATLAB provides the function freqs to compute the frequency response H() of an analog system H(s).JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 INTRODUCTION 251 Example 5.2: The input signal x(t) = e−2t u(t) is applied to an LTI system, and the output of the system is given as y(t) = e−t + e−2t − e−3t u(t). Find the system’s transfer function H(s) and the impulse response h(t). From Example 5.1 for different values of a,wehave X(s) = 1 s + 2 and Y(s) = 1 s + 1 + 1 s + 2 − 1 s + 3 . From Equation (5.7), we obtain H(s) = Y(s) X(s) = s2 + 6s + 7 (s + 1)(s + 3) = 1 + 1 s + 1 + 1 s + 3 . Taking the inverse Laplace transform, we have h(t) = δ(t) + e−t + e−3t u(t). The stability condition of an analog system can be represented in terms of its impulse response h(t) or its transfer function H(s). A system is stable if limt→∞ h(t) = 0. (5.9) This condition requires that all the poles of H(s) must lie in the left-half of the s-plane, i.e., σ<0. If limt→∞ h(t) →∞, the system is unstable. This condition is equivalent to the system that has one or more poles in the right-half of the s-plane, or has multiple-order pole(s) on the j-axis. Example 5.3: Consider the system with impulse response h(t) = e−atu(t). This function satisfies Equation (5.9), thus the system is stable for a > 0. From Example 5.1, the transfer function of this system is H(s) = 1 s + a , a > 0, which has the pole at s =−a. Thus, the system is stable since the pole is located at the left-hand side of s-plane. This example shows we can evaluate the stability of system from the impulse response h(t), or from the transfer function H(s). 5.1.2 Mapping Properties The z-transform can be viewed as the Laplace transform of the sampled function x(nT) by changing of variable z = esT . (5.10) This relationship represents the mapping of a region in the s-plane to the z-plane because both s and z are complex variables. Since s = σ + j,wehave z = eσ T e jT =|z|e jω, (5.11)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 252 DESIGN AND IMPLEMENTATION OF IIR FILTERS |z| = 1 Im z Re z z-plane ω = π/2 ω = 0 ω = 3π/2 ω = π s-plane σ = 0 σ σ > 0σ < 0 −π/T π/T jΩ Figure 5.1 Mapping between the s-plane and the z-plane where the magnitude |z|=eσ T (5.12) and the angle ω = T. (5.13) When σ = 0 (the j-axis on the s-plane), the amplitude given in Equation (5.12) is |z|=1 (the unit circle on the z-plane), and Equation (5.11) is simplified to z = e jT . It is apparent that the portion of the j-axis between  =−π/T and  = π/T in the s-plane is mapped onto the unit circle in the z-plane from Ðπ to π as illustrated in Figure 5.1. As  increases from π/T to 3π/T , it results in another counterclockwise encirclement of the unit circle. Thus, as  varies from 0 to ∞, there are infinite numbers of encirclements of the unit circle in the counterclockwise direction. Similarly, there are infinite numbers of encirclements of the unit circle in the clockwise direction as  varies from 0 to −∞. From Equation (5.12), |z| < 1 when σ<0. Thus, each strip of width 2π/T in the left-half of the s-plane is mapped inside the unit circle. This mapping occurs in the form of concentric circles in the z-plane as σ varies from 0 to Ð∞. Equation (5.12) also implies that |z| > 1ifσ>0. Thus, each strip of width 2π/T in the right-half of the s-plane is mapped outside of the unit circle. This mapping also occurs in concentric circles in the z-plane as σ varies from 0 to ∞. In conclusion, the mapping from the s-plane to the z-plane is not one to one since there are many points in the s-plane that correspond to a single point in the z-plane. This issue will be discussed later when we design a digital filter H(z) from a given analog filter H(s). 5.1.3 Characteristics of Analog Filters The ideal lowpass filter prototype is obtained by finding a polynomial approximation to the squared magnitude |H ()|2, and then converting this polynomial into a rational function. The approximations of the ideal prototype will be discussed briefly based on Butterworth filters, Chebyshev type I and type II filters, elliptic filters, and Bessel filters. The Butterworth lowpass filter is an all-pole approximation to the ideal filter, which is characterized by the squared-magnitude response |H()|2 = 1 1 +  p 2L , (5.14)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 INTRODUCTION 253 H(W) WP WWs 1 1− dP ds Figure 5.2 Magnitude response of Butterworth lowpass filter where L is the order of the filter, which determines how closely the Butterworth approximates the ideal filter. Equation (5.14) shows that |H(0)|=1 and |H(p)|=1/ √ 2 (or 20 log10 |H(p)|=−3 dB) for all values of L. Thus, p is called the 3-dB cutoff frequency. The magnitude response of a typical Butterworth lowpass filter is monotonically decreasing in both the passband and the stopband as illustrated in Figure 5.2. The Butterworth filter has a flat magnitude response over the passband and stopband, and thus is often referred to as the ‘maximally flat’ filter. This flat passband is achieved at the expense of slow roll-off in the transition region from p to s. Although the Butterworth filter is easy to design, the rate at which its magnitude decreases in the frequency range  ≥ p is rather slow for a small L. Therefore, for a given transition band, the order of the Butterworth filter is often higher than that of other types of filters. We can improve the roll-off by increasing the filter order L. Chebyshev filters permit a certain amount of ripples, but have a steeper roll-off near the cutoff frequency than the Butterworth filters. There are two types of Chebyshev filters. Type I Chebyshev filters are all-pole filters that exhibit equiripple behavior in the passband and a monotonic characteristic in the stopband (see the top plot of Figure 5.3). Type II Chebyshev filters contain both poles and zeros, and exhibit a monotonic behavior in the passband and an equiripple behavior in the stopband as shown in bottom plot |H(W)| WP Ws W 1 − dp ds 1 |H(W)| WP Ws W 1 − dp ds 1 Figure 5.3 Magnitude responses of type I (top) and type II Chebyshev lowpass filtersJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 254 DESIGN AND IMPLEMENTATION OF IIR FILTERS |H(W)| WP Ws W 1 − dp ds 1 Figure 5.4 Magnitude response of elliptic lowpass filter of Figure 5.3. In general, a Chebyshev filter meets the specifications with a fewer number of poles than the corresponding Butterworth filter and improves the roll-off; however, it has a poorer phase response. The sharpest transition from passband to stopband for any given δp,δs, and L can be achieved using the elliptic filter design. As shown in Figure 5.4, elliptic filters exhibit equiripple behavior in both the passband and the stopband. In addition, the phase response of elliptic filter is extremely nonlinear in the passband, especially near the cutoff frequency. Therefore, we can only use the elliptic design where the phase is not an important design parameter. In summary, the Butterworth filter has a monotonic magnitude response at both passband and stopband with slow roll-off. By allowing ripples in the passband for type I and in the stopband for type II, the Chebyshev filter can achieve sharper cutoff with the same number of poles. An elliptic filter has even sharper cutoffs than the Chebyshev filter for the same order, but it results in both passband and stopband ripples. The design of these filters strives to achieve the ideal magnitude response with trade-offs in phase response. Bessel filters are all-pole filters that approximate linear phase in the sense of maximally flat group delay in the passband. However, we must sacrifice steepness in the transition region. 5.1.4 Frequency Transforms We have discussed the design of prototype lowpass filters with cutoff frequency p. Although the same procedure can be applied to design highpass, bandpass, or bandstop filters, it is easier to obtain these filters from the lowpass filter using frequency transformations. In addition, most classical filter design techniques generate lowpass filters only. A highpass filter Hhp(s) can be obtained from the lowpass filter H(s)by Hhp(s) = H(s)|s= 1 s = H 1 s . (5.15) For example, we have Butterworth H(s) = 1/(s + 1) for L = 1. From Equation (5.15), we obtain Hhp(s) = 1 s + 1 s= 1 s = s s + 1 . (5.16) This shows that Hhp(s) has identical pole as the lowpass prototype, but with an additional zero at the origin.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 DESIGN OF IIR FILTERS 255 Bandpass filters can be obtained from the lowpass prototypes by replacing s with (s2 + 2 m)/BW. That is, Hbp(s) = H(s)| s= s2+2m BW , (5.17) where m is the center frequency of the bandpass filter defined as m = ab , (5.18) where a and b are the lower and upper cutoff frequencies, respectively. The filter bandwidth BW is defined as BW = b − a. (5.19) For example, considering L = 1, we have Hbp(s) = 1 s + 1 s= s2+2m BW = BWs s2 + BWs + 2 m . (5.20) For an Lth-order lowpass filter, we obtain a bandpass filter of order 2L. Bandstop filter transfer functions can be obtained from the corresponding highpass filters by Hbs(s) = Hhp(s) s= s2+2m BW . (5.21) 5.2 Design of IIR Filters The transfer function of the IIR filter is defined in Equation (3.42) as H(z) = L−1 l=0 bl z−l 1 + M m=1 am z−m . (5.22) The design problem is to find the coefficients bl and am so that H(z) satisfies the given specifications. The IIR filter can be realized by the I/O equation y(n) = L−1 l=0 bl x(n − l) − M m=1 am y(n − m). (5.23) The problem of designing IIR filters is to determine a digital filter H(z) which approximates the prototype filter H(s) designed by one of the analog filter design methods. There are two methods that can map the analog filter into an equivalent digital filter: the impulse-invariant and the bilinear transform. The impulse-invariant method preserves the impulse response of the original analog filter by digitizing its impulse response, but has inherent aliasing problem. The bilinear transform will preserve the magnitude response characteristics of the analog filters, and thus is better for designing frequency-selective IIR filters.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 256 DESIGN AND IMPLEMENTATION OF IIR FILTERS Digital filter specifications Bilinear transform Bilinear transform W→ω W←ω Analog filter specifications Analog filter H(s) Digital filter H(z) Analog filter design Figure 5.5 Digital IIR filter design using the bilinear transform 5.2.1 Bilinear Transform The procedure of digital filter design using bilinear transform is illustrated in Figure 5.5. This method maps the digital filter specifications to an equivalent analog filter. The designed analog filter is then mapped back to obtain the desired digital filter using the bilinear transform. The bilinear transform is defined as s = 2 T z − 1 z + 1 = 2 T 1 − z−1 1 + z−1 . (5.24) This is called the bilinear transform due to the linear functions of z in both the numerator and the denominator. Because the j-axis maps onto the unit circle (z = e jω), there is a direct relationship between the s-plane frequency  and the z-plane frequency ω. Substituting s = j and z = e jω into Equation (5.24), we have j = 2 T e jω − 1 e jω + 1 . (5.25) It can be easily shown that the corresponding mapping of frequencies is obtained as  = 2 T tan ω 2 , (5.26) or equivalently, ω = 2 tan−1 T 2 . (5.27) Thus, the entire j-axis is compressed into the interval [−π/T,π/T ] for ω in a one-to-one manner. The portion of 0 →∞in the s-plane is mapped onto the 0 → π portion of the unit circle, while the 0 →−∞portion in the s-plane is mapped onto the 0 →−π portion of the unit circle. Each point in the s-plane is uniquely mapped onto the z-plane. The relationship between the frequency variables  and ω is illustrated in Figure 5.6. The bilinear transform provides a one-to-one mapping of the points along the j-axis onto the unit circle, or onto the Nyquist band |ω|≤π. However, the mapping is highly nonlinear. The point  = 0 is mapped to ω = 0 (or z = 1), and the point  =∞is mapped to ω = π (or z =−1). The entire band T ≥ 1is compressed onto π/2 ≤ ω ≤ π. This frequency compression effect is known as frequency warping, andJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 DESIGN OF IIR FILTERS 257 01 WT ω π 2 π −π Figure 5.6 Plot of transformation given in Equation (5.27) must be taken into consideration for digital filter design using the bilinear transform. The solution is to prewarp the critical frequencies according to Equation (5.26). 5.2.2 Filter Design Using Bilinear Transform The bilinear transform of an analog filter H(s) is obtained by simply replacing s with z using Equation (5.24). The filter specifications will be in terms of the critical frequencies of the digital filter. For example, the critical frequency ω for a lowpass filter is the bandwidth of the filter. Three steps involved in the IIR filter design using bilinear transform are summarized as follows: 1. Prewarp the critical frequency ωc of the digital filter using Equation (5.26) to obtain the corresponding analog filter’s frequency c. 2. Scale the analog filter H(s) with c to obtain the scaled transfer function ˆH(s) = H(s)|s=s/c = H s c . (5.28) 3. Replace s using Equation (5.24) to obtain desired digital filter H(z). That is H(z) = ˆH(s)|s=2(z−1)/(z+1)T . (5.29) Example 5.4: Using the simple lowpass filter H(s) = 1/(s + 1) and the bilinear transform method to design a digital lowpass filter with the bandwidth 1000 Hz and the sampling frequency 8000 Hz. The critical frequency for the lowpass filter is the bandwidth ωc = 2π(1000/8000) = 0.25π, and T = 1/8000 s. Step 1: Prewarp the critical frequency as c = 2 T tan ωc 2 = 2 T tan (0.125π) = 0.8284 T .JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 258 DESIGN AND IMPLEMENTATION OF IIR FILTERS Step 2: Use frequency scaling to obtain ˆH(s) = H(s) s=s/(0.8284/T ) = 0.8284 sT + 0.8284 . Step 3: Using bilinear transform in Equation (5.29) yields the desired transfer function H(z) = ˆH(s) s=2(z−1)/(z+1)T = 0.2929 1 + z−1 1 − 0.4142z−1 . MATLABSignal Processing Toolboxprovides impinvar and bilinear functions to support impulse- invariant and bilinear transform methods, respectively. For example, we can use numerator and denomi- nator polynomials as follows: [NUMd,DENd] = bilinear(NUM,DEN,Fs, Fp); where NUMd and DENd are digital filter coefficients obtained from the bilinear function. NUM and DEN are row vectors containing numerator and denominator coefficients in descending powers of s, respectively, Fs is the sampling frequency in Hz, and Fp is prewarping frequency. Example 5.5: In order to design a digital IIR filtering using the bilinear transform, the transfer function of the analog prototype filter is first determined. The numerator and denominator poly- nomials of the prototype filter are then mapped to the polynomials for the digital filter using the bilinear transform. The following MATLAB script (example5_5.m) designs a lowpass filter: Fs = 2000; % Sampling frequency Wn = 300; % Edge frequency Fc = 2*pi*Wn % Edge frequency in rad/s n = 4; % Order of analog filter [b, a] = butter(n, Fc, 's'); % Design an analog filter [bz, az] = bilinear(b, a, Fs, Wn); % Determine digital filter [Hz,Wz] = freqz(bz,az,512,Fs); % Display magnitude & phase 5.3 Realization of IIR Filters An IIR filter can be realized in different forms or structures. In this section, we will discuss direct-form I, direct-form II, cascade, and parallel realizations of IIR filters. These realizations are equivalent math- ematically, but may have different performance in practical implementation due to the finite wordlength effects. 5.3.1 Direct Forms The direct-form I realization is defined by the I/O equation (5.23). This filter has (L + M) coefficients and needs (L + M + 1) memory locations to store {x(n − l), l = 0, 1,..., L − 1} and {y(n − m), m = 0, 1,..., M}. It also requires (L + M) multiplications and (L + M − 1) additions. The detailed signal-flow diagram for L = M + 1 is illustrated in Figure 3.11.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 REALIZATION OF IIR FILTERS 259 H1 (z) H2 (z) z−1 z−1 z−1z−1 x(n) b2 b1 b0 −a1 −a2 y(n) y(n − 1) y(n − 2) x(n − 1) x(n − 2) Figure 5.7 Direct-form I realization of second-order IIR filter Example 5.6: Consider a second-order IIR filter H(z) = b0 + b1z−1 + b2z−2 1 + a1z−1 + a2z−2 . (5.30) The I/O equation of the direct-form I realization is described as y(n) = b0x(n) + b1x(n − 1) + b2x(n − 2) − a1 y(n − 1) − a2 y(n − 2). (5.31) The signal-flow diagram is illustrated in Figure 5.7. As shown in Figure 5.7, the IIR filter H(z) can be interpreted as the cascade of two transfer functions H1(z) and H2(z). That is, H(z) = H1(z)H2(z), (5.32) where H1(z) = b0 + b1z−1 + b2z−2 and H2(z) = 1/ 1 + a1z−1 + a2z−2 . Since multiplication is com- mutative, we have H(z) = H2(z)H1(z). Therefore, Figure 5.7 can be redrawn by exchanging the order of H1(z) and H2(z), and combining two signal buffers into one as illustrated in Figure 5.8. This efficient realization of a second-order IIR filter is called direct-form II (or biquad), which requires three memory x(n) w(n) w(n − 1) w(n − 2) y(n)b0 z−1 z−1 −a1 −a2 b2 b1 Figure 5.8 Direct-form II realization of second-order IIR filterJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 260 DESIGN AND IMPLEMENTATION OF IIR FILTERS x(n) w(n) w(n − 1) w(n − 2) y(n)b0 z−1 z−1 −a1 −a2 b2 w(n − L − 1) −aM bL−1 b1 Figure 5.9 Direct-form II realization of general IIR filter, L = M + 1 locations as opposed to six memory locations required for the direct-form I given in Figure 5.7. Therefore, the direct-form II is called the canonical form since it needs the minimum numbers of memory. The direct-form II second-order IIR filter can be implemented as y(n) = b0w(n) + b1w(n − 1) + b2w(n − 2), (5.33) where w(n) = x(n) − a1w(n − 1) − a2w(n − 2). (5.34) This realization can be expanded as Figure 5.9 to realize the IIR filter defined in Equation (5.23) with M = L− 1 using the direct-form II structure. 5.3.2 Cascade Forms By factoring the numerator and the denominator polynomials of the transfer function H(z), an IIR filter can be realized as a cascade of second-order IIR filter sections. Consider the transfer function H(z)given in Equation (5.22), it can be expressed as H(z) = b0 H1(z)H2(z) ···HK (z) = b0 K k=1 Hk(z), (5.35) where K is the total number of sections, and Hk(z) is a second-order filter expressed as Hk(z) = (z − z1k)(z − z2k) (z − p1k)(z − p2k) = 1 + b1k z−1 + b2k z−2 1 + a1k z−1 + a2k z−2 . (5.36)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 REALIZATION OF IIR FILTERS 261 y(n) HK (z)H2 (z)H1 (z) b0x(n) Figure 5.10 Cascade realization of digital filter If the order is an odd number, one of the Hk(z) is a first-order IIR filter expressed as Hk(z) = z − z1k z − p1k = 1 + b1k z−1 1 + a1k z−1 . (5.37) The realization of Equation (5.35) in cascade form is illustrated in Figure 5.10. In this form, any complex-conjugated roots must be grouped into the same section to guarantee that the coefficients of Hk(z) are all real-valued numbers. Assuming that every Hk(z) is a second-order IIR filter described by Equation (5.36), the I/O equations describing the cascade realization are wk(n) = xk(n) − a1kwk(n − 1) − a2kwk(n − 2), (5.38) yk(n) = wk(n) + b1kwk(n − 1) + b2kwk(n − 2), (5.39) xk+1(n) = yk(n), (5.40) for k = 1,2,...,K where x1(n) = b0x(n) and y(n) = yK (n). It is possible to obtain many different cascade realizations for the same transfer function H(z)by different ordering and pairing. Ordering means the order of connecting Hk(z), and pairing means the grouping of poles and zeros of H(z) to form Hk(z). In theory, these different cascade realizations are equivalent; however, they may be different due to the finite-wordlength effects. In DSP implementation, each section will generate a certain amount of roundoff error, which is propagated to the next section. The total roundoff noise at the final output will depend on the particular pairing/ordering. In the direct-form realization shown in Figure 5.9, the variation of one parameter will affect all the poles of H(z). In the cascade realization, the variation of one parameter will only affect pole(s) in that section. Therefore, the cascade realization is preferred in practical implementation because it is less sensitive to parameter variation due to quantization effects. Example 5.7: Consider the second-order IIR filter H(z) = 0.5(z2 − 0.36) z2 + 0.1z − 0.72 . By factoring the numerator and denominator polynomials of H(z), we obtain H(z) = 0.5(1 + 0.6z−1)(1 − 0.6z−1) (1 + 0.9z−1)(1 − 0.8z−1) . By different pairings of poles and zeros, there are four possible realizations of H(z) in terms of first-order sections. For example, we may choose H1(z) = 1 + 0.6z−1 1 + 0.9z−1 and H2(z) = 1 − 0.6z−1 1 − 0.8z−1 .JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 262 DESIGN AND IMPLEMENTATION OF IIR FILTERS The IIR filter can be realized by the cascade form expressed as H(z) = 0.5H1(z)H2(z). 5.3.3 Parallel Forms The expression of H(z) in a partial-fraction expansion leads to another canonical structure called the parallel form expressed as H(z) = c + H1(z) + H2(z) +···+HK (z), (5.41) where c is a constant, and Hk(z) is a second-order IIR filter expressed as Hk(z) = b0k + b1k z−1 1 + a1k z−1 + a2k z−2 , (5.42) or a first-order filter expressed as Hk(z) = b0k 1 + a1k z−1 . (5.43) The realization of Equation (5.41) in parallel form is illustrated in Figure 5.11. Each second-order section can be implemented as direct-form II shown in Figure 5.8. Example 5.8: Considering the transfer function H(z) given in Example 5.7, we can express it as H (z) = H(z) z = 0.5 1 + 0.6z−1 1 − 0.6z−1 z 1 + 0.9z−1 1 − 0.8z−1 = A z + B z + 0.9 + C z − 0.8 , where A = zH (z)|z=0 = 0.25 B = (z + 0.9)H (z)|z=−0.9 = 0.147 C = (z − 0.8)H (z)|z=0.8 = 0.103. HK (z) H2 (z) H1 (z) x(n) y(n) c0 Figure 5.11 A parallel realization of digital IIR filterJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 REALIZATION OF IIR FILTERS 263 Therefore, we obtain H(z) = 0.25 + 0.147 1 + 0.9z−1 + 0.103 1 − 0.8z−1 . 5.3.4 Realization of IIR Filters Using MATLAB The cascade realization of an IIR filter involves its factorization. This can be done in MATLAB using the function roots. For example, the statement r = roots(b); returns the roots of the numerator vector b in the output vector r. Similarly, we can obtain the roots of the denominator vector a. The coefficients of each section can be determined by pole-zero pairings. The function tf2zp available in the Signal Processing Toolbox finds the zeros, poles, and gain of systems. For example, the statement [z, p, c] = tf2zp(b, a); will return the zero locations in z, the pole locations in p, and the gain in c. Similarly, the function [b, a] = zp2tf(z,p,k); forms the transfer function H(z) given a set of zero locations in vector z, a set of pole locations in vector p, and a gain in scalar k. Example 5.9: The zeros, poles, and gain of the system defined in Example 5.7 can be obtained using the MATLAB script (example5_9.m) as follows: b = [0.5, 0, -0.18]; a = [1, 0.1, -0.72]; [z, p, c] = tf2zp(b,a) Runing the program, we obtain z = 0.6, −0.6, p =−0.9, 0.8, and c = 0.5. These results verify the derivation obtained in Example 5.7. Signal Processing Toolbox also provides a useful function zp2sos to convert a zero-pole-gain repre- sentation to an equivalent representation of second-order sections. The function [sos, G] = zp2sos(z, p, c); finds the overall gain G and a matrix sos containing the coefficients of each second-order section deter- mined from its zero-pole form. The matrix sos is a K× 6 matrix as sos = ⎡ ⎢⎢⎢⎣ b01 b11 b21 1 a11 a21 b02 b12 b22 1 a12 a22 ... ... ... ... ... ... b0K b1K b2K 1 a1K a2K ⎤ ⎥⎥⎥⎦ , (5.44)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 264 DESIGN AND IMPLEMENTATION OF IIR FILTERS where each row contains the numerator and denominator coefficients, bik and aik,ofthekth second-order section Hk(z). The overall transfer function is expressed as H(z) = G K k=1 Hk(z) = G K k=1 b0k + b1k z−1 + b2k z−2 1 + a1k z−1 + a2k z−2 , (5.45) where G is a scalar which accounts for the overall gain of the system. Similarly, the function [sos, G] = tf2sos(b, a) finds a matrix sos and a gain G. In addition, we can use [sos, G] = tf2sos(b, a, dir_flag, scale); to specify the ordering of the second-order sections. If dir_flag is UP, the first row will contain the poles closest to the origin, and the last row will contain the poles closest to the unit circle. If dir_flag is DOWN, the sections are ordered in the opposite direction. The input parameter scale specifies the desired scaling of the gain and the numerator coefficients of all second-order sections. The parallel realizations discussed in Section 5.3.3 can be developed in MATLAB using the function residuez in the Signal Processing Toolbox. This function converts the transfer function expressed as Equation (5.22) to the partial-fraction-expansion (or residue) form as Equation (5.41). The function [r, p, c] = residuez(b, a); returns the column vector r that contains the residues, p contains the pole locations, and c contains the direct terms. 5.4 Design of IIR Filters Using MATLAB MATLAB can be used to evaluate the IIR filter design methods, realize and analyze the designed filters, and quantize filter coefficients for fixed-point implementations. 5.4.1 Filter Design Using MATLAB The Signal Processing Toolbox provides a variety of functions for designing IIR filters. This toolbox supports design of Butterworth, Chebyshev type I, Chebyshev type II, elliptic, and Bessel IIR filters in four different types: lowpass, highpass, bandpass, and bandstop. The direct filter design function yulewalk finds a filter with magnitude response approximating a desired function, which supports the design of a bandpass filter with multiple passbands. The filter design methods and functions available in the Signal Processing Toolbox are summarized in Table 5.1. Table 5.1 List of IIR filter design methods and functions Design method Functions Description Order estimation buttord, cheb1ord, cheb2ord, ellipord Design a digital filter through frequency transformation and bilinear transform using an analog lowpass prototype filter Design function besself, butter, cheby1, cheby2, ellip Direct design yulewalk Design directly by approximating a magnitude response Generalized design maxflat Design lowpass Butterworth filters with more zeros than polesJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 DESIGN OF IIR FILTERS USING MATLAB 265 Additional IIR filter design methods are supported by MATLAB Filter Design Toolbox, which are summarized as follows: iircomb - IIR comb notching or peaking digital filter design; iirgrpdelay - allpass filter design given a group delay; iirlpnorm - least P-norm optimal IIR filter design; iirlpnormc - constrained least P-norm IIR filter design; iirnotch - second-order IIR notch digital filter design; and iirpeak - second-order IIR peaking (resonator) digital filter design. We will use iirpeak in Section 5.6 for practical application. As indicated in Table 5.1, the IIR filter design requires two processes. First, compute the minimum filter order N and the frequency-scaling factor Wn from the given specifications. Second, calculate the filter coefficients using these two parameters. In the first step, the following MATLAB functions are used for estimating filter order: [N, Wn] = buttord(Wp, Ws, Rp, Rs); % Butterworth filter [N, Wn] = cheb1ord(Wp, Ws, Rp, Rs); % Chebyshev type I filter [N, Wn] = cheb2ord(Wp, Ws, Rp, Rs); % Chebyshev type II filter [N, Wn] = ellip(Wp, Ws, Rp, Rs); % Elliptic filter The parameters Wp and Ws are the normalized passband and stopband edge frequencies, respectively. The ranges of Wp and Ws are between 0 and 1, where 1 corresponds to the Nyquist frequency ( fs/2). The parameters Rp and Rs are the passband ripple and the minimum stopband attenuation specified in dB, respectively. These four functions return the order N and the frequency-scaling factor Wn, which are needed in the second step of IIR filter design. In the second step, the Signal Processing Toolbox provides the following functions: [b, a] = butter(N, Wn); [b, a] = cheby1(N, Rp, Wn); [b, a] = cheby2(N, Rs, Wn); [b, a] = ellip(N, Rp, Rs, Wn); [b, a] = besself(N, Wn); These functions return the filter coefficients in row vectors b and a. We can use butter(N,Wn,'high') to design a highpass filter. If Wn is a two-element vector, Wn = [W1 W2], butter returns an order 2N bandpass filter with passband in between W1 and W2, and butter(N,Wn,'stop') designs a bandstop filter. Example 5.10: Design a lowpass Butterworth filter with less than 1.0 dB of ripple from 0 to 800 Hz, and at least 20 dB of stopband attenuation from 1600 Hz to the Nyquist frequency 4000 Hz. The MATLAB script (example5_10.m) for designing the filter is listed as follows: Wp = 800/4000; Ws= 1600/4000; Rp = 1.0; Rs = 20.0; [N, Wn] = buttord(Wp, Ws, Rp, Rs); % First stage [b, a] = butter(N, Wn); % Second stage freqz(b, a, 512, 8000); % Display frequency responsesJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 266 DESIGN AND IMPLEMENTATION OF IIR FILTERS Figure 5.12 Filter visualization tool window Instead of using freqz for display magnitude and phase responses, we can use a graphical user interface (GUI) tool called the Filter Visualization Tool (FVTool) to analyze digital filters. The following command fvtool(b,a) launches the FVTool and computes the magnitude response for the filter defined by numerator and denominator coefficients in vectors b and a, respectively. For example, after execution of example5_10.m, when you type in fvtool(b,a) in the MATLAB command window, Figure 5.12 is displayed. From the Analysis menu, we can further analyze the designed filter. Example 5.11: Design a bandpass filter with passband of 100Ð200 Hz, and the sampling rate is 1 kHz. The passband ripple is less than 3 dB and the stopband attenuation is at least 30 dB by 50 Hz out on both sides of the passband. The MATLAB script (example5_11.m) for designing and evaluating filter is listed as follows: Wp = [100 200]/500; Ws = [50 250]/500; Rp=3; Rs = 30; [N, Wn] = buttord(Wp, Ws, Rp, Rs); [b, a] = butter(N, Wn); % Design a Butterworth filter fvtool(b, a); % Analyze the designed IIR filterJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 DESIGN OF IIR FILTERS USING MATLAB 267 50 −50 −150 −250 −350 −4500 0.1 0.2 0.3 0.4 0.5 Normalized frequency (×π rad/sample) Magnitude (dB) and Phase responses Magnitude (dB) Magnitude Response Phase response 0.6 0.7 0.8 0.9 −1000 −760 −520 Phase (degrees) −280 −40 200 Figure 5.13 Magnitude and phase responses of the bandpass filter From the Analysis menu in the FVTool window, we select the Magnitude and Phase Responses. The magnitude and phase responses of the designed bandpass filter are shown in Figure 5.13. 5.4.2 Frequency Transforms Using MATLAB The Signal Processing Toolbox provides functions lp2hp, lp2bp, and lp2bs for converting the prototype lowpass filters to highpass, bandpass, and bandstop filters, respectively. For example, the following command [numt,dent] = lp2hp(num,den,wo); transforms the lowpass filter prototype num/den with unity cutoff frequency to a highpass filter with cutoff frequency wo. The Filter Design Toolbox provides additional frequency transformations via numerator and denomi- nator functions that are listed as follows: iirbpc2bpc Ð complex bandpass to complex bandpass; iirlp2bp Ð real lowpass to real bandpass; iirlp2bpc Ð real lowpass to complex bandpass; iirlp2bs Ð real lowpass to real bandstop; iirlp2bsc Ð real lowpass to complex bandstop; iirlp2hp Ð real lowpass to real highpass; iirlp2lp Ð real lowpass to real lowpass; iirlp2mb Ð real lowpass to real multiband; and iirlp2mbc Ð real lowpass to complex multiband.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 268 DESIGN AND IMPLEMENTATION OF IIR FILTERS 0 −140 −120 −100 −80 −60 −40 −20 0 20 0.1 0.2 0.3 0.4 0.5 Normalized frequency (×π rad/sample) Magnitude response (dB) Magnitude (dB) 0.6 0.7 0.8 0.9 Lowpass Bandpass Figure 5.14 Magnitude responses of the lowpass and bandpass filters Example 5.12: The function iirlp2bp converts an IIR lowpass to an bandpass filter with the following syntax: [Num,Den,AllpassNum,AllpassDen] = iirlp2bp(b,a,Wo,Wt); This functions returns numerator and denominator vectors, Num and Den of the transformed lowpass digital filter. It also returns the numerator AllpassNum and the denominator AllpassDen of the allpass mapping filter. The prototype lowpass filter is specified by numerator b and denominator a, Wo is the center frequency value to be transformed from the prototype filter, and Wt is the desired frequency in the transformed filter. Frequencies must be normalized to be between 0 and 1. The following MATLAB script (example5_12.m, adapted from the Help menu) converts a lowpass filter to a bandpass filter and analyzes it using FVTool as shown in Figure 5.14: [b,a] = ellip(6,0.1,60,0.209); % Lowpass filter [num,den] = iirlp2bp(b,a,0.5,[0.25,0.75]); % Convert to bandpass fvtool(b,a,num,den); % Display both lowpass & bandpass filters 5.4.3 Design and Realization Using FDATool In this section, we use the FDATool shown in Figure 4.18 for designing, realizing, and quantizing IIR filters. To design an IIR filter, select the radio button next to IIR in the Design Method region on the GUI. There are seven options (from the pull-down menu) for Lowpass types, and several different filter design methods are available for different response types.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 DESIGN OF IIR FILTERS USING MATLAB 269 Figure 5.15 GUI of designing an elliptic IIR lowpass filter Example 5.13: Similar to Example 4.12, design a lowpass IIR filter with the following specifi- cations: sampling frequency fs = 8 kHz, passband cutoff frequency ωp = 2 kHz, stopband cutoff frequency ωs = 2.5 kHz, passband ripple Ap = 1 dB, and stopband attenuation As = 60 dB. We can design an elliptic filter by clicking the radio button next to IIR in the Design Method region and selecting Elliptic from the pull-down menu. We then enter parameters in Frequency Specifications and Magnitude Specifications regions as shown in Figure 5.15. After pressing Design Filter button to compute the filter coefficients, the Filter Specifications region changed to a Magnitude Response (dB) as shown in Figure 5.15. We can specify filter order by clicking the radio button Specify Order and entering the filter order in a text box, or choose the default Minimum Order. The order of the designed filter is 6, which is stated in Current Filter Information region (top-left) as shown in Figure 5.15. By default, the designed IIR filter was realized by cascading of second-order IIR sections using the direct-form II biquads shown in Figure 5.8. We can change this default setting from Edit→Convert Structure, the dialog window shown in Figure 5.16 displayed for selecting different structures. We can reorder and scale second-order sections by selecting Edit→Reorder and Scale Second-Order Sections. Once the filter has been designed and verified as shown in Figure 5.15, we can turn on the quantization mode by clicking the Set Quantization Parameters button . The bottom-half of the FDATool windowJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 270 DESIGN AND IMPLEMENTATION OF IIR FILTERS Figure 5.16 Convert filter structure window will change to a new pane with the Filter Arithmetic option allowing the user to quantize the designed filter and analyzing the effects of changing quantization settings. To enable the fixed-point quantization, select Fixed-Point from the Filter Arithmetic pull-down menu. See Section 4.2.5 for details of those options and settings. Example 5.14: Design a quantized bandpass IIR filter for a 16-bit fixed-point DSP processor with the following specifications: sampling frequency = 8000 Hz, lower stopband cutoff frequency Fstop1 = 1200 Hz, lower passband cutoff frequency Fpass1 = 1400 Hz, upper passband cutoff fre- quency Fpass2 = 1600 Hz, upper stopband cutoff frequency Fstop2 = 1800 Hz, passband ripple = 1 dB, and stopband (both lower and upper) attenuation = 60 dB. Start FDATool and enter the appropriate parameters in the Frequency Specifications and Mag- nitude Specifications regions, select elliptic IIR filter type, and click Design Filter. The order of designed filter is 16 with eight second-order sections. Click the Set Quantization Parameters button, and select the Fixed-Point option from the pull-down menu of Filter Arithmetic and use default settings. After designing and quantizing the filter, select the Magnitude Response Esti- mate option on the Analysis menu for estimating the frequency response for quantized filter. The magnitude response of the quantized filter is displayed in the analysis area as shown in Figure 5.17. We observe that quantizing the coefficients has satisfactory filter magnitude response, primarily because FDATool implements the filter in cascade second-order sections, which is more resistant to the effects of coefficient quantization. We also select Filter Coefficients from the Analysis menu, and display it in Figure 5.18. It shows both the quantized coefficients (top) with Q15 format and the original coefficients (bottom) with double- precision floating-point format. We can save the designed filter coefficients in a C header file by selecting Generate C header from the Targets menu. The Generate C Header dialog box appears as shown in Figure 5.19. For an IIR filter, variable names in the C header file are numerator (NUM), numerator length (NL), denominator (DEN), denominator length (DL), and number of sections (NS). We can use the default variable names as shown in Figure 5.19, or change them to match the names used in the C program that will include this header file. Click Generate, and the Generate C Header dialog box appears. Enter the filename and click Save to save the file.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 271 Figure 5.17 FDATool window for a quantized 16-bit bandpass filter 5.5 Implementation Considerations This section discusses important considerations for implementing IIR filters, including stability and finite wordlength effects. 5.5.1 Stability The IIR filter defined by the transfer function given in Equation (3.44) is stable if all the poles lie within the unit circle. That is, |pm| < 1, m = 1, 2,...,M. (5.46) Figure 5.18 Filter coefficientsJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 272 DESIGN AND IMPLEMENTATION OF IIR FILTERS Figure 5.19 Generate C header dialog box In this case, we can show that limn→∞ h(n) = 0. If |pm| > 1 for any m, then the IIR filter is unstable since limn→∞ h(n) →∞. In addition, an IIR filter is unstable if H(z) has multiple-order pole(s) on the unit circle. Example 5.15: Considering the IIR filter with transfer function H(z) = 1 1 − az−1 , the impulse response of the system is h(n) = an, n ≥ 0. If the pole is inside the unit circle, i.e., |a| < 1, the impulse response limn→∞ h(n) = limn→∞ an → 0. Thus, the IIR filter is stable. However, the IIR filter is unstable for |a| > 1 since the pole is outside the unit circle and limn→∞ h(n) = limn→∞ an →∞if |a| > 1. Example 5.16: Considering the system with transfer function H(z) = z (z − 1)2 , there is a second-order pole at z = 1. The impulse response of the system is h(n) = n, which is an unstable system. An IIR filter is marginally stable (or oscillatory bounded) if limn→∞ h(n) = c, (5.47) where c is a nonzero constant. For example, if H(z) = 1/1 + z−1, there is a first-order pole on the unit circle. It is easy to show that the impulse response oscillates between ±1 since h(n) = (−1)n, n ≥ 0.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 273 a2 a1 a2 = 1 a1 = 1 + a2 − a1 = 1 + a2 1 −1 −22 Figure 5.20 Region of coefficient values for a stable second-order IIR filter Consider the second-order IIR filter defined by Equation (5.30). The denominator can be factored as 1 + a1z−1 + a2z−2 = 1 − p1z−1 1 − p2z−1 , (5.48) where a1 =−(p1 + p2) and a2 = p1 p2. (5.49) The poles must lie inside the unit circle for stability; that is, |p1| < 1 and |p2| < 1. From Equation (5.49), we need |a2| = |p1 p2| < 1 (5.50) for a stable system. The corresponding condition on a1 can be derived from the SchurÐCohn stability test as |a1| < 1 + a2. (5.51) Stability conditions in Equations (5.50) and (5.51) are illustrated in Figure 5.20, which shows the resulting stability triangle in the a1 − a2 plane. That is, the second-order IIR filter is stable if and only if the coefficients define a point (a1, a2 ) that lies inside the stability triangle. 5.5.2 Finite-Precision Effects and Solutions In practical applications, the coefficients obtained from filter design are quantized to a finite number of bits for implementation. The filter coefficients, bl and am, obtained by MATLAB are represented using double-precision floating-point format. Let b l and a m denote the quantized values corresponding to bl and am, respectively. The transfer function of quantized IIR filter is expressed as H (z) = L−1 l=0 b l z−l 1 + M m=1 a m z−m . (5.52)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 274 DESIGN AND IMPLEMENTATION OF IIR FILTERS If the wordlength is not sufficiently large, some undesirable effects will occur. For example, the magnitude and phase responses of H (z) may be different from those of H(z). If the poles of H(z) are close to the unit circle, the pole(s) of H (z) may move outside the unit circle after coefficient quantization, resulting in an unstable implementation. These undesired effects are more serious when higher order IIR filters are implemented using the direct-form realization. Therefore, the cascade and parallel realizations are preferred in practical DSP implementations with each Hk(z) be a first- or second-order section. The cascade form is recommended for the implementation of high-order narrowband IIR filters that have closely clustered poles. Example 5.17: Consider the IIR filter with transfer function H(z) = 1 1 − 0.85z−1 + 0.18z−2 , with the poles located at z = 0.4 and z = 0.45. This filter can be realized in the cascade form as H(z) = H1(z)H2(z), where H1(z) = 1 (1−0.4z−1) and H2(z) = 1 (1−0.45z−1) . If this IIR filter is implemented on a 4-bit (a sign bit plus three data bits; see Table 3.2) fixed-point hardware, 0.85 and 0.18 are quantized to 0.875 and 0.125, respectively. Therefore, the direct-form realization is described as H (z) = 1 1 − 0.875z−1 + 0.125z−2 . The poles of the direct-form H (z) become z = 0.1798 and z = 0.6952, which are significantly different than the original 0.4 and 0.45. For cascade realization, the poles 0.4 and 0.45 are quantized to 0.375 and 0.5, respectively. The quantized cascade filter is expressed as H (z) = 1 1 − 0.375z−1 · 1 1 − 0.5z−1 . The poles of H (z) are z = 0.375 and z = 0.5. Therefore, the poles of cascade realization are closer to the desired H(z)atz = 0.4 and z = 0.45. Rounding of 2B-bit product to B bits introduces the roundoff noise. The order of cascade sections influences the output noise power due to roundoff. In addition, when digital filters are implemented using fixed-point processors, we have to optimize the ratio of signal power to the power of the quantization noise. This involves a trade-off with the probability of arithmetic overflow. The most effective technique in preventing overflow is to use scaling factors at various nodes within the filter sections. The optimization is achieved by keeping the signal level as high as possible at each section without getting overflown. Example 5.18: Consider the first-order IIR filter with scaling factor α described by H(z) = α 1 − az−1 , where stability requires that |a| < 1. The goal of including the scaling factor α is to ensure that the values of y(n) will not exceed 1 in magnitude. Suppose that x(n) is a sinusoidal signal of frequencyJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 275 ω0, the amplitude of the output is a factor of |H(ω0)|. For such signals, the gain of H(z)is maxω |H(ω)| = α 1 −|a|. Thus, if the signals being considered are sinusoidal, a suitable scaling factor is given byα<1 − |a|. 5.5.3 MATLAB Implementations The MATLAB function filter implements the IIR filter defined by Equation (5.23). The basic forms of this function are y = filter(b, a, x); y = filter(b, a, x, zi); The first element of vector a, the first coefficient a(1), is assumed to be 1. The input vector is x, and the filter output vector is y. At the beginning, the initial conditions (data in the signal buffers) are set to zero. However, they can be specified in the vector zi to reduce transients. Example 5.19: Given a signal consists of sinewave (150 Hz) corrupted by white noise with SNR = 0 dB, and the sampling rate is 1000 Hz. To enhance the sinewave, we need a bandpass filter with passband centered at 150 Hz. Similar to Example 5.11, we design a bandpass filter with the following MATLAB functions: Wp = [130 170]/500; % Passband edge frequencies Ws = [100 200]/500; % Stopband edge frequencies Rp = 3; % Passband ripple Rs = 40; % Stopband ripple [N, Wn] = buttord(Wp, Ws, Rp, Rs); % Find the filter order [b, a] = butter(N, Wn); % Design an IIR filter We implement the designed filter using the following function: y = filter(b, a, xn); % IIR filtering We then plot the input and output signals in the xn and y vectors, which are displayed in Figure 5.21. The complete MATLAB script for this example is given in example5_19.m. MATLAB Signal Processing Toolbox also provides the second-order (biquad) IIR filtering function with the following syntax: y = sosfilt(sos,x) This function applies the IIR filter H(z) with second-order sections sos as defined in Equation (5.44) to the vector x. Example 5.20: In Example 5.19, we design a bandpass filter and implement the direct-form IIR filter using the function filter. In this example, we convert the direct-form filter to cascade of second-order sections using the following function: sos = tf2sos(b,a);JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 276 DESIGN AND IMPLEMENTATION OF IIR FILTERS 4 2 0 0 50 100 150 200 250 300 Amplitude −2 −4 2 1 0 0 50 100 150 Time index n 200 250 300 Amplitude −1 −2 Figure 5.21 Input (top) and output (bottom) signals of bandpass filter The sos matrix is shown as follows: sos = 0.0000 0.0000 0.0000 1.0000 -0.9893 0.7590 1.0000 2.0000 1.0000 1.0000 -1.0991 0.7701 1.0000 1.9965 0.9965 1.0000 -0.9196 0.8119 1.0000 -2.0032 1.0032 1.0000 -1.2221 0.8350 1.0000 -1.9968 0.9968 1.0000 -0.9142 0.9257 1.0000 -2.0000 1.0000 1.0000 -1.3363 0.9384 We then perform the IIR filtering using the following function: y = sosfilt(sos,xn); The complete MATLAB program for this example is example5_20.m. The Signal Processing Tool (SPTool) supports the user to analyze signals, design and analyze filters, perform filtering, and analyze the spectra of signals. We can open this tool by typing sptool in the MATLAB command window. The SPTool main window is shown in Figure 5.22.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 277 Figure 5.22 SPTool window There are four windows that can be accessed within the SPTool: 1. The Signal Browser is used to view the input signals. Signals from the workspace or a file can be loaded into the SPTool by clicking File → Import. The Import to SPTool window allows users to select the data from either a file or a workspace. For example, after we execute example5_19.m, our workspace contains noisy sinewave in vector xn with sampling rate 1000 Hz. We import it by entering appropriate parameters in the dialog box. To view the signal, simply highlight the signal, and click View. The Signal Browser window is shown in Figure 5.23, which allows the user to zoom-in the signal, read the data values via markers, display format, and even play the selected signal using the computer’s speakers. 2. The Filter Designer is used to design filters. Users can click the New icon to start a new filter, or the Edit icon to open an existing filter. We can design filters using different filter design algorithms. For example, we design an IIR filter displayed in Figure 5.24 that uses the same specifications as Example 5.19. In addition, we can also design a filter using the Pole/Zero Editor to graphically place the poles and zeros in the z-plane. 3. Once the filter has been designed, the frequency specification and other filter characteristics can be verified using the Filter Viewer. Selecting the name of the designed filter, and clicking the View icon under the Filter column will open the Filter Viewer window. We can analyze the filter in terms of its magnitude response, phase response, group delay, zero-pole plot, impulse response, and step response. After the filter characteristics have been verified, we can perform the filtering operation of the selected input signal. Click the Apply button, the Apply Filter window will be displayed, which allows the user to specify the file name of the output signal.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 278 DESIGN AND IMPLEMENTATION OF IIR FILTERS Figure 5.23 Signal browser window Figure 5.24 Design of bandpass filterJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 PRACTICAL APPLICATIONS 279 Figure 5.25 Spectrum viewer window for both input and output signals 4. We can compute the spectrum by selecting the signal, and then clicking the Create button in the Spectra column. Figure 5.25 is the display of the Spectrum Viewer. At the left-bottom corner of the window, click Apply to generate the spectrum of the selected signal. We repeat this process for both input and output signals. To view the spectra of input and output signals, select both spect1 (spectrum of input) and spect2 (spectrum of output), and click the View button in the Spectra column to display them (see Figure 5.25). 5.6 Practical Applications In this section, we briefly introduce the application of IIR filtering for signal generation and audio equalization. 5.6.1 Recursive Resonators Consider a simple second-order filter whose frequency response is dominated by a single peak at frequency ω0. To make a peak at frequency ω = ω0, we place a pair of complex-conjugated poles at pi = rpe± jω0 , (5.53) where the radius 0 < rp < 1. The transfer function of this IIR filter can be expressed as H(z) = A 1 − rpe jω0 z−1 1 − rpe− jω0 z−1 = A 1 − 2rp cos (ω0) z−1 + r 2 p z−2 = A 1 + a1z−1 + a2z−2 , (5.54)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 280 DESIGN AND IMPLEMENTATION OF IIR FILTERS x(n) A z−1 z−1 2rp cosω0 y(n) −rp 2 Figure 5.26 Signal-flow graph of second-order resonator filter where A is a fixed gain used to normalize the filter to unity at ω0 such that |H (ω0)| = 1. The direct-form realization is shown in Figure 5.26. The magnitude response of this normalized filter is given by |H(ω0)|z=e− jω0 = A |(1 − rpe jω0 e− jω0 )(1 − rpe− jω0 e− jω0 )| = 1. (5.55) This condition can be used to obtain the gain A =|(1 − rp)(1 − rpe−2 jω0 )|=(1 − rp)  1 − 2rp cos(2ω0) + r 2 p . (5.56) The 3-dB bandwidth of the filter is equivalent to |H (ω)|2 = 1 2 |H (ω0)|2 = 1 2 . (5.57) There are two solutions on both sides of ω0, and the bandwidth is the difference between these two frequencies. When the poles are close to the unit circle, the BW is approximated as BW ∼= 2(1 − rp). (5.58) This design criterion determines the value of rp for a given BW. The closer rp is to 1, the sharper the peak. From Equation (5.54), the I/O equation of resonator is given by y(n) = Ax(n) − a1 y(n − 1) − a2 y(n − 2), (5.59) where a1 =−2rp cos ω0 and a2 = r 2 p . (5.60) This recursive oscillator is very useful for generating sinusoidal waveforms. This method uses a marginally stable two-pole resonator where the complex-conjugated poles lie on the unit circle (rp = 1). This recur- sive oscillator is the most efficient way for generating a sinusoidal waveform, particularly if the quadrature signals (sine and cosine signals) are required.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 PRACTICAL APPLICATIONS 281 The Filter Design Toolbox provides the function iirpeak for designing IIR peaking filter with the following syntax: [NUM, DEN] = iirpeak(Wo, BW); This function designs a second-order resonator with the peak at frequency Wo and a 3-dB bandwidth BW. In addition, we can use [NUM,DEN] = iirpeak(Wo,BW,Ab) to design a peaking filter with a bandwidth of BW atalevelAb in decibels. Example 5.21: Design resonators operating at a sampling rate of 10 kHz having peaks at 1 and 2.5 kHz, and a 3-dB bandwidth of 500 and 200 Hz, respectively. These filters can be designed using the following MATLAB script (example5_21.m, adapted from the Help menu): Fs = 10000; % Sampling rate Wo = 1000/(Fs/2); % First filter peak frequency BW = 500/(Fs/2); % First filter bandwidth W1 = 2500/(Fs/2); % Second filter peak frequency BW1 = 200/(Fs/2); % Second filter bandwidth [b,a] = iirpeak(Wo,BW); % Design first filter [b1,a1] = iirpeak(W1,BW1); % Design second filter fvtool(b,a,b1,a1); % Analyze both filters The magnitude responses of both filters are shown in Figure 5.27. In the FVTool window, we select Analysis→Pole/Zero Plot to display poles and zeros of both filters, which are shown in Figure 5.28. It is clearly shown that the second filter (peak at 2500 Hz) has a narrower bandwidth (200 Hz), and thus its poles are closer to the unit circle. 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency (×π rad/sample) Magnitude response (dB) Magnitude (dB) 0.6 0.7 0.8 0.9 Figure 5.27 Magnitude responses of resonatorsJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 282 DESIGN AND IMPLEMENTATION OF IIR FILTERS −1.5 −1 −0.5 0 Real part Pole/zero plot −1 −0.8 −0.6 −0.4 −0.2 0 Imaginary part 0.2 0.4 0.6 0.8 1 0.5 1 1.5 Figure 5.28 Pole/zero plot of resonators 5.6.2 Recursive Quadrature Oscillators Consider two causal impulse responses hc(n) = cos (ω0n) u(n) (5.61a) and hs(n) = sin (ω0n) u(n), (5.61b) where u(n) is the unit step function. The corresponding system transfer functions are Hc(z) = 1 − cos(ω0)z−1 1 − 2 cos(ω0)z−1 + z−2 (5.62a) and Hs(z) = sin(ω0)z−1 1 − 2 cos(ω0)z−1 + z−2 . (5.62b) A two-output recursive structure with these system transfer functions is illustrated in Figure 5.29. The implementation requires just two data memory locations and two multiplications per sample. The output equations are yc(n) = w(n) − cos(ω0)w(n − 1) (5.63a)JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 PRACTICAL APPLICATIONS 283 sin(ω0)cos(ω0)2 + − w(n − 1) w(n − 2) w(n) + − yc(n) ys(n) z−1 z−1 Figure 5.29 Recursive quadrature oscillators and ys(n) = sin(ω0)w(n − 1), (5.63b) where w(n) is an internal state variable that is updated as w(n) = 2 cos(ω0)w(n − 1) − w(n − 2). (5.64) An impulse signal Aδ(n) is applied to excite the oscillator, which is equivalent to presetting the following initial conditions: w(−2) =−A and w(−1) = 0. (5.65) The waveform accuracy is limited primarily by the DSP processor wordlength. The quantization of the coefficient cos (ω0) will cause the actual output frequency to differ slightly from the ideal frequency ω0. For some applications, only a sinewave is required. From Equations (5.59) and (5.60) using the conditions that x(n) = Aδ(n) and rp = 1, we can obtain the sinusoidal function ys(n) = Ax(n) − a1 ys(n − 1) − a2 ys(n − 2) = 2 cos(ω0)ys(n − 1) − ys(n − 2) (5.66) with the initial conditions ys(−2) =−A sin(ω0) and ys(−1) = 0. (5.67) The oscillating frequency defined by Equation (5.66) is determined from its coefficient a1 and its sampling frequency fs, and can be expressed as f = cos−1 |a1| 2 fs 2π Hz, (5.68) where the coefficient |a1| ≤ 2.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 284 DESIGN AND IMPLEMENTATION OF IIR FILTERS Example 5.22: The sinewave generator using resonator can be realized from the recursive compu- tation given in Equation (5.66). The implementation using the TMS320C55x assembly language is listed as follows: mov cos_w,T1 mpym *AR1+,T1,AC0 ; AC0=cos(w)*y[n-1] sub *AR1-<<#16,AC0,AC1 ; AC1=cos(w)*y[n-1]-y[n-2] add AC0,AC1 ; AC1=2*cos(w)*y[n-1]-y[n-2] || delay *AR1 ; y[n-2]=y[n-1] mov rnd(hi(AC1)),*AR1 ; y[n-1]=y[n] mov rnd(hi(AC1)),*AR0+ ; y[n]=2*cos(w)*y[n-1]-y[n-2] || mpym *AR1+,T1,AC0 ; AC0=cos(w)*y[n-1] In the program, AR1 is the pointer for the signal buffer. The output sinewave samples are stored in the output buffer pointed by AR0. Due to the limited wordlength, the quantization error of fixed- point DSP processors such as the TMSC320C55x could be severe for the recursive computation. 5.6.3 Parametric Equalizers A simple parametric equalizer filter can be designed from a resonator given in Equation (5.54) by adding a pair of zeros near the poles at the same angles as the poles; that is, placing the complex-conjugated poles at zi = rze± jω0 , (5.69) where 0 < rz < 1. Thus, the transfer function given in Equation (5.54) becomes H(z) = 1 − rze jω0 z−1 1 − rze− jω0 z−1 1 − rpe jω0 z−1 1 − rpe− jω0 z−1 = 1 − 2rz cos (ω0) z−1 + r 2 z z−2 1 − 2rp cos (ω0) z−1 + r 2 p z−2 = 1 + b1z−1 + b2z−2 1 + a1z−1 + a2z−2 . (5.70) When rz < rp, the pole dominates over the zero because it is closer to the unit circle than the zero does. Thus, it generates a peak in the frequency response at ω = ω0. When rz > rp, the zero dominates over the pole, thus providing a dip in the frequency response. When the pole and zero are very close to each other, the effects of the poles and zeros are reduced, resulting in a flat response. Therefore, Equation (5.70) provides a boost ifrz < rp, or a cut ifrz > rp. The amount of gain and attenuation is controlled by the differ- ence betweenrp andrz. The distance fromrp to the unit circle will determine the bandwidth of the equalizer. Example 5.23: Design a parametric equalizer with a peak at frequency 1500 Hz, and the sampling rate is 10 kHz. The parameters rz = 0.8 and rp = 0.9. The MATLAB script (example5_23.m)is listed as follows: rz=0.8; rp=0.9; b=[1, -2*rz*cos(w0), rz*rz]; a=[1, -2*rp*cos(w0), rp*rp]; Since rz < rp, this filter provides a boost.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 285 Table 5.2 List of C function for implementing a floating-point, direct-form I IIR filter void floatPoint_IIR(double in, double *x, double *y, double *b, short nb, double *a, short na) { double z1,z2; short i; for(i=nb-1; i>0; i--) // Update the buffer x[] x[i] = x[i-1]; x[0] = in; // Insert new data to x[0] for(z1=0, i=0; i0; i--) // Update y buffer y[i] = y[i-1]; for(z2=0, i=1; i0; i--) // Update the buffer x[] x[i] = x[i-1]; x[0] = in; // Insert new data to x[0] for(z1=0, i=0; i>11); } for(i=na-1; i>0; i--) // Update y[] buffer y[i] = y[i-1]; for(z2=0, i=1; i>11); } y[0] = (short)(z1 - z2); // Place the result into y[0] }JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 287 Table 5.5 File listing for experiment exp5.7.2_fixedPoint_directIIR Files Description fixPoint_directIIRTest.c C function for testing fixed-point IIR filter experiment fixPoint_directIIR.c C function for fixed-point IIR filter fixPointIIR.h C header file for IIR experiment fixedPoint_direcIIR.pjt DSP project file floatPoint_direcIIR.cmd DSP linker command file input.pcm Data file 3. Validate the output signal to ensure that the 800 and 3300 Hz sinusoidal components are reduced by 60 dB. 4. Compare the output signal with previous floating-point output signal to check the performance difference. 5. Profile the fixed-point IIR filter performance. 5.7.3 Fixed-Point Direct-Form II Cascade IIR Filter The cascade structure shown in Figure 5.8 can be expressed as w(n) = x(n) − a1w(n − 1) − a2w(n − 2). y(n) = b0w(n) + b1w(n − 1) + b2w(n − 2) (5.71) The C implementation of cascading K second-order sections is given as follows: temp = input[n]; for (k=0; k>15); // Save in Q15 format w_0 = (long)temp16 * *(coef+j); j++; w_0 <<= 1; temp32 = (long)*(w+l) * *(coef+j); j++; l=(l+Ns)%k; w_0 += temp32<<1; temp32 = (long)*(w+l) * *(coef+j); j=(j+1)%m; l=(l+1)%k; w_0 += temp32<<1; w_0 += 0x800; // Rounding } y[n] = (short)(w_0>>12); // Output in Q15 format } } The coefficient and signal buffers are configured as circular buffers shown in Figure 5.30. The signal buffer contains two elements, wk(n − 1) and wk(n − 2), for each second-order section. The pointer address is initialized pointing at the first sample w1(n − 1) in the buffer. The coefficient vector is arranged with five coefficients (a1k, a2k, b2k, b0k, and b1k) per section with the coefficient pointer initialized to point at the first coefficient, a11. The circular pointers are updated by j=(j+1)%m and l=(l+1)%k, where m and k are the sizes of the coefficient and signal buffers, respectively. The test function reads in the filter coefficients header file, fdacoefsMATLAB.h, generated by the FDATool directly. Table 5.7 lists the files used for this experiment, where the test input data file in.pcm consists of three frequencies, 800, 1500, and 3300 Hz with the 8 kHz sampling rate. Procedures of the experiment are listed as follows: 1. Open the project file fixedPoint_cascadeIIR.pjt and rebuild the project. 2. Run the cascade filter experiment to filter the input signal in the data directory.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 289 Coefficient buffer C[ ] Signal buffer w[ ] Offset = Number of sections Section 1 coefficients Section 2 coefficients Section K coefficients a11 a21 b21 b01 b11 a12 a22 b22 b02 b12 a1K a2K b2K b0K b1K w1(n − 1) w2(n − 1) wK(n − 1) wK(n − 2) w1(n − 2) w2(n − 2) : : : : : : Figure 5.30 IIR filter coefficient and signal buffers configuration 3. Validate the output data to ensure that the 800 and 3300 Hz sinusoidal components are reduced by the 60 dB. 4. Profile the fixed-point direct-form II cascade IIR filter performance. 5.7.4 Implementation Using DSP Intrinsics The C55x C intrinsics can be used as any C function and they produce assembly language statements directly in compile time. The intrinsics are specified with a leading underscore and can be accessed Table 5.7 File listing for experiment exp5.7.3_fixedPoint_cascadeIIR Files Description fixPoint_cascadeIIRTest.c C function for testing cascade IIR filter experiment fixPoint_cascadetIIR.c C function for fixed-point second-order IIR filter cascadeIIR.h C header file for cascade IIR experiment fdacoefsMATLAB.h FDATool generated C header file tmwtypes.h Data type definition file for MATLAB C header file fixedPoint_cascadeIIR.pjt DSP project file fixedPoint_cascadeIIR.cmd DSP linker command file in.pcm Data fileJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 290 DESIGN AND IMPLEMENTATION OF IIR FILTERS by calling them as C functions. For example, the multiplyÐaccumulation operation, z+=x*y, can be implemented by the following intrinsic: short x,y; long z; z = _smac(z,x,y); // Perform signed z=z+x*y This intrinsic performs the following assembly instruction: macm Xmem,Ymem,Acx; Perform signed z=z+x*y Table 5.8 lists the intrinsics supported by the TMS320C55x C compiler. We will modify the previous fixed-point C function for cascade IIR filter using C intrinsics. Table 5.9 lists the implementation of the fixed-point IIR filter with coefficients in Q14 format. For the modulo operation, we replaced the sections, k, with an and(&) operation since the number k is a power-of-2 number. The test function reads in the filter coefficients from the C header file fdacoefsMATLAB.h generated by the FDATool. Table 5.10 lists the files used for this experiment, where the input data file in.pcm consists of three frequencies, 800, 1500, and 3300 Hz with the 8 kHz sampling rate. Procedures of the experiment are listed as follows: 1. Open the project file intrisics_implementation.pjt and rebuild the project. 2. Run the experiment to filter the test signal in the data directory. 3. Validate the output signal to ensure that the 800 and 3300 Hz sinusoidal components are attenuated by 60 dB. 4. Profile the code and compare the result with the performance obtained in previous experiment. 5.7.5 Implementation Using Assembly Language The fixed-point C implementation of an IIR filter can be more efficient using the C55x multiplyÐ accumulator instruction with circular buffers. The C55x assembly implementation of the second-order, direct-form II IIR filter given in Equation (5.71) can be written as mov *AR0+<<#12,AC0 ; AC0 = x(n) with scale down masm *AR3+,*AR7+,AC0 ; AC0=AC0-a1*wi(n-1) masm T3=*AR3,*AR7+,AC0 ; AC0=AC0-a2*wi(n-2) mov hi(AC0),*AR3- ; wi(n-2)=wi(n) mpym *AR7+,T3,AC0 ; AC0=b2*wi(n-2) macm *AR3+,*AR7+,AC0 ; AC0=AC0+bi0*wi(n-1) macm *AR3,*AR7+,AC0 ; AC0=AC0+bi1*wi(n) mov hi(AC0),*AR1+ ; Store filter result The assembly program contains three data pointers and a coefficient pointer. The auxiliary register AR0 is the input buffer pointer pointing to the input sample. The filtered sample is rounded and stored in the output buffer pointed by AR1. The signal buffer wi (n) is pointed by AR3. The filter coefficients pointer AR7 and signal pointer AR3 can be efficiently implemented using circular addressing mode.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 291 Table 5.8 Intrinsics supported by the TMS320C55x C compiler C compiler intrinsics Description short _sadd(short src1, short src2); Adds two 16-bit integers with SATA set, producing a saturated 16-bit result long _lsadd(long src1, long src2); Adds two 32-bit integers with SATD set, producing a saturated 32-bit result short _ssub(short src1, short src2); Subtracts src2 from src1 with SATA set, producing a saturated 16-bit result long _lssub(long src1, long src2); Subtracts src2 from src1 with SATD set, producing a saturated 32-bit result short _smpy(short src1, short src2); Multiplies src1 and src2 and shifts the result left by 1. Produces a saturated 16-bit result. (SATD and FRCT are set.) long _lsmpy(short src1, short src2); Multiplies src1 and src2 and shifts the result left by 1. Produces a saturated 32-bit result. (SATD and FRCT are set.) long _smac(long src, short op1, short op2); Multiplies op1 and op2, shifts the result left by 1, and adds it to src. Produces a saturated 32-bit result. (SATD, SMUL, and FRCT are set.) long _smas(long src, short op1, short op2); Multiplies op1 and op2, shifts the result left by 1, and subtracts it from src. Produces a 32-bit result. (SATD, SMUL, and FRCT are set.) short _abss(short src); Creates a saturated 16-bit absolute value. _abss(0x8000) => 0x7FFF (SATA set) long _labss(long src); Creates a saturated 32-bit absolute value. _labss(0x8000000) => 0x7FFFFFFF (SATD set) short _sneg(short src); Negates the 16-bit value with saturation _sneg(0xffff8000) => 0x00007FFF long _lsneg(long src); Negates the 32-bit value with saturation. _lsneg(0x80000000) => 0x7FFFFFFF short _smpyr(short src1, short src2); Multiplies src1 and src2, shifts the result left by 1, and rounds by adding 215 to the result. (SATD and FRCT are set.) short _smacr(long src, short op1, short op2); Multiplies op1 and op2, shifts the result left by 1, adds the result to src, and then rounds the result by adding 215. (SATD, SMUL, and FRCT are set) short _smasr(long src, short op1, short op2); Multiplies op1 and op2, shifts the result left by 1, subtracts the result from src, and then rounds the result by adding 215. (SATD, SMUL, and FRCT set.) short _norm(short src); Produces the number of left shifts needed to normalize src. short _lnorm(long src); Produces the number of left shifts needed to normalize src. short _rnd(long src); Rounds src by adding 215. Produces a 16-bit saturated result. (SATD set) continues overleafJWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 292 DESIGN AND IMPLEMENTATION OF IIR FILTERS Table 5.8 (continued) C compiler intrinsics Description short _sshl(short src1, short src2); Shifts src1 left by src2 and produces a 16-bit result. The result is saturated if src2 is less than or equal to 8. (SATD set) long _lsshl(long src1, short src2); Shifts src1 left by src2 and produces a 32-bit result. The result is saturated if src2 is less than or equal to 8. (SATD set) short _shrs(short src1, short src2); Shifts src1 right by src2 and produces a 16-bit result. Produces a saturated 16-bit result. (SATD set) long _lshrs(long src1, short src2); Shifts src1 right by src2 and produces a 32-bit result. Produces a saturated 32-bit result. (SATD set) short _addc(short src1, short src2); Adds src1, src2, and carry bit and produces a 16-bit result. long _laddc(long src1, short src2); Adds src1, src2, and carry bit and produces a 32-bit result. This IIR filtering code can be easily modified for performing either a sample-by-sample or block processing. When the IIR filter function is called, the temporary register T0 contains the number of input samples to be filtered, and T1 contains the number of second-order sections. The IIR filter sections are implemented by the inner loop, and the outer loop is used for processing samples in blocks. The Table 5.9 Fixed-point implementation of direct-form II IIR filter using intrinsics void intrinsics_IIR(short *x, short Nx, short *y, short *coef, short Ns, short *w) { short i,j,n,m,k,l; short temp16; long w_0; m=Ns*5; // Setup circular buffer coef[] k=Ns*2-1; // Setup circular buffer w[] for (j=0,l=0,n=0; n>15); // Save in Q15 format w_0 = _lsmpy(temp16,*(coef+j)); j++; w_0 = _smac(w_0,*(w+l),*(coef+j)); j++; l=(l+Ns)&k; w_0 = _smac(w_0,*(w+l),*(coef+j)); j=(j+1)%m; l=(l+1)&k; } y[n] = (short)(w_0>>12); // Output in Q15 format } }JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 293 Table 5.10 File listing for experiment exp5.7.4_intrisics_implementation Files Description intrinsics_IIRTest.c C function for testing IIR filter intrinsics experiment intrinsics_IIR.c Intrinsics implementation of second-order IIR filter intrinsics_IIR.h C header file for intrinsics IIR experiment fdacoefsMATLAB.h FDATool generated C header file tmwtypes.h Data type definition file for MATLAB C header file intrisics_implementation.pjt DSP project file intrisics_implementation.cmd DSP linker command file in.pcm Data file IIR filter coefficients are represented using Q14 format. To prevent the overflow, the input sample is scaled down as well. To compensate the Q14 formatted coefficients and scaled down input samples, the filter result y(n) is scaled up to form the Q15 format and stored with rounding. Temporary register T3 is used to hold the second element wi (n − 2) of the signal buffer when the buffer update is taking place. For a K-section IIR filter, the signal buffer elements are arranged in such a way that two elements of each section are separated by KÐ 1 elements. The filter coefficients and the signal samples are arranged for circular buffer as shown in Figure 5.30. The complete assembly language implementation of the IIR filter in cascade second-order sections is listed in Table 5.11. Table 5.12 lists the files used for this experiment. The test function reads in the filter coefficients generated by the FDATool, which are saved in C header file fdacoefsMATLAB.h. Procedures of the experiment are listed as follows: 1. Open the project file asm_implementation.pjt and rebuild the project. 2. Run the experiment to filter the input signal in data directory. 3. Validate the output signal to ensure that the 800 and 3300 Hz sinusoidal components are attenuated by 60 dB. 4. Profile this experiment and compare the result with previous experiments. 5.7.6 Real-Time Experiments Using DSP/BIOS In Chapter 4, we have used DSP/BIOS for real-time FIR filtering. In this experiment, we will apply the same process to create a new DSP/BIOS project for IIR filtering. The DSP/BIOS provides addi- tional tools for code development and debug. One of the useful tools is the CPU load graph. The CPU loading can be plotted in real time to monitor the real-time performance of DSP system. This fea- ture is especially useful for multithread system when multiple threads sharing the CPU concurrently. Another useful graphical tool is the DSP execution graph. The DSP execution graph shows several DSP/BIOS tasks, including hardware interrupts (HWI), software interrupts (SWI), tasks (TSK), and semaphores (SEM). The IIR filtering experiment execution graph is shown in Figure 5.31. Our exper- iment has one task Ð swiAudioProcess. This task is software interrupt based and has the highest priority.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 294 DESIGN AND IMPLEMENTATION OF IIR FILTERS Table 5.11 Assembly language implementation of direct-form II IIR filter .global _asmIIR .sect ".text:iir_code" _asmIIR pshm ST1_55 ; Save ST1, ST2, ST3 pshm ST2_55 pshm ST3_55 psh T3 ; Save T3 pshboth XAR7 ; Save AR7 or #0x340,mmap(ST1_55) ; Set FRCT, SXMD, SATD bset SMUL ; Set SMUL sub #1,T0 ; Number of samples - 1 mov T0,BRC0 ; Set up outer loop counter sub #1,T1,T0 ; Number of sections -1 mov T0,BRC1 ; Set up inner loop counter mov T1,T0 ; Set up circular buffer sizes sfts T0,#1 mov mmap(T0),BK03 ; BK03=2*number of sections sfts T0,#1 add T1,T0 mov mmap(T0),BK47 ; BK47=5*number of sections mov mmap(AR3),BSA23 ; Initial signal buffer base mov mmap(AR2),BSA67 ; Initial coefficient base amov #0,AR3 ; Initial signal buffer entry amov #0,AR7 ; Initial coefficient entry or #0x88,mmap(ST2_55) mov #1,T0 ; Used for shift left || rptblocal sample_loop-1 ; Start IIR filtering loop mov *AR0+ <<#12,AC0 ; AC0 = x(n)/8 (i.e. Q12) || rptblocal filter_loop-1 ; Loop for each section masm *(AR3+T1),*AR7+,AC0 ; AC0-=ai1*wi(n-1) masm T3=*AR3,*AR7+,AC0 ; AC0-=ai2*di(n-2) mov rnd(hi(AC0< 0; and (c) a2 1/4 + a2 = 0. 7. A first-order allpass filter has the transfer function H(z) = z−1 − a 1 − az−1 . (a) Draw the direct-form I and II realizations. (b) Show that |H(ω)|=1 for all ω. (c) Sketch the phase response of this filter. 8. Given a six-order IIR transfer function H(z) = 6 + 17z−1 + 33z−2 + 25z−3 + 20z−4 − 5z−5 + 8z−6 1 + 2z−1 + 3z−2 + z−3 + 0.2z−4 − 0.3z−5 − 0.2z−6 , find the factored form of the IIR transfer function in terms of second-order sections using MATLAB. 9. Given a fourth-order IIR transfer function H(z) = 12 − 2z−1 + 3z−2 + 20z−4 6 − 12z−1 + 11z−2 − 5z−3 + z−4 . (a) Use MATLAB to express H(z) in factored form. (b) Develop two different cascade realizations. (c) Develop two different parallel realizations.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 EXERCISES 301 10. Design and plot the magnitude response of an elliptic IIR lowpass filter with the following specifications using MATLAB: passband edge at 1600 Hz, stopband edge at 2000 Hz, passband ripple of 0.5 dB, and minimum stopband attenuation of 40 dB with sampling rate of 8 kHz. Analyze the design filter using the FVTool. 11. Use FDATool to design an IIR filter specified in Problem 10 using: (a) Butterworth; (b) Chebyshev type-I; and (c) Chebyshev type-II and Bessel methods. Show both magnitude and phase responses of the designed filters and indicate the required filter order. 12. Redo Problem 10 using the FDATool, compare the results with Problem 10, and design a quantized filter for 16-bit fixed-point DSP processors. 13. Redo Problem 12 for designing an 8-bit fixed-point filter. Show the differences with the 16-bit filter designed in Problem 12. 14. Design an IIR Butterworth bandpass filter with the following specifications: passband edges at 450 and 650 Hz, stopband edges at 300 and 750 Hz, passband ripple of 1 dB, minimum stopband attenuation of 60 dB, and sampling rate of 8 kHz. Analyze the design filter using the FVTool. 15. Redo Problem 14 using the FDATool, compare the results with Problem 14, and design a quantized filter for 16-bit fixed-point DSP processors. 16. Design a type-I Chebyshev IIR highpass filter with passband edge at 700 Hz, stopband edge at 500 Hz, passband ripple of 1 dB, and minimum stopband attenuation of 32 dB. The sampling frequency is 2 kHz. Analyze the design filter using the FVTool. 17. Redo Problem 16 using FDATool, compare the results with Problem 16, and design a quantized filter for 16-bit fixed-point DSP processors. 18. Given an IIR lowpass filter with transfer function H(z) = 0.0662 1 + 3z−1 + 3z−2 + z−3 1 − 0.9356z−1 + 0.5671z−2 − 0.1016z−3 , plot the impulse response using an appropriate MATLAB function and compare the result using the FVTool. 19. It is interesting to examine the frequency response of the second-order resonator filter as the radius rp and the pole angle ω0 are varied. Use the MATLAB to compute and plot the magnitude response for ω0 = π/2 and various values of rp. Also, plot the magnitude response for rp = 0.95 and various values of ω0. 20. Use MATLAB FDATool to design an 8 kHz sampling rate highpass filter with the passband starting at 3000 Hz and at least 45-dB attenuation in the stopband. Write a direct-form I IIR filter function in fixed-point C. The test data file is given in the companion CD. 21. Rewrite the highpass filter implementation in Problem 20 using C intrinsics. 22. Use MATLAB FDATool to design an 8 kHz sampling rate lowpass filter with the stopband beginning at 1000 Hz with at least 60-dB attenuation. Write the direct-form I IIR filter in C55x assembly language. The test data file is given in the companion CD. 23. Create a real-time experiment using DSP/BIOS for the direct-form I IIR filter designed by Problem 22.JWBK080-05 JWBK080-Kuo March 8, 2006 11:47 Char Count= 0 302 DESIGN AND IMPLEMENTATION OF IIR FILTERS 24. The cascade IIR filter implementation in this problem has some issues that need to be corrected. Identify the problems and make corrections. Run the test and compare the result with the experiment given in Section 5.7.3. The software for this exercise is included in the companion CD. 25. Write the IIR filter function using intrinsics for Problem 24. Compare the profile result against the experiment given in Section 5.7.4. 26. Write the program for IIR filter function defined in Problem 24 using C55x assembly language. Compare the profile result against the experiment given in Section 5.7.5. 27. Can the experiment given in Section 5.7.5 be optimized even further? Try to improve the run-time efficiency of the experiment given in Section 5.7.5. 28. Use MATLAB FDATool to design two 16-kHz sampling rate filters. A lowpass filter with stopband at 800 Hz and attenuation at least 40 dB in the stopband and a highpass filter with the passband starting from 1200 Hz and at least 40-dB attenuation in its stopband. Modify the experiment given in Section 5.7.6 such that the DSK uses stereo line-in at 16 kHz sampling rate and PIP frame size set to 40. Place the lowpass filter on the left channel of the audio path and the highpass filter on the right channel of the audio path. Use a high-fidelity CD as audio input and listen to the filter output of the left and right channels. 29. Using Equation (5.70), design a three-band equalizer with normalized resonate frequencies of 0.05, 0.25, and 0.5. The equalizer must have a dynamic range of at least ± 9 dB at 1 dB step with: (a) 8 kHz sampling rate; (b) 48 kHz sampling rate. 30. Write a fixed-point C program to verify the three-band equalizer performance from Problem 29. Implement this real-time three-band equalizer using DSK with: (a) 8 kHz sampling rate; (b) 48 kHz sampling rate.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 6 Frequency Analysis and Fast Fourier Transform This chapter introduces the properties, applications, and implementations of the discrete Fourier transform (DFT). Because of the development of the fast Fourier transform algorithms, the DFT is now widely used for spectral analysis and fast convolution. 6.1 Fourier Series and Transform In this section, we will introduce the representation of analog periodic signals using Fourier series and the analysis of finite-energy signals using Fourier transform. 6.1.1 Fourier Series A periodic signal can be represented as the sum of an infinite number of harmonic-related sinusoids and complex exponentials. The representation of periodic signal x(t) with period T0 is the Fourier series defined as x(t) = ∞ k=−∞ cke jk0t , (6.1) where ck is the Fourier series coefficient, 0 = 2π/T0 is the fundamental frequency, and k0 is the frequency of the kth harmonic. The kth Fourier coefficient ck is expressed as ck = 1 T0 T0 x(t)e− jk0t dt. (6.2) For an odd function x(t), it is easier to calculate the interval from 0 to T0. For an even function, integration from −T0/2toT0/2 is commonly used. The term c0 = 1 T0 T0 x(t)dt is called the DC component because it equals the average value of x(t) over one period. Example 6.1: A rectangular pulse train is a periodic signal with period T0 and can be expressed as x(t) = A, 0, kT0 − τ/2 ≤ t ≤ kT0 + τ/2 otherwise , (6.3) Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 303JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 304 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM where k = 0, ±1, ±2,...,andτ 0 and u(t) is the unit-step function. From Equation (6.7), we have X() = ∞ −∞ e−atu(t)e− jt dt = ∞ 0 e−(a+ j)t dt = 1 a + j. For a function x(t) defined over a finite interval T0, i.e., x(t) = 0 for |t| > T0/2, the Fourier series coefficients ck can be expressed in terms of X() using Equations (6.2) and (6.7) as ck = 1 T0 X(k0). (6.9) Therefore, the Fourier transform X() of a finite interval function at a set of equally spaced points on the -axis is specified exactly by the Fourier series coefficients ck. 6.2 Discrete Fourier Transform In this section, we introduce the discrete-time Fourier transform and discrete Fourier transform of digital signals. 6.2.1 Discrete-Time Fourier Transform The discrete-time Fourier transform (DTFT) of a discrete-time signal x(nT ) is defined as X(ω) = ∞ n=−∞ x(nT)e− jωnT . (6.10) It shows that X(ω) is a periodic function with period 2π. Thus, the frequency range of a discrete-time signal is unique over the range (−π, π) or (0, 2π).JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 306 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM The DTFT of x(nT ) can also be defined in terms of normalized frequency as X(F) = ∞ n=−∞ x(nT)e− j2π Fn. (6.11) Comparing this equation with Equation (6.8), the periodic sampling imposes a relationship between the independent variables t and n as t = nT = n/fs. It can be shown that X(F) = 1 T ∞ k=−∞ X( f − kfs). (6.12) This equation states that X(F) is the sum of an infinite number of X( f ), scaled by 1/T , and then frequency shifted to kfs. It also states that X(F) is a periodic function with period T = 1/fs. Example 6.4: Assume that a continuous-time signal x(t) is bandlimited to fM, i.e., |X( f )| = 0 for | f | ≥ fM, where fM is the bandwidth of signal x(t). The spectrum is zero for | f | ≥ fM as shown in Figure 6.1(a). As shown in Equation (6.12), sampling extends the spectrum X( f ) repeatedly on both sides of the f-axis. When the sampling rate fs is greater than 2 fM, i.e., fM ≤ fs/2, the spectrum X( f )is preserved in X(F) as shown in Figure 6.1(b). In this case, there is no aliasing because the spectrum X ( f ) 0 fM−fM f (a) Spectrum of an analog signal. X( f/fs ) X( f/fs ) 0 fM fs−fs fs 2 −fs 2 −fM f (b) Spectrum of discrete-time signal when the sampling theorem is satisfied. 2 −fs fs 2 0 fM fs−fs −fM f (c) Spectrum of discrete-time signal when the sampling theorem is violated. Figure 6.1 Spectrum replication caused by sampling: (a) spectrum of analog bandlimited signal x(t); (b) sampling theorem is satisfied; and (c) overlap of spectral componentsJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 DISCRETE FOURIER TRANSFORM 307 of the discrete-time signal is identical (except the scaling factor 1/T ) to the spectrum of the analog signal within the frequency range | f | ≤ fs/2or|F| ≤ 1. The analog signal x(t) can be recovered from the discrete-time signal x(nT) by passing it through an ideal lowpass filter with bandwidth fM and gain T . This verifies the sampling theorem defined in Equation (1.3). However, if the sampling rate fs < 2 fM, the shifted replicas of X( f ) will overlap as shown in Figure 6.1(c). This phenomenon is called aliasing since the frequency components in the overlapped region are corrupted. The DTFT X(ω) is a continuous function of frequency ω and the computation requires an infinite-length sequence x(n). We have defined DFT in Section 3.2.6 for N samples of x(n)atN discrete frequencies. Therefore, DFT is a numerically computable transform. 6.2.2 Discrete Fourier Transform The DFT of a finite-duration sequence x(n) of length N is defined as X(k) = N−1 n=0 x(n)e− j(2π/N)kn, k = 0, 1,...,N − 1, (6.13) where X(k) is the kth DFT coefficient and the upper and lower indices in the summation reflect the fact that x(n) = 0 outside the range 0 ≤ n ≤ N − 1. The DFT is equivalent to taking N samples of DTFT X(ω) over the interval 0 ≤ ω<2π at N discrete frequencies ωk = 2πk/N, where k = 0, 1,...,N − 1. The spacing between two successive X(k)is2π/N rad (or fs/N Hz). Example 6.5: If the signal {x(n)} is real valued and N is an even number, we can show that X(0) = N−1 n=0 x(n) and X(N/2) = N−1 n=0 e− jπn x(n) = N−1 n=0 (−1)n x(n). Therefore, the DFT coefficients X(0) and X(N/2) are real values. The DFT defined in Equation (6.13) can also be written as X(k) = N−1 n=0 x(n)W kn N , k = 0, 1,...,N − 1, (6.14) where W kn N = e− j 2π N kn = cos  2πkn N  − j sin  2πkn N  , 0 ≤ k, n ≤ N − 1. (6.15)JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 308 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM W8 6 = −W 8 2 W8 7 = −W8 3 W8 0 = 1 W8 3 W8 2 W8 1 W8 5 = −W 8 1 W8 4 = −W8 0 = −1 Figure 6.2 Twiddle factors for DFT, N = 8 The parameter W kn N is called the twiddle factors of the DFT. Because W N N = e− j2π = 1 = W 0 N , W k N , k = 0, 1,...,N − 1 are the N roots of unity in clockwise direction on the unit circle. It can be shown that W N/2 N = e− jπ =−1. The twiddle factors have the symmetry property W k+N/2 N =−W k N , 0 ≤ k ≤ N/2 − 1, (6.16) and the periodicity property W k+N N = W k N . (6.17) Figure 6.2 illustrates the cyclic property of the twiddle factors for an 8-point DFT. Example 6.6: Consider the finite-length signal x(n) = an, n = 0, 1,...,N − 1, where 0 < a < 1. The DFT of x(n) is computed as X(k) = N−1 n=0 ane− j(2πk/N)n = N−1 n=0  ae− j2πk/N n = 1 −  ae− j2πk/N N 1 − ae− j2πk/N = 1 − aN 1 − ae− j2πk/N , k = 0, 1,...,N − 1. The inverse discrete Fourier transform (IDFT) is used to transform the frequency domain X(k) back into the time-domain signal x(n). The IDFT is defined as x(n) = 1 N N−1 k=0 X(k)e j(2π/N)kn = 1 N N−1 k=0 X(k)W −kn N , n = 0, 1,...,N − 1. (6.18) This is identical to the DFT with the exception of the normalizing factor 1/N and the opposite sign of the exponent of the twiddle factors.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 DISCRETE FOURIER TRANSFORM 309 The DFT and IDFT defined in Equations (6.14) and (6.18), respectively, can be expressed in matrix- vector form as X = Wx (6.19) and x = 1 N W∗X, (6.20) where x = [x(0) x(1) ...x(N − 1)]T is the signal vector, the complex vector X = [X(0)X(1) ... X(N − 1)]T contains the DFT coefficients, and the NxN twiddle-factor matrix (or DFT matrix) W is defined by W =  W kn N  0≤k,n≤N−1 = ⎡ ⎢⎢⎢⎣ 11··· 1 1 W 1 N ··· W N−1 N ... ... ... ... 1 W N−1 N ··· W (N−1)2 N ⎤ ⎥⎥⎥⎦ , (6.21) and W∗ is the complex conjugate of the matrix W. Since W is a symmetric matrix, the inverse matrix W−1 = 1 N W∗ was used to derive Equation (6.20). Example 6.7: Given x(n) ={1, 1, 0, 0}, the DFT of this 4-point sequence can be computed using the matrix formulation as X = ⎡ ⎢⎢⎣ 1111 1 W 1 4 W 2 4 W 3 4 1 W 2 4 W 4 4 W 6 4 1 W 3 4 W 6 4 W 9 4 ⎤ ⎥⎥⎦ x = ⎡ ⎢⎢⎣ 1111 1 − j −1 j 1 −11−1 1 j −1 − j ⎤ ⎥⎥⎦ ⎡ ⎢⎢⎣ 1 1 0 0 ⎤ ⎥⎥⎦ = ⎡ ⎢⎢⎣ 2 1 − j 0 1 + j ⎤ ⎥⎥⎦ , where we used symmetry and periodicity properties given in Equations (6.16) and (6.17) to obtain W 0 4 = W 4 4 = 1, W 1 4 = W 9 4 =−j, W 2 4 = W 6 4 =−1, and W 3 4 = j. The IDFT can be computed as x = 1 4 ⎡ ⎢⎢⎣ 1111 1 W −1 4 W −2 4 W −3 4 1 W −2 4 W −4 4 W −6 4 1 W −3 4 W −6 4 W −9 4 ⎤ ⎥⎥⎦ X = 1 4 ⎡ ⎢⎢⎣ 1111 1 j −1 − j 1 −11−1 1 − j −1 j ⎤ ⎥⎥⎦ ⎡ ⎢⎢⎣ 2 1 − j 0 1 + j ⎤ ⎥⎥⎦ = ⎡ ⎢⎢⎣ 1 1 0 0 ⎤ ⎥⎥⎦ . The DFT coefficients are equally spaced on the unit circle with frequency intervals of fs/N (or 2π/N). Therefore, the frequency resolution of the DFT is  = fs/N. The frequency sample X(k) represents discrete frequency fk = k fs N , for k = 0, 1,...,N − 1. (6.22) Since the DFT coefficient X(k) is a complex variable, it can be expressed in polar form as X(k) =|X(k)|e jφ(k), (6.23)JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 310 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM where the DFT magnitude spectrum is defined as |X(k)|=  {Re[X(k)]}2 +{Im[X(k)]}2 (6.24) and the phase spectrum is defined as φ(k) = tan−1 Im[X(k)] Re[X(k)]  . (6.25) Example 6.8: Consider a finite-length DC signal x(n) = c, where n = 0, 1,...,N − 1. From Equation (6.14), we obtain X(k) = c N−1 n=0 W kn N = c 1 − W kN N 1 − W k N . Since W kN N = e− j 2π N kN = 1 for all k, and W k N = 1 for k = iN,wehaveX(k) = 0 for k = 1, 2,...,N − 1. For k = 0, N−1 n=0 W kn N = N. Therefore, we obtain X(k) = cNδ(k), k = 0, 1,...,N − 1. 6.2.3 Important Properties This section introduces several important properties of DFT that are useful for analyzing digital signals and systems. Linearity:If{x(n)} and {y(n)} are digital sequences of the same length, DFT[ax(n) + by(n)] = aDFT[x(n)] + bDFT[y(n)] = aX(k) + bY(k), (6.26) where a and b are arbitrary constants. Linearity allows us to analyze complex signals and systems by evaluating their individual components. The overall response is the combination of individual results evaluated at every frequency component. Complex conjugate: If the sequence {x(n), 0 ≤ n ≤ N − 1} is real valued, then X(−k) = X ∗(k) = X(N − k), 0 ≤ k ≤ N − 1, (6.27) where X ∗(k) is the complex conjugate of X(k). Or equivalently, X(M + k) = X ∗(M − k), 0 ≤ k ≤ M, (6.28) where M = N/2ifN is even, or M = (N − 1)/2ifN is odd. This property shows that only the first (M + 1) DFT coefficients from k = 0toM are independent as illustrated in Figure 6.3. For complex signals, however, all N complex outputs carry useful information. From the symmetry property, we obtain |X(k)|=|X(N − k)|, k = 1, 2,...,M − 1 (6.29)JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 DISCRETE FOURIER TRANSFORM 311 X(0) X(1) … X(M − 2) X(M − 1) X(M ) X(M + 1) X(M + 2) … X(N − 1) Complex conjugate Real Real Figure 6.3 Complex-conjugate property, N is an even number and φ(k) =−φ(N − k), k = 1, 2,...,M − 1. (6.30) Circular shifts: Let y(n) be a circular-shifted sequence defined as y(n) = x(n − m) mod N , (6.31) where m is the number of samples by which x(n) is shifted to the right and the modulo operation 0 ≤ (n − m) mod N = (n − m ± iN) < N. (6.32) For example, if m = 1, x(0) shifts to x(1), x(1) shifts to x(2), ..., x(N − 2) shifts to x(N − 1), and x(N − 1) shifts back to x(0). Thus, a circular shift of an N-point sequence is equivalent to a linear shift of its periodic extension. Considering the y(n) defined in Equation (6.31), we have Y(k) = e− j(2πk/N)m X(k) = W mk N X(k). (6.33) DFT and z-transform: DFT is equal to the z-transform of a sequence x(n) of length N, evaluated on the unit circle at N equally spaced frequencies ωk = 2πk/N, where k = 0, 1,...,N − 1. That is, X(k) = X(z)| z=e j( 2π N )k , k = 0, 1,...,N − 1. (6.34) Circular convolution:Ifx(n) and h(n) are real-valued N-periodic sequences, y(n) is the circular convo- lution of x(n) and h(n) defined as y(n) = x(n) ⊗ h(n), n = 0, 1,...,N − 1, (6.35) where ⊗ denotes circular convolution. The circular convolution in time domain is equivalent to mul- tiplication in the frequency domain expressed as Y(k) = X(k)H(k), k = 0, 1,...,N − 1. (6.36) Note that the shorter sequence must be padded with zeros in order to have the same length for computing circular convolution.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 312 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM x(n) x(n − 1) x(n − 2) x(n − N + 1) h(n) h(n − 1) h(n − 2) h(n − N + 1) Figure 6.4 Circular convolution of two sequences using the concentric circle approach Figure 6.4 illustrates the cyclic property of circular convolution using two concentric circles. Toperform circular convolution, N samples of x(n) are equally spaced around the outer circle in the clockwise direction, and N samples of h(n) are displayed on the inner circle in the counterclockwise direction starting at the same point. Corresponding samples on the two circles are multiplied, and the products are summed to form an output. The successive value of the circular convolution is obtained by rotating the inner circle of one sample in the clockwise direction, and repeating the operation of computing the sum of corresponding products. This process is repeated until the first sample of inner circle lines up with the first sample of the exterior circle again. Example 6.9: Given two 4-point sequences x(n) ={1, 2, 3, 4} and h(n) ={1, 0, 1, 1}. Using the circular convolution method illustrated in Figure 6.4, we can obtain n = 0, y(0) = 1 × 1 + 1 × 2 + 1 × 3 + 0 × 4 = 6 n = 1, y(1) = 0 × 1 + 1 × 2 + 1 × 3 + 1 × 4 = 9 n = 2, y(2) = 1 × 1 + 0 × 2 + 1 × 3 + 1 × 4 = 8 n = 3, y(3) = 1 × 1 + 1 × 2 + 0 × 3 + 1 × 4 = 7 Therefore, we obtain y(n) = x(n) ⊗ h(n) ={6, 9, 8, 7}. Note that the linear convolution of sequences x(n) and h(n) results in y(n) = x(n) ∗ h(n) ={1, 2, 4, 7, 5, 7, 4}, which is also implemented in MATLAB script example6_9.m. To eliminate the circular effect and ensure that the DFT method results in a linear convolution, the signals must be zero-padded. Since the linear convolution of two sequences of lengths L and M will resultJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 FAST FOURIER TRANSFORMS 313 in a sequence of length L + M − 1, the two sequences must be extended to the length of L + M − 1or greater by zero-padding. That is, append the sequence of length L with at least M − 1 zeros, and pad the sequence of length M with at least L − 1 zeros. Example 6.10: Consider the same sequences h(n) and x(n) given in Example 6.9. If those 4-point sequences are zero-padded to 8 points as x(n) ={1, 2, 3, 4, 0, 0, 0, 0} and h(n) = {1, 0, 1, 1, 0, 0, 0, 0}, the resulting circular convolution is n = 0, y(0) = 1 × 1 + 0 × 2 + 0 × 3 + 0 × 4 + 0 × 0 + 1 × 0 + 1 × 0 + 0 × 0 = 1 n = 1, y(1) = 0 × 1 + 1 × 2 + 0 × 3 + 0 × 4 + 0 × 0 + 0 × 0 + 1 × 0 + 1 × 0 = 2 n = 2, y(2) = 1 × 1 + 0 × 2 + 1 × 3 + 0 × 4 + 0 × 0 + 0 × 0 + 0 × 0 + 1 × 0 = 4 ... We finally have y(n) = x(n) ⊗ h(n) ={1, 2, 4, 7, 5, 7, 4, 0}. This result is identical to the linear convolution of the two sequences as given in Example 6.9. Thus, the linear convolution can be realized by the circular convolution with proper zero-padding. MATLAB script example6_10.m implements the circular convolution of zero-padded sequences using DFT. Zero-padding can be implemented using the MATLAB function zeros. For example, the 4-point sequence x(n) given in Example 6.9 can be zero-padded to 8 points with the following command, x = [1, 2, 3, 4, zeros(1, 4)]; where the MATLAB function zeros(1, N) generates a row vector of N zeros. 6.3 Fast Fourier Transforms The drawback of using DFT for practical applications is its intensive computational requirement. To compute each X(k) defined in Equation (6.14), we need approximately N complex multiplications and additions. For computing N samples of X(k) for k = 0, 1,...,N − 1, approximately N 2 complex multiplications and (N 2 − N) complex additions are required. Since a complex multiplication requires four real multiplications and two real additions, the total number of arithmetic operations required for computing N-point DFT is proportional to 4N 2, which becomes huge for large N. The twiddle factor W kn N is a periodic function with a limited number of distinct values since W kn N = W (kn) mod N N , for kn > N (6.37) and W N N = 1. Therefore, different powers of W kn N have the same value as shown in Equation (6.37). In addition, some twiddle factors have either real or imaginary parts equal to 1 or 0. By reducing these redundancies, a very efficient algorithm called the fast Fourier transform (FFT) can be derived, which requires only N log2 N operations instead of N 2 operations. If N = 1024, FFT requires about 104 operations instead of 106 operations for DFT.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 314 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM −1 −1 −1 −1 x(0) x(1) x(3) x(5) x(7) x(2) x(4) x(6) N/2-point DFT N/2-point DFT X1(0) X1(1) X1(2) X1(3) X2(0) X2(1) X2(2) X2(3) W 8 0 W 8 1 W 8 2 W 8 3 X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) Figure 6.5 Decomposition of an N-point DFT into two N/2 DFTs, N = 8 The generic term FFT covers many different algorithms with different features, advantages, and dis- advantages. Each FFT algorithm has different strengths and makes different trade-offs in terms of code complexity, memory usage, and computation requirements. In this section, we introduce two classes of FFT algorithms: decimation-in-time and decimation-in-frequency. 6.3.1 Decimation-in-Time For the decimation-in-time algorithms, the sequence {x(n), n = 0, 1,...,N − 1} is first divided into two shorter interwoven sequences: the even numbered sequence x1(m) = x(2m), m = 0, 1,...,(N/2) − 1 (6.38) and the odd numbered sequence x2(m) = x(2m + 1), m = 0, 1,...,(N/2) − 1. (6.39) Apply the DFT defined in Equation (6.14) to these two sequences of length N/2, and combine the resulting N/2-point X1(k) and X2(k) to produce the final N-point DFT. This procedure is illustrated in Figure 6.5 for N = 8. The structure shown on the right side of Figure 6.5 is called the butterfly network because of its crisscross appearance, which can be generalized in Figure 6.6. Each butterfly involves just a single complex multiplication by a twiddle factor W k N , one addition, and one subtraction. Wk −1 (m −1)th stage mth stage N Figure 6.6 Flow graph for a butterfly computationJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 FAST FOURIER TRANSFORMS 315 −1 −1 −1 −1 −1 −1 −1 −1 x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) N/4-point DFT X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) N/4-point DFT N/4-point DFT N/4-point DFT W 8 0 W 8 2 W 8 0 W 8 0 W 8 1 W 8 2 W 8 3W 8 2 Figure 6.7 Flow graph illustrating second step of N-point DFT, N = 8 Since N is a power of 2, N/2 is an even number. Each of these N/2-point DFTs can be computed by two smaller N/4-point DFTs. This second step process is illustrated in Figure 6.7. By repeating the same process, we will finally obtain a set of 2-point DFTs since N is a power of 2. For example, the N/4-point DFT becomes a 2-point DFT in Figure 6.7 for N = 8. Since the first stage uses the twiddle factor W 0 N = 1, the 2-point butterfly network illustrated in Figure 6.8 requires only one addition and one subtraction. Example 6.11: Consider the 2-point FFT algorithm which has two input samples x(0) and x(1). The DFT output samples X(0) and X(1) can be computed as X(k) = 1 n=0 x(n)W nk 2 , k = 0, 1. Since W 0 2 = 1 and W 1 2 = e−π =−1, we have X(0) = x(0) + x(1) and X(1) = x(0) − x(1). The computation is identical to the signal-flow graph shown in Figure 6.8. As shown in Figure 6.7, the input sequence is arranged as if each index was written in binary form and then the order of binary digits was reversed. This bit-reversal process is illustrated in Table 6.1 for the case of N = 8. The input sample indices in decimal are first converted to their binary representations, the binary bit streams are reversed, and then the reversed binary numbers are converted back to decimal −1 Figure 6.8 Flow graph of 2-point DFTJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 316 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM Table 6.1 Example of bit-reversal process, N = 8 (3 bits) Input sample index Bit-reversed sample index Decimal Binary Binary Decimal 0 000 000 0 1 001 100 4 2 010 010 2 3 011 110 6 4 100 001 1 5 101 101 5 6 110 011 3 7 111 111 7 values to give the reordered time indices. Most modern DSP processors (such as the TMS320C55x) provide the bit-reversal addressing mode to efficiently support this process. For the FFT algorithm shown in Figure 6.7, the input values are no longer needed after the computation of output values at a particular stage. Thus, the memory locations used for the FFT outputs can be the same locations used for storing the input data. This observation supports the in-place FFT algorithms that use the same memory locations for both the input and output numbers. 6.3.2 Decimation-in-Frequency The development of the decimation-in-frequency FFT algorithm is similar to the decimation-in-time algorithm presented in the previous section. The first step is to divide the data sequence into two halves, each of N/2 samples. The next step is to separate the frequency terms X(k) into even and odd samples of k. Figure 6.9 illustrates the first decomposition of an N-point DFT into two N/2-point DFTs. Continue the process of decomposition until the last stage consists of 2-point DFTs. The decomposition and symmetry relationships are reversed from the decimation-in-time algorithm. The bit reversal occurs at the output instead of the input and the order of the output samples X(k) will be rearranged as Table 6.1. Figure 6.10 illustrates the butterfly representation for the decimation-in-frequency FFT algorithm. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) N/2-point DFT N/2-point DFT x1(0) x1(1) x1(2) x1(3) x2(0) x2(1) x2(2) x2(3) W 8 0 W 8 1 W 8 2 W 8 3−1 −1 −1 −1 X(0) X(2) X(4) X(6) X(1) X(3) X(5) X(7) Figure 6.9 Decomposition of an N-point DFT into two N/2 DFTsJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 317 −1 WN k (m −1)th stage mth stage Figure 6.10 Butterfly network for decimation-in-frequency FFT algorithm The FFT algorithms introduced in this chapter are based on two-input, two-output butterfly compu- tations, and are classified as radix-2 FFT algorithms. It is possible to use other radix values to develop FFT algorithms. However, these algorithms only work well for some specific FFT lengths. In addition, these algorithms are more complicated than the radix-2 FFT algorithms and the programs for real-time implementation are not widely available for DSP processors. 6.3.3 Inverse Fast Fourier Transform The FFT algorithms introduced in the previous sections can be modified to efficiently compute the inverse FFT (IFFT). By complex conjugating both sides of Equation (6.18), we have x∗(n) = 1 N N−1 k=0 X ∗(k)W kn N , n = 0, 1,...,N − 1. (6.40) This equation shows we can use an FFT algorithm to compute the IFFT by first conjugating the DFT coefficients X(k) to obtain X ∗(k), computing the DFT of X ∗(k) using an FFT algorithm, scaling the results by 1/N to obtain x∗(n), and then complex conjugating x∗(n) to obtain the output sequence x(n). If the signal is real valued, the final conjugation operation is not required. 6.4 Implementation Considerations Many FFT routines are available in C and assembly programs for some specific DSP processors; however, it is important to understand the implementation issues in order to use FFT properly. 6.4.1 Computational Issues The FFT routines accept complex-valued inputs; therefore, the number of memory locations required is 2N for N-point FFT. To use the available complex FFT program for real-valued signals, we have to set the imaginary parts to zero. The complex multiplication has the form (a + jb)(c + jd) = (ac + bd) + j(bc + ad), which requires four real multiplications and two real additions. The number of multiplication and the storage requirements can be reduced if the signal has special properties. For example, if x(n) is real, only N/2 samples from X(0) to X(N/2) need to be computed as shown by complex-conjugate property. In most FFT programs developed for general-purpose computers, the computation of twiddle factors W kn N defined in Equation (6.15) is embedded in the program. However, the twiddle factors only need to be computed once during the program initialization stage. In the implementation of FFT algorithm on DSP processors, it is preferable to tabulate the values of twiddle factors so that they can be looked up during the computation of FFT.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 318 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM The complexity of FFT algorithms is usually measured by the required number of arithmetic oper- ations (multiplications and additions). In practical real-time implementations with DSP processors, the architecture, instruction set, data structures, and memory organizations of the processors are critical fac- tors. For example, modern DSP processors usually provide bit-reversal addressing and a high degree of instruction parallelism to implement FFT algorithms. 6.4.2 Finite-Precision Effects From the signal-flow graph of the FFT algorithm shown in Figure 6.7, X(k) will be computed by a series of butterfly computations with a single complex multiplication per butterfly network. Note that some butterfly networks with coefficients ±1 (such as 2-point FFT in the first stage) do not require multiplication. Figure 6.7 also shows that the computation of N-point FFT requires M = log2 N stages. There are N/2 butterflies in the first stage, N/4 in the second stage, and so on. Thus, the total number of butterflies required is N 2 + N 4 +···+2 + 1 = N − 1. (6.41) The quantization errors introduced at the mth stage are multiplied by the twiddle factors at each sub- sequent stage. Since the magnitude of the twiddle factor is always unity, the variances of the quantization errors do not change while propagating to the output. The definition of DFT given in Equation (6.14) shows that we can scale the input sequence with the condition |x(n)| < 1 N (6.42) to prevent the overflow at the output because |e− j(2π/N)kn|=1. For example, in a 1024-point FFT, the input data must be shifted right by 10 bits (1024 = 210). If the original data is 16 bits, the effective wordlength of the input data is reduced to only 6 bits after scaling. This worst-case scaling substantially reduces the resolution of the FFT results. Instead of scaling the input samples by 1/N at the beginning of the FFT, we can scale the signals at each stage. Figure 6.6 shows that we can avoid overflow within the FFT by scaling the input at each stage by 0.5 because the outputs of each butterfly involve the addition of two numbers. This scaling process provides a better accuracy than the scaling of input by 1/N. An alternative conditional scaling method examines the results of each FFT stage to determine whether to scale the inputs of that stage. If all of the results in a particular stage have magnitude less than 1, no scaling is necessary at that stage. Otherwise, scale the inputs to that stage by 0.5. This conditional scaling technique achieves much better accuracy, however, at the cost of increasing software complexity. 6.4.3 MATLAB Implementations As introduced in Section 3.2.6, MATLAB provides the function fft with syntax y = fft(x); to compute the DFT of x(n) in the vector x. If the length of x is a power of 2, the fft function employs an efficient radix-2 FFT algorithm. Otherwise, it uses a slower mixed-radix FFT algorithm or even a DFT. An alternative way of using fft function is y = fft(x, N);JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 319 60 50 40 30 Magnitude 20 10 10 20 30 Frequency index, k Spectrum of 50 Hz sinewave 40 50 600 Figure 6.11 Spectrum of 50 Hz sinewave to specify N-point FFT. If the length of x is less than N, the vector x is padded with trailing zeros to length N. If the length of x is greater than N, the fft function only performs the FFT of the first N samples. The execution time of the fft function depends on the input data type and the sequence length. If the input data is real valued, it computes a real power-of-2 FFT algorithm that is faster than a complex FFT of the same length. The execution is fastest if the sequence length is exactly a power of 2. For example, if the length of x is 511, the function y = fft(x, 512) will be computed faster than fft(x) which performs 511-point DFT. It is important to note that the vectors in MATLAB are indexed from 1 to N instead of from 0 to N − 1 as given in the DFT and IDFT definitions. Example 6.12: Consider a sinewave of frequency f = 50 Hz expressed as x(n) = sin(2π fn/fs), n = 0, 1,...,127, where the sampling rate fs = 256 Hz. We analyze this sinewave using a 128-point FFT given in the MATLAB script (example6_12.m), and display the magnitude spectrum in Figure 6.11. It shows the frequency index k = 25 corresponding to the spectrum peak. Substituting the associated parameters into Equation (6.22), we verified that the line spectrum is corresponding to 50 Hz. The MATLAB function ifft implements the IFFT algorithm as y = ifft(x); or y = ifft(x,N); The characteristics and usage of ifft are the same as those for fft.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 320 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM 6.4.4 Fixed-Point Implementation Using MATLAB MATLAB provides a function qfft for quantizing an FFT object to support fixed-point implementation. For example, the following command, F = qfft constructs a quantized FFT object F with default values. We can change the default settings by F = qfft('Property1',Value1, 'Property2',Value2, ...) to create a quantized FFT object with specific property/value pairs. Example 6.13: We can change the default 16-point FFT to 128-point FFT using the following command: F = qfft('length',128) We then obtain the following quantized FFT object in the command window: F= Radix = 2 Length = 128 CoefficientFormat = quantizer('fixed', 'round', 'saturate', [16 15]) InputFormat = quantizer('fixed', 'floor', 'saturate', [16 15]) OutputFormat = quantizer('fixed', 'floor', 'saturate', [16 15]) MultiplicandFormat = quantizer('fixed', 'floor', 'saturate', [16 15]) ProductFormat = quantizer('fixed', 'floor', 'saturate', [32 30]) SumFormat = quantizer('fixed', 'floor', 'saturate', [32 30]) NumberOfSections = 7 ScaleValues = [1] This shows that the quantized FFT is a 128-point radix-2 FFT for the fixed-point data and arithmetic. The coefficients, input, output, and multiplicands are represented using Q15 format [16 15], while the product and sum use Q30 format [32 30]. There are seven stages for N = 128, and no scaling is applied to the input at each stage by the default setting ScaleValues = [1]. We can set a scaling factor 0.5 at the input of each stage as follows: F.ScaleValues = [0.5 0.5 0.5 0.5 0.5 0.5 0.5]; Or, set different values at specific stages using different scaling factors. Example 6.14: Similar to Example 6.12, we used a quantized FFT to analyze the spectrum of sinewave. In example6_14a.m, we first generate the same sinewave as in Example 6.12, then use the following functions to compute the fixed-point FFT with Q15 format: FXk = qfft('length',128); % Create quantized FFT object qXk = fft(FXk, xn); % Compute Q15 FFT in xn vectorJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 321 When we run the MATLAB script, we receive the following warning messages reported in MATLAB command window: Warning: 1135 overflows in quantized fft. Max Min NOverflows NUnderflows NOperations Coefficient 1 -1 7 6 254 Input 0.9999 -0.9999 0 0 128 Output 2 -2 16 32 256 Multiplicand 2 -2 1063 91 3584 Product 1 -1 0 0 3584 Sum 2.414 -2.414 56 0 4480 Without proper scaling, the FFT has 1135 overflows, and thus the FFT results are wrong. We can modify the code by setting the scaling factor 0.5 at each stage as follows (see exam- ple6_14b.m): FXk = qfft('length',128); % Create quantized FFT object FXk.ScaleValues = [0.5 0.5 0.5 0.5 0.5 0.5 0.5]; % Set scaling factors qXk = fft(FXk, xn); % Compute Q15 FFT of xn vector When we run the modified program (example6_14b.m), there are no warnings or errors. The spectrum plot displayed in Figure 6.12 shows that we can perform FFT properly using 16-bit processors with adequate scaling factor at each stage. 60 50 40 30 Magnitude 20 10 10 20 30 Frequency index, k Spectrum stimation using quantized FFT 40 50 600 Figure 6.12 Sinewave spectrum computed using the quantized 16-bit FFTJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 322 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM 6.5 Practical Applications In this section, we will introduce two important FFT applications: spectral analysis and fast convolution. 6.5.1 Spectral Analysis The inherent properties of the DFT directly affect its performance on spectral analysis. The spectrum estimated from a finite number of samples is correct only if the signal is periodic and the sample set exactly covers one or multiple period of signal. In practice, we may have to break up a long sequence into smaller segments and analyze each segment individually using the DFT. As discussed in Section 6.2, the frequency resolution of N-point DFT is fs/N. The DFT coefficients X(k) represent frequency components that are equally spaced at frequencies fk as defined in Equation (6.22). One cannot properly represent a signal component that falls between two adjacent samples in the spectrum, because its energy will spread to neighboring bins and distort their spectral amplitude. Example 6.15: In Example 6.12, the frequency resolution ( fs/N) is 2 Hz using a 128-point FFT and sampling rate 256 Hz. The line component at 50 Hz can be represented by X(k)atk = 25 as shown in Figure 6.11. Consider the case of adding another sinewave at frequency 61 Hz (see example6_15.m). Figure 6.13 shows both spectral components at 50 and 61 Hz. However, the frequency component at 61 Hz (between k = 30 and k = 31) does not show a line component because its energy spreads into adjacent frequency bins. 60 50 40 30 Magnitude 20 10 10 20 30 Frequency index, k Spectra of 50 and 61 Hz sinewaves 40 50 600 50Hz 61Hz Figure 6.13 Spectra of sinewaves at 50 and 61 HzJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 PRACTICAL APPLICATIONS 323 A solution to this spectral leakage problem is to have a finer resolution fs/N by using a larger FFT size N. If the number of data samples is not sufficiently large, the sequence may be expanded to length N by adding zeros to the tail of true data. This process is simply the interpolation of the spectral curve between adjacent frequency components. Other problems relate to the FFT-based spectral analysis including aliasing, finite data length, spectral leakage, and spectral smearing. These issues will be discussed in the following section. 6.5.2 Spectral Leakage and Resolution The data set that represents the signal of finite length N can be obtained by multiplying the signal with a rectangular window expressed as xN (n) = w(n)x(n) = x(n), 0 ≤ n ≤ N − 1 0, otherwise , (6.43) where the rectangular function w(n) is defined in Equation (4.33). As the length of the window increases, the windowed signal xN (n) becomes a better approximation of x(n), and thus X(k) becomes a better approximation of the DTFT X(ω). The time-domain multiplication given in Equation (6.43) is equivalent to the convolution in the fre- quency domain. Thus, the DFT of xN (n) can be expressed as X N (k) = W(k) ∗ X(k) = N l=−N W(k − l)X(k), (6.44) where W(k) is the DFT of the window function w(n), and X(k) is the true DFT of the signal x(n). Equation (6.44) shows that the computed spectrum X N (k) consists of the true spectrum X(k) convoluted with the window function’s spectrum W(k). Therefore, the computed spectrum of the finite-length signal is corrupted by the rectangular window’s spectrum. As discussed in Section 4.2, the magnitude response of the rectangular window consists of a mainlobe and several smaller sidelobes. The frequency components that lie under the sidelobes represent the sharp transition of w(n) at the endpoints. The sidelobes introduce spurious peaks into the computed spectrum, or to cancel true peaks in the original spectrum. This phenomenon is known as spectral leakage. To avoid spectral leakage, it is necessary to use different windows as introduced in Section 4.2.3 to reduce the sidelobe effects. Example 6.16: If the signal x(n) consists of a single sinusoid cos(ω0n), the spectrum of the infinite-length sampled signal is X(ω) = 2πδ(ω ± ω0), −π ≤ ω ≤ π, (6.45) which consists of two line components at frequencies ±ω0. However, the spectrum of the windowed sinusoid can be obtained as X N (ω) = 1 2[W(ω − ω0) + W(ω + ω0)], (6.46) where W(ω) is the spectrum of the window function.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 324 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM Equation (6.46) shows that the windowing process has the effect of smearing the original sharp spectral line δ(ω − ω0) at frequency ω0 and replacing it with W(ω − ω0). Thus, the power has been spread into the entire frequency range by the windowing operation. This undesired effect is called spectral smearing. Thus, windowing not only distorted the spectrum due to leakage effects, but also reduced spectral resolution. Example 6.17: Consider a signal consisting of two sinusoidal components expressed as x(n) = cos(ω1n) + cos(ω2n). The spectrum of the windowed signal is X N (ω) = 1 2[W(ω − ω1) + W(ω + ω1) + W(ω − ω2) + W(ω + ω2)], (6.47) which shows that the sharp spectral lines are replaced with their smeared versions. If the frequency separation, ω =|ω1 − ω2|, of the two sinusoids is ω ≤ 2π N (6.48) or f ≤ fs N , (6.49) the mainlobe of the two window functions W(ω − ω1) and W(ω − ω2) overlap. Thus, the two spectral lines in X N (ω) are not distinguishable. MATLAB script example6_17.m uses 128-point FFT for signal with sampling rate 256 Hz. This example shows that two sinewaves of frequencies 60 and 61 Hz are mixed. From Equation (6.49), the frequency separation 1 Hz is less than the frequency resolution 2 Hz, thus these two sinewaves are overlapped as shown in Figure 6.14. 70 50 60 40 30Magnitude 20 10 10 20 30 Frequency index, k Spectra of mixing 60 and 61 Hz sinewaves 40 50 600 Figure 6.14 Spectra of mixing sinewaves at 60 and 61 HzJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 PRACTICAL APPLICATIONS 325 To guarantee that two sinusoids appear as two distinct ones, their frequency separation must satisfy the condition ω > 2π N or f > fs N . (6.50) Thus, the minimum DFT length to achieve a desired frequency resolution is given as N > fs f = 2π ω . (6.51) In summary, the mainlobe width determines the frequency resolution of the windowed spectrum. The sidelobes determine the amount of undesired frequency leakage. The optimum window used for spectral analysis must have narrow mainlobe and small sidelobes. The amount of leakage can be substantially reduced using nonrectangular window functions introduced in Section 4.2.3 at the cost of decreased spectral resolution. For a given window length N, windows such as rectangular, Hanning, and Hamming have relatively narrow mainlobe compared with Blackman and Kaiser windows. Unfortunately, the first three windows have relatively high sidelobes, thus having more leakage. There is a trade-off between frequency resolution and spectral leakage in choosing windows for a given application. Example 6.18: Consider the 61 Hz sinewave in Example 6.15. We can apply the Kaiser window with N = 128 and β = 8.96 to the signal using the following commands: beta = 8.96; wn = (kaiser(N,beta))'; % Kaiser window x1n = xn.*wn; % Generate windowed sinewave The magnitude spectra of sinewaves with the rectangular and Kaiser windows are shown in Figure 6.15 by the MATLAB script example6_18.m. This shows that the Kaiser window can effectively reduce the spectral leakage. Note that the gain for using Kaiser window has been scaled up by 2.4431 in order to compensate the energy loss compared with using rectangular window. The time- and frequency-domain plots of the Kaiser window with length N = 128 and β = 8.96 are shown in Figure 6.16 using WinTool. For a given window, increasing the length of the window reduces the width of the mainlobe, which leads to better frequency resolution. However, if the signal changes frequency content over time, the window cannot be too long in order to provide a meaningful spectrum. 6.5.3 Power Spectrum Density Consider a sequence x(n) of length N whose DFT is X(k), the Parseval’s theorem can be expressed as E = N−1 n=0 |x(n)|2 = 1 N N−1 k=0 |X(k)|2. (6.52) The term |X(k)|2 is called the power spectrum that measures the power of signal at frequency fk. Therefore, squaring the DFT magnitude spectrum |X(k)| produces a power spectrum, which is also called the periodogram. The power spectrum density (PSD) (power density spectrum or simply power spectrum) characterizes stationary random processes. The PSD is very useful in the analysis of random signals since it providesJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 326 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM 60 50 40 30 Magnitude 20 10 10 20 30 Frequency index, k Spectrum of 61 Hz sinewave 40 50 600 Kaiser window Rectangular window Figure 6.15 Spectra obtained using rectangular and Kaiser windows a meaningful measure for the distribution of the average power in such signals. There are different techniques for estimating the PSD. Since the periodogram is not a consistent estimate of the true PSD, the averaging method can reduce statistical variation of the computed spectra. One way of computing the PSD is to decompose x(n) into M segments, xm(n), of N samples each. These signal segments are spaced N/2 samples apart; i.e., there is 50 % overlap between successive segments. In order to reduce spectral leakage, each xm(n) is multiplied by a nonrectangular window Time domain Frequency domain 20 40 60 Samples Leakage Factor: 0 % Mainlobe width (−3dB): 0.025391Relative sidelobe attenuation: −66 dB 80 100 120 1 0.8 0.6 0.4Amplitude 0.2 0 40 20 0 −20 −40 Magnitude (dB) −60 −80 −100 −1200 0.2 0.4 Normalized frequency (×π rad/sample) 0.6 0.8 Figure 6.16 Kaiser window of N = 128 and β = 8.96JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 PRACTICAL APPLICATIONS 327 (such as Hamming) function w(n) of length N. The PSD is a weighted sum of the periodograms of the individual overlapped segment. The MATALAB Signal Processing Toolbox provides the function psd to estimate the PSD of the signal given in the vector x using the following statement: h = spectrum.periodogram; % Create a periodogram object psd(h,x,'Fs',Fs); % Plots the two-sided PSD by default where Fs is the sampling frequency. Example 6.19: Consider the signal x(n) which consists of two sinusoids (140 and 150 Hz) and noise. This noisy signal is generated by example6_19.m adapted from the MATLAB Help menu. The PSD can be computed by creating the following periodogram object: Hs=spectrum.periodogram; The psd function can also display the PSD (Figure 6.17) as follows: psd(Hs,xn,'Fs',fs,'NFFT',1024) Note that we can specify the window used for computing PSD. For example, we can use Hamming window as follows: Hs = spectrum.periodogram('Hamming'); 0 −80 −70 −60 −50 −40 Power/frequency (dB/Hz) −30 −20 −10 0 0.05 0.1 0.15 0.2 0.25 Frequency (kHz) Power spectral density estimate via periodogram 0.3 0.35 0.4 0.45 0.5 140Hz 150Hz Figure 6.17 PSD of two sinewaves embedded in noiseJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 328 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM For a time-varying signal, it is more meaningful to compute a local spectrum that measures spectral contents over a short-time interval. We use a sliding window to break up a long sequence into several short blocks x m(n)ofN samples, and then perform the FFT to obtain the time-dependent frequency spectrum at each segment m as follows: Xm(k) = N−1 n=0 x m(n)W kn N , k = 0, 1,...,N − 1. (6.53) This process is repeated for the next block of N samples. This technique is called the short-term Fourier transform, since Xm(k) is just the spectrum of the short segment of xm(n) that lies inside the sliding window w(n). This form of time-dependent Fourier transform has several applications in speech, sonar, and radar signal processing. Equation (6.53) shows that Xm(k) is a two-dimensional sequence. The index k represents frequency, and the block index m represents segment (or time). Since the result is a function of both time and frequency, a three-dimensional graphical display is needed. This is done by plotting |Xm(k)| using gray- scale (or color) images as a function of both k and m. The resulting three-dimensional graphic is called the spectrogram. The spectrogram uses the x-axis to represent time and the y-axis to represent frequency. The gray level (or color) at point (m, k) is proportional to |Xm(k)|. The Signal Processing Toolbox provides a function spectrogram to compute spectrogram. This MATLAB function has the form B = spectrogram(a,window,noverlap,nfft,Fs); where B is a matrix containing the complex spectrogram values |Xm(k)|, and other arguments are defined in the function psd. More overlapped samples make the spectrum move smoother from block to block. It is common to pick the overlap to be around 25 %. The spectrogram function with no output arguments displays the scaled logarithm of the spectrogram in the current graphic window. Example 6.20: The MATLAB program example6_20.m loads the speech file timit2.asc, plays it using the function soundsc, and displays the spectrogram as shown in Figure 6.18. The color bar on the right side indicates the signal strength in dB. The color corresponding to the lower power in the figure represents the silence and the color corresponding to the higher power represents the speech signals. 6.5.4 Fast Convolution As discussed in Chapter 4, FIR filtering is a linear convolution of filter impulse response h(n) with the input sequence x(n). If the FIR filter has L coefficients, we need L real multiplications and L − 1 real additions to compute each output y(n). To obtain L output samples, the number of operations (multiplication and addition) needed is proportional to L2. To take advantage of efficient FFT and IFFT algorithms, we can use the fast convolution algorithm illustrated in Figure 6.19 for FIR filtering. Fast convolution provides a significant reduction in computational requirements for higher order FIR filters, thus it is often used to implement FIR filtering in applications having a large number of data samples. It is important to note that the fast convolution shown in Figure 6.19 produces the circular convolution discussed in Section 6.2.3. In order to produce a linear convolution, it is necessary to append zeros to the signals as shown in Example 6.10. If the data sequence x(n) has finite duration M, the first step is to pad data sequence and coefficients with zeros to a length corresponding to an allowable FFT size NJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 PRACTICAL APPLICATIONS 329 Figure 6.18 Spectrogram of speech signal (≥ L + M − 1), where L is the length of h(n). The FFT is computed for both sequences to obtain X(k) and H(k), the corresponding complex products Y(k) = X(k)H(k) are calculated, and the IFFT of Y(k) is used to obtain y(n). The desired linear convolution is contained in the first L + M − 1 terms of these results. Since the filter impulse response h(n) is known as a priori, the FFT of h(n) can be precalculated and stored as fixed coefficients. For many applications, the input sequence is very long as compared to the FIR filter length L. This is especially true in real-time applications, such as in audio signal processing where the FIR filter order is extremely high due to high-sampling rate and input data is very long. In order to use the efficient FFT and IFFT algorithms, the input sequence must be partitioned into segments of N (N > L and N is a size supported by the FFT algorithm) samples, process each segment using the FFT, and finally assemble the output sequence from the outputs of each segment. This procedure is called the block-processing operation. The cost of using this efficient block processing is the buffering delay. More complicated x(n) h(n) H(k) X(k) Y(k) y(n) FFT IFFT FFT Figure 6.19 Fast convolution algorithmJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 330 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM xm−1 (n) N − 1 ym+1 (n) ym−1 (n) xm+1 (n) xm (n) ym (n) m − 1 m − 1 m + 1 m + 1 m m L L discarded 0 Figure 6.20 Overlap data segments for the overlap-save technique algorithms have been devised to have both zero latency as direct FIR filtering and computational efficiency [10]. There are two techniques for the segmentation and recombination of the data: the overlap-save and overlap-add algorithms. Overlap-Save Technique The overlap-save process overlaps Linput samples on each segment. The output segments are truncated to be nonoverlapping and then concatenated. The following steps describe the process illustrated in Figure 6.20: 1. Apply N-point FFT to the expanded (zero-padded) impulse response sequence to obtain H (k), where k = 0, 1,...,N − 1. This process can be precalculated off-line and stored in memory. 2. Select N signal samples xm(n) (where m is the segment index) from the input sequence x(n) based on the overlap illustrated in Figure 6.20, and then use N-point FFT to obtain Xm(k). 3. Multiply the stored H (k) (obtained in Step 1) by the Xm(k) (obtained in Step 2) to get Ym(k) = H (k)Xm(k), k = 0, 1,...,N − 1. (6.54) 4. Perform N-point IFFT of Ym(k) to obtain ym(n) for n = 0, 1,...,N − 1. 5. Discard the first L samples from each IFFT output. The resulting segments of (N − L) samples are concatenated to produce y(n). Overlap-Add Technique The overlap-add process divides the input sequence x(n) into nonoverlapping segments of length (N − L). Each segment is zero-padded to produce xm(n) of length N. Follow the Steps 2Ð4 of the overlap-save method to obtain N-point segment ym(n). Since the convolution is the linear operation, the output sequence y(n) is the summation of all segments.JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 OVERLAP-ADD TECHNIQUE 331 MATLAB implements this efficient FIR filtering using the overlap-add technique as y = fftfilt(b, x); The fftfilt function filters the input signal in the vector x with the FIR filter described by the coefficient vector b. The function chooses an FFT and a data block length that automatically guarantees efficient execution time. However, we can specify the FFT length N by using y = fftfilt(b, x, N) Example 6.21: The speech data timit2.asc (used in Example 6.20) is corrupted by a tonal noise at frequency 1000 Hz. We design a bandstop FIR filter with edge frequencies of 900 and 1100 Hz, and filter the noisy speech using the following MATLAB script (example6_21.m): Wn = [900 1100]/4000; % Edge frequencies b = fir1(128, Wn, 'stop'); % Design bandstop filter yn = fftfilt(b, xn); % FIR filtering using fast convolution soundsc(yn, fs, 16); % Listen to the filtered signal spectrogram(yn,kaiser(256,5),200,256,fs,'yaxis') % Spectrogram MATLAB program example6_21.m plays the original speech first, plays the noisy speech that is corrupted by the 1 kHz tone, and then shows the spectrogram with the noise component (in red) at 1000 Hz. In order to attenuate that tonal noise, a bandstop FIR filter is designed (using the function fir1) to filter the noisy speech using the function fftfilt. Finally, the filter output is played and its spectrogram is shown in Figure 6.21 with the 1000 Hz tonal noise being attenuated. Figure 6.21 Spectrogram of bandstop filter outputJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 332 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM 6.6 Experiments and Program Examples In the section, we will implement the DFT and FFT algorithms for DSP applications. The computation of the DFT and FFT involves nested loops, complex multiplication, and complex twiddle-factor generation. 6.6.1 Floating-Point C Implementation of DFT For multiplying a complex data sample x(n) = xr(n) + jxi(n) and a complex twiddle factor W kn N = cos (2πkn/N) − j sin (2πkn/N) = Wr − jWi defined in Equation (6.15), the product can be expressed as x(n)W kn N = xr(n)Wr + xi(n)Wi + j[xi(n)Wr − xr(n)Wi], (6.55) where the subscripts r and i denote the real and imaginary parts of complex variable. Equation (6.55) can be rewritten as X(n) = Xr(n) + jXi(n), (6.56) where Xr(n) = xr(n)Wr + xi(n)Wi (6.57a) Xi(n) = xi(n)Wr − xr(n)Wi. (6.57b) The C program listed in Table 6.2 uses two arrays, Xin[2*N] and Xout[2*N], to hold the complex input and output samples. The twiddle factors are computed at run time. Since most of real-world applications contain only real data, it is necessary to compose a complex data set from the given real data. The simplest way is to zero out the imaginary part before calling the DFT function. This experiment computes 128-point DFT of signal given in file input.dat, and displays the spectrum using the CCS graphics. Table 6.3 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Open the project file, float_dft128.pjt, and rebuild the project. 2. Run the DFT experiment using the input data file input.dat. 3. Examine the results saved in the data array spectrum[ ] using CCS graphics as shown in Figure 6.22. The magnitude spectrum shows the normalized frequencies at 0.125, 0.25, and 0.5. 4. Profile the code and record the required cycles per data sample for the floating-point implementation of DFT. 6.6.2 C55x Assembly Implementation of DFT We write assembly routines based on the C program listed in Table 6.4 to implement DFT on TMS320C55x. The sine and cosine generators for experiments given in Chapter 3 can be used to gener- ate the twiddle factors. The assembly function sine_cos.asm (see section ‘Practical Applications’ in Chapter 3) is a C-callable function that follows the C55x C-calling convention. This function has two arguments: angle and Wn. The first argument contains the input angle in radians and is passed to theJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 333 Table 6.2 List of floating-point C function for DFT #include #define PI 3.1415926536 void floating_point_dft(float Xin[], float Xout[]) { short i,n,k,j; float angle; float Xr[N],Xi[N]; float W[2]; for (i=0,k=0;k>6 ; 2*PI/N, N=128 .bss Wn,2 ; Wn[0]=Wr, Wn[1]=Wi .bss angle,1 ; Angle for sine-cosine function .text _dft_128 pshboth XAR5 ; Save AR5 bset SATD mov #N-1,BRC0 ; Repeat counter for outer loop mov #N-1,BRC1 ; Repeat counter for inner loop mov XAR0,XAR5 ; AR5 pointer to sample buffer mov XAR0,XAR3 mov #0,T2 ;k=T2=0 rptb outer_loop-1 ; for(k=0;k>1; /* Number of butterflies in sub-DFT */ U.re = 1.0; U.im = 0.; for (j=0; j>1; complex temp; /* Temporary storage of the complex variable */ for (j=0,i=1;i>=1; } j+=k; if (i>1; /* Number of butterflies in sub DFT */ U.re = 0x7fff; U.im = 0; for (j=0; j>SFT16); temp.re = _sadd(temp.re, 1)>>scale; /* Rounding & scale */ ltemp.im = _lsmpy(X[id].im, U.re); temp.im = (_smac(ltemp.im, X[id].re, U.im)>>SFT16);JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 339 Table 6.9 (continued ) temp.im = _sadd(temp.im, 1)>>scale; /* Rounding & scale */ X[id].re = _ssub(X[i].re>>scale, temp.re); X[id].im = _ssub(X[i].im>>scale, temp.im); X[i].re = _sadd(X[i].re>>scale, temp.re); X[i].im = _sadd(X[i].im>>scale, temp.im); } /* Recursive compute W^k as W*W^(k-1) */ ltemp.re = _lsmpy(U.re, W[L-1].re); ltemp.re = _smas(ltemp.re, U.im, W[L-1].im); ltemp.im = _lsmpy(U.re, W[L-1].im); ltemp.im = _smac(ltemp.im, U.im, W[L-1].re); U.re = ltemp.re>>SFT16; U.im = ltemp.im>>SFT16; } } Procedures of the experiment are listed as follows: 1. Open the project file, intrinsic_fft.pjt, and rebuild the project. 2. Run the FFT experiment using the data file input_i.dat. 3. Examine the results saved in spectrum[ ] using CCS graphics. The spectrum plot shows the normalized line frequency at 0.25. 4. Profile the intrinsics implementation of the FFT and compare the required cycles per data sample with the floating-point C FFT experiment result obtained in previous section. 6.6.5 Assembly Implementation of FFT and Inverse FFT In this experiment, we use the C55x assembly routines for computing the same radix-2 FFT algorithm implemented by the fixed-point C with intrinsics given in the previous experiment. The C55x FFT assembly routine listed in Table 6.11 follows the C55x C-calling convention. For readability, the assembly code mimics the C function closely. It optimizes the memory usage but not the run-time efficiency. The execution speed can be further improved by unrolling the loop and taking advantage of the FFT butterfly characteristics, but with the expense of the memory. Table 6.10 File listing for experiment exp6.6.4_intrinsics_FFT Files Description intrinsic_fftTest.c C function for testing intrinsics FFT experiment intrinsic_fft.c C function for intrinsics FFT ibit_rev.c C function performs fixed-point bit reversal intrinsic_fft.h C header file for fixed-point FFT experiment icomplex.h C header file defines fixed-point complex data type intrinsic_fft.pjt DSP project file intrinsic_fft.cmd DSP linker command file input_i.dat Data fileJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 340 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM The assembly routine defines local variables as a structure using the stack-relative addressing mode. The last memory location contains the return address of the caller function. Since the status registers ST1 and ST3 will be modified by the assembly routine, we use two stack locations to store the contents of these registers at entry, and they will be restored upon returning to the caller function. The complex temporary variable is stored in two consecutive memory locations by using a bracket with the numerical number to indicate the number of memory locations for the integer data type. Table 6.11 List of C55x assembly implementation of FFT algorithm .global _fft ARGS .set 0 ; Number of variables passed via stack FFT_var .struct ; Define local variable structure d_temp .short (2) ; Temporary variables (Re, Im) d_L .short d_N .short d_T2 .short ; Used to save content of T2 d_ST1 .short ; Used to save content of ST1 d_ST3 .short ; Used to save content of ST3 d_AR5 .short ; Used to save content of ar5 dummy .short ; Used to align stack pointer return_addr .short ; Space for routine return address Size .endstruct fft .set 0 fft .tag FFT_var .sect ".text:fft_code" _fft: aadd #(ARGS-Size+1),SP ; Adjust stack for local variables mov mmap(ST1_55),AR2 ; Save ST1,ST3 mov mmap(ST3_55),AR3 mov AR2,fft.d_ST1 mov AR3,fft.d_ST3 mov AR5,(fft.d_AR5) ; Protect AR5 btst @#0,T1,TC1 ; Check SCALE flag set mov #0x6340,mmap(ST1_55) ; Set CPL,XF,SATD,SXAM,FRCT (SCALE=1) mov #0x1f22,mmap(ST3_55) ; Set: HINT,SATA,SMUL xcc do_scale,TC1 mov #0x6300,mmap(ST1_55) ; Set CPL,XF,SATD,SXAM (SCALE=2) do_scale mov T2,fft.d_T2 ; Save T2 || mov #1,AC0 mov AC0,fft.d_L ; Initialize L=1 || sfts AC0,T0 ; T0=EXP mov AC0,fft.d_N ; N=1<>1 || sfts AC0,#-1 sub #1,AC0 ; Init mid_loop counter mov mmap(AC0L),BRC0 ; BRC0=LE1-1 sub #1,AC1 ; Initialize inner loop counter mov mmap(AC1L),BRC1 ; BRC1=(N>>L)-1 add AR1,AR0 mov #0,T2 ; j=0 || rptblocal mid_loop-1 ; for (j=0; j>#1,dual(*AR3) ; Scale X[i] by 1/SCALE mov dbl(*AR3),AC2 scale add T0,AR2 || sub dual(*AR4),AC2,AC1 ; X[id].re=X[i].re/SCALE-temp.re mov AC1,dbl(*(AR5+T0)) ; X[id].im=X[i].im/SCALE-temp.im || add dual(*AR4),AC2 ; X[i].re=X[i].re/SCALE+temp.re mov AC2,dbl(*(AR3+T0)) ; X[i].im=X[i].im/SCALE+temp.im inner_loop ; End of inner loop amar *CDP+ amar *CDP+ ; Update k for pointer to U[k] || add #2,T2 ; Update j mid_loop ; End of mid-loop sub #1,T1 add #1,fft.d_L ; Update L bcc outer_loop,T1>0 ; End of outer loop mov fft.d_ST1,AR2 ; Restore ST1,ST3,T2 mov fft.d_ST3,AR3 mov AR2,mmap(ST1_55) mov AR3,mmap(ST3_55) mov (fft.d_AR5),AR5 mov fft.d_T2,T2 aadd #(Size-ARGS-1),SP ; Reset SP retJWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 342 FREQUENCY ANALYSIS AND FAST FOURIER TRANSFORM We also write the bit-reversal function using C55x assembly language for improving run-time effi- ciency. Table 6.12 lists the assembly implementation of bit-reversal function. To reduce the computation of the FFT algorithm, we precalculate the twiddle factors using C function w_table.c during the setup process. In order to use the same FFT routine for the IFFT calculation, two simple changes are made. First, the conjugating twiddle factors imply the sign change of the imaginary portion of the complex sam- ples; that is, X[i].im = -X[i].im. Second, the normalization of 1/N is handled in the FFT routine by setting the scale flag to zero. Table 6.12 List of assembly implementation of bit-reversal function .global _bit_rev .sect ".text:fft_code" _bit_rev psh mmap(ST2_55) ; Save ST2 bclr ARMS ; Reset ARMS bit mov #1,AC0 sfts AC0,T0 ; T0=EXP, AC0=N=2EXP mov AC0,T0 ; T0=N mov T0,T1 add T0,T1 mov mmap(T1),BK03 ; Circular buffer size=2N mov mmap(AR0),BSA01 ; Init circular buffer base sub #2,AC0 mov mmap(AC0L),BRC0 ; Initialize repeat counter to N-1 mov #0,AR0 ; Set buffer start address mov #0,AR1 ; as offset = 0 bset AR0LC ; Enable AR0 and AR1 as bset AR1LC ; circular pointers || rptblocal loop_end-1 ; Start bit reversal loop mov dbl(*AR0),AC0 ; Get a pair of sample || amov AR1,T1 mov dbl(*AR1),AC1 ; Get another pair || asub AR0,T1 xccpart swap1,T1>=#0 || mov AC1,dbl(*AR0+) ; Swap samples if j>=i swap1 xccpart loop_end,T1>=#0 || mov AC0,dbl(*(AR1+T0B)) loop_end ; End bit reversal loop pop mmap(ST2_55) ; Restore ST2 ret The experiment computes 128-point FFT, inverse FFT, and the error between the input and the output of inverse FFT. The files used for this experiment are listed in Table 6.13. Procedures of the experiment are listed as follows: 1. Open the project file, asm_fft.pjt, and rebuild the project. 2. Run the FFT experiment using the input data file input.dat. 3. Examine the FFT and IFFT input and output, and check the input and output differences stored in the array error[ ].JWBK080-06 JWBK080-Kuo March 8, 2006 11:49 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 343 Table 6.13 File listing for experiment exp6.6.5_asm_FFT Files Description asm_fftTest.c C function for testing assembly FFT experiment fft.asm Assembly function for FFT bit_rev.asm Assembly function performs bit reversal w_table.c C function generates twiddle factors asm_fft.h C header file for fixed-point FFT experiment icomplex.h C header file defines fixed-point complex data type asm_fft.pjt DSP project file asm_fft.cmd DSP linker command file input.dat Data file 6.6.6 Implementation of Fast Convolution This experiment uses the overlap-add technique with the following steps: r Pad M (N − L) zeros to the FIR filter impulse response of length L where N > L, and process the sequence using an N-point FFT. Store the results in the complex buffer H[N]. r Segment the input sequence of length M with L − 1 zeros padded at the end. r Process each segment of data samples with an N-point FFT to obtain the complex array X[N]. r Multiply H and X in frequency domain to obtain Y. r Perform N-point IFFT to get the time-domain filtered sequence. r Add the first L samples overlapped with the previous segment to form the output. Combine all the resulting segments to obtainy(n). The C program implementation of fast convolution using FFT and IFFT is listed in Table 6.14. The files used for this experiment are listed in Table 6.15. Table 6.14 C program section for fast convolution for (i=0; i 0 0, e(n) = 0 −1, e(n) < 0 . (7.35) This sign operation of error signal is equivalent to a very harsh quantization of e(n). If μ is a negative power of 2, μ x(n) can be computed with a right shift of x(n). In DSP implementations, however, the conditional tests require more instruction cycles than the multiplications needed by the LMS algorithm. The sign operation can be performed on data x(n) instead of error e(n), and it results in the sign-data LMS algorithm expressed as w(n + 1) = w(n) + μ e(n)sgn [x(n)] . (7.36) Since L branch (IF-ELSE) instructions are required inside the adaptation loop to determine the signs of x(n − i), i = 0,1,...,L− 1, slower throughput than the sign-error LMS algorithm is expected. Finally, the sign operation can be applied to both e(n) and x(n), and it results in the sign-sign LMS algorithm expressed as w(n + 1) = w(n) + μ sgn [e(n)] sgn [x(n)] . (7.37) This algorithm requires no multiplication, and is designed for VLSI or ASIC implementation to save multiplications. It is used in the adaptive differential pulse code modulation (ADPCM) for speech compression. Some practical applications such as modems and frequency-domain adaptive filtering require complex operations for maintaining their phase relationships. The complex adaptive filter uses the complex input vector x(n) and complex coefficient vector w(n) expressed as x(n) = xr(n) + jxi(n) (7.38) and w(n) = wr(n) + jwi(n), (7.39) where the subscripts r and i denote the real and imaginary, respectively.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 362 ADAPTIVE FILTERING The complex output y(n) is computed as y(n) = wT (n)x(n), (7.40) where all multiplications and additions are complex operations. The complex LMS algorithm adapts the real and imaginary parts of w(n) simultaneously, and is expressed as w(n + 1) = w(n) + μe(n)x∗(n), (7.41) where ∗ denotes a complex conjugate such that x∗(n) = xr(n) − j xi(n). An example of decomposing complex calculations into real-number operations can be found in Section 7.6.7 for adaptive channel equalizer. 7.3 Performance Analysis In this section, we briefly discuss important properties of the LMS algorithm such as stability, convergence rate, and excess MSE due to gradient estimation. 7.3.1 Stability Constraint As shown in Figure 7.5, the LMS algorithm involves the presence of feedback. Thus, the algorithm is subject to the possibility of becoming unstable. From Equation (7.30), we observe that the parameter μ determines the step size of correction applied to the weight vector. The convergence of the LMS algorithm must satisfy 0 <μ< 2 λmax , (7.42) where λmax is the largest eigenvalue of the autocorrelation matrix R defined in Equation (7.22). The computation of λmax is difficult when L is large. In practical applications, it is desirable to estimate λmax using a simple method. From Equation (7.22), we have λmax ≤ L−1 l=0 λl = Lrxx(0) = LPx , (7.43) where Px ≡ rxx(0) = E  x2(n)  (7.44) denotes the power of x(n). Therefore, setting 0 <μ< 2 LPx (7.45) assures that Equation (7.42) is satisfied.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 PERFORMANCE ANALYSIS 363 Equation (7.45) provides important information on selecting μ: 1. The upper bound on μ is inversely proportional to filter length L, thus a small μ is used for a higher order filter. 2. Since μ is inversely proportional to the input signal power, low-power signals can use larger μ.We can normalize μ with respect to Px for choosing step size that is independent of signal power. The resulting algorithm is called the normalized LMS (NLMS) algorithm, which will be discussed later. 7.3.2 Convergence Speed Convergence of the weight vector w(n) from w(0) to w◦ corresponds to the convergence of the MSE from ξ(0) to ξmin. Therefore, convergence of the MSE toward its minimum value is a commonly used performance measurement in adaptive systems because of its simplicity. A plot of the MSE versus time n is referred to as the learning curve. Since the MSE is the performance criterion of the LMS algorithms, the learning curve is a natural way to describe the transient behavior. Each adaptive mode has its own time constant, which is determined by μ and the eigenvalue λl associated with that mode. Thus, the overall convergence is clearly limited by the slowest mode, and can be approximated as τmse ∼= 1 μλmin , (7.46) where λmin is the minimum eigenvalue of the R matrix. Because τmse is inversely proportional to μ,we have a large τmse (slow convergence) when μ is small. The maximum time constant τmse = 1/μλmin is a conservative estimate in practical applications since only large eigenvalues will exert significant influence on the convergence time. If λmax is very large, only a small μ can satisfy the stability constraint. If λmin is very small, the time constant can be very large, resulting in very slow convergence. The slowest convergence occurs for μ = 1/λmax. Substituting this smallest step size into Equation (7.46) results in τmse ≤ λmax λmin . (7.47) Therefore, the speed of convergence is dependent on the ratio of the maximum to minimum eigenvalues of the matrix R. The eigenvalues λmax and λmin are very difficult to compute if the order of filter is high. An efficient way is to approximate the eigenvalue spread by the spectral dynamic range expressed as λmax λmin ≤ max |X(ω)|2 min |X(ω)|2 , (7.48) where X(ω) is DTFT of x(n). Therefore, an input signal with a flat spectrum such as a white noise will have the fast convergence speed. 7.3.3 Excess Mean-Square Error The steepest-descent algorithm defined in Equation (7.26) requires knowledge of the true gradient ∇ξ(n), which must be estimated for each iteration. After the algorithm converges, the gradient ∇ξ(n) = 0;JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 364 ADAPTIVE FILTERING however, the gradient estimator ∇ ˆξ(n) = 0. As indicated by Equation (7.26), this will cause w(n) to vary randomly around w◦, thus producing excess noise at the filter output. The excess MSE, which is caused by random noise in the weight vector after convergence, can be approximated as ξexcess ≈ μ 2 LPx ξmin. (7.49) This approximation shows that the excess MSE is directly proportional to μ. The larger step size μ results in faster convergence at the cost of steady-state performance. Therefore, there is a design trade-off between the excess MSE and the speed of convergence for determining μ. The optimal step size μ is difficult to determine. Improper selection of μ might make the convergence speed unnecessarily slow or introduce more excess MSE in steady state. If the signal is nonstationary and real-time tracking capability is crucial for a given application, we may choose a larger μ. If the signal is stationary and convergence speed is not important, we can use a smaller μ to achieve bet- ter steady-state performance. In some practical applications, we can use a larger μ at the beginning of the operation for faster convergence, and then change to smaller μ to achieve better steady-state performance. The excess MSE, ξexcess, expressed in Equation (7.49) is also proportional to the filter length L, which means that a larger L results in higher algorithm noise. From Equation (7.45), a larger L implies that a smaller μ is required, thus resulting in slower convergence. On the other hand, a large L also implies better filter characteristics. Again, there exists an optimum filter length L for a given application. 7.3.4 Normalized LMS Algorithm The stability, convergence speed, and fluctuation of the LMS algorithm are governed by the step size μ and the input signal power. As shown in Equation (7.45), the maximum stable step size μ is inversely proportional to the filter length L and the signal power. One important technique to optimize the speed of convergence while maintaining the desired steady-state performance is the NLMS: w(n + 1) = w(n) + μ(n)x(n)e(n), (7.50) where μ(n) is a normalized step size that is computed as μ(n) = α LˆPx (n) , (7.51) where ˆPx (n) is an estimate of the power of x(n) at time n, and 0 <α<2 is a constant. Some useful implementation considerations are given as follows: 1. Choose ˆPx (0) as the best a priori estimate of the input signal power. 2. A software constraint is required to ensure that μ(n) is bounded if ˆPx (n) is very small when the signal is absent. 7.4 Implementation Considerations In many real-world applications, adaptive filters are implemented on fixed-point processors. It is important to understand the finite wordlength effects of adaptive filters in meeting design specifications.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 IMPLEMENTATION CONSIDERATIONS 365 7.4.1 Computational Issues The coefficient update defined in Equation (7.33) requires L+ 1 multiplications and L additions if we multiply μ*e(n) outside the loop. Given the input vector x(n) stored in the array x[ ], the error signal en, the weight vector w[ ], the step size mu, and the filter length L, Equation (7.33) can be implemented in C language as follows: uen=mu*en; // u*e(n) outside the loop for (l=0; lorder; w = &lms->w[0]; x = &lms->x[0]; // Update signal buffer for(j=n-1; j>0; j--) { x[j] = x[j-1]; } x[0] = lms->in; // Compute filter output - Equation (7.31) y = 0.0; for(j=0; jout = v; // Compute error signal - Equation (7.32) lms->err = lms->des - y; // Coefficients update - Equation (7.33) ue = lms->err * lms->mu; for(j=0; jorder; w = &lms->w[0]; x = &lms->x[0]; // Update data delay line for(j=n-1; j>0; j--) { x[j] = x[j-1]; } x[0] = lms->in; // Get adaptive filter output - Equation (7.31) temp32 = (long)w[0] * x[0]; for(j=1; jout = (short)((temp32+ROUND)>>15); // Compute error signal - Equation (7.32) lms->err = lms->des - lms->out; // Coefficients update - Equation (7.56) ue = (long)(((lms->err * (long)lms->mu)+ROUND)>>15); for(j=0; jleaky * w[j])+ROUND)>>15; w[j] = (short)temp32 + (short)(((ue * x[j])+ROUND)>>15); } } ETSI functions by mapping them directly to its intrinsics. Table 7.6 lists the ETSI operators and their corresponding intrinsics for the C55x. The C55x implementation of the NLMS algorithm using ETSI operators is listed in Table 7.7. Table 7.8 lists the files used for this experiment. Procedures of the experiment are listed as follows: Table 7.5 File listing for experiment exp7.6.2_fixPoint_LeakyLMS Files Description fixPoint_leaky_lmsTest.c C function for testing leaky LMS experiment fixPoint_leaky_lms.c C function for fixed-point leaky LMS algorithm fixPoint_leaky_lms.h C header file fixPoint_leaky_lms.pjt DSP project file fixPoint_leaky_lms.cmd DSP linker command file input.pcm Input signal file desired.pcm Desired signal fileJWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 382 ADAPTIVE FILTERING Table 7.6 C55x ETSI functions and corresponding intrinsic functions ETSI function Intrinsics representation Description L_add(a,b) _lsadd((a),(b)) Add two 32-bit integers with SATD set, producing a saturated 32-bit result. L_sub(a,b) _lssub((a),(b)) Subtract b from a with SATD set, producing a saturated 32-bit result. L_negate(a) _lsneg(a) Negate the 32-bit value with saturation._lsneg (0x80000000)=> 0x7FFFFFFF L_deposite_h(a) (long)(a<<16) Deposit the 16-bit a into the 16 MSB of a 32-bit output and the 16 LSB of the output are zeros. L_deposite_l(a) (long)a Deposit the 16-bit a into the 16 LSB of a 32-bit output and the 16 MSB of the output are sign extended. L_abs(a) _labss((a)) Create a saturated 32-bit absolute value._labss(0x8000000)=> 0x7FFFFFFF (SATD is set.) L_mult(a,b) _lsmpy((a),(b)) Multiply a and b and shift the result left by 1. Produce a saturated 32-bit result. (SATD and FRCT are set.) L_mac(a,b,c) _smac((a),(b),(c)) Multiply b and c, shift the result left by 1, and add it to a. Produce a saturated 32-bit result. (SATD, SMUL, and FRCT are set.) L_macNs(a,b,c) L_add_c((a),L_mult((b),(c))) Multiply b and c, shift the result left by 1, add the 32 bit result to a without saturation L_msu(a,b,c) _smas((a),(b),(c)) Multiply b and c, shift the result left by 1, and subtract it from a. Produce a 32-bit result. (SATD, SMUL, and FRCT are set.) L_msuNs(a,b,c) L_sub_c((a),L_mult((b),(c))) Multiply b and c, shift the result left by 1, and subtract it from a without saturation. L_shl(a,b) _lsshl((a),(b)) Shift a to left by b and produce a 32-bit result. The result is saturated if b is less than or equal to 8. (SATD is set.) L_shr(a,b) _lshrs((a),(b)) Shift a to right by b and produce a 32-bit result. Produce a saturated 32-bit result. (SATD is set.) L_shr_r(a,b) L_crshft_r((a),(b)) Same as L_shr(a,b) but with rounding. abs_s(a) _abss((a)) Create a saturated 16-bit absolute value._abss (0x8000)=> 0x7 FFF (SATA is set.) add(a,b) _sadd((a),(b)) Add two 16-bit integers with SATA set, producing a saturated 16-bit result. sub(a,b) _ssub((a),(b)) Subtract b from a with SATA set, producing a saturated 16-bit result. extract_h(a) (unsigned short)((a)>>16) Extract the upper 16-bit of the 32-bit a. extract_l(a) (short)a Extract the lower 16-bit of the 32-bit a. round(a) (short)_rnd(a)>>16 Round a by adding 215. Produce a 16-bit saturated result. (SATD is set.) mac_r(a,b,c) (short)(_smacr((a), (b),(c))>>16) Multiply b and c, shift the result left by 1, add the result to a, and then round the result by adding 215. (SATD, SMUL, and FRCT are set.)JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 383 Table 7.6 (continued) ETSI function Intrinsics representation Description msu_r(a,b,c) (short)(_smasr((a),(b),(c))>>16) Multiply b and c, shift the result left by 1, subtract the result from a, and then round the result by adding 215. (SATD, SMUL, and FRCT are set.) mult_r(a,b) (short)(_smpyr((a),(b))>>16) Multiply a and b, shift the result left by 1, and round by adding 215 to the result. (SATD and FRCT are set.) mult(a,b) _smpy((a),(b)) Multiply a and b and shift the result left by 1. Produce a saturated 16-bit result. (SATD and FRCT are set.) norm_l(a) _lnorm(a) Produce the number of left shifts needed to normalize a. norm_s(a) _norm(a) Produce the number of left shifts needed to normalize a. negate(a) _sneg(a) Negate the 16-bit value with saturation. _sneg (0xffff8000)=> 0x00007FFF shl(a,b) _sshl((a),(b)) Shift a to left by b and produce a 16-bit result. The result is saturated if b is less than or equal to 8. (SATD is set.) shr(a,b) _shrs((a),(b)) Shift a to right by b and produce a 16-bit result. Produce a saturated 16-bit result. (SATD is set.) shr_r(a,b) crshft((a),(b)) Same as shr(a,b)but with rounding. shift_r(a,b) shr_r((a),-(b)) Same as shl(a,b)but with rounding. div_s(a,b) divs((a),(b)) Produces a truncated positive 16-bit result which is the fractional integer division of a by b, a and b must be positive and b ≥ a. 1. Open the project file, ETSI_nlms.pjt, and rebuild the project. 2. Run the experiment using data files input.pcm and desired.pcm. 3. Compare the results of ETSI (intrinsics) implementation with the fixed-point C implementation in terms of convergence speed and steady-state MSE. 4. Profile the ETSI (intrinsics) implementation of the NLMS algorithm. How many cycles per data sample are needed using the intrinsics? 7.6.4 Assembly Language Implementation of Delayed LMS Algorithm The TMS320C55x has a powerful assembly instruction, LMS, for implementing the delayed LMS algo- rithm. This instruction utilizes the high parallelism of the C55x architecture to perform the followingJWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 384 ADAPTIVE FILTERING Table 7.7 Implementation of NLMS algorithm using C55x intrinsics void intrinsic_nlms(LMS *lmsObj) { LMS *lms=(LMS *)lmsObj; long temp32; short j,n,mu,ue,*x,*w; n = lms->order; w = &lms->w[0]; x = &lms->x[0]; // Update signal buffer for(j=n-1; j>0; j--) { x[j] = x[j-1]; } x[0] = lms->in; // Compute normalized mu temp32 = mult_r(lms->x[0],lms->x[0]); temp32 = mult_r((short)temp32, ONE_MINUS_BETA); lms->power = mult_r(lms->power, BETA); temp32 = add(lms->power, (short)temp32); temp32 = add(lms->c, (short)temp32); mu = lms->alpha / (short)temp32; // Compute filter output - Equation (7.31) temp32 = L_mult(w[0], x[0]); for(j=1; jout = round(temp32); // Compute error signal - Equation (7.32) lms->err = sub(lms->des, lms->out); // Coefficients update - Equation (7.50) ue = mult_r(lms->err, mu); for(j=0; j0; i--) // Update signal buffer x1[i] = x1[i-1]; // of unknown system // Adaptive system identification operation x[0]=input; // Get input signal x(n) y = 0.0; for (i=0; i0; i--) // Update signal buffer x[i] = x[i-1]; // of adaptive filter The unknown system for this experiment is an FIR filter with the filter coefficients given in plant[ ]. The input x(n) is a zero-mean white noise. The unknown system’s output d(n) is used as the desired signal for the adaptive filter, and the adaptive filter coefficients in w[i] will match closely to the unknown system response after the convergence of adaptive filter. The adaptive LMS algorithm used for system identification is listed in Table 7.11. First, the signal and coefficient buffers are initialized to zero. The random signal used for the experiment is generated in Ns samples per block. The adaptive filter of the system identification program uses the unknown plant output d(n) as the desired signal to calculate the error signal. The adaptive filter with N coefficients in w[ ] after convergence models the unknown system in the form of an FIR filter. The files used for this experiment are listed in Table 7.12. This experiment is implemented using block processing. Procedures of the experiment are listed as follows: 1. Open the project file, sysIdentify.pjt, and rebuild the project. 2. Run the system identification experiment using the data file x.pcm. The experiment will write the result in the text file result.txt in the data directory. 3. Compare the system identification result in result.txt with the unknown plant given by unknow_plant.dat. 4. Use MATLAB to design a bandpass FIR filter and rerun the system identification experiment using this bandpass FIR filter as the unknown plant. 5. Increase the adaptive filter length L to L = 2N1, where N1 is the length of the unknown system. Build the project and run the program again. Check the experiment results. Will the adaptive model identify the unknown plant? 6. Reduce the adaptive filter length to L = N1/2, where N1 is the length of the unknown system. Build the project and run the program again. Check the experiment results. Will the adaptive model identify the unknown plant? 7. Use MATLAB to design a second-order bandpass IIR filter and rerun the system identification experiment using this bandpass IIR filter as an unknown plant. What is system identification result for the IIR unknown plant? 7.6.6 Adaptive Prediction and Noise Cancelation As shown in Figure 7.9, the primary signal x(n) consists of the broadband components v(n) and the narrowband components s(n). The output of adaptive filter is the narrowband signal y(n) ≈ s(n). For applications such as spread spectrum communications, the narrowband interference can be tracked and removed by the adaptive filter. The error signal e(n) ≈ v(n) contains the desired broadband signal. WeJWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 389 Table 7.11 List of C55x assembly code for adaptive system identification _sysIdentification: pshm ST1_55 ; Save ST1, ST2, and ST3 pshm ST2_55 pshm ST3_55 mov dbl(*AR0(#2)),XAR1 ; AR1 is desired signal pointer mov dbl(*AR0(#4)),XAR2 ; AR2 is signal buffer pointer mov dbl(*AR0(#6)),XAR3 ; AR3 is coefficient buffer pointer mov *AR0(#8),T0 ; T0 number of samples in input buffer mov *AR0(#9),T1 ; T1 adaptive filter length mov mmap(AR3),BSA45 mov mmap(T1),BK47 mov mmap(AR2),BSA23 mov mmap(T1),BK03 mov *AR0(#10),AR3 ; AR3 -> x[] as circular buffer mov #0,AR4 ; AR4 -> w[] as circular buffer mov dbl(*AR0),XAR0 ; AR0 is input pointer or #0x340,mmap(ST1_55 ; Set FRCT,SXMD,SATD or #0x18,mmap(ST2_55) ; Enable circular addressing mode bset SATA ; Set SATA sub #1,T0 mov mmap(T0),BRC0 ; Set sample block loop counter sub #2,T1 mov mmap(T1),BRC1 ; Counter for LMS update loop mov mmap(T1),CSR ; Counter for FIR filter loop rptblocal loop-1 ; for (n=0; n x[] as circular buffer mov #0,AR4 ; AR4 -> w[] as circular buffer mov dbl(*AR0),XAR0 ; AR0 point to in[] mov mmap(ST1_55),AC0 ; Save ST1, ST2, and ST3 mov AC0,ale.d_ST1 mov mmap(ST2_55),AC0 mov AC0,ale.d_ST2 mov mmap(ST3_55),AC0 mov AC0,ale.d_ST3 or #0x340,mmap(ST1_55) ; Set FRCT,SXMD,SATD or #0x18,mmap(ST2_55) ; Enable circular addressing mode bset SATA ; Set SATA sub #1,T0 mov mmap(T0),BRC0 ; Set sample block loop counter sub #2,T1 mov mmap(T1),BRC1 ; Counter for LMS update loop mov mmap(T1),CSR ; Counter for FIR filter loop mov #ALPHA,T0 ; T0=leaky alpha || rptblocal loop-1 ; for (n=0; n0; j--) { x[j] = x[j-1]; } x[0] = *rx; // Compute normalized mu from I-symbol temp32 = (((long)x[0].re * x[0].re)+0x4000)>>15; temp32 = ((temp32 * ONE_MINUS_BETA)+0x4000)>>15; power.re = (short)(((power.re * (long)BETA)+0x4000)>>15); temp32 += (power.re+C); temp32 >>= 5; mu.re = ALPHA / (short)temp32; // Compute normalized mu from Q-symbol temp32 = (((long)x[0].im * x[0].im)+0x4000)>>15; temp32 = ((temp32 * ONE_MINUS_BETA)+0x4000)>>15; power.im = (short)(((power.im * (long)BETA)+0x4000)>>15); temp32 += (power.im+C); temp32 >>= 5; mu.im = ALPHA / (short)temp32; // Get the real adaptive filter output from complex symbols temp32 = (long)w[0].re * x[0].re; temp32 -= (long)w[0].im * x[0].im; for(j=1; j>15); // Get the image adaptive filter output from complex symbols temp32 = (long)w[0].im * x[0].re; temp32 += (long)w[0].re * x[0].im; for(j=1; j>15); // Compute error term from complex data err.re = rxDesire[txCnt].re - y.re; err.im = rxDesire[txCnt++].im - y.im; // Coefficients update - using complex error and data urer = (long)(((err.re * (long)mu.re)+ROUND)>>15); urei = (long)(((err.im * (long)mu.re)+ROUND)>>15); uier = (long)(((err.re * (long)mu.im)+ROUND)>>15); uiei = (long)(((err.im * (long)mu.im)+ROUND)>>15); for(j=0; j>15); w[j].re -= (short)temp32; } for(j=0; j>15); w[j].im -= (short)temp32; } // Return the output and error *error = err; *out = y; } Procedures of the experiment are listed as follows: 1. Open the project file, channel_equalizer.pjt, and rebuild the project. 2. Run the adaptive channel equalizer experiment. This experiment will output the error signal to file error.bin in the data directory. 3. Plot the error signal to verify the convergence of the equalizer. 4. Change the filter length and observe the behavior of adaptive equalizer. 7.6.8 Real-Time Adaptive Line Enhancer Using DSK In this experiment, we will port the adaptive predictor experiment in Section 7.6.6 to the C5510 DSK to examine the real-time behavior. There are two signal files: one is a single tone corrupted by white noise, and the other consists of repeated telephone digits corrupted by white noise. The input data files can be played back via an audio player that supports WAV file format. The DSK takes the input, processes it, and sends the output to a headphone or loudspeaker for play back. Figure 7.21 shows the spectrum of the input signal, and Figure 7.22 is the output captured in real time by an audio sound card. It can be seen Table 7.16 File listing for exp7.6.7_channel_equalizer Files Description eqTest.c C function for testing adaptive equalizer experiment adaptiveEQ.c C function for implementing adaptive equalizer channel.c C function simulates communication channel signalGen.c C function generates training sequence complexEQ.h C header file channel_equalizer.pjt DSP project file channel_equalizer.cmd DSP linker command fileJWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 395 Figure 7.21 Spectrum of the signal corrupted by broadband noise Figure 7.22 Spectrum of the adaptive line predictor output. The broadband noise has been reducedJWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 396 ADAPTIVE FILTERING Table 7.17 File listing for exp7.6.8_realtime_predictor Files Description rt_realtime_predictor.c C function for testing line enhancer experiment adaptivePredictor.asm Assembly function for adaptive line enhancer plio.c C function interfaces PIP with low-level I/O functions adaptive_predictor.h C header file for experiment plio.h C header file for PIP driver lio.h C header file for interfacing PIP with low-level drivers rt_adaptivePredictor.pjt DSP project file rt_adaptivePredictorcfg.cmd DSP linker command file rt_adaptivePredictor.cdb DSP/BIOS configuration file tone_1khz_8khz_noise.wav Data file Ð tone with noise multitone_noise_8khz.wav Data file Ð multitone with noise from Figure 7.22 that the wideband noise has been greatly reduced by the 128-tap adaptive line enhancer. The files used for this experiment are listed in Table 7.17. Procedures of the experiment are listed as follows: 1. Open the project file, rt_adaptivePredictor.pjt, and rebuild the project. 2. Run the adaptive line enhancer using the C5510 DSK. Connect the audio player output to the DSK line-in. Use a headphone to listen to the result from the DSK headphone output. 3. Change the adaptive filter length, step size, and evaluate the behavior changes of the adaptive line enhancer. 4. Capture the input and output signals using a digital scope and evaluate the adaptive filter performance by examining the time-domain waveform and frequency-domain noise level before and after applying the adaptive line enhancer. References [1] S. T. Alexander, Adaptive Signal Processing, New York: Springer-Verlag, 1986. [2] M. Bellanger, Adaptive Digital Filters and Signal Analysis, New York: Marcel Dekker, 1987. [3] P. M. Clarkson, Optimal and Adaptive Signal Processing, Boca Raton, FL: CRC Press, 1993. [4] C. F. N. Cowan and P. M. Grant, Adaptive Filters, Englewood Cliffs, NJ: Prentice Hall, 1985. [5] J. R. Glover, Jr., ‘Adaptive noise canceling applied to sinusoidal interferences,’ IEEE Trans. Acoust., ASSP-25, pp. 484Ð491, Dec. 1977. [6] S. Haykin, Adaptive Filter Theory, 2nd Ed., Englewood Cliffs, NJ: Prentice Hall, 1991. [7] S. M. Kuo and C. Chen, ‘Implementation of adaptive filters with the TMS320C25 or the TMS320C30,’ in Digital Signal Processing Applications with the TMS320 Family, vol. 3, P. Papamichalis, Ed., Englewood Cliffs, NJ: Prentice Hall, 1990, pp. 191Ð271, Chap. 7. [8] S. M. Kuo and D. R. Morgan, Active Noise Control Systems Ð Algorithms and DSP Implementations, New York: John Wiley & Sons, Inc., 1996. [9] L. Ljung, System Identification: Theory for the User, Englewood Cliffs, NJ: Prentice Hall, 1987. [10] J. Makhoul, ‘Linear prediction: A tutorial review,’ Proc. IEEE, vol. 63, pp. 561Ð580, Apr. 1975.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 EXERCISES 397 [11] J. R. Treichler, C. R. Johnson, Jr., and M. G. Larimore, Theory and Design of Adaptive Filters, New York: John Wiley & Sons, Inc., 1987. [12] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hern, J. R. Zeidler, E. Dong, and R. C. Goodlin, ‘Adaptive noise canceling: Principles and applications,’ Proc. IEEE, vol. 63, pp. 1692Ð1716, Dec. 1975. [13] B. Widrow and S. D. Stearns, Adaptive Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1985. [14] M. L. Honig and D. G. Messerschmitt, Adaptive Filters: Structures, Algorithms, and Applications, Boston, MA: Kluwer Academic Publishers, 1986. [15] MathWorks, Inc., Using MATLAB, Version 6, 2000. [16] MathWorks, Inc., Signal Processing Toolbox User’s Guide, Version 6, 2004. [17] MathWorks, Inc., Filter Design Toolbox User’s Guide, Version 3, 2004. [18] MathWorks, Inc., Fixed-Point Toolbox User’s Guide, Version 1, 2004. [19] MathWorks, Inc., Communications Toolbox User’s Guide, Version 3, 2005. [20] ITU Recommendation V.29, 9600 Bits Per Second Modem Standardized for Use on Point-to-Point 4-Wire Leased Telephone-Type Circuits, Nov. 1988. Exercises 1. Determine the autocorrelation function of the following signals: (a) x(n) = A sin(2πn/N), (b) y(n) = A cos(2πn/N). 2. Find the crosscorrelation functions rxy(k) and ryx(k), where x(n) and y(n) are defined in the Prob- lem 1. 3. Let x(n) and y(n) be two independent zero-mean WSS random signals. The random signal w(n)is obtained by using w(n) = ax(n) + by(n), where a and b are constants. Express rww(k), rwx (k), and rwy(k) in terms of rxx(k) and ryy(k). 4. Similar to Example 7.7, the desired signal d(n) is the output of the FIR filter with coefficients 0.2, 0.5, and 0.3 when the input x(n) is zero-mean, unit-variance white noise. This white noise is also used as the input signal for the adaptive FIR filter with L = 3 using the LMS algorithm. Compute R, p, wo, and minimum MSE. 5. Consider a second-order autoregressive (AR) process defined by d(n) = v(n) − a1d(n − 1) − a2d(n − 2), where v(n) is a white noise of zero mean and variance σ 2v . This AR process is generated by filtering v(n) using the second-order IIR filter H(z). (a) Derive the IIR filter transfer function H(z). (b) Consider a second-order optimum FIR filter shown in Figure 7.3. If the desired signal is d(n), the primary input x(n) = d(n− 1). Find the optimum weight vector w◦ and the minimum MSE ξmin. 6. Given the two finite-length sequences: x(n) = {13−212−1442}, y(n) = {2 −141−23}.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 398 ADAPTIVE FILTERING Using MATLAB function xcorr, compute and plot the crosscorrelation function rxy(k) and the autocorrelation function rxx(k). 7. Write a MATLAB script to generate the length 1024 signal defined as x(n) = 0.8 sin (ω0n) + v(n), where ω0 = 0.1π, v(n) is a zero-mean random noise with variance σ 2v = 1 (see Section 3.3 for details). Compute and plotrxx(k), where k = 0, 1,...,127, using MATLAB.Explain this simulation result using theoretical derivations given in Examples 7.1 and 7.3. 8. Redo Example 7.7 by using x(n) as input to the adaptive FIR filter (L = 2) with the LMS algorithm. Implement this adaptive filter using MATLAB or C. Plot the error signal e(n), and show the adaptive weights converged to the derived optimum values. 9. Implement the adaptive system identification technique illustrated in Figure 7.7 using MATLAB or C program. The input signal is a zero-mean, unit-variance white noise. The unknown system is an IIR filter defined in Problem 5. Evaluate different filter lengths L and step size μ, and plot e(n) for these parameters. Find the optimum values that result in fast convergence and low excess MSE. 10. Implement the adaptive line enhancer illustrated in Figure 7.9 using MATLAB or C program. The desired signal is given by x(n) = √ 2 sin(ωn) + v(n), where frequency ω = 0.2π and v(n) is the zero-mean white noise with unit variance. The de- correlation delay = 1. Plot both e(n) and y(n). Evaluate the convergence speed and steady-state MSE for different parameters L and μ. 11. Implement the adaptive noise cancelation illustrated in Figure 7.11 using MATLAB or C program. The primary signal is given by d(n) = sin(ωn) + 0.8v(n) + 1.2v(n − 1) + 0.25v(n − 2) where v(n) is defined by Problem 5. The reference signal is v(n). Plot e(n) for different values of L and μ. 12. Implement the single-frequency adaptive notch filter illustrated in Figure 7.14 using MATLAB or C program. The desired signal d(n) is given in Problem 11, and x(n)isgivenby x(n) = √ 2 sin(ωn). Plot e(n) and the magnitude response of second-order FIR filter after convergence. 13. Use MATALAB to generate primary input signal x(n) = 0.25 cos(2πnf1/fs) + 0.25 sin(2πnf2/fs) and the reference signal d(n) = 0.125 cos(2πnf2/fs), where fs is sampling frequency, f1 and f2 are the frequencies of the desired signal and interference, respectively. Implement the adaptive noise canceler that removed the interference signal.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 EXERCISES 399 14. Port the functions developed in Problem 13 to DSK. Create a real-time experiment by connecting the primary input and reference input signals to the DSK stereo-line input. Left channel is the primary input with interference and the right channel contains only the interference signal. Test the adaptive noise canceler in real time with DSK. 15. Create a real-time adaptive notch filter experiment using DSK. 16. The system identification experiment is implemented for large memory model. Modify the program given by Table 7.11 such that this assembly program can be used by both large memory model and small memory model.JWBK080-07 JWBK080-Kuo March 8, 2006 19:14 Char Count= 0 400JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 8 Digital Signal Generators Signal generations are useful for algorithm design, analysis, and real-world DSP applications. In this chapter, we will introduce different methods for the generation of digital signals and their applications. 8.1 Sinewave Generators There are several characteristics that should be considered when designing algorithms for generating sinewaves. These issues include total harmonic distortion, frequency and phase control, memory usage, computational cost, and accuracy. Some trigonometric functions can be approximated by polynomials, for example, the cosine and sine approximation given by Equations (3.90a) and (3.90b). Because polynomial approximations are realized with multiplications and additions, they can be efficiently implemented on DSP processors. Sinewave gen- eration using polynomial approximation is presented in Section 3.6.5, and using resonator is introduced in Chapter 5. Therefore, this section discusses only the lookup-table method for sinewave generation. 8.1.1 Lookup-Table Method The lookup-table (or table-lookup) method is probably the most flexible technique for generating periodic waveforms. This technique involves reading a series of stored data values that represent the waveform. These values can be obtained either by sampling analog signals or by computing the mathematical algorithms. Usually only one period of the waveform is stored in the table. A sinewave table containing one period of waveform can be obtained by computing the following function: x(n) = sin 2πn N , n = 0, 1,...N − 1. (8.1) These samples are represented in binary form; thus, the accuracy is determined by the wordlength. The desired sinewave can be generated by reading these stored values from the table at a constant step . The data pointer wraps around at the end of the table. The frequency of the generated sinewave depends on the sampling period T , table length N, and the table address increment  as f =  NT Hz. (8.2) Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 401JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 402 DIGITAL SIGNAL GENERATORS For a given sinewave table of length N, a sinewave with frequency f and sampling rate fs can be generated using the pointer address increment  = Nf fs (8.3) with the following constraint to avoid aliasing:  ≤ N 2 . (8.4) To generate an L-sample sinewave x(l), where l = 0, 1, . . . , L − 1, we use a circular pointer k such that k = (m + l) mod N , (8.5) where m determines the initial phase of sinewave. It is important to note that the step  given in Equation (8.3) may not be an integer; thus, (m + l) in Equation (8.5) makes k a real number. The values between neighboring entries can be estimated using the existing table values. An easy solution is to round the noninteger index k to the nearest integer. A better but more complex method is to interpolate the value based on the adjacent samples. The following two errors will cause harmonic distortion: 1. Amplitude quantization errors due to the use of finite wordlength to represent values in the table. 2. Time-quantization errors from synthesizing data values between table entries. Increasing table size can reduce the time-quantization errors. To reduce the memory requirement, we can take the advantage of symmetry property since the absolute values of a sinewave repeat four times in each period. Thus, only one-fourth of the period is required. However, a more complex algorithm will be needed to track which quadrant of the waveform is generated. To decrease the harmonic distortion for a given table size, an interpolation technique can be used to compute the values between table entries. The simple linear interpolation that assumes a value between two consecutive table entries lies on a straight line between these two values. Suppose the integer part of the pointer k is i (0 ≤ i < N) and the fractional part is f (0 < f < 1), the interpolated value will be computed as x(n) = s(i) + f [s(i + 1) − s(i)] , (8.6) where [s(i + 1) − s(i)] is the slope of the line between successive table entries s(i) and s(i + 1). Example 8.1: We use the MATLAB program example8_1.m for generating one period of 200 Hz sinewave with sampling rate 4000 Hz as shown in Figure 8.1. These 20 samples are stored in a table for generating sinewave with fs = 4 kHz. From Equation (8.3),  = 1 will be used for generating 200 Hz sinewave and  = 2 for 400 Hz. But,  = 1.5 should be needed for generating 300 Hz. From Figure 8.1, when we access the lookup table with  = 1.5, we get the first value which is the first entry in the table as shown by arrow. However, the second value is not available in the table since it is in between the second and third entries. Therefore, the linear interpolation results in theJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 SINEWAVE GENERATORS 403 1 0.8 0.6 0.4 0.2 −0.2 −0.4 −0.6 −0.8 −1 0 Amplitude, A 02468101214161820 Time index, n 200 Hz sinewave sampled at 4000Hz Figure 8.1 One period of sinewave, where sinewave samples are marked by ‘o’ average of these two entries. To generate 250 Hz sinewave,  = 1.25, and we can use Equation (8.6) for computing sample values with noninteger index. Example 8.2: A cosine/sine function generator using table-lookup method with 1024-point cosine table can be implemented using the following TMS320C55x assembly code (cos_sin.asm): ; cos_sin.asm - Table lookup sinewave generator with ; 1024-point cosine table range (0, π) ; ; Prototype: void cos_sin(short, short *, short *) ; Entry: arg0: T0 - alpha ; arg1: AR0 - pointer to cosine ; arg2: AR1 - pointer to sine .def _cos_sin .ref tab_0_PI .sect "cos_sin" _cos_sin mov T0,AC0 ; T0=a sfts AC0,#11 ; Size of lookup table mov #tab_0_PI, T0 ; Table based address || mov hi(AC0),AR2 mov AR2,AR3JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 404 DIGITAL SIGNAL GENERATORS abs AR2 ; cos(-a) = cos(a) add #0x200,AR3 ; 90 degree offset for sine and #0x7ff,AR3 ; Modulo 0x800 for 11-bit sub #0x400,AR3 ; Offset 180 degree for sine abs AR3 ; sin(-a) = sin(a) || mov *AR2(T0),*AR0 ; *AR0=cos(a) mov *AR3(T0),*AR1 ; *AR1=sin(a) ret .end In this example, we use a one-half period table (0 − π) to reduce memory usage. Obviously, a sine function generator using a full table (0 − 2π) can be easily implemented with only a few lines of assembly code, while a function generator using a one-fourth table (0 − π/2) will be more complicated. The implementation of sinewave generator for the C5510 DSK using the table-lookup technique will be presented in Section 8.4. 8.1.2 Linear Chirp Signal A linear chirp signal is a waveform whose instantaneous frequency changes linearly with time between two specified frequencies. It is a waveform with the lowest possible peak to root-mean-square amplitude ratio in the desired frequency band. The digital chirp waveform is expressed as c(n) = A sin[φ(n)], (8.7) where A is a constant amplitude and φ(n) is a quadratic phase in the form of φ(n) = 2π fLn + fU − fL 2(N − 1) n2 + α, 0 ≤ n ≤ N − 1, (8.8) where N is the total number of points in a single chirp. In Equation (8.8), α is an arbitrary constant phase factor, and fL and fU are the normalized lower and upper frequency limits, respectively. The waveform periodically repeats with φ(n + kN) = φ(n), k = 1, 2,.... (8.9) The instantaneous normalized frequency is defined as f (n) = fL + fU − fL N − 1 n, 0 ≤ n ≤ N − 1. (8.10) This expression shows that the instantaneous frequency goes from f (0) = fL at timen = 0to f (N − 1) = fU at time n = N − 1. Because of the complexity of the linear chirp signal generator, it is more convenient to generate a chirp sequence by computer and store it in a lookup table for real-time applications. An alternative solution is to generate the table during DSP system initialization process. The lookup-table method introduced in Section 8.1.1 can be used to generate the desired signal using the stored table. MATLAB Signal Processing Toolbox provides the function y = chirp(t, f0, t1, f1) for gen- erating linear chirp signal at the time instances defined in array t, where f0 is the frequency at time 0 and f1 is the frequency at time t1. Variables f0 and f1 are in Hz.JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 NOISE GENERATORS 405 Figure 8.2 Spectrogram of chirp signal from 0 to 300 Hz Example 8.3: Compute the spectrogram of a chirp signal with the sampling rate 1000 Hz. The signal sweeps from 0 to 150 Hz in 1 s. The MATLAB code is listed as follows (example8_3.m, adapted from MATLAB Help menu): Fs = 1000; % Define variables T = 1/Fs; t = 0:T:2; % 2 seconds at 1 kHz sample rate y = chirp(t,0,1,150); % Start at DC, cross 150 Hz at t=1 second spectrogram(y,256,250,256,1E3,'yaxis') The spectrogram of generated chirp signal is illustrated in Figure 8.2. 8.2 Noise Generators Random numbers are used in many practical applications for simulating noises. Although we cannot produce perfect random numbers by using digital hardware, it is possible to generate a sequence of numbers that are unrelated to each other. Such numbers are called pseudo-random numbers. In this section, we will introduce random number generation algorithms. 8.2.1 Linear Congruential Sequence Generator The linear congruential method is widely used by random number generators, and can be expressed as x(n) = [ax(n − 1) + b]mod M , (8.11)JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 406 DIGITAL SIGNAL GENERATORS Table 8.1 C program for generating linear congruential sequence /* * URAN - Generation of floating-point pseudo-random numbers */ static long n=(long)12357; // Seed x(0) = 12357 float uran() { float ran; // Random noise r(n) n=(long)2045*n+1L; // x(n)=2045*x(n-1)+1 n-=(n/1048576L)*1048576L;//x(n)=x(n)-INT[x(n)/1048576]*1048576 ran=(float)(n+1L)/(float)1048577; //r(n)=FLOAT[x(n)+1]/1048577 return(ran); // Return r(n) to the main function } where the modulo operation (mod) returns the remainder after division by M. The constants a, b, and M can be chosen as a = 4K + 1, (8.12) where K is an odd number such that a is less than M, and M = 2L (8.13) is a power of 2, and b can be any odd number. Equations (8.12) and (8.13) guarantee that the period of the sequence given by Equation (8.11) has full-length M. A good choice of these parameters are M = 220 = 1 048 576, a = 4(511) + 1 = 2045, and x(0) = 12 357. Since a random number generator usually produces samples between 0 and 1, we can normalize the nth random sample as r(n) = x(n) + 1 M + 1 (8.14) so that the random samples are greater than 0 and less than 1. A floating-point C function (uran.c) that implements the random number generator defined by Equations (8.11) and (8.14) is listed in Table 8.1. A fixed-point C function (rand.c) that is more efficient for a fixed-point DSP processor was provided in Section 3.6.6. Example 8.4: The following TMS320C55x assembly code (rand_gen.asm) implements an M = 216 (65 536) random number generator: ; rand16_gen.asm - 16-bit zero-mean random number generator ; ; Prototype: int rand16_gen(int *) ; ; Entry: arg0 - AR0 pointer to seed value ; Return: T0 - Random numberJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 NOISE GENERATORS 407 b15 b14 b13 b12 b11 b10 x2 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 x1 x XOR XOR XOR Figure 8.3 16-bit pseudo-random number generator C1 .equ 0x6255 C2 .equ 0x3619 .def _rand16_gen .sect "rand_gen" _rand16_gen mov #C1,T0 mpym *AR0,T0,AC0 ; Seed=(C1*seed+C2) add #C2,AC0 and #0xffff,AC0 ; Seed%=0x10000 mov AC0,*AR0 sub #0x4000,AC0 ; Zero-mean random number mov AC0,T0 ret .end 8.2.2 Pseudo-Random Binary Sequence Generator A shift register with feedback from specific bit locations can also generate a repetitive pseudo-random sequence. The schematic of a 16-bit generator is shown in Figure 8.3, where the functional operator labeled ‘XOR’ performs the exclusive-OR operation of its two binary inputs. The sequence itself is determined by the position of the feedback bits on the shift register. In Figure 8.3, x1 is the output of b0 XOR with b2, x2 is the output of b11 XOR with b15, and x is the output of x1 XOR with x2. Each output from the sequence generator is the entire 16-bit of the register. After the random number is generated, every bit in the register is shifted left by 1 bit (b15 is lost), and then x is shifted into b0 position. A shift register of length 16 bits can readily be accommodated by a single word on 16-bit DSP processors. It is important to recognize, however, that sequential words formed by this process will be correlated. The maximum sequence length before repetition is L = 2M − 1, (8.15) where M is the number of bits of the shift register. Example 8.5: The pseudo-random number generator given in Table 8.2 (pn_sequence.c) re- quires at least 11 C statements to complete the computation. The following TMS320C55x assemblyJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 408 DIGITAL SIGNAL GENERATORS Table 8.2 C program for generating pseudo-random sequence // // Pseudo-random sequence generator // static short shift_reg; short pn_sequence(short *sreg) { short b2,b11,b15; short x1,x2; /* x2 also used for x */ b15 = *sreg >>15; b11 = *sreg >>11; x2 = b15^b11; /* First b15 XOR b11 */ b2 = *sreg >>2; x1 = *sreg ^b2; /* Second b2 XOR b0 */ x2 = x1^x2; /* Final x1 XOR x2 */ x2 &= 1; *sreg = *sreg <<1; *sreg = *sreg | x2; /* Update the shift register */ x2 = *sreg-0x4000; /* Zero-mean random number */ return x2; } program (pn_gen.asm) computes the same sequence in 11 cycles: ; pn_gen.asm - 16-bit pseudo-random sequence generator ; ; Prototype: int pn_gen(int *) ; ; Entry: arg0 - AR0 pointer to the shift register ; Return: T0 - Random number BIT15 .equ 0x8000 ; b15 BIT11 .equ 0x0800 ; b11 BIT2 .equ 0x0004 ; b2 BIT0 .equ 0x0001 ; b0 .def _pn_gen .sect "rand_gen" _pn_gen mov *AR0,AC0 ; Get register value bfxtr #(BIT15| BIT2),AC0,T0 ; Get b15 and b2 bfxtr #(BIT11| BIT0),AC0,T1 ; Get b11 and b0 sfts AC0,#1 || xor T0,T1 ; XOR all 4 bits mov T1,T0 sfts T1,#-1 xor T0,T1 ; Final XOR and #1,T1 or T1,AC0 mov AC0,*AR0 ; Update register sub #0x4000,AC0,T0 ; Zero-mean random number || ret .endJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 PRACTICAL APPLICATIONS 409 8.3 Practical Applications In this section, we will introduce some real-world applications that are related to the sinewave and random number generators. 8.3.1 Siren Generators An interesting application of chirp signal generator is to generate sirens. The electronic sirens are often produced by a generator inside the vehicle compartment. This generator drives either a 60- or 100-W loudspeaker in a light bar mounted on the vehicle roof. The actual siren characteristics (bandwidth and duration) vary slightly from manufacturers. The wail type of siren sweeps between 800 and 1700 Hz with a sweep period of approximately 4.92 s. The yelp siren has similar characteristics to the wail but with a period of 0.32 s. Example 8.6: We modify the chirp signal generator given in Example 8.3 for generating sirens. The MATLAB code example8_6.m generates wail type of siren and plays it using soundsc function. 8.3.2 White Gaussian Noise The MATLAB Communication Toolbox provides wgn function for generating white Gaussian noise (WGN) that is widely used for modeling communication channels. We can specify the power of the noise in dBW (decibels relative to 1-watt), dBm, or linear units. We can generate either real or complex noise. For example, the command below generates a vector of length 50 containing real-valued WGN whose power is 2 dBW: y1 = wgn(50,1,2); The function assumes that the load impedance is 1 . Example 8.7: A WGN channel adds white Gaussian noise to the signal that passes through it. To model a WGN channel, use the awgn function as follows: y = awgn(x,snr) This command adds white Gaussian noise to the vector signal x. The scalar snr specifies the signal-to-noise ratio in dB. If x is complex, then awgn adds complex noise. This syntax assumes that the power of x is 0 dBW. The following MATLABscript (example8_7.m, adapted from MATAB Help menu) adds white Gaussian noise to a square wave signal. It then plots the original and noisy signals in Figure 8.4: t = 0:.1:20; x = square(t); % Create square signal y = awgn(x,10,'measured'); % Add white Gaussian noise plot(t,x,t,y) % Plot both signals legend('Original signal','Signal with AWGN');JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 410 DIGITAL SIGNAL GENERATORS 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 50101520 Time Amplitude Original signal Signal with AWGN Figure 8.4 A square wave corrupted by white Gaussian noise Note that in the code, square(t) generates a square wave with period 2π for the elements of time vector t with peaks of +1to−1 instead of a sinewave. 8.3.3 Dual-Tone Multifrequency Tone Generator A common application of sinewave generator is the touch-tone telephones and cellular phones that use the dual-tone multifrequency (DTMF) transmitter and receiver. DTMF also finds widespread uses in electronic mail systems and automated telephone servicing systems in which the user can select options from a menu by sending DTMF signals from a telephone. Each key-press on the telephone keypad generates the sum of two tones expressed as x(n) = cos (2π fLnT) + cos (2π fHnT) , (8.16) where T is the sampling period, and the two frequencies fL and fH uniquely define the key that was pressed. Figure 8.5 shows the matrix of the frequencies used to encode the 16 DTMF symbols defined by ITU Recommendation Q.23. The values of these eight frequencies have been chosen carefully so that they do not interfere with speech. The low-frequency group (697, 770, 852, and 941 Hz) selects the row frequencies of the 4 × 4 keypad, and the high-frequency group (1209, 1336, 1477, and 1633 Hz) selects the column frequencies. A pair of sinusoidal signals with fL from the low-frequency group and fH from the high-frequency group willJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 PRACTICAL APPLICATIONS 411 123A 456B 7 89C *0#D 1209 1336 1477 1633Hz 941Hz 852Hz 770Hz 697Hz Figure 8.5 Telephone keypad matrix represent a particular key. For example, the digit ‘3’ is represented by two sinewaves at frequencies 697 and 1477 Hz. The generation of dual tones can be implemented using two sinewave generators connected in parallel. The DTMF signal must meet timing requirements for duration and spacing of digit tones. Digits are required to be transmitted at a rate of less than 10 per second. A minimum spacing of 50 ms between tones is required, and the tones must be presented for a minimum of 40 ms. A tone-detection scheme used as a DTMF receiver must have sufficient time resolution to verify correct digit timing. The issues of tone detection will be discussed later in Chapter 9. 8.3.4 Comfort Noise in Voice Communication Systems In voice communication systems, the complete suppression of a signal using residual echo suppressor (will be discussed later in Section 10.5) has an adverse subjective effect. This problem can be solved by adding a low-level comfort noise. As illustrated in Figure 8.6, the output of residual echo suppressor is expressed as y(n) = αv(n), |x(n)| ≤ β x(n), |x(n)| >β, (8.17) x(n) v(n) y(n) α Noise power estimator Noise generator > βx(n) ≤ βx(n) Figure 8.6 Injection of comfort noise with active center clipperJWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 412 DIGITAL SIGNAL GENERATORS Table 8.3 File listing for experiment exp8.4.1_signalGenerator Files Description tone.c C function for testing experiment tone.cdb CCS configuration file for experiment tonecfg.cmd DSP linker command file signalGenerator.pjt DSP project file 55xdspx.lib Large memory mode DSK library dsk5510bslx.lib Large memory mode DSK board support library where v(n) is an internally generated zero-mean pseudo-random noise, x(n) is the input applied to the center clipper, and β is the clipping threshold. In echo cancelation applications, the characteristics of the comfort noise should match the background noise when neither talker is active. In speech coding applications, the characteristics of the comfort noise should match the background noise during the silence. In both cases, the algorithm shown in Figure 8.6 is a process of estimating the power of the background noise in x(n) and generating the comfort noise with same power to replace signals suppressed by the center clipper. Detailed information on residual echo suppressor and comfort noise generation will be presented in Chapter 10. 8.4 Experiments and Program Examples This section presents several hands-on experiments including real-time signal generation using the C5510 DSK and DTMF generation using MATLAB. 8.4.1 Sinewave Generator Using C5510 DSK The objective of this experiment is to use the C5510 DSK with its associated CCS, BSL (board support library), and AIC23 for generating sinusoidal signals. We will develop our programs based on the tone example project that is available in the C5510 DSK folder ..\examples\dsk5510\bsl\tone. In this experiment, we will modify the C program and build the project using CCS for real-time execution on the C5510 DSK. Table 8.3 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Create a working folder and copy the following files from the DSK folder ..\examples\dsk5510\ bsl\tone into the new folder. In addition, also copy the DSPLIB 55xdspx.lib from the DSK folder ..\c5500\dsplib and dsk5510bslx.lib from the DSK folder ..\c5500\dsk5510\lib into the new folder. 2. Start CCS and create a new project in the new folder. Add tone.c, tone.cdb and tonecfg.cmd into the project. In addition, also add the 55xdspx.lib and dsk5510bslx.lib into the project. We will need DSPLIB functions to generate sine and random signals. Choose the large memory model and build the project. 3. Connect a headphone (or a loudspeaker) to the headphone output of the C5510 DSK and run the program.JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 413 Table 8.4 Code section to generate random signal #define SINE_TABLE_SIZE 8 // No. of samples short sinetable[SINE_TABLE_SIZE]; // Vector for random samples ... for (msec = 0; msec < 5000; msec++) { rand16(sinetable, SINE_TABLE_SIZE); for (sample = 0; sample < SINE_TABLE_SIZE; sample++) { /* Send a sample to the left channel */ while (!DSK5510_AIC23_write16(hCodec, sinetable[sample])); /* Send a sample to the right channel */ while (!DSK5510_AIC23_write16(hCodec, sinetable[sample])); } } In the C source code tone.c, the array sinetable contains 48 samples (which cover exactly one period) of a precalculated sinewave in Q15 data format shown below: Int16 sinetable[SINE_TABLE_SIZE] = { 0x0000, 0x10b4, 0x2120, 0x30fb, 0x3fff, 0x4dea, 0x5a81, 0x658b, 0x6ed8, 0x763f, 0x7ba1, 0x7ee5, 0x7ffd, 0x7ee5, 0x7ba1, 0x76ef, 0x6ed8, 0x658b, 0x5a81, 0x4dea, 0x3fff, 0x30fb, 0x2120, 0x10b4, 0x0000, 0xef4c, 0xdee0, 0xcf06, 0xc002, 0xb216, 0xa57f, 0x9a75, 0x9128, 0x89c1, 0x845f, 0x811b, 0x8002, 0x811b, 0x845f, 0x89c1, 0x9128, 0x9a76, 0xa57f, 0xb216, 0xc002, 0xcf06, 0xdee0, 0xef4c }; The sampling rate of CODEC is default at 48 kHz, thus the CODEC outputs 48 000 samples per second. Since the time interval between two consecutive samples is T = 1/48 000 s, each period of sinewave contains 48 samples, and the period of sinewave is 48/48 000 = 1/1000 s = 1 ms. Therefore, the frequency of the generated sinewave is 1000 Hz. Since each period of sinewave is 1/1000 s, the program generates 5000 periods, and it lasts for 5 s. 8.4.2 White Noise Generator Using C5510 DSK In this experiment, we use the C55x DSPLIB function rand16 to generate eight samples of random signal for 8 kHz sampling rate (or 48 samples if the sampling rate is 48 kHz). Instead of writing a new program, we will modify tone.c from the previous experiment and rename it as noise.c. Partial of the modified C code that uses the array sinetable[ ] for storing random numbers is listed in Table 8.4. The files used for this experiment are listed in Table 8.5. Procedures of the experiment are listed as follows: 1. Create a DSP project for the noise experiment. 2. Run the experiment and listen to the noise generated by the C5510 DSK.JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 414 DIGITAL SIGNAL GENERATORS Table 8.5 File listing for experiment exp8.4.2_noiseGenerator Files Description noise.c C function for testing experiment tone.cdb CCS configuration file for experiment tonecfg.cmd DSP linker command file noiseGeneration.pjt DSP project file 55xdspx.lib Large memory mode DSK library dsk5510bslx.lib Large memory mode DSK board support library 3. Modify the experiment such that the noise generated will be sampled at 8 kHz. 4. Modify the experiment to generate 2 s of random noise at 8 kHz sampling rate. 8.4.3 Wail Siren Generator Using C5510 DSK In this experiment, we will use the table-lookup method to implement a wail siren using the C5510 DSK. The modified C code using the array sirentable[ ] for storing siren data values is listed in Table 8.6. There is a limitation for this experiment running on the C5510 DSK. In Example 8.6, the sweeping of data numbers from 800 to 1700 Hz at 8 kHz sampling rate requires a table of 39 360 entries. We will not be able to access the complete table because the addressing range of 16-bit C55x is limited to 32 767. To Table 8.6 Code section for siren generator #define SIREN_TABLE_SIZE 19680 /* Length of siren table */ Int16 sirentable[SIREN_TABLE_SIZE]={ #include "wailSiren.h" }; /* Generate 10-sweep of siren wave */ for (i=0; i<10; i++) { for (sample = 0; sample < SIREN_TABLE_SIZE; sample++) { data = sirentable[sample]; // Get two samples each time /* Send first sample to the left channel */ while (!DSK5510_AIC23_write16(hCodec, (data&0xff)<<8)); /* Send first sample to the right channel */ while (!DSK5510_AIC23_write16(hCodec, (data&0xff)<<8)); /* Send second sample to the left channel */ while (!DSK5510_AIC23_write16(hCodec, data&0xff00)); /* Send second sample to the right channel */ while (!DSK5510_AIC23_write16(hCodec, data&0xff00)); } }JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 415 Table 8.7 File listing for experiment exp8.4.3_sirenGenerator Files Description siren.c C function for testing experiment tone.cdb CCS configuration file for experiment tonecfg.cmd DSP linker command file sirenGenerator.pjt DSP project file 55xdspx.lib Large memory mode DSK library dsk5510bslx.lib Large memory mode DSK board support library resolve this problem, we generate 8-bit siren data and pack two 8-bit data into one 16-bit word. In this way, we can use a table of 19 680 entries for the 4.92 s of wail siren. The demo program is modified, so each data read will take two 8-bit data and they are unpacked and played by the DSK. Table 8.7 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Create a DSP project for the siren experiment. 2. Write a C or MATLAB program to generate siren lookup table in 8-bit data and two 8-bit data packed formats. 3. Set the AIC23 sampling rate to 8 kHz. 4. Run the experiment and listen to the siren signal generated. 8.4.4 DTMF Generator Using C5510 DSK In this experiment, we will implement DTMF signal generation using the C5510 DSK. We modify the previous table-lookup experiment to create a DTMF generator with 8-kHz sampling frequency. The ITU Q23 recommendation defines the DTMF signaling with eight frequencies, four lower frequencies for the rows and four high frequencies for the columns as shown in Figure 8.5. The ITU Q.24 recommendation specifies the duration of the DTMF signal and silence interval between DTMF signals. We generate eight tables for eight DTMF frequencies. Each table has 800 entries for 100-ms duration. The following C code can be used to generate the sinewave tables: w = 2.0*PI*f/Fs; for(n=0; n<800; n++) { cosine[n] = (short)(cos(w*n)*16383); // Q14 format } In the code, f is the DTMF frequency and Fs is the sampling frequency. Table 8.8 lists the partial code for DTMF tone generation. This experiment can generate a series of DTMF signals from a given digit string. The DTMF tones are separated by 60 ms of silence. The files used for this experiment are listed in Table 8.9.JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 416 DIGITAL SIGNAL GENERATORS Table 8.8 Code section of DTMF signal generation for (sample = 0; sample 48) to cover one period of sinewave. 6. The yelp siren has similar characteristics as the wail siren but its period is 0.32 s. Use the wail siren experiment as reference to create a yelp siren generator using the table-lookup method. 7. ITU Q.24 recommendation specifies that the DTMF frequency offsets for North America must be no more than 1.5 %. Develop a method to examine all 16 waveform tables used for DTMF generation given in Section 8.4.4. Are these DTMF tones all within the specified tolerance? If not, how to correct the problem? 8. The DTMF signal generation uses eight tables of 800 entries each. By packing two 8-bit bytes in one 16-bit word can save half of the data memory used for tables. Compress these eight tables into byte format and rerun the DTMF experiments. 9. The ITU Q24 allows the high-frequency component of the DTMF tone level to be higher than the low-frequency component. Redesign the experiment given in Section 8.4.4 such that the level of the high-frequency component of the DTMF tone generated is 3 dB higher than the low frequency. 10. Add two graph windows to the DTMF GUI in Section 8.4.5. One of these windows is used to display the time-domain DTMF signal waveform, and the other is used to plot the spectrum of the generated DTMF signal.JWBK080-08 JWBK080-Kuo March 8, 2006 11:58 Char Count= 0 420JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 9 Dual-Tone Multifrequency Detection Dual-tone multifrequency (DTMF) generation and detection are widely used in telephone signaling and interactive control applications through telephone and cellular networks. In this chapter, we will focus on the DTMF detection and applications. 9.1 Introduction DTMF signaling was developed initially for telephony signaling such as dialing and automatic redial. Modems use DTMF for dialing stored numbers to connect with network service providers. DTMF has also been used in interactive remote access control with computerized automatic response systems such as airline’s information systems, remote voice mailboxes, electronic banking systems, as well as many semiautomatic services via telephone networks. DTMF signaling scheme, reception, testing, and implementation requirements are defined in ITU Recommendations Q.23 and Q.24. DTMF generation is based on a 4 × 4 grid matrix shown in Figure 8.5. This matrix represents 16 DTMF signals including numbers 0Ð9, special keys ∗ and #, and four letters AÐD. The letters AÐD are assigned to unique functions for special communication systems such as the military telephony systems. As discussed in Chapter 8, the DTMF signals are based on eight specific frequencies defined by two mutually exclusive groups. Each DTMF signal consists of two tones that must be generated simultaneously. One is chosen from the low-frequency group to represent the row index, and the other is chosen from the high-frequency group for the column index. A DTMF decoder must able to accurately detect the presence of these tones specified by ITU Q.23. The decoder must detect the DTMF signals under various conditions such as frequency offsets, power level variations, DTMF reception timing inconsistencies, etc. DTMF decoder implementation requirements are detailed in ITU-T Q.24 recommendation. An application of using DTMF signaling for remote access control between individual users and bank automated electronic database is illustrated in Figure 9.1. In this example, user follows the prerecorded voice commands to key-in the corresponding information, such as the account number and user authen- tication, using a touch-tone telephone keypad. User’s inputs are converted to a series of DTMF signals. The reception end processes these DTMF tones to reconstruct the digits for the remote access control. The banking system sends the queries, responses, and confirmation messages via voice channel to the user during the remote access process. Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 421JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 422 DUAL-TONE MULTIFREQUENCY DETECTION Voice channel DTMF detection Voice in Voice command Bank account access Voice out Voice channel Voice inVoice out BankDatabase User Figure 9.1 A simplified DTMF application used for remote access control For voice over IP (VoIP) applications, a challenge for DTMF signaling is to pass through the VoIP networks via speech coders and decoders. When DTMF signaling is used with VoIPnetworks, the DTMF signaling events can be sent in data packet types. The procedure of how to carry the DTMF signaling and other telephony events in real-time transport protocol (RTP) packet is defined by Internet engineering task force RFC2833 specification. Besides DTMF tones, there are many other multifrequency tones used in communications. For example, the call progress tones include dial tone, busy tone, ringing-back tone, and modem and fax tones. The basic tone detection algorithm and implementation techniques are similar. In this chapter, we will concentrate on the DTMF detection. 9.2 DTMF Tone Detection This section introduces methods for detecting DTMF tones used in communication networks. The correct detection of a DTMF digit requires both a valid tone pair and the correct timing intervals. Since the DTMF signaling may be used to set up a call and to control functions such as call forwarding, it is necessary to detect DTMF signaling in the presence of speech. 9.2.1 DTMF Decode Specifications The implementation of DTMF decoder involves the detection of the DTMF tones, and determination of the correct silence between the tones. In addition, it is necessary to perform additional assessments to ensure that the decoder can accurately distinguish DTMF signals in the presence of speech. For North America, DTMF decoders are required to detect frequencies with a tolerance of ±1.5%. The frequencies that are offset by ±3.5 % or greater must not be recognized as DTMF signals. For Japan, the detection of frequencies has a tolerance of ±1.8 %, and the tone offset is limited to ±3.0 %. This requirement prevents the detector from falsely detecting speech and other signals as valid DTMF signals. The receiver must work under the worst-case signal-to-noise ratio of 15 dB with a dynamic range of 25 dB for North America (or 24 dB for Japan). The ITU-T requirements for North America are listed in the Table 9.1. Another requirement is the ability to detect DTMF signals when two tones are received at different levels. This level difference is called twist. The high-frequency tone may be received at a lower level than the low-frequency tone due to the magnitude response of the communication channel, and this situation is described as a forward (or standard) twist. Reverse twist occurs when the received low-frequency tone has lower level than the high-frequency tone. The receiver must operate with a maximum 8 dB of forward twist and 4 dB of reverse twist. The final requirement is that the receiver must avoid incorrectly identifying the speech signal as valid DTMF tones. This is referred as talk-off performance.JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 DTMF TONE DETECTION 423 Table 9.1 Requirements of DTMF specified in ITU-T Q.24 Signal frequencies Low group 697, 770, 852, 941 Hz High group 1209, 1336, 1477, 1633 Hz Frequency tolerance Operation ≤ 1.5% Nonoperation ≥ 3.5% Signal duration Operation 40 ms min Nonoperation 23 ms max Twist Forward 8 dB max Reverse 4 dB max Signal power Operation 0 to −25 dBm Nonoperation −55 dBm max Interference by echoes Echoes Should tolerate echoes delayed up to 20 ms and at least 10 dB down 9.2.2 Goertzel Algorithm The basic principle of DTMF detection is to examine the energy of the received signal and determine whether a valid DTMF tone pair has been received. The detection algorithm can be implemented using a DFT or a filterbank. For example, an FFT can calculate the energies of N evenly spaced frequencies. To achieve the required frequency resolution to detect the DTMF frequencies within ±1.5 %, a 256-point FFT is needed for 8 kHz sample rate. Since the DTMF detection considers only eight frequencies, it is more efficient to use a filterbank that consists of eight IIR bandpass filters. In this chapter, we will introduce the modified Goertzel algorithm as filterbank for DTMF detection. The DFT can be used to compute eight different X(k) that correspond to the DTMF frequencies as X(k) = N−1 n=0 x(n)W kn N . (9.1) Using the modified Goertzel algorithm, the DTMF decoder can be implemented as a matched filter for each frequency index k as illustrated in Figure 9.2, where x(n) is the input signal, Hk(z) is the transfer function of the kth filter, and X(k) is the corresponding filter output. From Equation (7.4), we have W −kN N = e j(2π/N)kN = e j2πk = 1. (9.2) X (N − 1)HN−1(z) x (n) Hk (z) H0 (z) X (k) X (0) Figure 9.2 Block diagram of Goertzel filterbankJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 424 DUAL-TONE MULTIFREQUENCY DETECTION Multiplying the right-hand side of Equation (9.1) by W −kN N , we have X(k) = W −kN N N−1 n=0 x(n)W kn N = N−1 n=0 x(n)W −k(N−n) N . (9.3) Define the sequence yk(n) = N−1 m=0 x(m)W −k(n−m) N . (9.4) This equation can be interpreted as a convolution between the finite-duration sequence x(n) and the sequence W −kn N u(n) for 0 ≤ n ≤ N − 1. Consequently, yk(n) can be viewed as the output of a filter with impulse response W −kn N u(n). That is, a filter with impulse response hk(n) = W −kn N u(n). (9.5) Thus, Equation (9.4) can be expressed as yk(n) = x(n) ∗ W −kn N u(n). (9.6) From Equations (9.3) and (9.4), and the fact that x(n) = 0 for n < 0 and n ≥ N, we can show that X(k) = yk(n)|n=N−1. (9.7) That is, X(k) is the output of filter Hk(z) at time n = N − 1. Taking the z-transform of Equation (9.6), we obtain Yk(z) = X(z) 1 1 − W −k N z−1 . (9.8) The transfer function of the kth Goertzel filter is defined as Hk(z) = Yk(z) X(z) = 1 1 − W −k N z−1 , k = 0, 1,...,N − 1. (9.9) This filter has a pole on the unit circle at the frequency ωk = 2πk/N. Thus, the DFT can be computed by filtering a block of input data using N filters in parallel as defined by Equation (9.9). Each filter has a pole at the corresponding frequency of the DFT. The parameter N must be chosen to ensure that X(k) is the result representing to the DTMF at frequency fk that meets the requirement of frequency tolerance given by Table 9.1. The DTMF detection accuracy can be ensured only if we choose the N such that the following approximation is satisfied: 2 fk fs ∼= k N . (9.10) A block diagram of the transfer function Hk(z) for recursive computation of X(k) is depicted in Figure 9.3. Since the coefficients W −k N are complex valued, the computation of each new value of yk(n) requires four multiplications and additions. All the intermediate values, yk(0), yk(1),...,andyk(N − 1),JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 DTMF TONE DETECTION 425 x(n) WN −k yk (n) Hk (z) z−1 Figure 9.3 Block diagram of recursive computation of X(k) must be computed in order to obtain the final output yk(N − 1) = X(k). Therefore, the computation of X(k) given in Figure 9.3 requires 4N complex multiplications and additions for each frequency index k. We can avoid the complex multiplications and additions by combining the pairs of filters that have complex-conjugated poles. By multiplying both the numerator and the denominator of Hk(z) given in Equation (9.9) by the factor (1 − W k N z−1), we have Hk(z) = 1 − W k N z−1 (1 − W −k N z−1)(1 − W k N z−1) = 1 − e j2πk/N z−1 1 − 2 cos(2πk/N)z−1 + z−2 . (9.11) This transfer function can be represented as signal-flow diagram shown in Figure 9.4 using the direct- form II IIR filter. The recursive part of the filter is on the left side, and the nonrecursive part is on the right side. Since the output yk(n) is required only at time N − 1, we just need to compute the nonrecursive part of the filter at the (N − 1)th iteration. The recursive part of algorithm can be expressed as wk(n) = x(n) + 2 cos(2π fk/fs)wk(n − 1) − wk(n − 2), (9.12) while the nonrecursive calculation of yk(N − 1) is expressed as X(k) = yk(N − 1) = wk(N − 1) − e− j2π fk /fs wk(N − 2). (9.13) Hk (z) wk (n)x(n) 2 cos(2πfk / fs) yk (n) wk (n − 2) wk (n − 1) −1 z−1 z−1 −e−j2πfk / fs Figure 9.4 Detailed signal-flow diagram of Goertzel algorithmJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 426 DUAL-TONE MULTIFREQUENCY DETECTION The algorithm can be further simplified by realizing that only the squared X(k) (magnitude) is needed for tone detections. From Equation (9.13), the squared magnitude of X(k) is computed as |X(k)|2 = w2 k (N − 1) − 2 cos(2π fk/fs)wk(N − 1)wk(N − 2) + w2 k (N − 2). (9.14) This avoids the complex arithmetic given in Equation (9.13), and requires only one coefficient, 2 cos(2π fk/fs), for computing each |X(k)|2. Since there are eight possible tones, the detector needs eight filters as described by Equations (9.12) and (9.14). Each filter is tuned to one of the eight frequen- cies. Note that Equation (9.12) is computed for n = 0, 1,...,N − 1, but Equation (9.14) is computed only once at time n = N − 1. 9.2.3 Other DTMF Detection Methods Goertzel algorithm is very efficient for DTMF signal detection. However, some real-world applications may already have other DSP modules that can be used for DTMF detection. For example, some noise reduction applications use FFT algorithm to analyze the spectrum of noise, and some speech-coding algorithms use the linear prediction coding (LPC). In these cases, the FFT or the LPC coefficients can be used for DTMF detection. DTMF detection embedded in noise cancelation In noise reduction systems that use spectrum subtraction method (will be introduced in Chapter 12), the time-domain signal is transformed to frequency domain using the FFT algorithm. Therefore, the FFT results can be used for DTMF detection as shown in Figure 9.5. The system shown in Figure 9.5 shares the FFT results for noise cancelation and DTMF detection. Since frequency information is available from the FFT block, the DTMF detection can be simplified. All-pole modeling using LPC coefficients Chapter 11 will introduce many speech-coding algorithms using an all-pole LPC synthesis filter. The synthesis filter is defined as 1 A(z) = 1 1 − p i=1 ai z−i , (9.15) FFT Noise reduction and IFFT DTMF detection Noise suppressed speechNoise speech Speech coding Mux DTMF information To channel Encoded bitNoise cancellation Figure 9.5 DTMF detection embedded in a noise cancelation systemJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 DTMF TONE DETECTION 427 where ai is the short-term LPC coefficient and p is the LPC filter order. The calculation of LPC coefficients can be found in Section 11.4. This all-pole filter can be further decomposed with several second-order sections. If the LPC order p is an even number, it can be written as 1 A(z) = 1 (1 + a11z−1 + a12z−2)(1 + a21z−1 + a22z−2) ···(1 + aq1z−1 + aq2z−2) (9.16) with q = p/2. If p is an odd number with q = (p − 1)/2, the first-order component (1 + aq+1z−1)is used and Equation (9.16) can be modified as 1 A(z) = 1 (1 + a11z−1 + a12z−2) ···(1 + aq1z−1 + aq2z−2)(1 + aq+1z−1) . (9.17) We assume that we have LPC coefficients and they are shared between a speech coder and a DTMF detector. Example 9.1:Compare the similarity of the FFT spectrum of the DTMF digit ‘5’ and the frequency response of a 10th-order LPC synthesis filter. The frequencies used for DTMF digit ‘5’ are fL = 770 Hz and fH = 1336 Hz at sampling rate 8000 Hz, and the DTMF signal can be generated by MATLAB as x(1 : N) = sin (2π fL(1 : N)) + sin (2π fH(1 : N)) . Using MATLABfunction levinson, we can compute the LPC coefficients from its autocorrelation function based on Equation (9.15) as follows: lpcOrder=10; % LPC order w=hamming(N); % Generate hamming window x=x.*w'; % Windowing m=0; while (m<=lpcOrder); % Calculation of auto-correlation r(m+1)=sum(x(1:(N-m)).*x((1+m):N)); m=m+1; end; a=levinson(r,lpcOrder); % Levinson algorithms The generated LCP coefficients are listed as follows: a[0] = 1.0000, a[1] = -1.5797, a[2] = 1.4570, a[3] = -0.0021, a[4] = -0.1805, a[5] = 0.1195, a[6] = 0.3082, a[7] = 0.2145, a[8] = 0.0230, a[9] = -0.0556, a[10] = 0.1797 Figure 9.6 shows the spectrum of DTMF tones for digit ‘5’ and the spectrum from the LPC coef- ficients estimation. This example demonstrates that the roots of an all-pole filter, which represents the dual frequencies of DTMF tones, can be closely located using the LPC modeling. Example 9.2: The roots of LPC synthesis filter coefficients can be computed using MATLAB function roots(a). The angles of these roots can be converted from frequency in radian to Hz using MATLAB function freq=((angle(r)*8000/(2*pi)). The roots and angles are listed in Table 9.2 and the roots are also plotted in Figure 9.7.JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 428 DUAL-TONE MULTIFREQUENCY DETECTION 50 40 30 20 10 Magnitude (dB) 0 0 500 1000 1500 Frequency (Hz) Synthesis filter spectrum response LPC all-pole filter frequency response Original signal spectrum 2000 2500 3000 3500 4000 −10 −20 Figure 9.6 Comparison of spectrum estimated by LPC The roots from Example 9.2 are complex-conjugated pairs. These roots represent five real second- order functions in Equation (9.16). The third and fourth pairs of roots are the most important since they are very close to the unit circle, and their frequencies are comparable to two frequencies used for digit ‘5’. The amplitudes of other roots are smaller since they are located inside the unit circle, and their frequencies are not within the DTMF frequency ranges. The roots with amplitudes close to unity dominate the magnitude response. For the example using DTMF digit ‘5’, the relative differences in amplitude estimation are 0.0873 % for 770 Hz and 0.1274 % for 1336 Hz. The estimated frequency differences are 0.1799 % for 770 Hz and 0.1223 % for 1336 Hz. Examples 9.1 and 9.2 show that the estimated DTMF frequencies from LPC coefficients are very close to the DTMF frequencies defined by ITU Q.23 recommendation. Thus, the LPC coefficients from speech coders can be used for DTMF detection. 9.2.4 Implementation Considerations The flow chart of DTMF tone detection algorithm is illustrated in Figure 9.8. At the beginning of each frame, the state variables x(n), wk(n), wk(n − 1), wk(n − 2), and yk(n) for each of the eight Goertzel Table 9.2 Roots and angles of 10th-order LPC synthesis filter Complex roots Amplitude Frequency (Hz) 1 −0.6752 ± j0.2510 0.7204 ±3546.8 2 −0.3481 ± j0.6506 0.7378 ±2625.5 3 0.8225 ± j0.5672 0.9991 ±0768.6 4 0.4964 ± j0.8666 0.9987 ±1337.6 5 0.4942 ± j0.6283 0.7993 ±1151.4JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 DTMF TONE DETECTION 429 Root location1 0.8 0.6 0.4 0.2 0 −0.2Image part −0.4 −0.6 −0.8 −1 −1 −0.5 0 0.5 1 Roots of synthesis filter Real part Conjugate roots at 1337.6 Hz Conjugate roots at 768.6 Hz Figure 9.7 Plot of roots of 10th-order LPC synthesis filter filters and the energy are set to zero. For each new sample, the recursive part of filtering defined in Equation (9.12) is executed. At the end of each frame, i.e., n = N − 1, the squared magnitude |X(k)|2 for each DTMF frequency is computed based on Equation (9.14). Six tests are followed to determine if a valid DTMF digit has been detected: Magnitude test: According to ITU Q.24, the maximum signal level transmit to the public network shall not exceed −9 dBm. This limits an average voice range of −35 dBm for a very weak long-distance call to −10 dBm for a local call. A DTMF receiver is expected to operate at an average range of −29 to +1 dBm. Thus, the largest magnitude in each band must be greater than a threshold of −29 dBm; otherwise, the DTMF signal should not be detected. For the magnitude test, the squared magnitude |X(k)|2 defined in Equation (9.14) for each DTMF frequency is computed. The largest magnitude in each group is obtained. Twist test: The tones may be attenuated according to the telephone system’s gains at the tonal frequencies. Therefore, we do not expect the received tones to have same amplitude, even though they may be transmitted with the same strength. Twist is defined as the difference, in decibels, between the low- and high-frequency tone levels. In practice, the DTMF digits are generated with forward twist to compensate for greater losses at higher frequency within a long telephone cable. For example, Australia allows 10 dB of forward twist, Japan allows only 5 dB, and North America recommends not more than 8 dB of forward twist and 4 dB of reverse twist. Frequency-offset test: This test prevents some broadband signals from being detected as DTMF tones. If the effective DTMF tones are present, the power levels at those two frequencies should be much higher than the power levels at the other frequencies. To perform this test, the largest magnitude in each group is compared to the magnitudes of other frequencies in that group. The difference must be greater than the predetermined threshold in each group.JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 430 DUAL-TONE MULTIFREQUENCY DETECTION Initialization Get 8 kHz input sample Compute the recursive part of the Goertzel filter for the eight frequencies n = N − 1? Compute the nonrecursive part of the Goertzel filter for the eight frequencies Yes No Magnitude > threshold? Twist normal? Does frequency offset pass? Total-energy test pass? Second harmonic signal too strong? Output digit N N N N Y N N Y Y Y Y N Y Y D(m)=D(m − 2)? D(m)=D(m − 1)? Figure 9.8 Flow chart for the DTMF tone detector Total-energy test: Similar to the frequency-offset test, the goal of total-energy test is to reject some broadband signals to further improve the robustness of a DTMF decoder. To perform this test, three different constants c1, c2, and c3 are used. The energy of the detected tone in the low-frequency group is weighted by c1, the energy of the detected tone in the high-frequency group is weighted by c2, and the sum of the two energies is weighted by c3. Each of these terms must be greater than the summation of the energy from the rest of the filter outputs.JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 INTERNET APPLICATION ISSUES AND SOLUTIONS 431 Second harmonic test: The objective of this test is to reject speech that has harmonics close to fk so that they might be falsely detected as DTMF tones. Since DTMF tones are pure sinusoids, they contain very little second harmonic energy. Speech, on the other hand, contains a significant amount of second harmonic. To test the level of second harmonic, the detector must evaluate the second harmonic frequencies of all eight DTMF tones. These second harmonic frequencies (1394, 1540, 1704, 1882, 2418, 2672, 2954, and 3266 Hz) can also be detected using the Goertzel algorithm. Digit decoder: Finally, if all five tests are passed, the tone pair is decoded and mapped to one of the 16 keys on a telephone touch-tone keypad. This decoded digit is placed in a memory location designated D(m). If any of the tests fail, then ‘−1’ representing ‘no detection’ is placed in D(m). For a new valid digit to be declared, the current D(m) must be the same in three successive frames, i.e., D(m − 2) = D(m − 1) = D(m). There are two reasons for checking three successive digits at each pass. First, the check eliminates the need to generate hits every time a tone is present. As long as the tone is present, it can be ignored until it changes. Second, comparing digits D(m − 2), D(m − 1), and D(m) improves noise and speech immunity. 9.3 Internet Application Issues and Solutions Voicecoders have been designed for transmission of low-bit-rate signals while retaining reasonable audio quality and robustness over the networks. However, most of the vocoders overlook the importance of passing in-band tonal signals including DTMF. The approach for DTMF signaling over the networks can either use vocoders that are capable of passing in-band tones or use out-of-band signaling. RFC2833 is a document for carrying DTMF signals and telephony events using RTP packets over the Internet. A separate RTP payload format is used due to the concern of low-rate vocoders that may not guarantee to reproduce the tone signals accurately for automatic recognition. The separate RTP payload format can be considered as the ‘out-of-band’ channel. Using separate payload formats also permits higher redundancy while maintaining a low-bit rate. Figure 9.9 shows an example for Internet applications using DTMF generator and detector. The end user can use an automated computer system that recognizes the DTMF signaling and controls the access to the database, such as electronic bank accounts, voicemail systems, and automatic billing systems. Using the separated RTP payload allows the receiving side to recognize the in-band DTMF tones and regenerate these tones locally if necessary. RFC2833 also covers fax tones, modem tones, country-specific subscriber line tones, and trunk events. The RTP payload type for vocoders will be discussed in Chapter 11. The DTMF typically uses dynamic payload type. The dynamic payload type means the type is negotiated using session description protocol between the two sides defined in RFC3551. Table 9.3 gives an example of the DTMF packet. The first 12 bytes are RTP header and the last 4 bytes are DTMF event. In the DTMF event data, the first byte Encoder Signal in Signal out DTMF detector Decoder DTMF generator DTMF IP network Voice packet Figure 9.9 An example of DTMF detection and generation for InternetJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 432 DUAL-TONE MULTIFREQUENCY DETECTION Table 9.3 Example of RTP packet of DTMF digit ‘1’ 80 62 f4 62 00 24 cb ac ac 24 a8 7a 01 08 00 a0 Packet summary lines ; Data Real-Time Transport Protocol ; 10.. .... = Version: RFC 1889 Version (2) ; 80 ..0. .... = Padding: False ; ...0 .... = Extension: False ; .... 0000 = Contributing source identifiers count: 0 ; 0... .... = Marker: False ; 62 .110 0010 = Payload type: Unknown (98) ; Sequence number: 62562 ; f4 62 Timestamp: 2411436 ; 00 24 cb ac Synchronization source identifier: 2888083578 ; ac 24 a8 7a RFC 2833 RTP event Event ID: DTMF One 1 (1) ; 01 0... .... = End of event: False ; 08 .0.. .... = Reserved: False ; ..00 1000 = Volume: 8 ; Event duration: 160 ; 00 a0 0x01 represents the digit ‘1’, the last 6 bits of second byte, 0x08, represent the volume of −8 dBm0, and the third and fourth bytes, 0x00a0, represent the duration of 160 ms. 9.4 Experiments and Program Examples In this section, we will use the MATLAB’s CCS Link to connect MATLAB with the C5510 DSK for experiments on DTMF tone detection. 9.4.1 Implementation of Goertzel Algorithm Using Fixed-Point C The Goertzel algorithm can be separated into two paths. The recursive path is defined by Equation (9.12). Table 9.4 lists the implementation of the recursive path in fixed-point C program. In the code, the pointer d points to the filter’s delay-line buffer. The input is passed to the function by variable in. The variable coef is the filter coefficient. To prevent overflow, input data is scaled down by 7 bits. Note that the Table 9.4 C implementation of Goertzel filter’s recursive path void gFilter (short *d, short in, short coef) { long AC0; d[0] = in >> 7; // Get input with scale down by 7 bit AC0 = (long) d[1] * coef; AC0 >>= 14; AC0 -= d[2]; d[0] += (short)AC0; d[2] = d[1]; // Update delay-line buffer d[1] = d[0]; }JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 433 Table 9.5 C implementation of nonrecursive path of Goertzel filter short computeOutput(short *d, short coef) { long AC0, AC1; AC0 = (long) d[1] * d[1]; // Square d[1] AC0 += (long) d[2] * d[2]; // Add square d[2] AC1 = (long) d[1] * coef; AC1 >>= 14; AC1 = AC1 * d[2]; AC0 -= AC1; d[1] = 0; d[2] = 0; return ((short)(AC0 >> 14)); } fixed-point C implementation requires the data type conversion to be cast in long for multiplication, and the 14-bit shift is used to align the multiplication product to be stored in Q15 format. The recursive path calculation is carried out for every data sample. The nonrecursive path of Goertzel filter described by Equation (9.14) is used once for every other N sample. The calculation of the final Goertzel filter output is implemented in C as shown in Table 9.5. Table 9.6 lists the files used for this experiment. The test program DTMFdecodeTest.c opens the test parameter file param.txt to get the DTMF data filenames. This experiment analyzes the input data file and reports the detected DTMF digits in the output file DTMFKEY.txt. Procedures of the experiment are listed as follows: 1. Open C55 project fixedPoint_DTMF.pjt, build, load, and run the program. 2. Examine the DTMF detection results to validate the correctness of the decoding process. 3. Modify DTMF generation experiment given in Section 8.4.4 to generate DTMF signals that can be used for testing DTMF magnitude level and frequency offset. Table 9.6 File listing for experiment exp9.4.1_fixedPointDTMF Files Description DTMFdecodeTest.c C function for testing experiment gFilter.c C function computes recursive path computeOutput.c C function computes nonrecursive path dtmfFreq.c C function calculates all frequencies gFreqDetect.c C function maps frequencies to keypad row and column checkKey.c C function reports DTMF keys init.c C function for initialization dtmf.h C header file fixedPoint_DTMF.pjt DSP project file fixedPoint_DTMF.cmd DSP linker command file param.txt Parameter file DTMF16digits.pcm Data file containing 16 digits DTMF_with_noise.pcm Data file with noiseJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 434 DUAL-TONE MULTIFREQUENCY DETECTION 4. Modify the program to perform the magnitude test as described in Section 9.2.4. 5. Modify the program to perform the frequency test as described in Section 9.2.4. 9.4.2 Implementation of Goertzel Algorithm Using C55x Assembly Language The efficient implementation of DTMF detection has been a design challenge for multichannel real-time applications. This experiment presents the implementation of Goertzel algorithm using the TMS320C55x assembly language. Table 9.7 lists the assembly routine that implements the Goertzel algorithm. The input data sample is passed in by register T0. The right-shifted 7 bits scale the input signal to prevent possible overflow during the filtering process. The Goertzel filter coefficient is stored in register T1. The auxiliary register AR0 is the pointer to d[3] in the delay-line buffer. The Goertzel filtering routine gFilter is called for every data sample. After the recursive path has processed N − 1 samples, the final Goertzel filtering result will be com- puted at the Nth iteration for the nonrecursive path. Table 9.8 shows the assembly routine that computes the final Goertzel filter output. The caller passes two arguments, the auxiliary register AR0 is the pointer to the delay line d[3], and register T0 contains the Goertzel filter coefficient of the given frequency. The final Goertzel filter output is returned to the caller via register T0 at the end of the routine. Table 9.9 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Open the project asm_DTMF.pjt, build, load, and run the program. 2. Examine the DTMF detection results to validate the correctness of the decoding process. 3. Modify DTMF generation experiment given in Section 8.4.4 to generate DTMF signals that can be used for testing DTMF twist level and the second harmonics. Table 9.7 Assembly language implementation of Goertzel filter .global _gFilter _gFilter: mov T0, AC0 sfts AC0, #-7 ; d[0] = in >> 7 mov AC0, *AR0+ mov *AR0+, AR1 mov AR1, HI(AC1) mpy T1, AC1 ; AC0 = (long) d[1] * coef sfts AC1, #-14 ; AC0 >>= 14 sub *AR0-, AC1, AR2 ; AC0 -= d[2] amar *AR0- || add AC0, AR2 mov AR2, *AR0+ ; d[0] += (short)AC0 mov AR2, *AR0+ ; d[1] = d[0] mov AR1, *AR0 ; d[2] = d[1] retJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 435 Table 9.8 Assembly routine to compute the Goertzel filter output .global _computeOutput _computeOutput: amar *AR0+ mpym *AR0+, T0, AC1 ; AC1 = (long) d[1] * coef sfts AC1, #-14 ; AC1 >>= 14; mov AC1, T0 mpym *AR0-, T0, AC0 ; AC1 = AC1 * d[2] sqrm *AR0+, AC1 ; AC0 = (long) d[1] * d[1]; sqam *AR0-, AC1 ; AC0 += (long) d[2] * d[2]; sub AC0, AC1 ; AC0 -= AC1 || mov #0, *AR0+ ; d[1] = 0 sfts AC1, #-14, AC0 || mov #0, *AR0 ; d[2] = 0 mov AC0, T0 ; out = (short)(AC0 >> 14); ret 4. Use CCS profile tool to compare the number of cycles used between this assembly implementation and fixed-point C implementation given in Section 9.4.1. 9.4.3 DTMF Detection Using C5510 DSK In this experiment, we will use MATLAB Link for CCS with the C5510 DSK to conduct the DTMF detection experiment. The flow of experiment is shown in Figure 9.10. Some of the frequently used CCS control commands that MATLAB supports are listed in Table 9.10. We modified the MATLAB script of DTMF generator given in Chapter 8 for this experiment. Go through the following steps to create a new GUI for DTMF experiment: 1. Start MATLABand set path to ..\experiments\exp9.4.3_MATLABCCSLink. Copy the MATLAB files DTMFGenerator.m and DTMFGenerator.fig from Chapter 8 to current folder. Table 9.9 File listing for experiment exp9.4.2_asmDTMF Files Description DTMFdecodeTest.c C function for testing experiment gFilter.asm Assembly function computes recursive path computeOutput.asm Assembly function computes nonrecursive path dtmfFreq.asm Assembly function calculates all frequencies gFreqDetect.c C function maps frequencies to keypad checkKey.c C function reports DTMF keys init.c C function for initialization dtmf.h C header file asm_DTMF.pjt DSP project file asm_DTMF.cmd DSP linker command file param.txt Parameter file DTMF16digits.pcm Data file containing 16 digits DTMF_with_noise.pcm Data file with noiseJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 436 DUAL-TONE MULTIFREQUENCY DETECTION MATLAB: Creates DTMF data files using GUI for DSP experiment MATLAB: Open DSP project, build, and load DSP code for the experiment C55x DSK: Reads in DTMF data file and decode DTMF signal MATLAB: Reads experiment output and display the result Figure 9.10 MATLAB Link for CCS experiment flow 2. Enter the command guide at MATLABcommand window to start Guide Quick Start menu. Choose Open Exist GUI tab to open file DTMFGenerator.fig. Add a new button key named Decode DTMF as shown in Figure 9.11. 3. Save the GUI as DTMF.fig and a new file DTMF.m. The file DTMF.m is the same as DTMFGenerator.m except a new callback function is appended at the end. Table 9.11 shows partial code of the DTMF.m. The MATLAB files DTMF.fig and DTMF.m are used to control the project to conduct the DSK experiment. 4. Modify the function dtmfGen( ) in DTMF.m to include the capability of saving keypress into a PCM file as the DTMF signal. The modified DTMF generator is listed as follows: % --- DTMF signal generation function x=dtmfGen(fl, fh) fs = 8000; N = [0:1/fs:0.1]; x = 0.5*(cos(2*pi*fl*N)+cos(2*pi*fh*N)); sound(x,fs) x = int16(x*16383); fid=fopen('.\\DTMF\\data\\data.pcm', 'ab'); % Write DTMF data fwrite(fid, x, 'int16'); x = zeros(1, 400); fwrite(fid, x, 'int16'); fclose(fid); Table 9.10 MATLAB Link for CCS functions MATLAB function CCS command and status build Build the active project in CCS IDE ccsboardinfo Obtain information about the boards and simulators ccsdsp Create the link to CCS IDE clear Clear the link to CCS IDE halt Stop execution on the target board or simulator isrunning Check if the DSP processor is running load Load executable program file to target processor read Read global variables linked with CCS Link object reset Reset the target processor restart Place program counter to the entry point of the program run Execute program loaded on the target board or simulator visible Control the visibility for CCS IDE window write Write data to global variables linked with CCS Link objectJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 437 Figure 9.11 MATLAB GUI for DTMF detection experiment This modified function records each DTMF signal to a PCM file data.pcm. The duration of each DTMF signal is 100 ms followed by 50 ms of silence. 5. Add MATLAB Link for CCS code to the DTMF.m. Table 9.12 lists the MATLAB script. This MATLAB script controls the execution of DSK. The function ccsboardinfo checks the DSP development system and returns the information regarding the board and processor that it has identified. Table 9.11 MATLAB script DTMF.m generated by GUI editor % --- DTMF signal generation function x=dtmfGen(fl, fh) fs = 8000; N = [0:1/fs:0.1]; x = 0.5*(cos(2*pi*fl*N)+cos(2*pi*fh*N)); sound(x,fs) % --- Executes on button press in pushbutton17 function pushbutton17_Callback(hObject, eventdata, handles)JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 438 DUAL-TONE MULTIFREQUENCY DETECTION Table 9.12 MATLAB script using Link for CCS to command the C5510 DSK board = ccsboardinfo; % Get DSP board & processor information dsp = ccsdsp('boardnum',... % Link DSP with CCS board.number, 'procnum', board.proc(1,1).number); set(dsp,'timeout',100); % Set CCS timeout value to 100(s) visible(dsp,1); % Force CCS to be visible on desktop open(dsp,'DTMF\\ccsLink.pjt'); % Open project file build(dsp,1500); % Build the project if necessary load(dsp, '.\\DTMF\\Debug\\ccsLink.out',300); % Load project with timeout 300(s) reset(dsp); % Reset the DSP processor restart(dsp); % Restart the program run(dsp); % Start execution or wait new command cpurunstatus = isrunning(dsp); while cpurunstatus == 1, % Wait until processor completes task cpurunstatus = isrunning(dsp); end The ccsdsp function creates the link object to CCS using the information obtained from the function call to ccsboardinfo. The functions open, build, load, reset, restart, and run are the standard CCS commands that control the CCS IDE functions and status. The function run consists of several options. The option main is the same as CCS command Go Main. The option tohalt will start DSP processor and run until the program reaches a breakpoint or it is halted. The option tofunc will start and run the DSP processor until the program reaches the given function. The build function also has multiple options. The default build function makes an incremental build, while the option all will perform CCS command Rebuild All. In this experiment, the function isrunning is used to check if the DSK processing is completed. The software for DTMF decoder using MATLAB Link for CCS includes the DSP project, source files, and MATLAB script files. Table 9.13 lists the files used for this experiment. Table 9.13 File listing for experiment exp9.4.3_MATLABCCSLink Files Description DTMF.m MATLAB script for testing experiment DTMF.fig MATLAB GUI DTMFdecodeTest.c DTMF experiment test file gFilter.asm Assembly function computes recursive path computeOutput.asm Assembly function computes nonrecursive path dtmfFreq.asm Assembly function calculates all frequencies gFreqDetect.c C function maps frequencies to keypad checkKey.c C function reports DTMF keys Init.c C function for initialization dtmf.h C header file ccsLink.pjt DSP project file ccsLink.cmd DSP linker command file dtmfGen.m MATLAB function for DTMF tone generation dspDTMf.m MATLAB function for commanding C55xCCSJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 439 In this experiment, MATLAB command window will show each key that is pressed and display the DTMF detection result. Procedures of the experiment are listed as follows: 1. Connect DSK to the computer and power on the DSK. 2. Create and build the DSP project for the experiment. If no errors, exit CCS. 3. Start MATLAB and set the path to the directory ..\exp9.4.3_MATLABCCSLink. 4. Type DTMF at MATLAB command window to display the DTMF experiment GUI. 5. Press several DTMF keys to generate a DTMF sequence. 6. Press the Decode DTMF key on the GUI to start CCS, build the DSP project, and then run the DTMF decoder. 7. Multichannel DTMF detection is widely used in industries. Modify the experiment such that it per- forms two-channel DTMF detection in parallel. The input data for the two-channel DTMF detection can be generated in time-division method. Since this experiment reads the input data from a PCM data file, we can create two DTMF signaling files and read both of them in for the two-channel experiment. 9.4.4 DTMF Detection Using All-Pole Modeling In this experiment, we will present the MATLAB script to show the basic concept of DTMF detection using the LPC all-pole modeling. Table 9.14 lists the files used for this experiment. Procedures of the experiment are listed as follows: 1. Copy the MATLAB files DTMF.fig and DTMF.m from the previous experiment to the directory ..\ exp9.4.4_LPC. 2. Modify DTMF.m to replace the Link for CCS function by the code listed in Table 9.15. This function reads the DTMF data. The all-pole function is implemented using the Levinson algorithm to avoid direct matrix inversion in computing the autocorrelation and LPC coefficients. The roots of LPC coefficients are calculated using the MATLAB function roots. Finally, the amplitude and angles are analyzed and compared to detect DTMF tones. 3. The user interface is the same as the previous experiment. Press DTMF keys to generate a PCM file. Press the Decode DTMF key to start the decoder. Table 9.14 File listing for experiment exp9.4.4_LPC Files Description DTMF.m MATLAB script for testing experiment DTMF.fig MATLAB GUIJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 440 DUAL-TONE MULTIFREQUENCY DETECTION Table 9.15 MATLAB code for LPC all-pole modeling N=256; % Length of FFT fs=8000; % Sampling frequency f=fs*(0:(N/2-1))/N; % Frequency scale for display KEY = 1:16; % Keypad map lookup table index % col| row KEY(1+1) =0016+0001; KEY(1+2) =0032+0001; KEY(1+3) =0064+0001; KEY(1+10)=0128+0001; KEY(1+4) =0016+0002; KEY(1+5) =0032+0002; KEY(1+6) =0064+0002; KEY(1+11)=0128+0002; KEY(1+7) =0016+0004; KEY(1+8) =0032+0004; KEY(1+9) =0064+0004; KEY(1+12)=0128+0004; KEY(1+14)=0016+0008; KEY(1+0) =0032+0008; KEY(1+15)=0064+0008; KEY(1+13)=0128+0008; % Table lookup for Keys DIGIT =['0','1','2','3','4','5','6','7','8','9','A','B','C','D','*','#']; freq = [697 770 852 941 1209 1336 1477 1633]; digi=[1248163264128]; lpcOrder=10; % LPC order w=hamming(N); % Generate Hamming window fid=fopen('.\\data\\data.pcm','r'); % Open the PCM data file prevDigit = 0; while ∼ feof(fid) x = fread(fid,N,'short'); if size(x) ∼ =N continue; end % Check energy if sum(abs(x)) <= 200000 prevDigit = 0; else % Compute autocorrelation x=x.*w; % Windowing m=0; while (m<=lpcOrder); r(m+1)=sum(x(1:(N-m)).*x((1+m):N)); m=m+1; end; a=levinson(r,lpcOrder); % Levinson algorithm % Calculate root r=roots(a); % Find roots amp=abs(r); % Get amplitudes ang=(angle(r)*fs/pi/2); % Get angles % Compare with the tableJWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 REFERENCES 441 Table 9.15 (continued ) AmpThreahold = 0.98; % 0.02% AngThreahold = 5; % 5 Hz dtmf =0; for i=1:2:(lpcOrder) if abs(amp(i)) >= AmpThreahold for j = 1:8 if (abs(ang(i)) <= (freq(j)+AngThreahold)) if (abs(ang(i)) >= (freq(j)-AngThreahold)) dtmf = dtmf + digi(j); end end end end end % Check if dtmf detected dtmfDet=0; for i=1:16 if dtmf == KEY(i) dtmfDet =i; end end % Display result if dtmfDet ∼ =0 if (DIGIT(dtmfDet) ∼ = prevDigit) disp(sprintf('%s is detected', DIGIT(dtmfDet))); prevDigit = DIGIT(dtmfDet); end else prevDigit = 0; end end end fclose(fid); 4. Validate the DTMF digits displayed on MATLAB command window are correctly decoded. 5. If the all-pole filter order is 4, is it possible to find the root of the filter that matches the DTMF frequency? Modify the experiment to validate your claim. References [1] ITU-T Recommendation Q.23, Technical Features of Push-Button Telephone Sets, 1993. [2] ITU-T Recommendation Q.24, Multifrequency Push-Button Signal Reception, 1993. [3] 3GPP TR-T12-26.975 V6.0.0, Performance Characterization of the Adaptive Multi-Rate (AMR) Speech Codec (Release 6), Dec. 2004. [4] TI Application Report, DTMF Tone Generation and Detection Ð An Implementation Using the TMS320C54x, SPRA 096A, May 2000. [5] W. Tian and Y. Lu, ‘System and method for DTMF detection using likelihood ratios,’ US Patent no. 6,873,701, Mar. 2005.JWBK080-09 JWBK080-Kuo March 8, 2006 12:0 Char Count= 0 442 DUAL-TONE MULTIFREQUENCY DETECTION [6] Y. Lu and W. Tian, ‘DTMF detection based on LPC coefficients,’ US Patent no. 6,590,972, July 2003. [7] F. F. Tzeng, ‘Dual-tone multifrequency (DTMF) signaling transparency for low-data-rate vocoder,’ US Patent no. 5,459,784, Oct. 1995. [8] R. Rabipour and M. Beyrouti, ‘LPC-based DTMF receiver for secondary signaling,’ US Patent no. 4,853,958, Aug. 1989. [9] C. Lee and D.Y. Wong, ‘Digital tone decoder and method of decoding tones using linear prediction coding,’ US Patent no. 4,689,760, Aug. 1987. [10] N. Ahmed and T. Natarajan, Discrete-Time Signals and Systems, Englewood Cliffs, NJ: Prentice-Hall, 1983. [11] MATLAB, Version 7.0.1 Release 14, Sep. 2004. [12] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1989. [13] S. J. Orfanidis, Introduction to Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1996. [14] J. G. Proakis and D. G. Manolakis, Digital Signal Processing Ð Principles, Algorithms, and Applications, 3rd Ed., Englewood Cliffs, NJ: Prentice Hall, 1996. [15] A Bateman and W. Yates, Digital Signal Processing Design, New York: Computer Science Press, 1989. [16] J. Hartung, S. L. Gay, and G. L. Smith, Dual-tone Multifrequency Receiver Using the WE DSP16 Digital Signal Processor, Application Note, AT&T, 1988. [17] Analog Devices, Digital Signal Processing Applications Using the ADSP-2100 Family, Englewood Cliffs, NJ: Prentice Hall, 1990. [18] P. Mock, Add DTMF Generation and Decoding to DSP-uP Designs, Digital Signal Processing Applications with the TMS320 Family, Texas Instruments, 1986, Chap. 19. [19] J. S. Lim and A. V. Oppenheim, ‘Enhancement and bandwidth compression of noisy speech,’ Proc. of the IEEE, vol. 67, Dec. 1979, pp. 1586Ð1604. [20] J. R. Deller, Jr., J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals, New York: MacMillan, 1993. [21] H. Schulzrinne and S. Petrack, RTP Payload for DTMF Digits, Telephony Tones and Telephony Signals, Request for Comments 2833 (RFC2833), May 2000. [22] H. Schulzrinne and S. Casner, RTP Profile for Audio and Video Conferences with Minimal Control, IETF RFC3551, July 2003. Exercises 1. According to Table 9.1, DTMF frequency tolerance for operation is ≤1.5 % and nonoperation is ≥3.5 %. Calculate the operation and nonoperation frequency boundaries of all eight frequencies. 2. For N-point DFT, the frequency resolution is fs 2N = 8000 2N at 8000 Hz sampling rate. In Goertzel algorithm, the signal frequency fk is approximated by fs k 2N .IfN is not properly selected, the signal frequency fk may be off more than 1.5 % due to the DFT algorithm. By using the frequency operation tolerance 1.5 %, calculate N ∈ [180, 256] which makes all eight frequencies meet the requirement. 3. A female voice contains the strong harmonic with pitch frequency of 210 Hz. Which digit may be falsely registered? Explain why? Assume this DTMF detector meets the frequency tolerance requirement. 4. Besides Goertzel algorithm, an IIR filterbank can be used for DTMF detection. Design a new experiment that uses eight fourth-order IIR filters for DTMF detection. Profile the performance and compare it with the decoder that uses the Goertzel algorithm. 5. DTMF digits can also be used for low-rate data communications. One digit can be treaded as a 4-bit symbol. Assuming each DTMF digit is sent every 80 ms, calculate the bit rate per second.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 10 Adaptive Echo Cancelation Adaptive echo cancelation is an important application of adaptive filtering for attenuating undesired echoes. In addition to canceling the voice echo in long-distance links and acoustic echo in hands-free speakerphones, adaptive echo cancelers are also widely used in full-duplex data transmission over two- wire circuits, such as high-speed modems. This chapter focuses on voice echo cancelers for long-distance networks, VoIP (voice over Internet protocol) applications, and hands-free speakerphones. 10.1 Introduction to Line Echoes One of the main problems associated with telephone communications is the echo due to impedance mismatches at various points in the networks. Such echoes are called line (or network) echoes. If the time delay between the original speech and the echo is short, the echo may not be noticeable. The deleterious effects of echoes depend upon their loudness, spectral distortion, and delay. In general, longer delay requires more echo attenuation. A simplified telecommunication network is illustrated in Figure 10.1, where the local telephone is connected to a central office by a two-wire line in which both directions of transmission are carried on a single wire pair. The connection between two central offices uses the four-wire facility, which physically segregates the transmission by two facilities. This is because long-distance transmission requires ampli- fication that is a one-way function. The four-wire transmission path may include various equipments, including switches, cross-connects, and multiplexers. A hybrid (H) located in the central office makes the conversion between the two-wire and four-wire facilities. An ideal hybrid is a bridge circuit with the balancing impedance that is exactly equal to the impedance of the two-wire circuit. Therefore, it will couple all energy on the incoming branch of the four-wire circuit into the two-wire circuit. In practice, the hybrid may be connected to any of the two-wire loops served by the central office. Thus, the balancing network can provide only a fixed and compromise impedance match. As a result, some of the incoming signals from the four-wire circuit leak into the outgoing four- wire circuit, and return to the source as an echo shown in Figure 10.1. This echo requires special treatment if the round-trip delay exceeds 40 ms. Example 10.1: For Internet protocol (IP) trunk applications that use IP packets to relay the time division multiplex (TDM) traffic, the round-trip delay can easily exceed 40 ms. Figure 10.2 shows an example of VoIP applications using a gateway in which the voice is converted from the TDM circuits to the IP packets. Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 443JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 444 ADAPTIVE ECHO CANCELATION TelephoneTelephone H Four-wire facility Two-wire facility Two-wire facility Echo Echo H Figure 10.1 Long-distance telecommunication networks The delay includes vocoders, jitter compensation algorithms, and the network delay. The ITU- T G.729 CODEC (will be introduced in Chapter 11) is widely used for VoIP applications for its good performance and low delay. The G.729 algorithm delay is 15 ms. If using 10-ms frame real-time protocol packet and 10-ms jitter compensation, the round-trip delay of G.729 will be at least 2 × (15 + 10) = 50 ms without counting the IP network delay and the processing delay. Such long delay is the reason why adaptive echo cancelation is required for VoIP applications if one or two ends are connected by TDM circuit. 10.2 Adaptive Echo Canceler For telecommunication network using echo cancelation, the echo canceler is located in the four-wire section of the network near the origin of the echo source. The principle of the adaptive echo cancelation is illustrated in Figure 10.3. We show only one echo canceler located at the left end of network. To overcome the echoes in a full-duplex communication network, it is desirable to cancel the echoes in both directions of the trunk. The reason for showing a telephone and two-wire line is to indicate that this side is called the near-end, while the other side is referred to as the far-end. Telephone H Decoder d(n) Near-end Far-end Jitter buffer Encoder IP network RTP packet Round-trip delay Figure 10.2 Example of round-trip delay for VoIP applicationsJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 ADAPTIVE ECHO CANCELER 445 Telephone H x(n) e(n) LMSW(z) d(n) y(n) Near-end Far-end + − Σ Figure 10.3 Block diagram of an adaptive echo canceler 10.2.1 Principles of Adaptive Echo Cancelation To explain the principle of the adaptive echo cancelation in details, the function of the hybrid shown in Figure 10.3 can be illustrated in Figure 10.4, where the far-end signal x(n) passing through the echo path P(z) results in echo r(n). The primary signal d(n) is a combination of echo r(n), near-end signal u(n), and noise v(n). The adaptive filter W(z) models the echo path P(z) using the far-end speech x(n) as an exci- tation signal. The echo replica y(n) is generated by W(z), and is subtracted from the primary signal d(n) to yield the error signal e(n). Ideally, y(n) ≈ r(n) and the residual error e(n) is substantially free of echo. The impulse response of an echo path is shown in Figure 10.5. The time span over the hybrid is typically about 4 ms and is called the dispersive delay. Because of the four-wire circuit located between the location of the echo canceler and the hybrid, the impulse response of echo path has a flat delay. The flat delay depends on the transmission delay between the echo canceler and the hybrid, and the delay through the sharp filters associated with frequency division multiplex equipment. The sum of the flat delay and the dispersive delay is the tail delay. Assuming that the echo path P(z) is linear, time invariant, and with infinite impulse response p(n), n = 0, 1,...,∞, the primary signal d(n) can be expressed as d(n) = r(n) + u(n) + v(n) = ∞ l=0 p(l)x(n − l) + u(n) + v(n), (10.1) x(n) e(n) LMSW(z) y(n) + − ΣΣ P(z) r(n) u(n)+ + + v(n) Telephone Hybrid d(n) Figure 10.4 An echo canceler diagram with details of hybrid functionJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 446 ADAPTIVE ECHO CANCELATION Time, n Flat delay Dispersive delay Tail delayp(n) Figure 10.5 Impulse response of an echo path where the additive noise v(n) is assumed to be uncorrelated with the near-end speech u(n) and the echo r(n). The adaptive FIR filter W(z) generates an echo estimation y(n) = L−1 l=0 wl (n)x(n − l), (10.2) which is used to cancel the echo. The error signal can be expressed as e(n) = d(n) − y(n) = L−1 l=0 [p(l) − wl (n)]x(n − l) + ∞ l=L p(l)x(n − l) + u(n) + v(n). (10.3) The adaptive filter W(z) adjusts its weights wl (n) to mimic the first L-sample impulse response of the echo path during the process of echo cancelation. The normalized LMS algorithm introduced in Section 7.3.4 is commonly used for adaptive echo cancelation applications. Assuming that disturbances v(n) and voice u(n) are uncorrelated with x(n), W(z) will converge to P(z), i.e., wl (n) ≈ p(l). As shown in Equation (10.3), the residual error after the W(z) has converged can be expressed as e(n) ≈ ∞ l=L p(l)x(n − l) + u(n) + v(n). (10.4) By making the length of W(z) sufficiently long, the residual echo can be minimized. However, as discussed in Section 7.3.3, the excess mean-square error (MSE) produced by the adaptive algorithm and finite-precision errors are also proportional to the filter length. Therefore, there is an optimum length L for echo cancelation. The number of coefficients required for the FIR filter is determined by the tail delay. As mentioned earlier, the impulse response of the hybrid (dispersive delay) is relatively short. However, the flat delay from the echo canceler to the hybrid depends on the physical location of the echo canceler, and the processing delay of transmission equipments. 10.2.2 Performance Evaluation The effectiveness of an echo canceler is usually measured by the echo return loss enhancement (ERLE) defined as ERLE = 10 log E d2(n) E e2(n) . (10.5)JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 PRACTICAL CONSIDERATIONS 447 For a given application, the ERLE depends on the step size μ, the filter length L, the signal-to-noise ratio, and the nature of signal in terms of power and spectral contents. A larger step size provides a faster initial convergence, but the final ERLE is smaller due to the excess MSE and quantization errors. If the filter length is long enough to cover the length of echo tail, increasing L will further reduce the ERLE. The ERLE achieved by an echo canceler is limited by many practical factors. Detailed requirements of an adaptive echo canceler are defined by ITU-T Recommendations G.165 and G.168, including the maximum residual echo level, the echo suppression effect on the hybrid, the convergence time, the initial setup time, and the degradation in a double-talk situation. The first special-purpose chip for echo cancelation implements a single 128-tap adaptive echo canceler [5]. In the past, adaptive echo cancelers were implemented using customized devices in order to handle the heavy computation in real-time applications. Disadvantages of VLSI (very large scale integrated circuit) implementation are long development time, high development cost, and lack of flexibility to meet application-specific requirements and improvements such as the standard changed from G.165 to G.168. Therefore, recent activities in the design and implementation of adaptive echo cancelers for real-world applications are focus on programmable DSP processors. 10.3 Practical Considerations This section discusses two practical techniques in designing adaptive echo canceler: prewhitening and delay detection. 10.3.1 Prewhitening of Signals As discussed in Chapter 7, convergence time of adaptive FIR filter with the LMS algorithm is proportional to spectral ratio λmax/λmin. Thus, an input signal with flat spectrum such as white noise has faster convergence rate. Since speech signal is highly correlated with nonflat spectrum, the convergence speed is slow. The decorrelation (whitening) of input signal will improve the convergence speed. Figure 10.6 shows a typical prewhitening structure for input signals. The whitening filter FW (z) is used for both the far-end and near-end signals. This fixed filter could be obtained using the reversed statistical or temporal averaged spectrum values. One example is the anti-tile filter used to lift up the high-frequency components since most speech signal’s power is concentrated at the low-frequency region. Echo path W(z) W(z) Update FW(z) d(n) e′(n) FW(z) FW(z) − e(n) x(n) y(n) NLP + Figure 10.6 Block diagram of a signal prewhitening structureJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 448 ADAPTIVE ECHO CANCELATION W(z) Update FW(z) 1/FW(z) NLP d(n) x(n) e(n) FW(z) Echo Figure 10.7 Block diagram of an adaptive prewhitening structure The whitening filter can be adaptive based on the far-end signal x(n). In this case, the filter coefficients update is synchronized for both branches. An equivalent structure of adaptive prewhitening is shown in Figure 10.7. The adaptation of the whitening filter coefficients is similar to a perceptive weighting filter, which will be discussed in Chapter 11. We can use the LevinsonÐDurbin algorithm to estimate the input signal spectrum envelope and use the reversed function to filter the signal. The calculation of transfer function FW (z) is similar to the adaptive channel equalization discussed in Chapter 7. However, two conditions must be satisfied: the processing should be linear and the filter FW (z) must be reversible. 10.3.2 Delay Detection As discussed in Section 10.2.1, the initial part of the impulse response of an echo path represents a transmission delay between the echo canceler and the hybrid. To take advantage of this flat delay, the structure illustrated in Figure 10.8 was developed. Here,  represents the number of flat-delay samples. By estimating the number of zero coefficients needed to cover the flat delay, the echo canceler W(z) length can be shortened by . This technique effectively reduces the computational requirements. However, there are three major difficulties: the existence of multiple echoes, the difficulty to estimate the flat delay, and the delay variation during a call. Telephone H x(n) e(n) LMSW(z) d(n) y(n) + − Σ z−Δ x(n − Δ) Figure 10.8 Adaptive echo canceler with effective flat-delay compensationJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 PRACTICAL CONSIDERATIONS 449 The crosscorrelation function between the far-end signal x(n) and the near-end signal d(n) can be used to estimate the delay. The normalized crosscorrelation function at time n with lag k can be estimated as ρ(n, k) = rxd(n, k)√ rxx(n)rdd(n) , (10.6) where rxd(n, k) is the crosscorrelation function defined as rxd(n, k) = 1 L L−1 l=0 x (n − (l + k)) d(n − l), (10.7) and the autocorrelation for x(n) and d(n) are defined as rxx(n) = 1 L L−1 l=0 x (n − l) x(n − l), (10.8) rdd(n) = 1 L L−1 l=0 d(n − l)d(n − l). (10.9) The typical value of length L is between 128 and 256 for 8 kHz sampling rate. The flat delay will be the lag k that makes the maximum normalized crosscorrelation function as defined by Equation (10.6). Unfortunately, this method may have poor performance for speech signals as shown in Figure 10.9, although it has a good performance for signals with the flat-spectra such as white noise. To improve the performance of crosscorrelation method, a bandpass filter using two or three formants in the passband can be considered. This makes the crosscorrelation technique more reliable by whitening the input signals over the passband. The multirate filtering (introduced in Section 4.4) can be used to further reduce the computational load. The normalized crosscorrelation function ρ(n, k) is then computed using the subband signals u(n) and v(n). With properly chosen bandpass filter and downsampling factor D, the downsampled subband signals u(n) and v(n) are closer to that of the white noise. Figure 10.10 shows the performance improvement using a 16-band filterbank. The third subband signal is decimated by a factor of 16 and the downsampled signal is used to calculate the crosscorrelation function defined in Equation (10.7). Because of the downsampling operations, the flat-delay estimation is divided into two steps. The first step finds T0 in the downsampled domain. This delay has a resolution of the downsampling factor D. The exact delay will be between T0 − D/2 and T0 + D/2. The second step finds the resolution T1 using the original signal that has maximum value of ρ(n, k). This two-step approach requires less computation since the first step works at lower sampling rate and the second step needs to perform only limited fine searches. It requires about 1/D of the crosscorrelation computational requirements (refer to [4] for details). Example 10.2: For a typical impulse response of echo path shown in Figure 10.5, calculate the required number of FIR filter coefficients given that the flat delay is 15 ms, the dispersive segment is 10 ms, and the sampling rate is 8 kHz. For this specific example, the adaptive echo canceler is located at tandem switch as shown in Figure 10.11. Since the flat delay is a pure delay, this part can be implemented using a tapped delay line as z− where  = 120. The actual filter length is 80 to cover the 10 ms of dispersive segment. In this case, the FIR filter coefficients need to compensate only for the dispersive delay of hybrid rather than the flat delay between the hybrid and the echo canceler. The delay estimation becomes very important since this flat delay may change for different connections.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 450 ADAPTIVE ECHO CANCELATION 0 100 200 300 400 500 600 700 800 900 1000 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 × 104 Peak around 512 Crosscorrelation between far-end and near-end speech Time at 8000 Hz sampling rate Crosscorrelation Figure 10.9 Crosscorrelation function of a voiced speech 10.4 Double-Talk Effects and Solutions An extremely important issue of designing adaptive echo cancelers is to handle double talk, which occurs when the far-end and near-end talkers are speaking simultaneously. In this case, signal d(n) consists of both echo r(n) and near-end speech u(n) as shown in Figure 10.3. During the double-talk periods, the error signal e(n) described in Equation (10.4) contains the residual echo, the uncorrelated noise v(n), and the near-end speech u(n). To correctly identify the characteristics of P(z), d(n) must originate solely from its input signal x(n). In theory, the far-end signal x(n) is uncorrelated with the near-end speech u(n), and thus will not affect the asymptotic mean value of the adaptive filter coefficients. However, the variation in the filter coefficients about this mean will be increased substantially in the presence of the near-end speech. Thus, the echo cancelation performance is degraded. An unprotected algorithm may exhibit unacceptable behavior during double-talk periods. An effective solution is to detect the occurrence of double talk and then to disable the adaptation of W(z) during the double-talk periods. Note that only the coefficient adaptation as illustrated in Figure 10.12 is disabled. If the echo path does not change during the double-talk periods, the echo can be canceled by the previously converged W(z), whose coefficients are fixed during double-talk periods. As shown in Figure 10.12, the speech detection/control block is used to control the adaptation of the adaptive filter W(z) and the nonlinear processor (NLP) that is used for reducing residual echo. TheJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 DOUBLE-TALK EFFECTS AND SOLUTIONS 451 0 10 20 30 40 50 60 −2 −1 0 1 2 3 4 ×104 Single peak around 32 Crosscorrelation between far-end and near-end speech Time at 500 Hz sampling rate Crosscorrelation Figure 10.10 Improved resolution of the crosscorrelation peaks double-talk detector (DTD), which detects the presence of near-end speech when the far-end speech is present, is a very critical element in echo cancelers. The conventional DTD based on the echo return loss (ERL), or hybrid loss, can be expressed as ρ = 20 log10 E [|x(n)|] E [|d(n)|] . (10.10) H H Echo canceler Echo canceler Long- delay channel − +Σ ΣTx delay Rx delay Local switch Local loop Tandem switch − + Figure 10.11 Configuration of echo cancelation with flat delayJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 452 ADAPTIVE ECHO CANCELATION Telephone H x(n) e(n) LMSW(z) d(n) y(n) + − Σ NLP Detection and control To far-end Figure 10.12 Adaptive echo canceler with speech detectors and nonlinear processor In several adaptive echo cancelers such as defined by ITU standards, the ERL value is assumed to be 6 dB. Based on this assumption, the near-end speech is present if |d(n)| > 1 2 |x(n)| . (10.11) However, we cannot just use the instantaneous absolute values |d(n)| and |x(n)| under the noisy condition. A modified near-end speech detection algorithm declares the presence of near-end speech if |d(n)| > 1 2 max {|x(n)| ,...,|x(n − L + 1)|} . (10.12) Equation (10.12) compares an instantaneous absolute value |d(n)| with the maximum absolute value of x(n) over a time window spanning the echo path. The advantage of using an instantaneous power of d(n) is its fast response to the near-end speech. However, it will increase the probability of false trigger if noise exists in the network. A more robust version of speech detector replaces the instantaneous power |x(n)| and |d(n)| with the short-term power estimates Px (n) and Pd (n). These short-term power estimates are implemented by the first-order IIR filter as Px (n) = (1 − α)Px (n − 1) + α |x(n)| (10.13) and Pd (n) = (1 − α)Pd (n − 1) + α |d(n)| , (10.14) where 0 <α<<1. The use of a larger α results in robust detector. However, it also causes slower response to the presence of near-end speech. With the modified short-term power estimate, the near-end speech is detected if Pd (n) > 1 2 max {Px (n), Px (n − 1),...,Px (n − L + 1)} . (10.15) It is important to note that a portion of the initial break-in near-end speech u(n) may not be detected by this detector. Thus, adaptation would proceed in the presence of double talk. Furthermore, the requirement ofJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 NONLINEAR PROCESSOR 453 maintaining a buffer to store L-power estimates increases the memory requirement and the complexity of algorithm. The assumption that the ERL is a constant (6 dB) is not always correct. If the ERL is higher than 6 dB, it will take longer time to detect the presence of near-end speech. If the ERL is lower than 6 dB, most far-end speech will be falsely detected as near-end speech. For practical applications, it is better to dynamically estimate the time-varying threshold ρ by observing the signal level of x(n) and d(n) when the near-end speech u(n) is absent. 10.5 Nonlinear Processor The residual echo can be further reduced using an NLP realized as a center clipper. The comfort noise is inserted to minimize the adverse effects of the NLP. 10.5.1 Center Clipper Nonlinearities in the echo path, noise in the circuits, and uncorrelated near-end speech limit the amount of achievable cancelation for a typical adaptive echo canceler. The NLP shown in Figure 10.12 removes the last vestiges of the remaining echoes. The most widely used NLP is a center clipper with the inputÐoutput characteristic illustrated by Figure 10.13. This nonlinear operation can be expressed as y(n) = 0, |x(n)| ≤ β x(n), |x(n)| >β, (10.16) where β is the clipping threshold. This center clipper completely eliminates signals below the clipping threshold β, but leaves signals greater than the clipping threshold unaffected. A large value of β suppresses all the residual echoes but also deteriorates the quality of the near-end speech. Usually the threshold is chosen to be equal or to exceed the peak amplitude of return echo. 10.5.2 Comfort Noise The NLP completely eliminates the residual echo and circuit noise, thus making the connection not ‘real’. For example, if the near-end subscriber stops talking, the noise level will suddenly drop to zero since it has been clipped by the NLP. If the difference is significant, the far-end subscriber may think the call has x(n) y(n) −β 0 −β β β Figure 10.13 InputÐoutput relationship of center clipperJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 454 ADAPTIVE ECHO CANCELATION H x(n) e(n) LMS W(z) d(n) y(n) Near-end Far-end + − Σ Noise update Comfort noise v(n) Figure 10.14 Implementation of G.168 with comfort noise insertion been disconnected. Therefore, the complete suppression of a signal using NLP has an undesired effect. This problem can be solved by injecting a low-level comfort noise when the residual echo is suppressed. As specified by Test 9 of G.168, the comfort noise must match the signal level and frequency contents of background noise. In order to match the spectrum, the comfort noise insertion is implemented in frequency domain by capturing the frequency characteristic of background noise. An alternate approach uses the linear predictive coding (LPC) coefficients to model the spectral information. In this case, the comfort noise is synthesized using a pth-order LPC all-pole filter, where the order p is between 6 and 10. The LPC coefficients are computed during the silence segments. The ITU-T G.168 recommends the level of comfort noise within ±2 dB of the near-end noise. An effective way of implementing NLP with comfort noise is shown in Figure 10.14, where the generated comfort noise v(n) or echo canceler output e(n) is selected as the output according to the control logic. The background noise is generated with a matched level and spectrum, heard by the far- end subscriber remaining constant during the call connection, and thus significantly contributing to the high-grade perceptive speech quality. 10.6 Acoustic Echo Cancelation There has been a growing interest in applying acoustic echo cancelation for hands-free cellular phones in mobile environments and speakerphones in teleconferencing. Acoustic echoes consist of three major components: (1) acoustic energy coupling between the loudspeaker and the microphone; (2) multiple- path sound reflections of far-end speech; and (3) the sound reflections of the near-end speech signal. In this section, we focus on the cancelation of the first two echo components. 10.6.1 Acoustic Echoes Speakerphone has become important office equipment because it provides the convenience of hands-free conversation. For reference purposes, the person using the speakerphone is the near-end talker and the person at the other end is the far-end talker. In Figure 10.15, the far-end speech is broadcasted through one or more loudspeakers inside the room. Unfortunately, the far-end speech played by the loudspeaker is also picked up by the microphone inside the room, and this acoustic echo is returned to the far end. The basic concept of acoustic echo cancelation is similar to the line echo cancelation; however, the adaptive filter of acoustic echo canceler models the loudspeaker-room-microphone system instead of the hybrid. Thus, the acoustic echo canceler needs to cancel a long echo tail using a much high-orderJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 ACOUSTIC ECHO CANCELATION 455 Far-end signal Power amplifier Preamplifier Reflection Direct coupling Room Acoustic echo Near-end talker Figure 10.15 Acoustic echo generated by a speakerphone in a room adaptive filter. One effective technique is the subband acoustic echo canceler, which splits the full-band signal into several overlapped subbands and uses an individual low-order filter for each subband. Example 10.3: To evaluate an acoustic echo path, the impulse responses of a rectangular room (246 × 143 × 111 in3) were measured. The original data is sampled at 48 kHz, which is then bandlimited to 400 Hz and decimated to 1 kHz for display purpose. The room impulse response is stored in the file imp.dat and is shown in Figure 10.16 (example10_3.m): load imp.dat; % Room impulse response plot(imp(1:1000)); % Display samples from 1 to 1000 0 100 200 300 400 500 Room impulse response Time Amplitude 600 700 800 900 1000 −1.5 −1 −0.5 0 0.5 1 1.5 × 10−4 Figure 10.16 An example of room impulse responseJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 456 ADAPTIVE ECHO CANCELATION There are three major factors making the acoustic echo cancelation far more challenging than the line echo cancelation for real-world applications: 1. The reverberation of a room causes a very long acoustic echo tail. The duration of the acoustic echo path is usually 400 ms in a typical conference room. For example, 3200 taps are needed to cancel 400 ms of echo at sampling rate 8 kHz. 2. The acoustic echo path may change rapidly due to the motion of people in the room, the change in position of the microphone, and some other factors like doors and/or windows opened or closed, etc. The acoustic echo canceler requires a faster convergence algorithm to track these fast changes. 3. The double-talk detection is much more difficult since we cannot assume the 6-dB acoustic loss as the hybrid loss in line echo canceler. Therefore, acoustic echo cancelers require more computation power, faster convergence speed, and more sophisticated double-talk detector. 10.6.2 Acoustic Echo Canceler The block diagram of an acoustic echo canceler is illustrated in Figure 10.17. The acoustic echo path P(z) includes the transfer functions of the A/D and D/A converters, smoothing and antialiasing lowpass filters, speaker power amplifier, loudspeaker, microphone, microphone preamplifier, and the room transfer function from the loudspeaker to the microphone. The adaptive filter W(z) models the acoustic echo path P(z) and yields an echo replica y(n) to cancel acoustic echo components in d(n). The adaptive filter W(z) generates a replica of the echo as y(n) = L−1 l=0 wl (n)x(n − l). (10.17) This replica is then subtracted from the microphone signal d(n) to generate e(n). The coefficients of the W(z) filter are updated by the normalized LMS algorithm as wl (n + 1) = wl (n) + μ(n)e(n)x(n − l), l = 0, 1,...,L − 1, (10.18) where μ(n) is the normalized step size by the power estimation of x(n). x(n) e(n) LMSW(z) d(n) y(n) + − Σ Acoustic echo path P(z) Far-end talker NLP Figure 10.17 Block diagram of an acoustic echo cancelerJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 ACOUSTIC ECHO CANCELATION 457 10.6.3 Subband Implementations Subband and frequency-domain adaptive filtering techniques have been developed to cancel long acoustic echoes. The advantages of using subband acoustic echo cancelers are (1) the decimation of subband signals reduces computational requirements, and (2) the signal whitening using normalized step size at each subband results in fast convergence. A typical structure of subband echo canceler is shown in Figure 10.18, where Am(z) and Sm(z) are analysis and synthesis filters, respectively. The number of subbands is M, and the decimation factor can be a number equal to or less than M. There are M adaptive FIR filtersWm(z), one for each channel. Usually, these filter coefficients are in complex form with much lower order than the full-band adaptive filter W(z) shown in Figure 10.17. The filterbank design with complex coefficients reduces the filter length due to the relaxed antialias- ing requirement. The drawback is the increased computation load because one complex multiplication requires four real multiplications. However, complex filterbank is still commonly used because of the difficulties to design a real coefficient bandpass filter with sharp cutoff and strict antialiasing requirements for adaptive echo cancelation. An example of designing a 16-band filterbank with complex coefficients is highlighted as follows: 1. Using the MATLAB to design a prototype lowpass FIR filter with coefficients h(n), n = 0,1,..., N − 1, which meets the requirement of the 3-dB bandwidth at π/2M, where M = 16. The magnitude response of the prototype filter is shown in Figure 10.19(a), and the impulse response is given in Figure 10.19(b). x(n) A1(z) ↓M Am(z) ↓M AM(z) ↓M + A1(z) ↓M Am(z) ↓M AM(z) ↓M ↑M ↑M ↑M Sm(z) SM(z) S1(z)W1(z) Wm(z) − +− +− DTD P(z) d(n) WM(z) Figure 10.18 Block diagram of a subband echo cancelerJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 458 ADAPTIVE ECHO CANCELATION 0 500 1000 1500 2000 2500 3000 3500 4000 −120 −100 −80 0 20 −40 −20 −60 0 20 40 60 80 100 120 −0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 (a) Frequency (Hz) Time in sample Frequency (Hz) Frequency (Hz) Prototype filter for 16-band frequency response Prototype filter impulse response Full band frequency response16 band overall anylasis and synthesis filter frequency response (b) 0 500 1000 1500 2000 2500 3000 3500 4000 −120 −100 −80 −60 −40 −20 0 0 500 1000 1500 2000 2500 3000 3500 4000 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (c) Magnitude (dB) Magnitude (dB) Magnitude (dB) Amplitude (d) Figure 10.19 Example of filterbank with 16 complex subbands 2. Applying cos π m−1/2 M n − N+1 2 and sin π m−1/2 M n − N+1 2 to modulate the prototype filter to produce the complex-coefficient bandpass filters, Am(z), m = 0,1,..., M − 1, as shown in Figure 10.18. The overall filter’s magnitude response is shown in Figure 10.19(c). In this example, the synthesis filterbank is identical to the analysis filterbank; i.e., Sm(z) = Am(z) for m = 0, 1, . . . , M − 1. 3. Decimating filterbank outputs by M to produce the low-rate signals sm(n), m = 0,1,...,M − 1, for the far-end and dm(n) for the near-end. 4. Performing the adaptation and echo cancelation for each individual subband with 1/M sampling rate. This produces error signals em(n), m = 0,1,..., M − 1. 5. The error signals at these M bands are synthesized back to the full-band signal using the bandpass filters Sm(z). Figure 10.19(d) shows the filterbank performance. Example 10.4: For the same tail length, compare the computational load between the adaptive echo cancelers of two subbands (assume real coefficients) and full band. More specifically, given that the tail length is 32 ms (256 samples at 8 kHz sampling rate), estimate the required number of multiplyÐadd operations.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 ACOUSTIC ECHO CANCELATION 459 Subband implementation requires 2 ×128 multiplications and additions for updating coeffi- cients at half of the sampling rate. In comparison, the full-band adaptive filter needs 256 multi- plications and additions at sampling rate. This means subband implementation needs only half of the computations required for a full-band implementation. In this comparison, the computation load of splitting filter is not counted since this computation load is very small as compared to the coefficients update using the adaptive algorithm. 10.6.4 Delay-Free Structures The inherent disadvantage of subband implementations is the extra delay introduced by the filterbank, which splits the full-band signal into multiple subbands and also synthesizes the processed subband signals into a full-band signal. Figure 10.20 shows the algorithm delay of subband adaptive echo canceler. A delay-free subband acoustic echo canceler can be implemented by adding an additional short full- band adaptive FIR filters W0(z), which covers the first part of the echo path and its length is equal to the total delay introduced by the analysis/synthesis filters plus the block-processing size. The subband adaptive filters model the rest of the echo path. Figure 10.21 illustrates the structure of delay-free subband acoustic echo cancelation. Example 10.5: For a 16-band subband acoustic echo canceler with delay-free structure, calculate the minimum filter length of the first FIR filter. Given that the filterbank is a linear phase FIR filter with 128 taps. The filterbank (analysis and synthesis) delay is 128 samples and the processing block delay is 16 samples. Therefore, the total delay due to filterbank is 128 + 16 = 144 samples. In this case, the length of the first FIR filter W0(z) is at least 144. 10.6.5 Implementation Considerations As shown in Figure 10.8, an effective technique to reduce filter length is to introduce a delay buffer of  samples at the input of adaptive filter. This buffer compensates for delay in the echo path caused by the propagation delay from the loudspeaker to the microphone. This technique saves computation since it effectively covers  impulse response samples without using adaptive filter coefficients. For example, Far-end P(z) d(n)Near-end Synthesis filterbank Echo cancelation in subbands Full-band signal Algorithm delay + − Analysis filterbank Analysis filterbank Figure 10.20 Illustration of algorithm delay due to filterbankJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 460 ADAPTIVE ECHO CANCELATION Far-end Analysis filterbank P(z) d(n) Analysis filterbank Synthesis filterbank Echo cancelation in subbands Full-band signal … … … Full-band cancelation Near-end + − − − + + Figure 10.21 Structure of delay-free subband acoustic echo canceler if the distance between the loudspeaker and the microphone is 1.5 m, the measured time delay in the system is about 4.526 ms based on the sound speed traveling at 331.4 m/s, which corresponds to  = 36 at 8 kHz sampling rate. As discussed in Chapter 7, if a fixed-point DSP processor is used for implementation and μ is suf- ficiently small, the excess MSE increases with a larger L, and the numerical errors (due to coefficient quantization and roundoff) increase with a larger L, too. Furthermore, roundoff error causes early termi- nation of the adaptation if a small μ is used. In order to alleviate these problems, a larger dynamic range is required which can be achieved by using floating-point arithmetic. However, floating-point solution requires a more expensive hardware for implementation. As mentioned earlier, the adaptation of coefficients must be temporarily stopped when the near-end talker is speaking. Most double-talk detectors for adaptive line echo cancelers are based on ERL. For acoustic echo cancelers, the echo return (or acoustic) loss is very small or may be even a gain because of the use of amplifiers in the system. Therefore, the higher level of acoustic echo makes detection of weak near-end speech very difficult. 10.6.6 Testing Standards ITU G.167 specifies the procedure for evaluating the performance of an acoustic echo canceler. As shown in Figure 10.22, the echo canceler is tested with the input far-end signal Rin and near-end signal Sin, and the output near-end signal Rout and far-end signal Sout. The performance of echo cancelation is evaluated based on these signals. Some G.167 requirements are listed as follows: r Initial convergence time: For all the applications, the attenuation of the echo shall be at least 20 dB after 1 s. This test evaluates the convergence time of adaptive filter. The filter structure, adaptation algorithm, step size μ, type of input signal, and prewhitening technique may affect this test. r Weighted terminal coupling loss during single talk: For teleconferencing systems and hands-free communication, its value shall be at least 40 dB on both sides. The value is the difference betweenJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 461 Rin e(n) LMSW(z) y(n) + − Σ NLP Rout Sin Sout Gain control Figure 10.22 Simplified diagram for G.167 testing the signal level (in Sout) without echo cancelation and the signal level with echo canceler in steady state. Test signal is applied to Rin and no other speech signal other than the acoustic return from the loudspeaker(s) is applied to the microphone. r Weighted terminal coupling loss during double talk: For teleconferencing systems and hands-free communication, its value shall be at least 25 dB on both sides. After the echo canceler reaches the steady state, a near-end speech is applied at the Sin for 2 s. The adaptive filter coefficients are frozen and then the near-end speech is removed. This test evaluates how fast and accurate is the DTD to stop the coefficient update during the double talk. r Recovery time after echo path variation: For all the applications, the attenuation of the echo should be at least 20 dB after 1 s. This test evaluates the echo canceler after the double talk; the coefficients may be affected but the system should not take more than 1 s to update to the optimum level. One of the interesting observations for the tests specified by ITU-T G. 167 is that the test vectors are artificial voices according to ITU P.50 standard. These artificial voices are composed using the speech synthesis model. This makes the test easier with reduced limitation of human resources. 10.7 Experiments and Program Examples This section presents some echo cancelation modules using MATLAB, C, or C55x programs to further illustrate the algorithms and examine the performance. 10.7.1 MATLAB Implementation of AEC This experiment is modified from the lms demo available in the MATLAB Signal Processing Blockset. Procedures of the experiment are listed as follows: 1. Start MATLAB and change to directory ..\experiments\exp10.7.1_matAec. 2. Run the experiment by typing lms_aec in the MATLAB command window. This starts Simulink and creates a window that contains the model as shown in Figure 10.23.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 462 ADAPTIVE ECHO CANCELATION Figure 10.23 Acoustic echo cancelation demo using Simulink 3. In the lms_aec window shown in Figure 10.23, make sure that: (a) nLMS module is connected to the Enable1’s position 1; (b) Manual switch is in silence position; and (c) nLMS module is connected to the Reset1’s position 0. If any of the connections is not met, double click the connection to change it. 4. Open the Simulation pull-down menu and click Start to start Simulink. 5. After the algorithm reaches steady state, disable adaptation and freeze the coefficients by changing the ‘Enable1’ switch to position 0. The adaptive filter coefficients are shown in Figure 10.24(a). In this experiment, the echo path is simulated by a 128-tap FIR lowpass filter with normalized cutoff frequency of 0.5. The coefficients of echo canceler in steady state approximate the coefficients of lowpass filter. Figure 10.24(b) shows the magnitude response of the converged adaptive filter. 6. Figure 10.25(a) shows the near-end signal, error signal, and echo (signal + noise). The differences between the near-end signal and the error signal indicate the performance of echo cancelation. 7. Switch from silence to near-end signal to add the near-end signal (a sinewave). 8. Figure 10.25(b) shows the output of the echo canceler, which is very close to the near-end signal.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 463 Figure 10.24 Adaptive filter in steady state: (a) impulse response; (b) frequency response This experiment can be repeated with different parameters. For example, on double clicking the nLMS module shown in Figure 10.23, a function parameter configuration window will be displayed as shown in Figure 10.26. From this window, we can modify the step size, filter length, as well as leaky factor. Figure 10.25 Signal waveforms generated by Simulink model: (a) echo only; (b) during double talkJWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 464 ADAPTIVE ECHO CANCELATION Figure 10.26 Configuration of the LMS adaptive filter 9. Try the following configurations and verify their performance. Explain why the echo cancelation performance becomes worse or better?r Change the echo path from an FIR filter to an IIR filter.r Change the switch from the silence to near-end signal position before disabling the adaptation with the LMS algorithm.r Change the step size μ from 1.5 to 0.1 and 4, and observe the results. 10. Select the FIR filter length of 256 and the LMS filter length of 64. Explain why the coefficients cannot match the echo path? 10.7.2 Acoustic Echo Cancelation Using Floating-Point C An acoustic echo canceler implemented using floating-point C is presented in this experiment. The files used for this experiment are listed in Table 10.1. The data files used for experiment are captured using a PC sound card at 8-kHz sampling rate. The conversation is carried out in a room of size 11 × 13 × 9ft3. The far-end speech file rtfar.pcm and the near-end speech file rtmic.pcm are captured simultaneously. The near-end signal picked up by a microphone consists of the near-end speech and acoustic echoes generated from the far-end speech.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 465 Table 10.1 File listing for experiment exp10.7.2_floatingPointAec Files Description AecTest.c Program for testing acoustic echo canceler AecInit.c Initialization function AecUtil.c Echo canceler utility functions AecCalc.c Main module for echo canceler Aec.h C header file floatPoint_aec.pjt DSP project file floatPoint_aec.cmd DSP linker command file rtfar.pcm Far-end data file rtmic.pcm Near-end data file The adaptive echo canceler operates in four different modes based on the power of far-end and near-end signals. These four operating modes are defined as follows: 1. Receive mode: Only the far-end speaker is talking. 2. Transmit mode: Only the near-end speaker is talking. 3. Idle mode: Both ends are silence. 4. Double-talk mode: Both ends are talking. Different operations are required for different modes. For example, the adaptive filter coefficients will be updated only at the receive mode. Typical operations at different modes are coded in Table 10.2. Figure 10.27 illustrates the performance of acoustic echo canceler: (a) the far-end speech signal that is played via a loudspeaker in the room; (b) the near-end signal picked up by a microphone, which consists of the near-end speech as well as the echoes in the room generated from playing the far-end speech; and (c) the acoustic echo canceler output to be transmitted to the far-end. It clearly shows that the echo canceler output contains only the near-end speech. In this experiment, the double talk is not present. The echo canceler reduces the echo by more than 20 dB. More experiments can be conducted by using different parameters. Procedures of the experiment are listed as follows: 1. Use an audio player or MATLAB to play the data files. The rtfar.pcm is transmitted from the far-end and played by a loudspeaker, which will generate acoustic echo in a room. The rtmic.pcm is the near-end signal captured by a microphone. We can clearly hear both the near-end speech and the echo generated from the far-end speech. 2. Open and build the experiment project. 3. Load and run the experiment using the provided data files. Verify the performance of acoustic echo canceler for removing the echo. 4. Open the C source file AecInit.c, adjust the following adaptive echo canceler parameters, and rerun the experiment to observe changing behavior: (a) echo tail length aec->AECorder (up to 1024); (b) leaky factor aec->leaky (1.0 disables leaky function); and (c) step size aec->mu.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 466 ADAPTIVE ECHO CANCELATION Table 10.2 Partial C code for acoustic echo cancelation if (farFlag == 1) // There is far-end speech { if ((nearFlag == 0) || (trainTime > 0)) // Receive mode { /* Receive mode operations */ if (trainTime > 0) // Counter is no expire yet { trainTime--; // Decrement the counter if (txGain > 0.25) txGain -= rampDown; // Ramp down farEndOut = (float)(txGain*errorAEC); // Attenuate by 12 dB } if (errorAECpowM 18 dB { // Enable center clipper farEndOut = comfortNoise; // and inject comfort noise } else // If ERLE < 18 dB { if (txGain > 0.25)txGain -= rampDown; // Ramp down farEndOut = (float)(txGain*errorAEC); // Disable center clipper } // Attenuated by 12 dB if (farInPowM < 16000.) // Signal farEndIn is reasonable { /* Update AEC coefficients, otherwise skip adaptation*/ temp = (float)((mu*errorAEC) /(spkOutPowM+saveMargin)); // Normalize step size for (k=0; k 0.5) txGain -= rampDown; // Ramp down if (txGain < 0.5) txGain += rampUp; // Ramp up farEndOut = (float)(txGain*errorAEC); // Attenuate 6 dB } } else // No far-end speech { // Transmit mode operation if (nearFlag == 1) { if (txGain < 1) txGain += rampUp; farEndOut = txGain*microphoneIn; // Full gain at trans- mit path } else // Idle mode operation { if (txGain > 0.5) txGain -= rampDown; // Ramp down if (txGain < 0.5) txGain += rampUp; // Ramp up farEndOut = (float)(txGain*microphoneIn); // Attenuate 6 dB } }JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 467 2 0 −2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 (a) Far-end speech signal × 104 × 104 2 0 −2 0 0.5 1 1.5 2 2.5 3 3.5 Residual echo Near-end speech 4 4.5 5 (c) Acoustic echo canceler output × 104 × 104 2 0 −2 0 0.5 1 1.5 2 2.5 3 3.5 Echo from far-end speechEcho from far-end speech Near-end speech 4 4.5 5 (b) Near-end mic input × 104 × 104 Figure 10.27 Experiment results of acoustic echo cancelation: (a) far-end speech signal; (b) near-end mic input; and (c) acoustic echo canceler output 5. Using the knowledge learned from previous experiments, write an assembly program to replace the adaptive filtering function used by this experiment. 6. Convert the rest of the experiment to fixed-point C implementation. Pay special attention on data type conversion and fixed-point implementation for C55x processors.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 468 ADAPTIVE ECHO CANCELATION Table 10.3 File listing for experiment exp10.7.3_intrinsicAec Files Description intrinsic_aec.pjt C55x project file intrinsic_aec.cmd C55x linker command file fixPoint_leaky_lmsTest.c Main program fixPoint_aec_init.c Initialization function fixPoint_double_talk.c Double-talk detection function fixPoint_leaky_lms.c Major module for LMS update and filtering fixPoint_nlp.c NLP function utility.c Utility function of long division fixPoint_leaky_lms.h Header file gsm.h Header file for using intrinsics linkage.h Header file needed for intrinsics rtfar.pcm Data file of far-end signal rtmic.pcm Data file of near-end signal 10.7.3 Acoustic Echo Canceler Using C55x Intrinsics This experiment shows the implementation of a fixed-point acoustic echo canceler. We use the normalized LMS algorithm presented in Chapter 7. In addition, we add an NLP function to further attenuate the residue echoes. The files used for this experiment are listed in Table 10.3. Fixed-point C implementation of leaky NLMS algorithm using intrinsic functions has been discussed in Section 7.6.3. Using the same technique, the DTD can be implemented in fixed-point C using the C55x intrinsics. Table 10.4 lists portion of the C program for far-end speech detection. In the program, the function aec_power_estimate( ) is used to estimate the signal power. The variable dt->nfFar is the noise floor of the far-end signal. If the signal power dt->nfFar is higher than the noise floor, the speech is detected. Table 10.4 Partial fixed-point C code for far-end signal detection // Update noise floor estimate of receiving far-end signal // temp = |farEndIn|, estimate far-end signal power temp32a = L_deposit_h(lms->in); dt->farInPowS = aec_power_estimate( dt->farInPowS,temp32a,ALPHA_SHIFT_SHORT); if (dt->nfFar < dt->farInPowS) { // Onset of speech, slow update using long window dt->nfFar = aec_power_estimate( dt->nfFar,temp32a,ALPHA_SHIFT_MEDIUM); } else { dt->nfFar = aec_power_estimate( dt->nfFar,temp32a,ALPHA_SHIFT_SHORT); } // Threshold for far-end speech detector temp32b = L_mult(extract_h(dt->nfFar),VAD_POWER_THRESH); temp32b = L_add(temp32b,dt->nfFar); temp32b = L_add(temp32b,L_deposit_h(SAFE_MARGIN));JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXPERIMENTS AND PROGRAM EXAMPLES 469 Table 10.4 (continued) if(temp32b <= L_deposit_h(200)) temp32b = L_deposit_h(200); // Detect speech activity at far end if(dt->farInPowS > temp32b) // temp32b = thresFar { // Declare far-end speech dt->farFlag = 1; // Set hangover time counter dt->farHangCount = HANGOVER_TIME; } else { if (dt->farHangCount-- < 0) // Decrement hangover counter { dt->farFlag = 0; dt->farHangCount = 0; // Hangover counter expired } } Procedures of the experiment are listed as follows: 1. Build, load, and run the experiment program. 2. The acoustic echo canceler output is saved in the file named aecout.pcm. 3. Use the CCS graph tool to plot the adaptive filter coefficients, w, of length 512, as shown in Figure 10.28. 4. With the same inputs as shown in Figure 10.27(a) and (b), the processed output by this fixed-point acoustic echo canceler is shown in Figure 10.29. Further experiments include writing assembly programs to replace the intrinsics used in this experiment, and modifying the fixed-point C code to create an adaptive echo canceler using assembly program. 10.7.4 Experiment of Delay Estimation This experiment uses the MATLAB scripts exp10_7_4.m to find the echo delay based on the crosscor- relation method. The program is listed in Table 10.5. The MATLAB function xcorr(x,y, 'biased') is used to calculate the crosscorrelation between the vectors x and y. In this experiment, we use auco8khz.txt as the far-end data (y vector), delay it by 200 samples, and copy it as the near-end data (x vector). The crosscorrelation between the vectors x and y is returned to crossxy. The MATLAB function max( ) is used to find the maximum value m in the array, which represents the delay between the far-end and near-end signals. The files used for this experiment are listed in Table 10.6, and procedures of the experiment are listed as follows: 1. Running the script, the delay value is estimated and printed as The maximum corssXY(m). This simple technique works well for estimating a pure delay in noise-free environment. 2. In real applications with noises and multiple echoes, more complicated methods discussed in Section 10.3.2 are needed. The crosscorrelation function is shown in Figure 10.30.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 Figure 10.28 Adaptive filter coefficients in steady state Output of fixed-point AEC Time Amplitude 2 230 0 −2 145 × 104 × 104 Figure 10.29 The error signal of fixed-point AEC output Table 10.5 Crosscorrelation method for estimating the delay % Open data files fid1 = fopen('.//data//rtfar.pcm', 'rb'); fid2 = fopen('.//data//rtmic.pcm', 'rb'); % Read data files x = fread(fid1, 'int16'); y = fread(fid2, 'int16'); % crossxy(m) = cxy(m-N), m=1, ..., 2N-1 crossxy = xcorr(x(1:800),y(1:800),'biased'); len=size(crossxy); % Only half = cxy(m-N), m=1, ... 470JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 Table 10.5 (continued) xy = abs(crossxy(((len-1)/2+1):len)); % Find max in xy [ampxy,posxy]=max(xy); plot(xy),; title('Crosscorelation between x and y'); xlabel('Time at 8000 Hz sampling rate'); ylabel('Crosscorrelation'); text(posxy-1,ampxy,... '\bullet\leftarrow\fontname{times} CorossXY(m) = MAXIMUM', 'FontSize',12) disp(sprintf('The maximum corssXY(m) found at %d with value =%d \n', posxy-1,ampxy)); fclose(fid1); fclose(fid2); Table 10.6 File listing for experiment exp10.7.4_delayDetect Files Description delayDetect.m MATLAB experiment program rtfar.pcm Data file for far-end signal rtmic.pcm Data file for near-end signal Crosscorrelation between x and y CrossXY(m) = MAXIMUM 2500 2000 1500 1000 500 0 0 100 200 300 400 500 600 700 800 Time at 8000 Hz sampling rate Crosscorrelation Figure 10.30 Crosscorrelation function to find a flat delay 471JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 472 ADAPTIVE ECHO CANCELATION References [1] S. M. Kuo and D. R. Morgan, Active Noise Control Systems Ð Algorithms and DSP Implementations, New York: John Wiley & Sons, Inc., 1996. [2] W.Tian and A. Alvarez, ‘Echo canceller and method of canceling echo,’WorldIntellectual Property Organization, Patent WO 02/093774 A1, Nov. 2002. [3] W. Tian and Y. Lu, ‘System and method for comfort noise generation,’ US Patent no. 6 766 020 B1, July 2004. [4] Y. Lu, R. Fowler, W. Tian, and L. Thompson, ‘Enhancing echo cancellation via estimation of delay,’ IEEE Trans. Signal. Process., vol. 53, no. 11, pp. 4159Ð4168, Nov. 2005. [5] D. L. Duttweiler, ‘A twelve-channel digital echo canceller,’ IEEE Trans. Comm., vol. COM-26, pp. 647Ð653, May 1978. [6] D. L. Duttweiler and Y. S. Chen, ‘A single-chip VLSI echo canceller,’ Bell Sys. Tech. J., vol. 59, pp. 149Ð160, Feb. 1980. [7] K. Eneman and M. Moonen, ‘Filterbank constrains for subband and frequency-domain adaptive filters,’ Proc. IEEE ASSP Workshop, New Paltz, NY: Mohonk Mountain House, Oct. 1997. [8] Math Works, Inc.,Using MATLAB, Version 6, 2000. [9] Math Works, Inc.,MATLAB Reference Guide, 1992. [10] Analog Devices, Digital Signal Processing Applications Using the ADSP-2100 Family, Englewood Cliffs, NJ: Prentice Hall, 1990. [11] C. W. K. Gritton and D. W. Lin, ‘Echo cancellation algorithms’ IEEE ASSP Mag., pp. 30Ð38, Apr. 1984. [12] CCITT Recommendation G.165, Echo Cancellers, 1984. [13] M. M. Sondhi and D. A. Berkley, ‘Silencing echoes on the telephone network,’ Proc. IEEE, vol. 68, pp. 948Ð963, Aug. 1980. [14] M. M. Sondhi and W. Kellermann, ‘Adaptive echo cancellation for speech signals,’ in Advances in Speech Signal Processing, S. Furui and M. Sondhi, Eds., New York: Marcel Dekker, 1992, Chap. 11. [15] Texas Instruments, Inc., Acoustic Echo Cancellation Software for Hands-Free Wireless Systems, Literature no. SPRA162, 1997 [16] Texas Instruments, Inc., Echo Cancellation S/W for TMS320C54x, Literature no. BPRA054, 1997 [17] Texas Instruments, Inc., Implementing a Line-Echo Canceller Using Block Update & NLMS Algorithms- ’C54x, Literature no. SPRA188, 1997 [18] ITU-T Recommendation G.167, Acoustic Echo Controllers, Mar. 1993. [19] ITU-T Recommendation G.168, Digital Network Echo Cancellers, 2000. Exercises 1. What are the first things to check if the adaptive filter is diverged during the designing of an adaptive echo canceler? 2. Assuming a full-band adaptive FIR filter is used and the sampling frequency is 8 kHz, calculate the following: (a) the number of taps needed to cover an echo tail of 128 ms; (b) the number of multiplications needed for coefficient adaptation; and (c) the number of taps if the sampling frequency is 16 kHz. 3. In Problem 2, the full-band signal is sampled at 8 kHz. If using 32 subbands and the subband signal is critically sampled, answer the following questions: (a) What is the sampling rate for each subband? (b) What is the minimum number of taps for each subband in order to cover the echo tail length of 128 ms? (c) What is the total number of taps and is this number the same as that of Problem 2? (d) At 8 kHz sampling rate, how many multiplications are needed for coefficient adaptation in each sampling period? You should see the savings in computations over Problem 2.JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 EXERCISES 473 4. The C55x LMS instruction, LMS Xmem, Ymem, ACx, ACy, is very efficient to perform two parallel LMS operations in one cycle. Write a C55x code using this instruction to convert the fixed-point C code in the experiment given in Section 10.7.3. 5. For VoIP applications, if a conventional landline telephone user A calls an IP phone user B via the VoIP gateway, draw a diagram to show which side will hear line echo and which side needs a line echo canceler. If both sides are using IP phones, do you think we still need a line echo canceler? 6. In the experiment given in Section 10.7.2, the signal flow has been classified into transmit, receive, double talk, and idle mode. In each of these modes, summarize which processes, comfort noise insertion, ramping up, ramping down, or attenuation, are applied?JWBK080-10 JWBK080-Kuo March 2, 2006 16:3 Char Count= 0 474JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 11 Speech-Coding Techniques Communication infrastructures and services have been changed dramatically in recent years to include data and images. However, speech is still the most important and common service in the telecommu- nication networks. This chapter introduces speech-coding techniques to achieve the spectral efficiency, security, and easy storage. 11.1 Introduction to Speech-Coding Speech-coding techniques compress the speech signals to achieve the efficiency in storage and transmis- sion, and to decompress the digital codes to reconstruct the speech signals with satisfactory qualities. In order to preserve the best speech quality while reducing the bit rate, it uses sophisticated speech-coding algorithms that need more memory and computational load. The trade-offs between bit rate, speech quality, coding delay, and algorithm complexity are the main concerns for the system designers. The simplest method to encode the speech is to quantize the time-domain waveform for the digital representation of speech, which is known as pulse code modulation (PCM). This linear quantization requires at least 12 bits per sample to maintain a satisfactory speech quality. Since most telecommunication systems use 8 kHz sampling rate, PCM coding requires a bit rate of 96 kbps. As briefly introduced in Chapter 1, lower bit rate can be achieved by using logarithmic quantization such as the μ-law or A-law companding, which compresses speech to 8 bits per sample and reduces the bit rate to 64 kbps. Further bit-rate reduction at 32 kbps can be achieved using the adaptive differential PCM (ADPCM), which uses adaptive predictor and quantizer to track the input speech signal. AnalysisÐsynthesis coding methods can achieve higher compression rate by analyzing the spectral parameters that represent the speech production model, and transmit these parameters to the receiver for synthesizing the speech. This type of coding algorithm is called vocoder (voice coder) since it uses an explicit speech production model. The most widely used vocoder uses the linear predictive coding (LPC) technique, which will be focused in this chapter. The LPC method is based on the speech production model including excitation input, gain, and vocal- tract filter. It is necessary to determine a given segment or frame (usually in the range of 5Ð30 ms) in voiced or unvoiced speech. Segmentation is formed by multiplying the speech signal by a Hamming window. The successive windows are overlapped. For a voiced speech, the pitch period is estimated and used to generate the periodic excitation input. For an unvoiced speech, a random noise will be used as the excitation input. The vocal tract is modeled as an all-pole digital filter. The filter coefficients can be estimated by the LevinsonÐDurbin recursive algorithm, which will be introduced in Section 11.2. Real-Time Digital Signal Processing: Implementations and Applications S.M. Kuo, B.H. Lee, and W. Tian C 2006 John Wiley & Sons, Ltd 475JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 476 SPEECH-CODING TECHNIQUES In recent years, many LPC-based speech CODECs, especially code-excited linear predictive (CELP) at bit rate of 8 kbps or lower have been developed for wireless and network applications. The CELP- type speech CODECs are widely used in applications including wireless mobile and IP telephony communications, streaming media services, audio and video conferencing, and digital radio broadcast- ings. These speech CODECs include the 5.3 to 6.3-kbps ITU-T G.723.1 for multimedia communications, the low-delay G.728 at 16 kbps, the G.729 at 8 kbps, and the ISO (International Organization for Stan- dardization) MPEG-4 CELP coding. In addition, there are regional standards that include Pan-European digital cellular radio (GSM) standard at 13 kbps, and GSM adaptive multirate (AMR) for third generation (3G) digital cellular telecommunication systems. 11.2 Overview of CELP Vocoders CELP algorithms use an LPC approach. The coded parameters are analyzed to minimize the perceptually weighted error via a closed-loop optimization procedure. All CELP algorithms share the same basic func- tions including short-term synthesis filter, long-term prediction synthesis filter (or adaptive codebook), perceptual weighted error minimization procedure, and fixed-codebook excitation. The basic structure of the CELP coding system is illustrated in Figure 11.1. The following three components can be optimized to obtain good synthesized speech: 1. time-varying filters, including short-term LPC synthesis filter 1/A(z), long-term pitch synthesis filter P(z) (adaptive codebook), and post filter F(z); 2. perceptually based error minimization procedure related to the perceptual weighting filter W(z); and 3. fixed-codebook excitation signal eu(n), including excitation signal shape and gain. H(z) Input speech xin(n) Minimum MSE W(z) P(z) 1/A(z) W(z) eu(n) eu(n) 1/A(z) F(z) xout(n)x(n) xw(n) ev(n) e(n) xw(n) ew(n) e(n) Pitch analysis Output speech P(z) Encoder Decoder Encoded bit stream Excitation generator LPC analysis ˆ ˆ Figure 11.1 Block diagram of LPC coding scheme. The top portion of the diagram is the encoder, and the bottom portion is the decoderJWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF CELP VOCODERS 477 In the encoder, the LPC and pitch analysis modules analyze speech to obtain the initial parameters for the speech synthesis model. Following these two modules, speech synthesis is conducted to minimize the weighted error. To develop an efficient search procedure, the number of operations can be reduced by moving the weighting filter into two branches before the error signal as shown in Figure 11.1. In the encoder, xin(n) is the input speech, xw(n) is the original speech weighted by the perceptual weighting filter W(z), ˆxw(n) is the weighted reconstructed speech by passing excitation signal e(n) through the combined filter H(z), eu(n) is the excitation from the codebook, ev(n) is the output of pitch predictor P(z), and ew(n) is the weighted error. Parameters including excitation index, quantized LPC coefficients, and pitch predictor coefficients are encoded and transmitted. At the receiver, these parameters are used to synthesize the speech. The filter W(z) is used only for minimizing the mean-square error loop, and its coefficients are not encoded. The coefficients of the post filter F(z) are derived from the LPC coefficients and/or from the reconstructed speech. In the decoder, the excitation signal e(n) is first passed through the long-term pitch synthesis filter P(z) and then the short-term LPC synthesis filter 1/A(z). The reconstructed signal ˆx(n) is sent to the post filter F(z), which emphasizes speech formants and attenuates the spectral valleys between formants. 11.2.1 Synthesis Filter The time-varying short-term synthesis filter 1/A(z) and the long-term synthesis filter P(z) are updated frame by frame using the LevinsonÐDurbin recursive algorithm. The synthesis filter 1/A(z) is expressed as 1/A(z) = 1 1 − p i=1 ai z−i , (11.1) where ai is the short-term LPC coefficient and p is the filter order. The most popular method to calculate the LPC coefficients is the autocorrelation method. Due to the characteristics of speeches, we apply windows to calculate the autocorrelation coefficients as follows: Rn( j) = N−1− j m=0 sn(m)sn(m + j), j = 0, 1, 2,..,p, (11.2) where N is the window (or frame) size, n is the frame index, and m is the sample index in the frame. We need to solve the following matrix equation to derive the prediction filter coefficients ai : ⎡ ⎢⎢⎢⎣ Rn(0) Rn(1) ··· Rn(p − 1) Rn(1) Rn(0) ··· Rn(p − 2) ... ... ... ... Rn(p − 1) Rn(p − 2) ··· Rn(0) ⎤ ⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎣ a1 a2 ... ap ⎤ ⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎣ Rn(1) Rn(2) ... Rn(p) ⎤ ⎥⎥⎥⎦ . (11.3) The left-hand side matrix is symmetric, all the elements on its main diagonal are equal, and the elements on any other diagonal parallel to the main diagonal are also equal. This square matrix is Toeplitz. Several efficient recursive algorithms have been derived for solving Equation (11.3). The most widely usedJWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 478 SPEECH-CODING TECHNIQUES algorithm is the LevinsonÐDurbin recursion summarized as follows: E(0) n = R(0) n (11.4) ki = Rn(i) − i−1 j=1 a(i−1) j Rn(|i − j|) E(i−1) n (11.5) a(i) i = ki (11.6) a(i) j = a(i−1) j − ki a(i−1) i− j 1 ≤ j ≤ i − 1 (11.7) E(i) n = 1 − k2 i E(i−1) n . (11.8) After solving these equations recursively for i =1,2,..., p, the parameters ai are given by a j = a(p) j 1 ≤ j ≤ p. (11.9) Example 11.1: Consider the order p = 3 and given autocorrelation coefficients Rn( j), j = 0, 1, 2, 3, for a frame of speech signal. Calculate the LPC coefficients. We need to solve the following matrix equation: ⎡ ⎢⎣ Rn(0) Rn(1) Rn(2) Rn(1) Rn(0) Rn(1) Rn(2) Rn(1) Rn(0) ⎤ ⎥⎦ ⎡ ⎢⎣ a1 a2 a3 ⎤ ⎥⎦ = ⎡ ⎢⎣ Rn(1) Rn(2) Rn(3) ⎤ ⎥⎦ . This matrix equation can be solved recursively as follows: For i = 1: E(0) n = R(0) n k1 = Rn(1) E(0) n = Rn(1) Rn(0) a(1) 1 = k1 = Rn(1) Rn(0) E(1) n = 1 − k2 1 Rn(0) =  1 − R2 n(1) R2 n(0)  Rn(0). For i = 2, E(1) n and a(1) 1 are available from i = 1. Thus, we have k2 = Rn(2) − a(1) 1 Rn(1) E(1) n = Rn(0)Rn(2) − R2 n(1) R2 n(0) − R2 n(1) a(2) 2 = k2 a(2) 1 = a(1) 1 − k2a(1) 1 = (1 − k2) a(1) 1 E(2) n = 1 − k2 2 E(1) n .JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF CELP VOCODERS 479 For i = 3, E(2) n , a(2) 1 , and a(2) 2 are available from i = 2. Thus, we get k3 = Rn(3) −  a(2) 1 Rn(2) + a(2) 2 Rn(1)  E(2) n a(3) 3 = k3 a(3) 1 = a(2) 1 − k3a(2) 2 a(3) 2 = a(2) 2 − k3a(2) 1 . Finally, we have a0 = 1, a1 = a(3) 1 , a2 = a(3) 2 , a3 = a(3) 3 . We can use MATLAB functions provided in the Signal Processing Toolbox to calculate LPC co- efficients. For example, the LPC coefficients can be calculated using the LevinsonÐDurbin recursion as [a,e] = levinson(r,p) The parameter r is a deterministic autocorrelation sequence (vector), p is the order of denominator polynomial A(z), a = [a(1) a(2) ··· a(p + 1)] where a(1) = 1, and the prediction error e. The function lpc(x,p) determines the coefficients of forward linear predictor by minimizing the prediction error in the least-square sense. The command [a,g] = lpc(x,p) finds the coefficients of a linear predictor that predicts the current value of the real-valued time series x based on past samples. This function returns prediction coefficients a and error variances g. Example 11.2: Given a speech file voice4.pcm, calculate the LPC coefficients and spectral response using the function levinson( ). Also, compare the contours of speech spectrum with the synthesis filter’s frequency response. Assume that the LPC order is 10 and Hamming window size is 256. The partial MATLAB code to calculate the LPC coefficients is listed in Table 11.1. The complete MATLAB program is given in example11_2.m. The magnitude response of the synthesis filter shows the envelope of the speech spectrum as shown in Figure 11.2. Example 11.3: Calculate the LPC coefficients as in Example 11.2 using the function lpc instead of levinson. In this case, g and e are identical if using the same order, and the LPC coefficients are identical to Example 11.2. Using a high-order synthesis filter, the frequency response is closer to the original speech spectrum. Figure 11.3 shows the use of high order (42) to calculate the LPC coefficients using lpc.JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 480 SPEECH-CODING TECHNIQUES Table 11.1 MATLAB code to calculate LPC coefficients and frequency response fid=fopen('voice4.pcm','r'); % Open the pcm data file b= fread(fid,20000,'short'); % Read 20000 samples to b % Windowing w=hamming(frame); % Generate Hamming window x1=b((start+1):(start+frame)); x=x1.*w; % Windowing y=fft(x,fftL); % FFT of the specified block py=10*log10(y.*conj(y)/fftL); % Magnitude response fclose(fid); % Close the file % Calculation of autocorrelation m=0; while (m<=lpcOrder); r(m+1)=sum(x((m+1):(frame)).*x(1:frame-m)); r(m+1) = r(m+1)/frame; m=m+1; end; % Levinson algorithm [a,e]=levinson(r,lpcOrder); 0 500 1000 1500 2000 2500 3000 3500 400010 20 30 40 50 60 70 80 LPC envelope FFT spectrum Frequency (Hz) Original speech spectrum and its LPC envelope Magnitude (dB) Figure 11.2 Spectral envelope of speech spectrum derived from the synthesis filterJWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF CELP VOCODERS 481 0 500 1000 1500 2000 2500 3000 3500 400010 20 30 40 50 60 70 80 Synthesis filter frequency response FFT spectrum Frequency (Hz) Original speech spectrum and its LPC envelope Magnitude (dB) Figure 11.3 Magnitude response of synthesis filter with higher order 11.2.2 Long-Term Prediction Filter The long-term prediction filter P(z) models the long-term correlation in speech to provide fine spectral structure, and it has the following general form: P(z) = I i=−I bi z−(Lopt+i), (11.10) where Lopt is the optimum pitch period, and bi are the coefficients. Typically, I = 0 for one tap, I = 1 for three taps, and I = 2 for 5-tap pitch filters. In some cases, the long-term prediction filter is also called an adaptive codebook since the excitation signals are adaptively updated. An example is given in the ITU-T G.723.1, which uses a fifth-order long-term prediction filter. 11.2.3 Perceptual Based Minimization Procedure A reduction in perceived distortion is possible if the noise spectrum is shaped to place the majority of the error in the formant (high-energy) regions where human ears are relatively insensitive because of the auditory masking. On the other hand, more subjectively disturbing noise in the formant nulls must be reduced. The synthesis filter in Equation (11.1) is used to construct the perceptual weighting filter. TheJWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 482 SPEECH-CODING TECHNIQUES formant perceptual filter has the following transfer function: W(z) = 1 − p i=1 ai z−i γ i 1 1 − p i=1 ai z−i γ i 2 = A(z/γ2) A(z/γ1) , (11.11) where 0 <γ <1 is a bandwidth expansion factor with typical values γ1 = 0.9 and γ2 = 0.5. The synthesis filter 1/A(z) and perceptual weighting filter W(z) can be combined to form H(z) = W(z)/A(z). (11.12) The impulse response of a combined or cascaded filter is denoted by {h(n), n = 0, 1,..., Nsub − 1}, where Nsub is the subframe length. 11.2.4 Excitation Signal For efficient temporal analysis, a speech frame is usually divided into a number of subframes. For example, there are four subframes defined in the G.723.1. For each subframe, the excitation signal is generated and the error is minimized to find the optimum excitation. The excitation varies between the pulse train and the random noise. A general form of the excitation signal e(n) shown in Figure 11.1 can be expressed as e(n) = ev(n) + eu(n), 0 ≤ n ≤ Nsub − 1, (11.13) where eu(n) is the excitation from a fixed or secondary codebook given by eu(n) = Guck(n), 0 ≤ n ≤ Nsub − 1, (11.14) where Gu is the gain, Nsub is the length of the excitation vector (or the subframe), and ck(n)isthe nth-element of kth-vector in the codebook. In Equation (11.13), ev(n) is the excitation from the long- term prediction filter expressed as ev (n) = I j=−I e(n + j − Lopt)b( j), n = 0, 1,...,Nsub − 1. (11.15) Passing e(n) through the combined filter H(z), we have perceptually weighted synthesis speech given by ˆxw(n) = v(n) + u(n) = n j=0 ev( j)h(n − j) + n j=0 eu( j)h(n − j), n = 0, 1,...,Nsub − 1, (11.16) where the first term v(n) = n j=0 ev( j)h(n − j) is from the long-term predictor and the second term u(n) = n j=0 eu( j)h(n − j) is from the secondary codebook. Therefore, the weighted error ew(n) can be described as ew(n) = xw(n) − ˆxw(n). (11.17)JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF CELP VOCODERS 483 The squared error of ew(n)isgivenby Ew = Nsub−1 n=0 e2 w(n). (11.18) By computing the above equations for all possible parameters including the pitch predictor coefficients bi (pitch gain), lag Lopt (delay), optimum secondary excitation code vector ck with elements ck(n), n = 0,..., Nsub − 1, and gain Gu, we can find the minimum Ew,min as Ew,min = min {Ew} = min  Nsub−1 n=0 e2 w(n)  . (11.19) This minimization procedure is a joint optimization between the pitch prediction (adaptive codebook) and the secondary (fixed) codebook excitations. The joint optimization of all the excitation parameters, including Lopt, bi , Gu, and ck, is possible but it is computationally intensive. A significant simplification can be achieved by assuming that the pitch prediction parameters are optimized independently from the secondary codebook excitation parameters. We call this a separate optimization procedure. In a separate optimization, the excitation e(n) given in Equation (11.13) contains only ev(n) because eu(n) = 0. The optimized pitch lag and pitch gain are first found using historical excitations. The contribu- tion of the pitch prediction v(n) can be subtracted from the target signal xw(n) to form a new target signal. The second round of minimization is conducted by approximating the secondary codebook contribution to this new target signal. An example of separate optimization procedure can be found in G.723.1. 11.2.5 Algebraic CELP The algebraic CELP (ACELP) implies that the structure of the codebook is used to select the excitation codebook vector. The codebook vector consists of a set of interleaved permutation codes containing few nonzero elements [3, 4]. The ACELP fixed-codebook structures have been used in G.729 and G.723.1 low-bit rate at 5.3 kbps, and WCDMA AMR. The fixed-codebook structure used in G.729 is shown in Table 11.2. In Table 11.2, mk is the pulse position, k is the pulse number, the interleaving depth is 5. In this codebook, each codebook vector contains four nonzero pulses indexed by ik. Each pulse can have either the amplitudes of +1or−1, and can assume the positions given by Table 11.2. The codebook vector, ck, is determined by placing four unit pulses at the locations mk multiplied with their signs (±1) as follows: ck(n) = s0δ(n − m0) + s1δ(n − m1) + s2δ(n − m2) + s3δ(n − m3), n = 0, 1,...,39, (11.20) where δ(n) is a unit pulse. Table 11.2 G.729 ACELP codebook Number of bits to code Pulse ik Sign sk Position mk (sign + position) i0 ±1 0, 5, 10, 15, 20, 25, 30, 35 1 + 3 i1 ±1 1, 6, 11, 16, 21, 26, 31, 36 1 + 3 i2 ±1 2, 7, 12, 17, 22, 27, 32, 37 1 + 3 i3 ±1 3, 8, 13, 18, 23, 28, 33, 38, 4, 9, 14, 19, 24, 29, 34, 39 1 + 4JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 484 SPEECH-CODING TECHNIQUES 0 1 2345 10 15 20 22 25 30 34 39 Figure 11.4 Four pulse locations in a 40-sample frame Example 11.4: Assuming that all four pulses in a frame are located as shown in Figure 11.4, these pulse locations and signs are found by minimizing the squared error defined in Equation (11.18). These pulse positions are confined within certain positions with interleaving depth 5 as defined by Table 11.2. The pulse positions and signs can be coded into codeword as shown in Table 11.3. We assume that the positive sign is encoded with 0 and the negative sign is encoded with 1. The sign bit is the MSB. This example shows how to encode ACELP excitations. We need 4 + 4 + 4 + 5 = 17 bits to code this ACELP information. 11.3 Overview of Some Popular CODECs Different vocoders are used for different applications that depend on the bit rate, robustness to channel errors, algorithm delay, complexity, and sampling rate. Two most popular algorithms G.729 and G.723.1 are widely used in real-time communications over the Internet due to their low-bit rates and high qualities. The AMR is mainly used for the GSM and the WCDMA wireless systems for its flexible rate adaptation to error conditions of wireless channels. 11.3.1 Overview of G.723.1 One example of a CELP CODEC is the ITU-T G.723.1 [1], whose basic structure is illustrated in Figure 11.5. The predictor coefficients are updated using the forward adaptation method. The encoder operates on a frame of 240 samples, i.e., 30 ms at the 8 kHz sampling rate. The input speech is first buffered into frames of 240 samples, filtered by a highpass filter to remove the DC component, and then divided into four subframes of 60 samples each. For every subframe m, a 10th-order LPC filter is computed using the highpass filtered signal xm(n). For each subframe, a window of 180 samples is centered on the current subframe as shown in Figure 11.6. A Hamming window is applied to these samples. Eleven autocorrelation coefficients are computed from the windowed signal. The linear predictive coefficients are found using the LevinsonÐDurbin algorithm. For every input frame, four LPC sets will be computed, one for each subframe. These LPC coefficient sets are used to construct the short-term perceptual weighting filter. Table 11.3 An example of coding the ACELP pulses into codeword Pulse ik Sign sk Position mk Sign + position = encoded code i0 +1 5 (k=1) 0<<3+1=0001 i1 -1 1 (k=0) 1<<3+0=1000 i2 -1 22 (k=4) 1<<3+4=1100 i3 +1 34 (k=14) 0<<4 + 14 = 01110JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF SOME POPULAR CODECS 485 Formant perceptual weighting Highpass filter Framer Harmonic noise shaping Pitch predictor Impulse response calculator LPC analysis Secondary excitation decoder LSP quantizer em(n) y(n) xm(n) Am(z) Wm(z) Lolp,m Hm(z) fm(n) tm(n) vm(n) rm(n) hm(n) Pm(z) zm(n) ev,m(n) Pitch estimator MP-MLQ/ ACELP Zero input response Memory update LSP decoder LSP interpolator Pitch prediction decoder eu,m(n) Simulated decoder MSE Lopt,m, bopt,m Am(z)~ − − Figure 11.5 Block diagram of G.723.1 CODEC Calculating the LPC coefficients requires the past, current, and future subframes. The LPC synthesis filter is defined as 1 Am(z) = 1 1 − 10 i=1 ai,m z−i , (11.21) where m = 0, 1, 2, 3 is the subframe index. As shown in Figure 11.5, the LPC filter coefficients of the last subframe are converted to LSP (line spectral pairs) coefficients. The reason for doing LPC to Past Subframe1 Subframe2 Subframe3 Subframe4 Look-ahead 180-sample LPC analysis windows 60 60240 Figure 11.6 G.723.1 LPC analysis windows vs. subframesJWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 486 SPEECH-CODING TECHNIQUES Table 11.4 Procedures of LPC coefficient calculation and reconstruction Seq. Computed parameters Subframe 0 Subframe 1 Subframe 2 Subframe 3 1 LPC coefficients {ai,0}{ai,1}{ai,2}{ai,3} 2 LSP coefficients {βi,3} 3 LSP coefficients quantization and dequantization { ˆβi,3} 4 Interpolated LSP coefficients { ˆβi,0}{ˆβi,1}{ˆβi,2}{ˆβi,3} 5 Reconstructed LPC coefficients {ˆai,0}{ˆai,1}{ˆai,2}{ˆai,3} LSP coefficients conversion is to take the advantages of two properties of LSP coefficients: to verify the stability of the filter and to have higher coefficients correlation among subframes. The first property can be used to make synthesis filter stable after quantization, and the second is used to further remove redundancy. The LPC coefficients calculation and quantization can be further explained with Table 11.4 and the following procedures: 1. Compute {ai,m} for subframes m = 0, 1, 2, 3 and 10th-order LPC coefficients i = 1,...,10. With {ai,m}, we can construct Equation (11.21). The unquantized LPC coefficients are used to construct the short-term perceptual weighting filter Wm(z), which will be used to filter the entire frame to obtain the perceptually weighted speech signal. 2. Convert the last subframe’s {ai,3} to LSP coefficients {βi,3}. 3. Use vector quantization to quantize the 10 LSP coefficients into LSP index for transmitting. Dequan- tize the LSP coefficients to { ˆβi,3}. 4. Use the LSP coefficients { ˆβi,3} from the current frame and the last frame to interpolate the LSP coefficients for each subframe { ˆβi,m}. 5. Convert LSP coefficients { ˆβi,m} back to LPC coefficients {ˆai,m} and construct the synthesis filter 1/ ˜Am(z) for each subframe. Note that even in the encoder side, we also need this in order for both sides to use the same set of synthesis filter. Decoding side never has the unquantized LPC coefficients. The pitch estimation is performed on two adjacent subframes of 120 samples. The pitch pe- riod is searched in the range from 18 to 142 samples. Using the estimated open-loop pitch period Lolp,m, a harmonic noise-shaping filter Pm(z) can be constructed. The combination of the LPC synthe- sis filter with the formant perceptual weighting filter and the harmonic noise-shaping filter, Hm (z) = Wm (z) Pm (z) / ˜Am (z), is used to create an impulse response hm(n)(n = 0,...,59). The adaptive exci- tation signals ev,m(n) and the secondary excitation signal eu,m(n) are filtered by this combined filter to provide the zero-state responses um(n) and vm(n), respectively. In adaptive codebook excitation, a fifth-order pitch predictor is used. The optimum pitch periods (Lopt,m) of subframe 0 and 2 are computed via closed-loop vector quantization around the open-loop pitch estimate Lolp,m. The optimum pitch periods in subframes 1 and 3 are searched for differential values around the previous optimum pitch periods of subframes 0 and 2, respectively. The pitch periods of subframes 0 and 2, and the differential values for subframes 1 and 3 are transmitted to the decoder.JWBK080-11 JWBK080-Kuo March 2, 2006 16:5 Char Count= 0 OVERVIEW OF SOME POPULAR CODECS 487 Let bk, k = 0,...,Nltp − 1 denotes the kth gain vector and its elements are bk(n), n = 0, 1, 2, 3, 4, where Nltp is the size of pitch gain codebook. The adaptive codebook excitation is formed by ev,m(n) = 4 j=0 bk( j)em( − Lm − 2 + n + j), 0 ≤ n ≤ 59, 0 ≤ k ≤ Nltp − 1, (11.22) where Lm = Lolp,m − 1, Lolp,m, and Lolp,m + 1 for subframes 0 and 2, or Lm = Lopt,m−1 − 1, Lopt,m−1, Lopt,m−1 + 1, and Lopt,m−1 + 2 for subframes 1 and 3, where Lopt,m−1 is the optimum pitch lag in the previous subframe 0 or 2. By using the closed-loop quantization, the adaptive codebook contribution is computed as vm(n) = ev,m(n) ∗ hm(n) = n j=0 ev,m( j)hm(n − j), 0 ≤ n ≤ 59. (11.23) The optimization procedure minimizes the mean-squared error Eac given as Eac = 59 n=0 [tm(n) − vm(n)]2, 0 ≤ k ≤ Nltp − 1. (11.24) The best lag Lopt,m and tap vector bopt,m that minimize Eac are identified in the current subframe as the adaptive codebook’s lag value and gain vector, respectively. The optimum reconstructed adaptive codebook contribution is expressed as vm(n) = 4 j=0 bopt,m( j) n k=0 em(k − Lopt,m − 2 + j)hm(n − k), 0 ≤ n ≤ 59. (11.25) The contribution of the adaptive codeboo