professional git


Table of Contents Cover Title Page Introduction How this Book is Unique Target Audience Structure and Content Reader Value Next Steps PART I: UNDERSTANDING GIT CONCEPTS Chapter 1: What Is Git? History of Git Industry-Standard Tooling The Git Ecosystem Git's Advantages and Challenges Summary Chapter 2: Key Concepts Design Concepts: User-Facing Design Concepts: Internal Repository Design Considerations Summary Chapter 3: The Git Promotion Model The Levels of Git Summary Connected Lab 1: Installing Git Installing Git for Windows Steps Installing Git on Mac OS X Installing Git on Linux PART II: USING GIT Chapter 4: Configuration and Setup Executing Commands in Git Configuring Git Initializing a Repository Advanced Topics Summary Chapter 5: Getting Productive Getting Help The Multiple Repositories Model Adding Content to Track—Add Finalizing Changes—Commit Putting It All Together Advanced Topics Summary Connected Lab 2: Creating and Exploring a Git Repository and Managing Content Prerequisites Optional Advanced Deep-Dive into the Repository Structure Steps Chapter 6: Tracking Changes Git Status Git Diff Summary Connected Lab 3: Tracking Content through the File Status Life Cycle Prerequisites Steps Chapter 7: Working with Changes over Time and Using Tags The Log Command Git Blame Seeing History Visually Tags Undoing Changes in History Advanced Topics Summary Connected Lab 4: Using Git History, Aliases, and Tags Prerequisites Steps Chapter 8: Working with Local Branches What Is a Branch? Advanced Topics Summary Connected Lab 5: Working with Branches Prerequisites Steps Chapter 9: Merging Content The Basics of Merging Dealing with Conflicts Visual Merging Advanced Topics Summary Connected Lab 6: Practicing with Merging Prerequisites Steps Chapter 10: Supporting Files in Git The Git Attributes File The Git Ignore File Summary Chapter 11: Doing More with Git Modifying the Layout of Files and Directories in Your Local Environment Commands for Searching Working with Patches and Archives for Changes Commands for Cleaning Up Advanced Topics Summary Connected Lab 7: Deleting, Renaming, and Stashing Prerequisites Steps Chapter 12: Understanding Remotes—Branches and Operations Remotes Summary Connected Lab 8: Setting Up a GitHub Account and Cloning a Repository Prerequisites Steps Chapter 13: Understanding Remotes—Workflows for Changes The Basic Conflict and Merge Resolution Workflow in Git Hosted Repositories Summary Connected Lab 9: Using the Overall Workflow with a Remote Repository Prerequisites Steps Chapter 14: Working with Trees and Modules in Git Worktrees Submodules Subtrees Summary About Connected Labs 10–12 Connected Lab 10: Working with Worktrees Prerequisites Steps Connected Lab 11: Working with Submodules Prerequisites Steps Connected Lab 12: Working with Subtrees Prerequisites Steps Chapter 15: Extending Git Functionality with Git Hooks Installing Hooks Updating Hooks Common Hook Attributes Hook Descriptions Other Hooks Hooks Quick Reference Summary End User License Agreement List of Illustrations Chapter 1: What Is Git? Figure 1.1 Example GitHub page Figure 1.2 GitLab project screen Figure 1.3 Examples of GUIs available for Git (from git-scm.org) Figure 1.4 Example Gerrit screen Chapter 2: Key Concepts Figure 2.1 A traditional centralized version control model Figure 2.2 A distributed version control model Figure 2.3 Disconnected development Figure 2.4 The delta storage model Figure 2.5 The snapshot storage model Figure 2.6 A representation of Git's packing behavior to optimize content size Chapter 3: The Git Promotion Model Figure 3.1 A simple dev-test-prod environment Figure 3.2 The levels of a Git system Figure 3.3 The local versus remote environments Figure 3.4 Git in one picture Chapter 4: Configuration and Setup Figure 4.1 Understanding the scopes of Git configuration files Figure 4.2 Tree listing of a .git directory (local repository) Figure 4.3 Mapping files and directories to Git repositories Chapter 5: Getting Productive Figure 5.1 Abbreviated version of help invoked with the -h option Figure 5.2 Git browser-based man page Figure 5.3 Working with multiple repositories Figure 5.4 Overlaying configuration files on your model Figure 5.5 Where adding and staging fit in Figure 5.6 An edit session for a hunk Figure 5.7 Where commit fits in Figure 5.8 The basic workflow for multiple commits Figure 5.9 Workflow for an amended commit Figure 5.10 The editor session for a commit message using a template file and the --verbose --verbose options Chapter 6: Tracking Changes Figure 6.1 Empty local environment levels. Figure 6.2 File created in working directory Figure 6.3 Version a of the file is staged. Figure 6.4 Update made to working directory version Figure 6.5 Version b staged Figure 6.6 The file is committed. Figure 6.7 Starting point for diffing—working directory clean Figure 6.8 Workflow of git diff between working directory and Git (checking the staging area) Figure 6.9 Workflow of git diff between working directory and Git (checking the local repository) Figure 6.10 Local version updated to b Figure 6.11 Diff between modified local version and Git Figure 6.12 Diffing further up the chain Figure 6.13 Diffing from the working directory with a version in the staging area Figure 6.14 Diffing starting at the staging area Figure 6.15 Diffing directly against a SHA1 (HEAD) Figure 6.16 Vimdiff Figure 6.17 WinMerge Figure 6.18 Meld Figure 6.19 KDiff3 Chapter 7: Working with Changes over Time and Using Tags Figure 7.1 Using the gitk tool to browse local history Figure 7.2 Tagging a commit Figure 7.3 Starting repository contents Figure 7.4 Resetting back to an absolute SHA1 Figure 7.5 Resetting relative to a tag Figure 7.6 Resetting for revert Figure 7.7 Local environment after the revert Chapter 8: Working with Local Branches Figure 8.1 Progression of chain of commits Figure 8.2 Your starting chain of commits Figure 8.3 After the creation of a testing branch Figure 8.4 After checking out the testing branch Figure 8.5 The current branch pointer is moved to indicate that the newest commit is the latest content on that branch. Figure 8.6 Local repository—active branch: master Figure 8.7 Git checkout master Figure 8.8 Git checkout testing Figure 8.9 Git checkout master (again) Figure 8.10 Local repository with two branches Figure 8.11 After deleting the testing branch Figure 8.12 The master-as-production model Figure 8.13 The master-to-release model Figure 8.14 The master-as-integration model Figure 8.15 The parallel model Figure 8.16 Repository before checkout of fc28c0d Figure 8.17 Repository after checkout of fc28c0d Figure 8.18 Repository state after the new commit Figure 8.19 Repository after you switch back to feature1 Figure 8.20 After creating a new branch off of your commit Figure 8.21 After a checkout of experimental Figure 8.22 The two paths of your two branches Chapter 9: Merging Content Figure 9.1 Setup for the fast-forward example Figure 9.2 The fast-forward merge Figure 9.3 Setup for the three-way merge example—not eligible for fast-forward Figure 9.4 The three points considered for the three-way merge Figure 9.5 The three-way merge process Figure 9.6 The new merge commit after the three-way merge Figure 9.7 Setup for the rebase example Figure 9.8 Identifying a common ancestor Figure 9.9 Computing deltas from the source branch Figure 9.10 Applying deltas on the destination tip Figure 9.11 Completed rebase of a feature on master Figure 9.12 Setup for the cherry-pick example Figure 9.13 End result of the cherry-pick Figure 9.14 The merge process in the local environment Figure 9.15 Master branch with three topic branches Figure 9.16 After a merge of the three topic branches Figure 9.17 The earlier cherry-pick example Figure 9.18 C5 cannot be cherry-picked due to a conflict. Figure 9.19 The choices for options to pick one version Figure 9.20 Completed cherry-pick with C5 from feature Figure 9.21 After the octopus merge Figure 9.22 Merging with vimdiff Figure 9.23 Merging with WinMerge Figure 9.24 Merging with Meld Figure 9.25 Merging with KDiff3 Figure 9.26 Setup for an advanced rebase Figure 9.27 Topic's chain of commits Figure 9.28 Computing the deltas to rebase Figure 9.29 Applying the deltas to master Figure 9.30 The completed rebase Figure 9.31 Topic merged into master Figure 9.32 Beginning state of your branch Figure 9.33 Temporary file created for scripting the rebase actions Figure 9.34 Edited interactive rebase to-do script Figure 9.35 Screen to enter commit message for squashed commits Figure 9.36 Adding a new commit message for the squashed commits Figure 9.37 Your chains of commits after the interactive rebase is completed Chapter 10: Supporting Files in Git Figure 10.1 The Git model with smudge and clean filters Chapter 11: Doing More with Git Figure 11.1 Local environment with an uncommitted change Figure 11.2 After the initial stash Figure 11.3 Another change in your local environment with an untracked file Figure 11.4 After stashing, including the untracked file Figure 11.5 Another change in your local environment Figure 11.6 The third element on the queue Figure 11.7 Queue and local environment after an apply and pop from the stash Figure 11.8 Changing the format of a patch received in e-mail Figure 11.9 Starting state for bisect Figure 11.10 Checking for a good version Figure 11.11 Initial bisect trial Figure 11.12 Bisecting—the next steps Figure 11.13 Narrowing in on the first bad commit Figure 11.14 The first bad commit is found Figure 11.15 gitk view of a bisect Chapter 12: Understanding Remotes—Branches and Operations Figure 12.1 Arrangement of local versus remote environments Figure 12.2 Login access (top) versus SSH access (bottom) Figure 12.3 Start and end of a cloning operation Figure 12.4 A way to think about cloning multi-level paths Figure 12.5 Initial changes in the local repository Figure 12.6 After a push to the remote repository Figure 12.7 Remote tracking branch created in the local repository Figure 12.8 After a commit into the local repository Figure 12.9 Before and after a fetch operation Figure 12.10 The local repository before and after the merge Figure 12.11 Before and after a pull operation Chapter 13: Understanding Remotes—Workflows for Changes Figure 13.1 File granularity corresponding to delta changes Figure 13.2 Commits are a snapshot of files and directories. Figure 13.3 Two users with the same cloned contents Figure 13.4 User 1 successfully pushes their changes. Figure 13.5 User 2 attempts to push their changes and is rejected. Figure 13.6 User 2 pulls the latest changes to merge updates locally. Figure 13.7 Merged content is pushed back into the remote. Figure 13.8 Forking a repository Figure 13.9 The typical Git lifecycle on a forked repository Figure 13.10 Sending a pull request to the owner Figure 13.11 Repository owner pulls changes. Figure 13.12 A workflow model for making and incorporating changes Chapter 14: Working with Trees and Modules in Git Figure 14.1 Illustration of multiple working trees Figure 14.2 Illustration of how submodules work Figure 14.3 Illustration of a subtree layout List of Tables Chapter 3: The Git Promotion Model Table 3.1 Core Commands for Moving Content between Levels in Git Chapter 4: Configuration and Setup Table 4.1 Components of a Git Command Line Invocation Table 4.2 Porcelain Commands in Git Table 4.3 Plumbing Commands in Git Chapter 6: Tracking Changes Table 6.1 Git Status Codes for Short Options Chapter 10: Supporting Files in Git Table 10.1 The File Scope for Git Attributes Table 10.2 Options for Specifying Attributes Table 10.3 Scopes and Precedence for Git Ignore Files Chapter 12: Understanding Remotes—Branches and Operations Table 12.1 Summarizing the Types of Branches in Git Chapter 15: Extending Git Functionality with Git Hooks Table 15.1 List of Git Hooks by Operation PROFESSIONAL Git® Brent Laster Introduction Welcome. If your job or interests involve designing, creating, or testing software, or managing any part of a software development lifecycle, chances are that you’ve heard of Git and, at some level, have tried to use and understand it. This book will help you reach that goal. To put it simply, Professional Git is intended to help you understand and use Git to get your job done, whether that job is a personal project or a professional requirement. In the process, it will also make Git part of your professional comfort zone. Throughout the book, I’ve provided the background and concepts that you need to know (and understand) to make sense of Git, while you learn how to interact with it. This section will provide you with a quick introduction to the book. It will explain how this book is unique from other books about Git, the intended target audience, the book’s overall structure and content, and some of the value it offers you. I encourage you to take a few minutes and read through this section. Then, you can dive into the material at your own pace, and build your skills and understanding of Git through the text and the included hands-on labs. Or, if you’d like to quickly see additional information about the range of content, you can browse the table of contents. Thanks for taking a look at Professional Git. HOW THIS BOOK IS UNIQUE While many books about Git are already on the market, most are aimed at providing the technical usage of the application as their major and singular goal. Professional Git will provide you with that, but it will also provide you with an understanding of Git in terms of concepts that you probably already know. As well, most books do not provide practical ways to integrate the concepts they describe. Learning is most effective when you have actual examples to work through so you can internalize the concepts and gain proficiency at your pace. Professional Git includes Connected Labs that you can work through to absorb what you’ve just read. I’ve included simple, clear illustrations to help you visualize key ideas and workflows. I’ve also included Advanced Topics sections at the end of many chapters. These sections provide additional explanations of how to use some lesser-known features of Git as well as how to go beyond the standard Git features to gain extra value. It is easy to experience a bad transition from another source management system to Git, if you don’t understand Git. To be most effective, you need to comprehend the Git model and workflow. You should also know what to watch out for as you make the transition and why it’s important to consider not only the commands and workflow, but also the structure and scope of its underlying repositories. I cover all of this in Professional Git. TARGET AUDIENCE This book is based on my years of training people on Git; these people worked at all levels and came from many different backgrounds—developers, testers, project managers, people managers, documentation specialists, and so on. I have presented the basic materials outlined in this book through many workshops at industry conferences and corporate training sessions. I’ve presented them at locations across the United States, as well as internationally. I’ve been successful in helping people to walk away with a newfound confidence in using Git. I only make one assumption in this book: that you have experience with at least one source management system. It doesn’t matter which one: CVS, Subversion, Mercury— any will do. I just assume that you have a basic awareness of what a source management system does as well as fundamental concepts such as checking in and checking out code and branching. Beyond that, you do not require any prior knowledge or experience. And even if you have significant experience with Git or another system, you’ll find something of benefit here. In fact, if you’re reading this, then you probably fall into one of the following categories: You are new to Git and know that you need to learn it. You have used Git but have been trying to use it the same way you used your previous source control system. You have used Git and feel that you know “just enough to be dangerous.” You are getting by with Git, but really want to understand why it works the way it does and how to really use it as intended. You work with, or manage, people who either use Git or need to learn it. Given that association, you need to know about Git and to understand the fundamental concepts. You’ve heard about the potential benefits of Git, and so you are curious about it and about what it can do for you and the organization you work with. You may actually see yourself in more than one of these categories. However, you probably just want to be able to get your job done (whether that job is a personal or professional goal). This book was built on that premise. Git requires a mind shift. In fact, it requires a series of mind shifts. However, each shift is easy to understand once you can relate it to something you already know. Understanding each of these shifts will, in turn, allow you to be more productive and to harness the features of this powerful tool—and that’s what this book is about. STRUCTURE AND CONTENT This book is organized as a series of chapters that present Git from the ground up, teaching you what you need to know and build on to become proficient before adding new concepts. In the first three chapters, I cover the foundational concepts of Git: how it’s different from other systems, the ecosystem that’s been built around it, its advantages and challenges, and the model that allows you to understand its workflow and manage content effectively with it. This section will provide you with a basic understanding of the ideas, goals, and essential terminology of Git. In the remaining chapters of the book, I cover the usage and features of Git, from performing basic operations to create repositories and commit changes into them, to creating branches, doing merges, and working with content in public repositories. Notice that I don’t have you using Git right away. (If you want to do that, feel free to jump ahead to Chapter 4, which quickly enables you to start getting hands-on with Git.) However, I highly recommend reading the first three chapters. If you’re new to Git (or it’s been a while), the background reading, especially in Chapters 2 and 3, will provide the foundation you need to understand the remaining chapters. And even if you’ve used Git before, reading these chapters may clear up questions that you’ve had about Git, give you a better mental model to work from, and form a basis to understand some of the more advanced concepts. READER VALUE Throughout the book, you’ll find examples and guidance on the commands and workflows you need to be productive with Git. Each chapter includes ways to relate concepts to what you already know and understand. In addition to the text, you’ll find many illustrations to help you understand concepts visually. As I’ve already mentioned, this book also adds a feature that allows you to get hands-on experience with Git, via Connected Labs interspersed throughout the chapters. These labs are designed to reinforce the concepts presented in the text of the preceding chapter(s) and to get you actively involved in the learning process, allowing you to better grasp the concepts. To get the most out of the book, you should take the time to complete each lab—usually only a few minutes. You’ll find that these simple steps will greatly increase your overall understanding and confidence when using Git. As well, take a look at the Advanced Topics sections, located at the end of some chapters. You’ll likely find explanations and ideas to leverage Git functionality in ways you may not have considered before, or you may find out how to use that feature you’ve always wondered about. For the later labs, custom Git repositories with example content are provided for the user at http://github.com/professional-git. In addition, downloadable copies of the code for the hooks from the last chapter are available in http://github.com/professional-git/hooks. In the event that GitHub is not available, you can find the needed files at www.wrox.com/go/professionalgit NEXT STEPS If this sounds like the book for you, then I encourage you to keep reading and to start making the connections and mind shifts that will help you succeed with Git. As you progress through the book, you’ll find many ideas, insights, and “a-ha” moments that will serve you well. And with that knowledge, you’ll soon be working at the level of “Professional Git.” Part I Understanding Git Concepts CHAPTER 1: What Is Git? CHAPTER 2: Key Concepts CHAPTER 3: The Git Promotion Model Chapter 1 What Is Git? WHAT'S IN THIS CHAPTER? A brief introduction to Git and its history The different ways to find and access Git Types of applications that incorporate Git The advantages of using Git The challenges of using Git In this chapter, you'll be introduced to Git and will learn about it from a product perspective—what it is, why it's used, the different kinds of interfaces you can use with it, and the good parts and challenging parts of working with it. This will provide an important foundation for understanding the technical details that follow in the subsequent chapters. If I were to summarize what Git is in one paragraph, it would go something like this: Git is a popular and widely used source management system that greatly simplifies the development cycle. It enables users to create, use, and switch between branches for content development as easily as people create and switch between files in their daily workflow. It is implemented using a fast, efficient architecture that allows for ease of experimentation and refinement of local changes in an isolated environment before sharing them with others. In short, it allows everyday users to focus on getting the content right instead of worrying about source management, while providing more advanced users with the ability to record, edit, and share changes at any level of detail. In short, Git is different—really. When you're experienced with using Git and understand it, this will make you feel empowered and productive. When you're new to Git, and trying to understand it, you will encounter a model that will lead you to think differently about managing content in source control. To illustrate, there's an old saying that “when all you have is a hammer, everything looks like a nail.” When all you have is a traditional centralized source management system, everything looks like a file-by-file change that is expensive to branch. Not so with Git. Git is one of those nice tools that actually allows users to focus on developing content and simplifying workflows. It's not just another tool in the toolbox, it is the toolbox. It contains all of the tools you need to manage tracking anything from a few files for a single user to projects spanning hundreds of users and a huge scope, such as the Linux kernel. Today, many large companies use Git. It's free, it's powerful, it scales, and its model works when used as designed. Git also has a certain “feel” that's appealing to many people. Git is structured more like a series of individual utilities that you can run against your content, similar to how users work with operating systems. However, it doesn't try to be the system; it gives users ultimate control over their content, even to the point of being able to update history if needed. Git manages basic units that equate to directory structures rather than individual files, so content that extends across file and directory boundaries can be managed together. Git simplifies branching, to a point where creating, merging, or deleting branches becomes nearly as quick and easy as creating, merging, or deleting files. It also provides a local environment with full source management control that can be updated independently of the shared, public environment. Given that it is different from other source code management (SCM) systems, it's useful to understand how Git originated. The following section includes some of its history. HISTORY OF GIT Git has its roots in the development environment for the Linux kernel. In the early 2000s, the team working on the kernel began using a proprietary distributed source control system called BitKeeper (sometimes abbreviated as BK). The team was initially allowed to use this system for free. Over time, differences of opinion developed around the use of BK to the point that the owner of that system revoked the free use of the product. At that time (in 2005), Linus Torvalds, the creator of Linux, set out to create a new system that maintained the distributed ideal, but also incorporated several additional concepts he had been working with. Perhaps most importantly, he wanted it to provide the fast performance that a project on the scope of the Linux kernel would need. Thus the motivation and ideas for what became Git came into being. Development began in early April of 2005, and an initial release was ready by July. Originally, there was an idea of purposing Git as a toolkit that could have other systems implemented on top of it. However, over time, it has been made into a full- fledged SCM in its own right. If you're wondering about the name, there are multiple definitions for the word Git, but all of them imply a negative connotation about a person. Git was given its name by its creator. Linus jokingly stated that he named all his projects after himself. For those interested in learning more about this phase of Git development, detailed historical information is available on the Internet. INDUSTRY-STANDARD TOOLING From these early beginnings, Git has grown to become an industry-standard tool. Of course, industry standard is a relative term. Nevertheless, based on nearly any criteria, Git fits. It is used across all levels of industry. Huge projects, such as the Linux kernel, are managed in it, and also mandate its use (see the following list). It is a key component of many continuous integration/continuous delivery pipelines. Demand for knowledge about it is ever increasing. Commercial and open-source projects and applications recognize that if they require source management services, they have to integrate with Git. Projects and companies using Git include Google Facebook Microsoft Twitter LinkedIn Netflix O'Reilly PostgreSQL Android Linux Eclipse As with any sufficiently successful open-source technology, an entire ecosystem has sprung up around Git. This point is worth discussing for a moment. The basic tool that is Git has given rise to a seemingly endless number of applications to further help users who want to work with it—most named with some wordplay based on git. If you start discussing Git with someone, you may hear such names as GitHub, Gitolite, Easy Git, Git Extensions, EGit, and so on. To the uninitiated, it can be challenging to understand how each one of these names relates to the original Git tooling. To help clarify some of the confusion, I'll give you an overview of how the different offerings are categorized. THE GIT ECOSYSTEM Broadly, you can break down the Git-based offerings into a few categories: core Git, Git-hosting sites, self-hosting packages, ease-of-use packages, plug-ins, tools that incorporate Git, and Git libraries. Core Git In the core Git category, you have the basic Git executables, configuration files, and repository management tooling that you can install and use through the command line interface. (These can be installed from https://git-scm.com/downloads.) In addition to the basic pieces, the distributions usually include some supporting tools such as a simple GUI (git gui), a history visualization tool (gitk), and in some cases, an alternate interface such as a Bash shell that runs on Windows. The distribution for Windows is now called Git for Windows. Similarly there is a ported version of Git for OS/X. This version can be installed directly from the git-scm.com site, or via the Homebrew package manager or built via the MacPorts application. When installing on Linux systems, the recommended method is to use the preferred package manager for your distribution. Example commands are shown in the following list. Debian/Ubuntu $ apt-get install git Fedora (up to 21) $ yum install git Fedora (22 and beyond) $ dnf install git FreeBSD $ cd/usr/ports/devel/git $ make install Gentoo $ emerge --ask --verbose dev-vcs/git OpenBSD $ pkg_add git Solaris 11 Express $ pkg install developer/versioning/git Git-Hosting Sites Git-hosting sites are websites that provide hosting services for Git repositories, both for personal and shared projects. Customers may be individuals, open-source collaborators, or businesses. Many open-source projects have their Git repositories hosted on these sites. In addition to the basic hosting services, these sites offer added value in the form of custom browsing features, easy web interfaces to Git commands, integrated bug tracking, and the ability to easily set up and share access among teams or groups of individuals. These sites typically provide a workflow intended to allow users to contribute back to projects on the site. At a high level, this usually involves getting a copy of another user's repository, making changes in the copy, and then requesting that the original user review and incorporate the changes; this is sometimes known as the fork and pull model. (This model is explained in more detail in Chapter 13.) For hosting, there is a pricing model that depends on the level of access, number of users, number of repositories, or features needed. For example, if a repository is intended to be public—with open access to anyone—it may be hosted for free. If access to a repository needs to be limited or it needs a higher level of service, then there may also be a charge. In addition, the hosting site may offer services such as consulting or training to generate revenue. Examples of these types of sites include GitHub and Bitbucket. Figure 1.1 shows an example of a GitHub repository page. Figure 1.1 Example GitHub page Self-Hosting Packages Based on the success of the model and usage of the hosting sites, several packages have been developed to provide a similar functionality and experience for users and groups without having to rely on an external service. For some, this is their primary target market (GitLab), while others are stand-alone (also known as on-premise) versions of the popular web-hosting sites (such as GitHub Enterprise). These packages are more palatable to businesses that don't want to host their code externally (on someone else's servers), but still want the collaborative features and control that are provided with the model. The cost structure usually depends on factors relating to the scale of use, such as the number of users or repositories. Figure 1.2 shows an example of a GitLab project screen. Figure 1.2 GitLab project screen Ease-of-Use Packages The ease-of-use category encompasses applications that sit on top of the basic Git tooling with the intention of simplifying user interaction with Git. Typically, this means they provide GUI interfaces for working with repositories and may support GUI-based conventions such as drag-and-drop to move content between levels. In the same way, they often provide graphical tools for labor-intensive operations such as merging. Examples include SourceTree, SmartGit, TortoiseGit, and Git Extensions. Typically, these packages are free for non-commercial use. You can see a more comprehensive list at https://git-scm.com/downloads/guis. Figure 1.3 shows some examples of available packages. Figure 1.3 Examples of GUIs available for Git (from git-scm.org) CHOOSING AN INTERFACE One of the questions that frequently comes up when using Git is which stand- alone interface is best. There is no right answer here, but as a good default, the command line provides the most value for a number of reasons. Although a large number and variety of GUIs are available to use with Git, there is no accepted standard. GUIs come and go, and vary highly in their degree of functionality, completeness, and utility. The command line is consistent and universally applicable. Not all functionality is exposed through any one GUI for Git. However, all functionality available to users is exposed through the command line. If you need to do something that isn't available through a GUI, you can always drop back to the command line to accomplish it. In addition, Git includes man pages for all command line usage, so help is readily available for that interface. If you understand the command line operations and options, it's generally easy to translate and map them to the corresponding items in a GUI. Once you understand the basic command line operation, you'll have more insight into what you want and need to do with a GUI interface. You'll also be in a better position to choose one if desired. As a side note, one of the main advantages of having a graphical interface with Git is having a graphical merge tool. Git also allows you to configure using a thirdparty tool for merges from the command line interface. We'll explore configuring merge tools in Chapter 9. Plug-ins Plug-ins are software components that add interfaces for working with Git to existing applications. Common plug-ins that users may deal with are those for popular IDEs such as Eclipse, IntelliJ, or Visual Studio, or those that integrate with workflow tools such as Jenkins or TeamCity. It is now becoming more common for applications to include a Git plug-in by default, or, in some cases, to just build it in directly. Tools That Incorporate Git Over the past few years, tooling has emerged that directly incorporates and uses Git as part of its model. One example is Gerrit, a tool designed primarily to do code reviews on changes targeted for Git remote repositories. At its core, Gerrit manages Git repositories and inserts itself into the Git workflow. It wraps Git repositories in a project structure with access controls, a code review workflow and tooling, and the ability to configure other validations and checks on the code. Figure 1.4 shows an example of a Gerrit screen. Figure 1.4 Example Gerrit screen Git Libraries For interfacing with some programming languages, developers have implemented libraries that wrap those languages or re-implement the Git functionality. One of the best-known examples of this is JGit. JGit is a Java library that re-implements Git and is used by a number of applications such as Gerrit (mentioned in the previous section). These implementations make interfacing with Git programmatically much more direct. However, there is sometimes a cost in terms of waiting, when new features or bug fixes that are implemented in the core Git tooling have to be re- implemented in these libraries. GIT'S ADVANTAGES AND CHALLENGES Everyone has opinions, and anyone who's tried Git has an opinion about it. These usually vary from believing it's the greatest thing since sliced bread to wondering how they could ever effectively use it. In this section, you'll look at some of the advantages and challenges that Git offers (in no particular order). Granted, these lists are subjective, but themes in each area seem to consistently emerge. The Advantages Git is popular for many reasons. There are some things it just does better (faster, easier) than other source management systems and some things that it takes a totally different approach on. Learning about and leveraging the aspects outlined here will allow you to get the most out of this tool. Disconnected Development The Git model provides a local environment where you can work with a local copy of a server-side repository (this server-side repository is known as the remote in Git terminology). This copy resides within your workspace. When you are satisfied with your changes in this local repository, you then sync the local repository's contents up with the remote side. All of the source management commands that you need to make changes can be run in this local environment. There's no need to access the remote repository until you're ready to sync content. Because of this, you do not need a connection to the remote repository to conduct source management. You just work against the local copy. Because you can perform source management tasks in your local environment without needing a connection to the remote-server side, you can work disconnected from the remote and even disconnected from a network. This is what disconnected development means. One important factor to keep in mind is that until you sync up with the remote, all of your changes and data are only in the local environment on your system. This is usually the local disk on your machine. Fast Performance Git stores a lot of information. (I'll describe its internal storage model in the next chapter.) However, it is efficient both in the way it stores content and in the way it retrieves it. Internally, Git packs together similar objects. Externally, it uses a good compression model to send significant amounts of data efficiently through a network. Of course, this network performance may be mitigated by limiting factors such as network latency, but as a general rule, wait times for Git operations from the server are not a factor. For changes in the local environment, Git is as fast as its commands can be executed on your disk. Because it only has to interact with a local repository (in most cases not going across a network connection), the performance is equivalent to operating system commands. Another factor that aids Git's performance is that it is designed to manage multiple smaller repositories—rather than larger aggregate ones that may be present in traditional source control systems. For example, consider how you might store the source code for a large Java project. In a traditional source control management (SCM) system, you might have a single large Java repository with all of the source code in subdirectories for the different JARs. However, in Git you would typically have a separate repository for the source code for each JAR. This granularity contributes to the smaller amount of content that has to be moved around in Git, and thus to a faster operation. Finally, branching is extremely fast in Git. I'll explain why in Chapter 8, but essentially, as fast as you can create a file on your OS, you can create a branch in Git. This means there is no more waiting for extended periods while the source management system branches your content. Deleting branches is just as quick. Merging is generally quick as well, assuming there are no conflicts. Ease of Use There's a paradigm shift that is required when learning to use Git. And a prerequisite to thinking that Git is easy to use is understanding it. However, once you grasp the concepts and start to use this tool regularly, it becomes both easy to use and powerful. There are simple default forms of commands and options. As your proficiency grows, there are extended forms that can allow you to do nearly anything you need to do with your content. In addition, almost everything about Git settings is configurable so that you can customize your working environment. (Git configuration is discussed in detail in Chapter 4.) The primary mistake that most new Git users make is trying to use it in the same way that they've always used their traditional source management system. Usually this means that they are trying to map commands and workflow concepts from the previous system to Git's commands. However, trying to adhere too strictly to this approach with Git will actually make the learning curve steeper. A better approach is to consider what sort of source management outcome is needed (files in the repository, viewing history, and so on), and then take the time to learn how that workflow is done with Git. (The Connected Labs included throughout this book will aid this process significantly by providing hands-on experience with Git.) SHA1s The strange-looking name SHA1 is an acronym for Secure Hashing Algorithm 1. In short, it's a checksum. (It has its roots in the MD5 implementation if you're familiar with that.) Git computes SHA1s internally as keys for everything it stores in its repositories. This means that every change in Git has a unique identifier and that it's not possible to change content that Git manages without Git knowing about it— because the checksum would change. In Git, SHA1s represent a direct way to identify and specify the exact change that you want to work with. Ability to Rewrite History One aspect of Git that is different from most other source management systems is the ability to rewrite or redo previous versions of content stored in the repository—that is, history. Git provides functionality that allows you to traverse previous versions, edit and update them, and place the updated versions back in the same sequence of changes stored in the repository. This is a powerful feature of the tool, but it can also be dangerous (see the section, “The Challenges: Ability to Rewrite History,” later in this chapter). When content that you're working on in your local environment hasn't yet been synched to the remote side, this is a safe operation. And when you need it, it can be very beneficial. For example, consider a case where you forget to include a file with a change, or even just need to do something as simple as modify the message associated with the change. Git provides an amend option that allows you to update or replace the last change made in the local repository. Additional functionality makes it possible to take selected changes from one branch and incorporate them directly into the line of changes in another branch. Beyond that are levels of functionality for doing editing throughout the history of one or more branches. An example case would be removing a hard-coded password that was accidentally introduced into the history months ago from all affected versions. Staging Area Git includes an intermediate level between the directory where content is created and edited, and the repository where content is committed. New users typically don't see this extra level as a positive, due to the perceived inconvenience of having to move content through another level. However, it does provide a separate area for use in some of Git's advanced operations, such as the amend option discussed previously. It also simplifies some status tracking. I'll cover the staging area in detail in Chapter 3. Strong Support for Branching Using branches is a core concept of Git. Earlier, I mentioned the speed with which users can create, delete, and manipulate branches. However, beyond that, Git provides capabilities for changing branch points and reproducing changes from one branch onto another branch—a feature referred to as rebasing. This ease in working with and manipulating branches forms the basis for a development model with Git. In this model, branches are managed as easily as files are in some other systems. Later in the book, I devote entire chapters to branching concepts. One Working Area, Many Branches It is rare these days for source management users to only be concerned with one release of content. Even when products are managed via a continuous delivery process, in a user's local environment, there are typically multiple changes underway, for new features, bug fixes, and so on. Traditionally, the best way to develop these multiple changes in parallel has been in separate workspaces, and, depending on the scope and ease of use of the source management application, in separate branches. With legacy SCM systems, maintaining these multiple workspaces, switching contexts between them, and ensuring they are up to date with the correct source code is a multi-step process that requires tracking and coordination by the user. In Git, this is a single-step process managed by Git. Git allows you to work in one workspace for a repository, regardless of how many branches you may have or need to use. It manages updating the content in the workspace to ensure it is consistent with whichever branch is active. You never need to leave that workspace. Also, while working in one branch, you still have the expected access to view, merge, or create other branches. WORKING IN MULTIPLE BRANCHES SIMULTANEOUSLY WITH GIT If you do find yourself needing to work in multiple branches at the same time, recent versions of Git have introduced a new feature to support this—worktrees (otherwise known as working trees). Worktrees provide a way to have and use multiple working directories with different branches (at the same time) all tied back to the same local Git repository. We discuss worktrees in detail in Chapter 14. The Challenges Now, to balance out the picture, let's look at a few of the things about Git that can be challenging—especially for new users. I'll have more to say about this topic, including what to watch out for, and strategies for effectively dealing with these challenges, throughout the book. Very Different Model from Some Traditional Systems Going from a more traditional, centralized version control system to a distributed version control system such as Git requires a change in how you think about your source management workflow. Git implements a local environment with multiple levels in addition to a separate remote repository. As well, it operates with units that map more closely to directory tree structures than just individual files. This leads to considerations when creating and working in Git repositories, in terms of size and scope, that you don't usually worry about with centralized systems. Different Commands for Moving Content In most traditional source control systems, there are one or two commands for getting content out (checkout) and one or two for putting content in (check-in, commit), with options for modifying their behavior to work in different ways if needed. With Git, there are different commands for moving content between the different layers, and these commands must be used in a particular sequence. This isn't really an issue after you've been working with Git for a while, and actually is clearer when talking about the workflow. However, it can be a little confusing to new users. Staging Area As previously mentioned, Git includes a staging level. This is an intermediate area that new code has to travel through on its way to the local repository. This will seem cumbersome at first, because content must flow through it, even in some situations where it doesn't appear to add value. However, once you are comfortable with it, it will allow you to work with a power and flexibility that you haven't experienced previously. Mind Shift and Learning Curve All of the things I'm talking about as advantages and challenges contribute to the power of Git—as well as the learning curve. As I alluded to previously, one of the fundamental mistakes that new Git users make is trying to map too many concepts and workflows that they've used in the past with other systems, too closely to Git concepts and workflows. They often expect a one-to-one fit, just with different names. The basic principles of source management still apply—tracking changes, putting code in, getting code out, and so on. However, Git adds layers of flexibility and power on top of those principles, at the cost of requiring you to think differently about the units and stages of source control. This requires a learning curve and a willingness to accept some features and requirements as useful, even if they don't immediately appear so. It's one of those situations where a feature won't seem beneficial until it is. As you continue to use the tool, it's a pleasant experience when you encounter those situations where you need to do X, you wonder if Git can do X, and you discover (in most cases) it can. Of course, there's also a learning curve with figuring out the exact invocation, and implications, of doing X. Part of the mind shift comes early on in thinking about what should be in your Git repositories and branches. Just converting existing repositories one-to-one from another source management system is seldom the best approach. This is due to the way that Git manages scope in terms of changes and repositories. I'll discuss more about this as you learn more about Git. Finally, it's worth pointing out that Git offers a built-in way to learn and explore the tool and workflow as you're going through this mind shift and learning curve—the local environment. I'll talk more about this in the next couple of chapters, but for now, know that you have the ability to make any source management changes (and mistakes) you need to in your local environment before you ever push them over to the remote environment, where others can see or access them. Limited Support for Binary Files Most source management systems do not have strong support for binary files, and Git is no exception. There are two aspects of dealing with binary files that are challenging here: internal format and size. Because of the internal format of these types of files where the bits rather than the characters are what is important, standard source management operations can be difficult to apply or may not make sense at all. An example of the former would be diffing. An example of the latter would be managing line endings. If the SCM does not recognize or understand that a particular file is binary and tries to execute these types of operations against it, the results can be confusing and problematic. The size of binary files can routinely be much larger than text ones. Very large binary files can pose a challenge for a system like Git since they usually cannot be compressed very much, and so can impose more time and space to manage, leading to extended operation times when the system has to pass around these files such as when copying to a local system. Of course, larger text files can also pose size challenges, but with text files, the ability to compute differences between versions and more compressibility can work better with Git's internal strategies for efficiently storing and serving these files. Git has built-in mechanisms for identifying files as binary. However, it is also possible (and a best practice) to use one of its supporting files—the Git Attributes file—to explicitly identify which types of files are binary. Git Attribute files are covered in detail in Chapter 10. The challenges with large binary files for source management in general have led to the development of several separate applications to help. Artifact repositories, such as Artifactory and Nexxus, are targeted specifically at storing and managing revisions of binary files. And the Git community itself has created various applications targeted at helping with this. Currently, the best-known one is probably Git LFS (Git Large File Storage)—a solution from the Git hosting site, GitHub. This application stores large files in a separate repository and stores text pointers in the traditional Git repository to those large files. No Version Numbers As referenced in the previous section on SHA1s, Git creates checksums (SHA1s) for everything that it stores. From one perspective, the overall SHA1 value for a change can function like the version number in most other source control systems. However, unlike traditional version or revision numbers, these are not short, easily remembered identifiers. SHA1s are actually 40-character hexadecimal strings. So, from a user perspective, SHA1s are not as convenient to remember, find, or communicate about. Typing one also requires some care. Fortunately, in any Git instance, you only need to use enough of the characters from any SHA1 to uniquely identify that SHA1 from any other—usually the first seven characters. You can also use other references, such as tags or branch names, to indicate revisions where appropriate. Merging Scope While talking about the Git model, I mentioned that Git thinks in units that more closely map directory structures than individual files. This difference in granularity provides advantages in managing and manipulating changes in source control. However, it can also create disadvantages in merge situations where there are conflicts. Simply put, any two changes by different users within the scope of a commit can be a conflict, even if they are in entirely different files or directories. As a result, the more people that are making changes within the scope of a repository, the more likely they are to encounter merge conflicts when trying to get their updates in. This is a factor to consider when planning how to structure your Git repositories. Ability to Rewrite History Git's ability to rewrite history falls into both categories. On the challenging side of the scale is the potential impact that uncoordinated use can have on other users. Suppose that multiple users have obtained content from a remote (shared) Git repository. One user decides to perform an operation that changes the revision history. Changing the history results in new internal checksums (SHA1s) for changes in the repository, starting at whatever points the revisions were made. Once the updates are put back on the remote side, any other users that need to merge in updates will have to deal not only with the newest content, but also with the changes to the revisions in the history made by the other user. At best, this can be surprising. At worst, it can be very time- consuming and resource-intensive, because it requires them to incorporate all of the changes. As a highly recommended guideline, changes that alter history should only be made in a user's local environment before the affected revisions are pushed across to the remote side. If there is a critical need to change revisions in the history of a repository after it has been made available on the remote side, then there is a recommended approach: other users should be informed in advance, and given a chance to get their changes in before the changes to the history are made. After the changes are completed, they can get a fresh copy to work with locally. This will allow them to avoid potentially difficult merge situations. Timestamps When using most source control systems, timestamps that reflect when changes were made in the repositories are a useful and static property. Given any point in time, it is possible to pull the content from the repository as it was at that point and always get the same set of content on subsequent pulls. Not so with Git. Due to the way that remote repositories are synched from local repositories, the timestamp that shows up in the remote repository is the time the update was made on the local environment, not the timestamp of when things were synched to the remote. This means that it's possible to pull content from the remote side based on a particular timestamp and get a certain set of content, then later pull it again based on the same timestamp, and get a different set of content. This can happen if one or more changes were made in a user's local environment, prior to that timestamp, but weren't synched to the remote until between the two pulls. In this case, a new change with an older timestamp would suddenly show up in the remote. For this reason, you can't rely on timestamps for some of the cases where they are traditionally employed with existing source control systems. I will discuss what the alternative is for Git when I talk more about the remote side in Chapter 12. Access and Permissions Out of the box, Git does not provide a layer to set up users or to grant and deny access. For the local environment, this doesn't matter because everything is, well, local. For shared, server-side repositories, there are a few options: Using operating system mechanisms such as groups and umasks that limit the set of users and their direct repository permissions Limiting access via client-server protocols (SSH, HTTPS) Adding an external applications layer that implements a more fine-grained permissions model and interface Note that these are not mutually exclusive. In a corporate environment that chooses to host its own shared, server-side repositories, for example, you would want to limit who could directly access the actual repositories on disk at the system level, have authentication for users who need to put content into them from their local environments, and potentially have a permissions layer that can be centrally managed or managed by a team within a selected scope. SUMMARY In this chapter, I introduced Git, discussed where it came from, and talked about some of the advantages and disadvantages that users should be aware of when working with this tool. Along the way, I also introduced a number of terms and concepts that are part of Git. In subsequent chapters, I will be expanding on and explaining what each of these terms and concepts means, along with teaching you how to use them. If you're coming from an environment where you used a traditional centralized source control system, you'll find that Git is significantly different and has a learning curve. The workflow is different as well. Trying to map commands, structures, and workflows from your previous system is not an effective strategy. Rather, you should take the time to read through the following chapters and examine the concepts and examples. Equally important is that if you can work through the Connected Labs, they will go a long way toward helping you internalize the concepts, ensure a deeper understanding of the material, and help you be ready to apply Git to your job when you need it. In Chapter 2, you'll look at some of the primary design concepts that Git uses internally and that are helpful for users to understand before going further with it. Chapter 2 Key Concepts WHAT'S IN THIS CHAPTER? The differences between a centralized and distributed source management system The differences between a traditional delta model for tracking source code changes and the way that Git tracks changes Why Git is efficient How (and why) Git repositories should be organized Things to keep in mind when migrating repositories to Git Dealing with large files in Git In this chapter, I'll explain some of the underlying key design concepts that Git uses. Implementation around these concepts forms the basis for how Git works and how to use it. I'll broadly break these concepts down into two categories, user-facing and internal, and show how they differ from more traditional source management systems. Lastly, I'll focus on some important considerations for creating repositories in Git, and managing special content such as binary files. DESIGN CONCEPTS: USER-FACING Version control systems (VCS) such as Git can be broadly classified as either centralized or distributed. Git is an example of a distributed version control system (DVCS). Other systems in this category include Mercurial and Bazaar. Examples of a centralized version control system (CVCS) would be Concurrent Versions System (CVS) and Subversion. The fundamental differences between a DVCS and a CVCS have to do with how the system manages repositories and the workflow that the user employs to get content into the server-side part of the system. Centralized Model Figure 2.1 illustrates a traditional centralized model. In this model, you have a central server that holds all of the repositories with all of the history and all versions of changes that have been put into the system over time. This area is the one source of the truth—the container of all the repositories. Figure 2.1 A traditional centralized version control model When users want to work with a file in one of these repositories, they connect to the server via a client, and retrieve the files and the versions they want to work with. They then make whatever changes they need to, connect to the server again, and send the update back to it. There, the differences from the previous version are determined and stored in the repository as updates. In this type of model, users are dependent on the central server. If, for some reason, users cannot connect to the server, they cannot do any source management operations. Distributed Model In a distributed system, the model is somewhat different. There is still a server that holds the shared repositories, and that clients interact with. However, when users want to start making changes, instead of getting individual files or directories from the server, they get a copy of the entire repository. The copy comes from the server side and has all content (including history) up to the point in time when the copy is created. In Git terminology, the server side is called the remote repository (or just remote). The copy operation is referred to as a clone. You can call the area on your local system with the cloned repository your local environment because it consists of several layers (which you'll explore in the next chapter). For simplicity, I'll refer to the remote repository as just the remote throughout the rest of this discussion. Figure 2.2 illustrates this model. Figure 2.2 A distributed version control model The actual cloned (copied) repository within the local environment is called the local repository. It has all of the files, histories, and other data that were in the remote. A change that is made into the local repository is called a commit, similar in concept to a check-in in some other systems. Once users have cloned from a remote, they can do all of their source management operations against the local repository. When users have made all the commits they want in the local repository, they then push their changes to the remote. The key difference here is that, in a DVCS such as Git, users are performing the source management operations against a local copy of the server-side (remote) repository instead of making them against the actual server-side repository. Until users need to push the changes back to the remote, they do not even need to be connected to it. The connection between the local and the remote side is not constant. Rather, it is activated when updates need to be synchronized between the two repositories. Because users do not have to be connected to the remote to do their source management operations, they can work disconnected from the remote. As noted in Chapter 1, this is referred to as being able to do disconnected development. Figure 2.3 shows a conceptual model of this approach. Figure 2.3 Disconnected development In Figure 2.3, starting on the left, a user makes a change to a file in the local repository without any connection to the remote. Then a second change is made in the same way. Finally, the local environment is synched up with the remote side so that both areas have the latest content. One other thing to note is that a remote can actually be any Git repository that is set up to function that way. Most commonly, a remote is a Git repository hosted on a server and running as a daemon process. However, there are various protocols for communicating between Git clients and servers, even a simple one that operates via shared folders. I'll have more to say about these protocols in Chapter 12 where I discuss remotes in more detail. DESIGN CONCEPTS: INTERNAL Another area where Git differs significantly from traditional source management systems is in the way it represents and stores changes internally. Delta Storage In a traditional source management system, content is managed on a file-by-file basis. That is, each file is managed as an independent entity in the repository. When a set of files is added to a repository for the first time, each file is stored as a separate object in the repository, with its complete contents. The next time any changes to any of these files are checked in, the system computes the differences between the new version and the previous version for each file. It constructs a delta, or patch set, for each file from the differences. It then stores that delta as the file's next revision. This model is called delta storage. Figure 2.4 illustrates this process. Figure 2.4 The delta storage model In the first iteration, files A, B, and C are checked in. Then, changes are made to the three files and those changes are checked in. When that occurs, the system computes the deltas between the current and previous versions. It then constructs the patch set that will allow it to re-create the current version from the previous version (the set of lines added, deleted, changed, and so on). That patch set is stored as the next revision in the sequence. The process repeats as more changes are made. Each delta is dependent on the previous one in order to construct that version of the file. In order to get the most current version of a file from the system when the client requests it, the system starts with the original version of the file and then applies each delta in turn to arrive at the desired version. As the files continue to be updated over time, more and more deltas are created. In turn, more deltas must be applied in sequence to deliver a requested version. Eventually, this can lead to performance degradation, among other issues. Snapshot Storage Git uses a different storage model, called snapshot storage. Whereas in the delta model, revisions are tracked on a file-by-file basis, Git tracks revisions at the level of a directory tree. You can think of each revision within a Git repository as being a slice of a directory tree structure at a point in time—a snapshot. The structure that Git bases this on is the directory structure in your workspace (minus any files or directories that Git is told to ignore—more about that later). When a commit is made into a Git repository, it represents a snapshot of part or all of the directory tree in the workspace, at that point in time. When the next commit is made, another snapshot is taken of the workspace, and so on. In each of these snapshots, Git is capturing the contents of all of the involved files and directories as they are in your workspace at that point in time. It's recording the full content, not computing deltas. There is no work to compute differences at that point. The snapshot storage model is shown in Figure 2.5. In this model, you have the same set of three files, A, B, and C. At the point they are initially put into the repository, a snapshot of their state in the workspace is taken and that snapshot (with each of the file's full contents) is stored in Git and referenced as a unit. Figure 2.5 The snapshot storage model As additional changes are made to any of the files and further commits are done, each commit is built as a snapshot of the structure as it is at that point. If a file hasn't changed from one commit to the next, Git is smart enough not to store a new version, and just creates a link to the previous version. Note that there are not any deltas being computed at this point and you are managing content for the user at the level of a commit rather than individual files. Later, when you want to get one of these snapshots back, Git can just hand back the specific set of content associated with that commit, without going through the extensive reconstruction process required by the delta model. Git's Storage Requirements One of the questions that usually comes to mind right away when people are introduced to the snapshot storage concept is, “Doesn't this use a lot of disk space?” There are a couple of points related to that. First, as I just noted, Git can use links in some cases to reduce duplicate content. Second, Git compresses content using zlib compression. (Notice the smaller compressed size of the blocks representing content in the repository in Figure 2.5.) Third, periodically, at certain trigger points, such as when running garbage collection functionality, Git looks for content that is very similar between revisions and packs those revisions together to form a compressed pack file. In these cases, it can actually create an associated delta of sorts that represents the differences between very similar revisions. The delta here is what it takes to get back to previous revisions. Git assumes that the most recent revision is the one that will be most requested and thus best to keep as a full, ready revision. So, in the Git model, the use of any deltas is a deliberate optimization for storage rather than the default versioning mechanism. Figure 2.6 illustrates a way to think about this concept, where multiple objects have been packed together internally. This is invisible to the user. From a user perspective, Git still manages interactions with the user in terms of individual snapshots, regardless of whether or not content ends up packed in the repository. Figure 2.6 A representation of Git's packing behavior to optimize content size All of these approaches help to reduce the space a Git repository requires. In fact, if you were to compare the corresponding disk space requirements for a source control system that uses the delta model to the snapshot model that Git uses, you might find that in the best cases, Git actually uses less. (You may be wondering how a model like this handles binary files since those don't lend themselves to a delta model. I cover dealing with Git and binary files in more detail later in this chapter.) A final, related point is that Git is designed to work with multiple, smaller repositories rather than large, monolithic repositories, a characteristic I'll explore in more detail in the next section. So, to summarize, there are two differences between delta and snapshot storage: 1. Delta storage manages content on a file-by-file basis, as opposed to snapshot storage where content is managed at a directory tree level. 2. Delta storage manages versions over time by figuring out the differences and storing that information from revision to revision (the delta). It reconstructs later revisions by starting with the base version and applying deltas on top of that. Because snapshot storage is storing a capture of the entire tree, it does not usually have to do any reconstruction, or only a very small amount if the content has been packed. Git's approaches in these areas create a very powerful model to build on, especially as they pertain to branching. However, they also create the need to structure repositories appropriately in Git for the best usability and performance. This is the topic of the next section. REPOSITORY DESIGN CONSIDERATIONS When beginning to work with Git, whether creating repositories for new content or migrating existing content from another source management system, it is important to consider how you size and structure your repositories. For existing content, unless your code is already broken down into very distinct, separate modules, a one-to-one migration is unlikely to be the best approach. This is because of repository scope. Repository Scope A key point to keep in mind when beginning to work with Git is that it is designed to be used as a set of many, smaller repositories. How small? Well, as an example, consider the case of a Java project managed in a traditional, centralized source management system. You might have a single repository for a Java project that's made up of ten different JARs, with all of the source code for all of the JARs stored in different subdirectories in the repository. This arrangement typically works well in a centralized model where each file is managed separately. In the working model for that system, you don't typically check out or check in the entire repository each time. You can manage things at smaller granularities, such as only checking out the subdirectory with the code for one particular JAR, modifying a few files, and then checking those files back in. In the Git model, a more common scenario would be to have a separate repository for the code associated with each separate JAR. Why? Recall that Git manages changes as commits that are a snapshot of the larger workspace—the set of files and directories. While Git is efficient in how it stores and retrieves data, this efficiency is still relative to the size of the content. If the content is inordinately large, you may find yourself waiting longer than you'd expect for operations that get or put data from or into the repository. In addition, as I alluded to in Chapter 1, because Git manages content in terms of snapshots, any changes by two users within the scope of the same snapshot, regardless of whether or not they are to the same file, have potential to cause a merge conflict, depending on timing. To illustrate this, suppose you and another user clone the same repository in Git down to your local systems, and the repository contains directories 1 and 2. The other user makes a change in file A in directory 1, commits it, and pushes it up to the remote. Then you make a change in file B in directory 2, and commit and attempt to push your changes back to the remote. Git will reject your changes at the point where you try to get them into the remote. This is because Git considers that something else (anything else) has changed in this repository since you originally got your copy of the code. Even though you didn't touch the same file as the other user, you have a merge conflict within the snapshot, because someone else made a change before you could get yours in. This is one of the key frustrations for new Git users. I'll talk more about this in Chapter 13, including how to resolve the merge conflicts. (Also, see the following Note.) In addition to repository size, there's a second point to consider. Ideally, you want to create repositories that will not have too many users working in them at the same time, and making (from Git's viewpoint) conflicting changes. This will help limit the number of rejected pushes and the amount of merging work that has to be done. NOTE To be fair, resolving these kinds of conflicts is generally an easy mechanical process, unless both users have changed the same file or files. However, it does involve additional operations and inspection by the last user who is trying to get their changes in. Depending on the scope, the time to review the conflicts can be non-trivial. Having smaller repositories with only a few users making changes also allows for closer collaboration and coordination. It helps to keep operations in Git working quickly and smoothly when Git is manipulating content at the scope of a repository. This also applies to development environments, such as Eclipse, that look at projects as equating to a repository when interfacing with Git. In general, you can think of one repository in Git as equating in scope to one module of your project. If your code is not already modularized, it can sometimes be difficult to figure out what should constitute a module. One general guideline is to map the code to build a JAR, DLL, EXE, or other single component to a repository. Think in terms of what code you would use to build a single deliverable in an application such as a Gradle or Maven project or a developer interface such as Eclipse, IntelliJ, or Visual Studio. Consider code that is owned and maintained by only one or a few people to reduce the risk of merge conflicts. If your code does not easily map out this way, then it's worth spending some time up front to figure out how to get it into a structure that is more modular. You can then base your Git repositories on that revised structure. When considering how to organize code in Git repositories, it's also important to consider whether all categories of content related to a module are appropriate to migrate or store in a repository. There are general guidelines (especially around very large files) that apply, mostly independent of the source management application. I'll explore those guidelines next. File Scope When dealing with very large files, there are a number of considerations and approaches to take into account. An arbitrary definition of very large might be over 100 MB for text files, but less for binary files for reasons I'll talk about in the next few sections. Nearly all of these considerations apply to any source management system, not just Git. I'll now discuss some points you should consider. Storage Model Source management systems can't create deltas between versions of binary files. As a result, they end up storing full versions for each change to a binary file. This is necessary, but inefficient, and can quickly consume significant disk space if the files are large. Even in a system such as Git that compresses content, most binary files do not compress well. For certain types of smaller binary content, such as icons or other graphical elements, storing those files in the system usually doesn't present a problem and makes sense. For larger files, some pre-planning of alternative approaches to managing these files can help avoid issues in the repository. One common alternative approach for dealing with these files is to store them in a separate repository. Separate Repositories For the reasons outlined previously, storing very large files, especially binaries, in a repository such as Git is not the best approach. This also applies to generated files. Instead, there are specially designed applications for working with these types of files: artifact repositories. Artifact repositories work much like a source control system, but are designed to be a good fit for managing versions of files that don't really belong or fit well in your standard source repositories. Builds and other parts of a pipeline can pull source code from the source management system and resolve needed pre-built binary dependencies from artifact repositories. Some of the more popular artifact repositories today include Artifactory and Nexus. There is also an option to store large files that need to be managed in source control in a second, separate Git repository designated for them. This approach still suffers from the problems discussed in the “Storage Model” section. However, it does remove the impact of dealing with the large binaries in the other smaller repositories. Extensions to Git Not surprisingly, a set of applications and packages has been created around trying to solve the limitations of Git with large files. Among these are extensions to Git, such as the git-annex and Git Large File Storage (Git LFS) open-source packages. There are also other packages, but these two seem the most likely to continue to receive support and development. This is primarily due to their incorporation into two of the major Git-hosting applications: git-annex has now been incorporated into GitLab as GitLab- Annex, and Git LFS is now incorporated into GitHub as well as some versions of Bitbucket—another Git repository hosting system. In these implementations, the large files are stored in a separate space, but referenced by pointers inside of a normal Git repository. The applications vary in terms of characteristics that include the following: Performance Configurability (Can files be stored on user-configurable locations?) Ease of use (registering of files and use of existing commands versus new commands) Cost for long-term/large-scale use Learning curve All of these characteristics factor into the transparency and usability of the process, but some setup and overhead is always required. Generated Content Files generated from source code stored in your source control system should not actually be stored in the source management system. If these files are generated from sources that you have control over, then the file can always be reproduced from the sources. In a model where the generated files are stored in the source repository, if the sources change frequently, then the generated content must also be updated frequently in the repository. This can be challenging to keep in sync and can lead to the problems discussed in the “Storage Model” section. Generally, the reason why files produced from existing source are stored in the source management system boils down to having them easily accessible or using the source management system as a transport mechanism between processes. However, there are better ways to manage those needs such as using an artifact repository (described in the “Separate Repositories” section) that is designed for this purpose. MANAGING BINARY FILES IN GIT While I am talking about binary files, it's worth discussing how Git identifies and manages these files. Git can read a separate configuration file called a Git Attributes file (named .gitattributes on disk) to determine how to treat certain file types. In this file, different file types can be identified as binary. For such types, Git understands that it should not perform some of the operations that it does with text files, such as diffing and modifying line endings. I'll talk in detail about the Git Attributes file in Chapter 10. Shared Code While I'm on the topic of easily accessing code in the source management system, at times, it may seem that you need to share code from one repository to another. Git provides a way to do this through a construct called submodules. A submodule is essentially a static reference to another repository that resides in your local environment. Git understands that it is a separately managed repository even though it is in your tree structure. Submodules can be useful in certain cases, such as when an organization needs to share source for development dependencies that are being worked on by one group with other groups. However, they can be challenging to keep in sync without careful attention to updates. Managing them requires a different, coordinated set of operations. And it can be easy to back-level them for yourself or other users. For these reasons, submodules can be problematic and are not generally recommended for beginning Git users. Git also supports another construct called subtrees that provides similar benefits to submodules, but with a simpler structure and a simpler set of operations to manage them. Both submodules and subtrees are explored in detail in Chapter 14 and the reader is advised to read that before attempting to use either of these constructs. Another alternative approach is to just build the needed artifacts from other repositories separately and specify them as compile-time or run-time dependencies to pull them in, if this fits with how your project is organized. SUMMARY In this chapter, you learned about some of the differences between Git's overall design and functioning, and that of more traditional centralized source management systems. I covered the model that Git uses to clone a repository and create a stand-alone local environment in which to do source management operations versus the typical legacy “always do the operations to the server” model. Along these lines, I talked about how Git is structured with the local environment and the remote environment. I also introduced the concept of disconnected development, which is one of the appealing aspects of using Git. All of this allows you to get things the way you want them locally before you share them back with others in a public repository. I also shared some insights on how Git manages things internally. You learned how Git sees sets of files involved in a commit as a unit and works at a granularity that is directory tree–based, not file-based. You also looked at how it stores and manages commits over time. Finally, I discussed some considerations when creating or migrating to Git repositories, defining some guidelines for repository scope and file scope, especially around large files and binaries. Git is not strong in managing very large files, but there are good alternatives. In the next chapter, you'll expand your understanding of the local environment that Git provides by looking at the Git promotion model, as well as looking at the workflow to move content through the different levels. Chapter 3 The Git Promotion Model WHAT'S IN THIS CHAPTER? The different levels of Git The workflow for moving content between the levels (the Git promotion model) Why Git has the staging area and how it is used A summary of the commands that you use to move content between the levels Whenever you are learning a new system or process, it's convenient to think about it in terms of something you already know or have some familiarity with. In this chapter, you will take a tour of the various levels that make up a Git system. You will also relate them to a common model that almost everyone who works in an IT-related field will recognize. This model also provides a convenient way of thinking about how you get content through the levels, and introduces you to the basic Git commands for a workflow. In addition, I'll focus in on one level that is not typically found in other source management systems, but which plays a key role when interacting with Git and some of its advanced functionality. Understanding this level early on is a prerequisite to really understanding any Git workflow. THE LEVELS OF GIT So far, I have introduced Git and discussed its history, good points, and not-so-good points. I've also presented some concepts to help you understand its internal functioning. It's now time to look at the different levels that users encounter when working with Git. These levels represent the stages that content moves through, as it makes it way from the local development directory to the server-side (remote) repository. One way to think about and understand these levels is to compare them to another well-known model, a dev-test-prod environment. Dev-Test-Prod and Git Figure 3.1 shows a simple block diagram representing a dev-test-prod environment. Most organizations employ some version of this model in their software development and release processes. Figure 3.1 A simple dev-test-prod environment You can think of this environment as a sort of promotion model where content moves up through the levels as it matures. Each movement can be initiated by someone or some process when it is deemed ready. At any point, different levels may contain the same or different versions of some particular piece of content, depending on which levels it has been promoted to and whether any additional changes have been made at a lower level. To give you a better understanding of this environment, I'll briefly describe the purpose of each of the levels in my reference model. At the bottom, you start with a Dev area (a development workspace) where content is created, edited, deleted, and so on. Some other names that might be used for this level include sandbox, playpen, workspace, and working directory. When the code is deemed adequate, it can be moved to the Test area (the testing level). Not all of the code has to be moved to Test at the same time. This is an area where different pieces can be brought together to ensure that everything is ready for production. Once a set of code has passed the testing phase, it can be promoted to the Prod (or production) area; this is where it is considered ready and officially released. Then, for my purposes here, you add another level, Public, which represents an area where the production code is put, to be shared with others. An example might be a website where content is deployed so that others can see it and access it. Given this reference of a dev-test-prod(-public) model, let's look at the different levels that Git uses as an analogy to this model, and how they relate to each other. Figure 3.2 shows a similar way of thinking about the Git levels. Figure 3.2 The levels of a Git system Starting at the bottom is the working directory where content is created, edited, deleted, and so on. Any new content must exist here before it can be put into (tracked by) Git. This serves the same purpose as the Dev area in the dev-test-prod-public model. Next is the staging area. This serves as a holding area to accumulate and stage changes from the working directory before they are committed into the next level—the local repository. You can think of this process as being similar to how you might move content to the testing stage in your dev-test-prod-public model. It is a place to build up a set of content to then promote. I'll go into more detail about this area shortly. After the staging area comes the local repository. This is the actual source repository where content that Git manages is stored. Once content is committed to the local repository, it becomes a version in the repository and can be retrieved later. The combination of the working directory, staging area, and local repository make up your local environment. These are the parts of the Git system that exist on your local machine—actually, within a special subdirectory of the root (top-level) directory of your working directory. This local environment exists for users to create and update content and get it in the form they want before making it available or visible to others, in the remote repository. The remote repository is a separate Git repository intended to collect and host content pushed to it from one or more local repositories. Like the Public level in the dev-test- prod model, its main purpose is to be a place to share and access content from multiple users. There are various forms of hosting and protocols for access that I'll talk more about in Chapter 12. I'll refer to this as your remote environment. Figure 3.3 adds the local versus remote environments encapsulation to the model. Let's examine each of these areas in more detail. Figure 3.3 The local versus remote environments The Working Directory Any directory or directory tree on your local system can be a working directory for a Git repository. A working directory can have any number of subdirectories that form an overall workspace. (You might also hear this referred to by similar names such as “working tree” or “worktree.” In a tree structure, the higher-level directory where you initiated work with Git becomes the top level or root of your workspace. All subdirectories are considered part of the working directory's scope, unless Git is specifically told to ignore them via a .gitignore file (discussed in Chapter 10) or they are part of a Git submodule (discussed in Chapter 14). When you connect Git to a local directory tree, by default Git creates a repository skeleton in a special subdirectory at the top level of the tree. That repository skeleton is the local repository. The physical subdirectory is named .git by default. This is a similar convention that many open source projects use, storing metadata in a directory starting with a period (.) followed by the name of the tool or application. Thus, your repository and all of your source management information is located within a subdirectory of your working directory. OVERRIDING GIT'S DEFAULT LOCATIONS In the section above, we noted that “by default” Git creates the repository skeleton under a subdirectory named .git at the top level of your source tree. This is actually configurable through an option that we pass to Git (--git-dir) when running it, or through an environment variable ($GIT_DIR). However, unless you have a strong reason to change these, you are better off just leaving these as the locations set by default. Throughout the book, we will just refer to these settings as being “.git” under the working directory for simplicity. As I discussed in Chapter 2, it's important to consider how much content you're trying to manage in any one Git repository, and thus in your working directory. Your repository structure, content, and scope are based on the structure, content, and scope of your workspace, and so similar guidelines apply. When developing code, a workspace should most likely consist of the structure needed to create a single deliverable—a JAR file or DLL, and so on. For other kinds of content, consider what makes sense as a logical unit that can be managed separately and maintained by a small number of users to reduce the occurrence of merge conflicts. If you have content in your working directory that should not be tracked or managed by Git, then those files and directories should be listed in a .gitignore file at the top level of your tree. The .gitignore file is just a text file containing a list of files, directories, or regular expression patterns. Git understands that if this file exists, then Git should not add or track those files and directories listed in it. Common examples of types of files to have Git ignore would be very large files (especially binary files) and files that are generated from content already being tracked. (Refer to Chapter 2 for the reasons behind this.) The .gitignore file is discussed in detail in Chapter 10. NOTE Before going further, it's useful to clarify some Git terminology. People frequently talk about a commit in Git. In Git, a commit is both a noun and a verb, an entity and an action. Doing a commit (committing) means moving content from the staging area to the local repository. A commit as an entity means a set of content managed as a unit by Git. Similarly, the term stage or staging refers to the action of promoting content from the working directory to the staging area. I'll clarify all of this in the next chapter, but I'll need to use this terminology in talking about the remaining levels. The Staging Area The staging area is one of the concepts in Git that many new users have difficulty understanding and appreciating. At first glance, it may seem like an unnecessary intermediate level that gets in the way of trying to promote content from the working directory to the local repository. In fact, it plays a significant role in several parts of Git's functionality. What's the Point of the Staging Area? As its name implies, the staging area provides a place to stage changes before they are committed (promoted) into the local repository. The staging area can hold any set of content that has been promoted from the working directory and is a candidate for going into the local repository—from a single file to all of the eligible files. The staging area provides a place to collect or assemble individual changes into the set of things that will be committed. It allows finer-grained control over the set of things that make up a change. Now let's look at the common use cases for it. There are two ways of viewing the utility of the staging area: user-initiated and Git- initiated. Both ways offer benefits to the user; the difference is in which actions or processes place content into the level. You'll first look at the use cases or scenarios that originate with the user moving content into the staging area. The Prepare Scenario The first use case for the staging area can be thought of as the Prepare scenario. In this scenario, as a user completes changes in their workspace, they move files that are ready into the staging area. In the simplest case, this is a single promotion of all eligible content (any new or changed files that Git is not told to ignore). However, it can also be done at any granularity of files that the user chooses, meaning the user could even choose to promote each of the files one at a time into the staging area as work is completed. Think of it like this: suppose you have a large checklist of files to modify in order to create a feature or fix a bug. As you complete changes on a subset of the files, you want to go ahead and promote that subset to ensure the changes are persisted, outside of your workspace, on your way to building up the full set for the change. As pieces of the larger change are done, you move those pieces to the staging area and check them off your list. With other source management systems, you typically only have the workspace and the repository. And putting a subset that's an incomplete change into a repository can cause confusion, failed builds, and so on. That's because in those systems, committing changes means they go directly into a public/server-side repository where they are immediately visible and accessible by users and processes, rather than going into a local area first, as they do in Git. To avoid having those changes go directly into the public repository in other source management systems, you might resort to saving those changes off into another local directory—or just leaving everything in the workspace until you get the entire set of changes completed. However, a more useful and elegant model would allow you to stage parts of changes outside of your workspace until you have a complete change built up and ready to commit into the repository. This is what Git allows you to do. Of course, there's no requirement to stage the change as separate pieces. You can promote everything as a unit from the working directory. However, as you become more familiar with Git, and start to work with larger changes, you'll likely find more value in being able to break them up in this way. As well, Git allows for some interesting advanced functionality such as staging only selected changes from a file. You'll explore this workflow in more detail in Chapter 5. The Repair Scenario A second use case for the staging area can be referred to as the Repair scenario. In actuality, you might call it the amend scenario as it relies on an option by that name when doing a commit. As I noted in the previous chapter, one of the interesting things that Git allows users to do is to rewrite history. That is, they can modify previous commits in the repository. The simplest way to do this is by using the amend option when doing a commit. This operation allows the user to pull back the last commit from the repository, update its contents, and put the updated commit back in place of the previous one. Effectively, it provides a do-over, or an opportunity to repair the last commit. So where does the staging area come in for this mode? When the previous commit is amended, it is amended with any content that is in the staging area. The workflow is essentially as follows: Make any updates in the working directory. Put the updates into the staging area. Run the commit with the option to amend. The last operation will cause the previous commit to be updated with whatever is in the staging area and then place the updated commit back into the local repository, overwriting the previous version. (If there are no updated contents in the staging area, then only the message that is attached to the commit can be updated.) This is a powerful feature that gives users a lot of flexibility. As you may have gathered, one of Git's aims is to allow users to easily create and change things as many times as needed in their local environment before actually updating content (on the remote side) that others will see, or that could affect production processes. You'll work through an example of using the amend option in Chapter 5. When Is the Staging Area Used by Git? In addition to users performing actions that directly cause content to be moved into the staging area, Git also uses the staging area itself on certain occasions, notably for dealing with merge conflicts. This case most closely aligns with the prepare scenario I outlined previously. Merging is significant enough functionality in Git that it gets a full treatment in Chapter 9. For my purposes here, I'll describe how it works at a high level and particularly how it uses the staging area. When you merge in Git, you are generally merging together two or more branches. In a best-case scenario (not too uncommon), the merge may have no conflicts and everything merges cleanly. In that case, Git both completes the merge locally (in your working directory), and promotes the merged content automatically into the local repository—and you're done. However, in a case where there are merge conflicts that Git cannot automatically resolve, Git puts those files in your working directory for you to fix, and stages any files that merged cleanly. What it is doing is starting to create a set of merged content to be committed once everything is resolved. From here, the idea is that the user goes into the working directory and edits the files with conflicts in order to fix them. Then those fixed files are added into the staging area with the ones that were automatically merged. After this, the staging area will contain the full set of resolved files and a single commit can be done to complete the merge. There is another side benefit of this arrangement. After the merge has been attempted, if there are conflicts, the merged files are grouped together in the staging area. Separately, the files with merge conflicts are grouped together in the working directory. This offers a very easy way to see which files fall into which category, and thus an easy way for the user to understand what is merged and what needs to be manually resolved. Can I Bypass the Staging Area? While the staging area is very useful for the situations outlined previously, outside of those situations, most users still want to know if they can bypass it in normal use. The answer is … usually. Git provides a shortcut method to promote files to the staging area and then to the local repository with one operation. The caveat, though, is that this only works for files that Git is already tracking, meaning that the first time a file is added to Git, it has to go through the staging process. Afterward, for normal commit operations, you can use the shortcut if you choose to simplify updating revisions. The shortcut is explained in Chapter 5. MERGING AND THE STAGING AREA One other area where the staging operation is required is when you need to complete a merge operation that had confl icts. As discussed in the previous section, Git stages files that merged successfully. In order to complete the merge, files that have confl icts manually resolved must be staged. This creates a complete set of content to be committed to complete the merge operation. Other Names for the Staging Area One other note about the staging area is that it has a couple of other names in Git. It is sometimes referred to by the terms index or cache. In fact, some Git commands will have variations of index or cache as options for operations that work on content in the staging area. For purposes of what you're doing in this book, you can think of all of these terms as meaning the same thing. The Local Repository The local repository is the final piece of the set of Git levels that exist on a user's local machine (the local environment). Once content has been created or updated and then staged, it is ready to be committed into the local repository. As mentioned earlier, this repository is physically stored inside a separate (normally hidden) subdirectory normally within the root of the working directory. It is created in one of two ways: via a clone (copy) of a repository from a remote, or through telling Git to initialize a new environment locally. The nice thing about the local repository is that it is a source repository exclusively for the use of the current user. Modifications can be done until the user is satisfied with the content, and then the content can be sent to the remote repository where it is available to others. As noted before, because everything is local, source control operations can be done to the local repository without network overhead, and even when the machine is not connected to a network. Of course, there are always tradeoffs. Having everything local means that content is lost if the working directory is accidentally wiped out and content has not been synched to the remote repository. It also implies that the longer the time between when content is synched to the remote repository, the higher the chance of merge issues if others are continuing to update that particular remote repository. The Remote Repository The remote repository is the level of Git that hosts and serves up content for wider consumption. It's the place where multiple Git users sync up the changes from their respective local repositories. It corresponds to what you would traditionally think of as the server in other source management systems. I will go into more detail on remote repositories in later chapters, but there are a few general points that are useful to understand up front about remote repositories: A remote repository is unique. There can be many remote repositories for many different projects managed with Git, but Git does not make or use multiple copies of the remote repository on the server. A remote repository can be cloned as many times as needed to separate local repositories. Related to the section in Chapter 2 where I discussed the differences between centralized and distributed source management systems, multiple different users can get copies of the remote repository as their own local repositories to work with. Then, when they push changes from their local repositories, they are pushing them into the single corresponding remote repository that the local repositories were copied from. A remote repository does not make user-facing modifications to content, such as resolving conflicts for merging. It is primarily concerned with synching changes to and from the local repositories of individual users. If there are conflicts that need resolution at the time content is pushed over to the remote, that content has to be pulled back to the local environment, resolved there, and then synched up to the remote. REPOSITORY BRANCHES Before I leave the topic of repositories, it's worth saying a quick word about branches. As in most source management systems, Git supports the concept of branches (which I explore in detail in later chapters). In Git, there are branches that exist in the local repository (local branches) and branches that exist in the remote repository (remote branches). Synching of these branches occurs during some of the commands that I will talk about for working with remote repositories. At any point in time, one branch is active in the local environment, meaning that the files in the working directory tracked by Git came from that local branch. The Core Git Commands for Moving Content Now that you understand the different levels in the Git model, it's a good time to introduce the core Git commands for moving content between them. Some of these commands have already been mentioned in context. I'll just note them briefly here to help fill out an overall picture of the system. Chapter 5 will explain the local workflow in more detail, and later chapters will explain the workflow when working with the remote environment. I'll characterize these commands by which levels they interact with. Working Directory to Staging Area The add command stages content from the working directory to the staging area. Contrary to what the name implies, you always use the add command to stage anything, even content that is not new and that has been staged before. Staging Area to Local Repository The command that is used to promote things from the staging area to the local repository is the commit command. Think of it as making a commitment to put your changes into the official source management repository. This is most similar to what you might see as check-in in other source management systems, but note that it only takes content from the staging area. Local Repository to Remote Repository To synchronize changes from a local repository to the corresponding remote repository, the command is push. Unlike commits into the local repository, merge conflicts from content pushed by other users can be encountered here. Also, being able to push to a particular remote repository assumes appropriate access and permissions via whatever protocol and permissions checking is being used. Local Repository to Working Directory The checkout command is used to retrieve content (as flat files) from the local repository into the working directory. This is usually done by supplying a branch name and telling Git to get the latest copy of content from that branch. Checkout also tells Git to switch the branch that you are currently working with. Remote Repository to Local Environment When moving content from the remote repository to the local environment, there are several ways the local repository and the working directory can receive content from the remote repository. The clone command is used to create a new local environment from an existing remote repository. Essentially, it makes a local copy of the specified remote repository onto the local disk and checks out a flat copy of the files from a branch (typically master, although this is configurable) into the working directory. The fetch command is used to update the local repository from the remote repository. More specifically, it is updating reference copies of the remote branches (reference branches) that are maintained in the local repository. This allows for comparison between what you have in your local repository and what the remote repository had the last time you connected to it. A merge or rebase (merge with history) can then be done to update local branches as desired. The pull command does a fetch followed by the merge or rebase (merge with history) operation. This one command then results in not only updating the reference branches in the local repository from the remote side, but also merging that content into the local branch or branches. It also updates any of the corresponding files in the working directory if the current branch is one that had updates merged from the remote side. Table 3.1 summarizes the levels and commands. Table 3.1 Core Commands for Moving Content between Levels in Git From To Command Notes Working Directory Staging Area Add Stages local changes Staging Area Local Repository Commit Commits only content in staging area Local Repository Remote Repository Push Syncs content at time of push Local Repository Working Directory Checkout Switches current branch Remote Repository Local Environment Clone Creates local repository and working directory Remote Repository Local Repository Fetch Updates references for remote branches Remote Repository Local Repository and Working Directory Pull Fetches and merges to local branch and working directory Putting this table into a visual representation, you can add the commands to the previous picture of the Git model. This provides a representation of Git in one picture, as shown in Figure 3.4. Figure 3.4 Git in one picture SUMMARY In this chapter, you looked at the Git promotion model, a way of thinking about the different levels of Git and how content is moved between them. These levels include the remote repository, the local repository, the staging area, and the working directory. The last three levels make up what I refer to as your local environment, where you develop content and do source management locally before making it more widely available by pushing to the remote. I dove in to explain why the staging area exists, and some of the different uses and functionality it provides. These include gathering up changes for a commit (prepare), updating the last commit (repair), and providing separation between files that merge cleanly and files that don't when doing a merge. I concluded by giving a brief summary of the commands that you use with Git to move between the different levels, leading to a single-picture representation of the Git model. In the next chapter, you'll look at how to configure Git's various options and settings and actually start using Git to work through the model. About Connected Lab 1: Installing Git Before going to the next chapter though, this is a good point to work through Connected Lab 1: Installing Git. Having an installation of Git will be a prerequisite for the rest of the Connected Labs in the book. I highly encourage you to work through the labs if you are not familiar with the topics. This will provide you with hands-on experience with Git and help to internalize the concepts we discuss in the text. Enjoy! Connected Lab 1 Installing Git This lab will guide you through the steps for installing Git on your system. If you already have Git installed, you can skip to the next chapter. Otherwise, select the appropriate section for your operating system, and follow the instructions. Installing Git for Windows The Git for Windows package installs a version of Git that also includes a Bash (Unix) shell that runs on top of Windows and provides a Unix-style interface to Git. You can also integrate Git with the Windows Explorer and command prompts. The following instructions provide the necessary steps for installation, as well as additional information on the install screens you will encounter during the process. Steps Each of the following numbered steps represents a new screen of the installation tool. 1. In your browser, go to http://git-scm.com/download/win. The download starts automatically. 2. After the download completes, double-click the executable file (or select the “Run” button if one is available) to start the install (If any security prompts come up, answer them to allow the install to run.) 3. Click Next after viewing the license agreement. 4. (Optional) Deselect any integration pieces you don't want. This allows you to set up integration with the Windows Explorer and file associations if you want. Click Next. 5. Select how you want to use Git. This screen gives you several options: a. Use Git from Git Bash only: This refers to the Unix shell that comes as a separate program with Git for Windows. If you are comfortable with Unix, or you aren't and don't intend to run many operating system commands, this is the simplest option. The shell features some nice color-coding that can be helpful as you're learning Git. This option won't allow you to use Git integration in Windows command prompts. b. Use Git from the Windows Command Prompt: The main purpose of this option is to allow you to run Git commands in Windows command prompts. It also includes the ability to use Git through the Git Bash shell. It does not try to provide full integration with some of the Unix applications in command prompts as the first option does. This is a good default because it provides the flexibility to use Git in either or both the Bash shell and Windows command prompts. c. Use Git and optional Unix tools from the Windows Command Prompt: This option provides some additional Unix-style tools for you to use from command prompts. Keep in mind that these tools will be in the path and found before some of the Windows commands of the same name. In general, if you want to use Unix commands and tools, you're better off doing so through the Bash shell interface. 6. If you have access to Plink (PuTTy link) on your system, you see an additional screen allowing you to choose which SSH executable you want to use here. Unless you have a specific reason to do otherwise, choosing the Use OpenSSH option is fine. 7. Configure the line ending conversions. This refers to how you want Git to handle line endings in files when getting content out of Git or putting content into Git. I cover this setting in detail in Chapter 4. You can jump ahead and read about that now if you want, but briefly, this relates to how you plan to edit files you'll be managing with Git. You can find more details on the different options in the following paragraphs. If you plan to use Windows editors, then the first setting—Checkout Windows- style, commit Unix-style—will probably work best. This setting means that when you get text content out of Git, Git updates the line endings in the checked-out files to be carriage-return/line-feed (CRLF). This is the line ending expected by Windows editors. When you check in (or commit) content back into Git, Git strips out the CRs and stores (normalizes) the text/ASCII files with line endings that are just LF (the default for Unix). On the other hand, if you plan to edit with Unix-based editors (vi or others) or work primarily through the Bash shell, then the Checkout as-is, commit Unix-style line endings setting may be the best choice. This doesn't make any changes to the files on checkout, but normalizes them to LFs when storing them in Git. So, essentially, they will always have LFs. Because LF-only is the default for Unix systems, this works well for editing in that environment. The last choice—Checkout as-is, commit as-is—can be problematic. Basically, this tells Git not to make any changes for line endings—just to leave them as they are. This means that you can end up with a mixture of line endings in the repository. The other two options normalize files in the repositories to LFs. If a file is edited in a Windows editor and then stored back in Git, the file stored in the Git repository will contain CRLFs. However, if edits are done in Unix editors, the files will have just LFs stored in the repository. If someone then gets one of these files out on an OS that is different from the OS where it was last edited, they may be surprised by the line endings being in the style of the other OS. This can be especially troublesome for teams where some members use Unix and other members use Windows. You can change this setting at a later time by changing the configuration value for core.autocrlf that is mentioned here. (I cover this in more detail in Chapter 4.) However, at that point, there may already be files stored in the repository with undesired line endings. There are ways to fix these files that are beyond the scope of this discussion. For most of the work you'll do in the Connected Labs for this book, the value of this setting won't be significant. However, the best practice here is to choose one of the first two settings that best corresponds to the OS type where you plan to run your editors. 8. Configure the terminal emulator for the Bash shell. You have a choice of which terminal program you want to use for the Bash shell. Unless you have a specific reason to choose the Windows default console window option, choose the Use MinTTY option. This gives you a better user interface that supports functions such as Copy and Paste in the expected way (highlight and select) rather than with the limited functionality of the console window option. 9. Configure extra options. The Enable file system caching option is a relatively new addition. It attempts to speed up file-related operations for users of Git on Windows, where the file handling is not optimized in the same way as it is for Unix. In principle, this seems like a good option, although most users have had limited experience with it. Note that it can be turned off later by changing the core.fscache configuration value. This is one option you should probably turn on until any issues are found with it. The Git Credential Manager for Windows is a successor to a previous credential management application. It essentially helps with managing and simplifying different types of access for Git from various applications. Its use is generally transparent to the user. You can read more about this application in the README file for the project on the GitHub hosting site at https://github.com/Microsoft/Git-Credential-Manager-for- Windows/blob/master/README.md. Unless you have a specific reason not to use this application, just leave it checked. 10. Once you've completed the option screens in these steps, click the Install button. Git removes any old installs (if they exist) and updates with the newest version. Afterward, you have a new Git category in your available programs list with entries to start the Git Bash shell (the Unix shell), a Git CMD window, and a GIT GUI interface. The Git CMD window is like a Windows command prompt. However, you can start up a Windows command prompt and also have access to Git in this window. 11. Open the Git Bash shell, the Git CMD window, or a Windows command prompt, and type $ git --version to make sure you have Git installed and are running at the expected version. USING THE GIT BASH SHELL When you start up the Git Bash shell, it generally opens up to a directory of “/” (forward slash) at the prompt. This root directory corresponds to the directory on your Windows file system where Git is installed. By default, that is C:\Program Files\Git. It's not good practice to store repositories under Program Files, so you want to switch to a different directory before starting to work with Git. In the Git Bash shell, to change to a different directory on your C drive, you use a command like this: $ cd /C/Users/ This corresponds to the following command in a regular Windows command line interface: cd C:\Users\ So, in the Bash shell, the syntax for navigating around is to represent drive letters as // instead / d of : (note the slash versus colon) and then use forward slashes instead of backward slashes in the remaining parts of the path. Note that “˜” corresponds to c:\users\ and, as previously mentioned, “/” by itself corresponds to the directory where Git was installed. Installing Git on Mac OS X 1. In your browser, go to http://git-scm.com/download/mac. The download starts automatically. If not, there is a link you can click to start it. 2. Install the downloaded file via the DMG and PKG files. 3. Open up a terminal, and run the following command to make sure Git is installed and running at the expected version: $ git --version Installing Git on Linux 1. In your browser, go to http://git-scm.com/download/linux. 2. Follow the instructions on that page for the particular flavor of Linux you're using. 3. Confirm that Git is installed by opening up a terminal session and running the following command: $ git --version Part II Using Git CHAPTER 4: Configuration and Setup CHAPTER 5: Getting Productive CHAPTER 6: Tracking Changes CHAPTER 7: Working with Changes over Time and Using Tags CHAPTER 8: Working with Local Branches CHAPTER 9: Merging Content CHAPTER 10: Supporting Files in Git CHAPTER 11: Doing More with Git CHAPTER 12: Understanding Remotes—Branches and Operations CHAPTER 13: Understanding Remotes—Workflows for Changes CHAPTER 14: Working with Trees and Modules in Git CHAPTER 15: Extending Git Functionality with Git Hooks Chapter 4 Configuration and Setup WHAT'S IN THIS CHAPTER? Git command syntax and format The differences between porcelain and plumbing commands Working with auto-completion Basic configuration of Git and your user environment Creating a new repository Dealing with line endings with Git The contents of a Git repository Creating Git aliases When starting to use Git, it's important to configure it so that it works properly in your particular environment. You'll also want to be able to manage your content and your interactions with Git in a way that you prefer. In this chapter, you will learn how to configure your Git environment, and explore the different considerations that come into play. You'll look at some of the key required items such as line endings, as well as some of the more significant optional settings. You'll also learn how to define settings within the different scopes that Git allows. In the “Advanced Topics” section, I'll describe how the init command works, offer more detail about what's actually in the underlying repository, and show you how to create aliases that take parameters that can run small programs. EXECUTING COMMANDS IN GIT As I previously mentioned, this book focuses on the Git command line to provide the most universally applicable way to use the tool. The general form of commands is as follows: git Table 4.1 describes the different parts of this form. Table 4.1 Components of a Git Command Line Invocation Element Description EXAMPLE(S) Notes git Command to run Git git Global options for Git itself. These options may also specify a function to execute. git --work- tree= git --version Some of these options may be intended for standalone operation (for example, --version), while others modify values used for other commands (for example, -- work-tree). Git command to execute git push Options to the specified command git commit -m “comment” May have default options if none are specified. Options may also have values that can be selected to further qualify the option. Items for the command to operate on git add *.c Particular to the command being executed. Examples include files in the working directory, branches or SHA1s in a repository, or a particular setting or value. Operand Types As referenced in Table 4.1, Git can take different kinds of operands, which are specifications of objects to operate on. The two most common operands are the SHA1 value of a commit (or a named branch or tag that refers to such a commit) and a path specification to a file or directory on the disk. For many commands, either or both of these value types may be specified—or neither. When neither operand is specified, the command will operate against all eligible items that it finds in the scope of the repository, staging area, or working directory tree. NOTE Throughout this book, I won't usually supply optional commits or path specifications to Git's commands unless they are required. This will help to simplify examples and allow you to learn about the commands in this context. However, I will introduce the forms of the commands when I first discuss them so you'll be able to see where those items can be supplied. The primary reason to specify both commit references and paths would be to select certain paths that are part of, or in the scope of, the snapshot associated with the commit. Because Git operates at the granularity of a snapshot (tree), you may not always want to do the operation against all items in the snapshot. However, that's what would happen if you just specified the commit | tag | branch. To indicate that the operation should only be done against certain files or paths in the scope of the snapshot, you need to add specific filenames or paths. When both types are specified, if there is a possibility of Git not being able to tell the difference between a commit | branch | tag and one or more of the filenames or paths, then you can separate the two types using the special separation symbol “--”. Normally, this won't be needed if a commit is expressed as a SHA1 value, but it may be needed if branch or tag names could be mistaken as names for files or paths. As an example, the command git a1b2c3d4 file1.txt might be clear enough, but git my-tag-name -- my-file-name could be ambiguous enough when parsed to require the “--” separator symbol. NOTE As referenced in Table 4.1, Git has global options—in fact, quite a few. Beyond the obvious ones, such as --version and --help, there are a number of options concerned with allowing users to specify different paths for different areas of Git, as well as a few miscellaneous ones. At this point, I won't go into further detail about these options because many of them wouldn't make sense without additional context. However, where I identify value for individual options in the context of later chapters, I'll focus in on selected ones then. Porcelain versus Plumbing Commands In this section, command represents any of the commands available in Git, such as the ones I talked about for moving content between the levels of Git in Chapter 3 (add, commit, push, and so on). In Git, there are two categories for the types of commands: porcelain and plumbing. Those names may sound strange, but essentially, the porcelain commands are intended to be user-facing, more commonly used, and more convenient. They also typically provide a higher level of functionality. The commands that I previously mentioned in conjunction with the Git promotion model are examples of porcelain commands. The plumbing commands function at a lower level and are not expected to be used by the average user. These commands are typically targeted at extracting or modifying content and information more directly from the repository. An example would be the git cat-file or git ls-files commands that provide a way to look at the contents of a file or directory within the repository if you know how to reference those elements. Certain functionality in Git can be accomplished using either porcelain commands or plumbing commands. However, it would usually take several very specific plumbing commands to accomplish what one porcelain command can do. The porcelain commands are based on the plumbing commands. They aggregate the functionality of plumbing commands and certain options and sequences in order to make things simpler for the typical Git user. Table 4.2 shows a categorization of the porcelain (user-friendly) commands that are available in Git. Table 4.2 Porcelain Commands in Git Command Purpose add bisect branch checkout cherry cherry-pick clone commit config diff fetch grep help log merge mv pull push rebase rerere reset revert rm show status submodule subtree tag worktree Add file contents to the index. Find by binary search the change that introduced a bug. List, create, or delete branches. Switch branches or restore working tree files. Find commits yet to be applied to upstream (branch on the remote). Apply the changes introduced by some existing commits. Clone a repository into a new directory. Record changes to the repository. Get and set repository or global options. Show changes between commits, commits and working tree, and so on. Download objects and refs from another repository. Print lines matching a pattern. Display help information. Show commit logs. Join two or more development histories together. Move or rename a file, directory, or symlink. Fetch from, or integrate with, another repository or a local branch. Update remote refs along with associated objects. Forward-port local commits to the updated upstream head. Reuse recorded resolution for merged conflicts. Reset current HEAD to the specified state. Revert some existing commits. Remove files from the working tree and from the index. Show various types of objects. Show the working tree status. Initialize, update, or inspect submodules. Merge subtrees and split repositories into subtrees. Create, list, delete, or verify a tagged object. Manage multiple working trees. Table 4.3 shows the same categorization for the plumbing commands. These commands have names that indicate an action and an object to operate against as opposed to the simpler naming of the porcelain commands. Table 4.3 Plumbing Commands in Git Command Purpose cat-file commit-tree count-objects diff-index for-each-ref hash-object ls-files merge-base read-tree rev-list rev-parse show-ref symbolic-ref update-index update-ref verify-pack write-tree Provide content or type and size information for repository objects. Create a new commit object. Count an unpacked number of objects and their disk consumption. Compare a tree to the working tree or index. Output information on each ref. Compute object ID and optionally create a blob from a file. Show information about files in the index and the working tree. Find as good common ancestors as possible for a merge. Read tree information into the index. List commit objects in reverse chronological order. Pick out and massage parameters. List references in a local repository. Read, modify, and delete symbolic refs. Register file contents in the working tree to the index. Update the object name stored in a ref safely. Validate packed Git archive files. Create a tree object from the current index. The descriptions for the commands in these tables are taken directly from the Git help. Some of the terms are more Git-specific at this point. However, as I use commands through the remainder of this book, I'll simplify their definitions and the terminology so it all makes sense. The point of this section is that unless you have a specific need to deep-dive into the repository, you can simply use the porcelain commands and accomplish what you need to in Git. Specifying Arguments Arguments supplied either to Git or to Git commands can be abbreviated as a single letter or spelled out as words. One important note here is that if the argument is spelled out, you must precede it with two hyphens, as in --global. If the argument is abbreviated, only one hyphen is required, as in -a. Abbreviated arguments may be passed together, as in -am instead of -a -m. When arguments are combined in this way, the ordering is important. If the first argument requires a value, then the second argument may be taken as the required value instead of an additional argument. Auto-complete When you start typing a command or an argument to a command, Git has a helpful auto-completion feature (if enabled) that can do two things: Provide valid values for the commands or arguments that could complete the text you're typing—if there is more than one valid option. Automatically complete the command or argument that you're typing—if there is only one valid option. Following are a couple of examples. The first one is for a command. If you type git c and then press the Tab key, nothing happens because there's more than one command that starts with c. If you press the Tab key a second time (before typing anything else in between), Git helpfully displays all of the commands that start with c. In this case, it also scrolls that list up and leaves you at a prompt where you can continue typing the chosen command. $ git c checkout citool commit cherry clean config cherry-pick clone $ git c Here's another example, where you narrow the available commands with more letters. $ git co commit config $ git c If you type enough letters to uniquely identify only one possible choice, then pressing the Tab key auto-completes the command for you because there's only one option. For example, git con yields git config. This also works for arguments to commands. Typing git config --l gives the suggestions: --list --local. Typing either git config --l or git config -- li yields git config --list. NOTE When attempting to use auto-complete for an option, make sure that you have started the option with the double-hyphen (--) syntax and not just a single hyphen. Enabling Auto-complete If You Don't Have It As noted earlier, auto-complete is already enabled in Git for Windows and some other distributions. For other versions (Linux, OS X) where it is not enabled, you can download scripts that implement this feature for different shells from https://github.com/git/git/tree/master/contrib/completion. Once you understand tools like git pull, you can use them to retrieve these scripts via Git. Until then, or as an alternate approach, a simple way is just to click the desired script and then find the button labeled Raw on that page. Click that button to go to a web page with just the contents of that file. Then, you can download that script to your local system (through the browser) and add it into the appropriate init file in your home directory or into the appropriate directory for auto-completion for all users if your shell supports that. Let's work through a quick example of how to install this feature for a bash environment. Here's the direct link for the raw version: https://raw.githubusercontent.com/git/git/master/contrib/completion/git- completion.bash After getting the raw version of the file, you can download that page as the file git- completion.bash to your local system. Once the script is downloaded, you add a line like the following into your .bashrc file (create the file if needed): $ source ~/git-completion.bash To extend this functionality for all users, you'll need to find out where your particular OS stores and expects to find auto-completion scripts and put the downloaded file there. For most bash systems, there is a /etc/bash_completion.d directory where scripts like this can be stored to be loaded. If you're not sure where the location is, try searching for completion on your file system, or consult Google. Auto-completion and the Windows Command Prompt In the Windows command prompt, auto-complete functionality is not built in, and the method in the previous section doesn't work because it is based on a Linux script. However, there is a utility called clink that you can search for, download, and install on Windows that will provide command auto-completion for Git (as well as other functionality). The use is the same—suggestions or completion via the tab key. Note, however, that this does not provide suggestions or auto-completion for arguments to the commands. Now that you understand how to invoke Git commands and pass arguments, let's see how you can use this feature to accomplish one of the most basic and essential parts of using Git: configuration. CONFIGURING GIT To set configuration values in Git, you use the config command. Here's the syntax: git config [] [type] [--show-origin] [-z|--null] name [value [value_regex]] git config [] [type] --add name value git config [] [type] --replace-all name value [value_regex] git config [] [type] [--show-origin] [-z|--null] --get name [value_regex] git config [] [type] [--show-origin] [-z|--null] --get-all name [value_regex] git config [] [type] [--show-origin] [-z|--null] [--name-only] -- get-regexp name_regex [value_regex] git config [] [type] [-z|--null] --get-urlmatch name URL git config [] --unset name [value_regex] git config [] --unset-all name [value_regex] git config [] --rename-section old_name new_name git config [] --remove-section name git config [] [--show-origin] [-z|--null] [--name-only] -l | -- list git config [] --get-color name [default] git config [] --get-colorbool name [stdout-is-tty] git config [] -e | --edit Now here's an example of the most common syntax: $ git config --global user.name "Joe Gituser" Let's dissect the various parts of this command. The first two pieces are simply issuing the config command from git. After that is an option, global, (preceded by two hyphens because you are spelling it out). I'll be talking in more detail about this option shortly. Next comes the configuration setting that you're updating: user.name. Git uses a “.” notation to separate out the two pieces of a configuration setting—in this case, user and name. Think of this as setting the name value of the user section in the configuration. And finally, you have the actual value that you're setting this configuration setting to. Notice that because you have spaces in the value, you need to enclose the entire string in quotes. Here's another example: $ git config --global user.email Joe.Gituser@mailhost.com One additional note: Git configuration settings are stored in text files. It is possible to change these settings by editing the associated text files, but this is highly discouraged because it's easy to make a mistake and also to accidentally modify other settings. Telling Git Who You Are Referring to the two earlier examples, one of the first things that you need to configure in Git is who you are, in terms of the username and e-mail address. Git expects you to set these two values, regardless of what interface or version of Git you use. This is because Git is a source management system. Because its purpose is to track changes by users over time, it wants to know who is making those changes so that it can record them. If you don't specify these values, then Git will interpolate them from the signed-on userid and machine name (user@system). Chances are this is not what you want to have the system ultimately use. If you forget to set these values initially on a new system, and commits are recorded with the interpolated values, there is a way to go back and correct this information, using the commit command with the --amend and - -reset-author options. The values can be set via the same commands as shown in the previous section: git config --global user.name and git config --global user.email . NOTE The e-mail address is not validated when you set it in Git. In fact, you can enter any e-mail address and Git will be happy. However, there is some advanced functionality in Git that uses this e-mail address. That functionality allows for tasks such as creating and sharing patches and zipped versions of changes. For that functionality, having a correct e-mail address is important. Also, there are other tools, such as Gerrit (a code-review tool built on top of Git), that heavily utilize the e-mail address and depend on it being correct. Configuration Scope In the previous examples, I used the --global option as part of the configuration step. The global option is a way of telling Git how broadly this configuration setting should be used—which repositories it should apply to. Recall that the Git model is designed for many, smaller repositories instead of fewer, monolithic ones. Because users may normally be working with multiple repositories, it would be inconvenient and subject to error to have to configure the same settings in each repository. As a result, Git provides options to simplify choosing the scope for configuration values. There are three levels available for configuration: system, global, and local. System Configuration at the system level means that a configuration value applies to all repositories on a given system (machine) unless it's overridden at a lower level. These settings apply regardless of the particular user. To ensure that a configuration value applies at the system level, you specify the -- system option for the config command, as in git config --system core.autocrlf true. These settings are usually stored in a gitconfig file in either /usr/etc or /usr/local/etc. On a Windows system, if you're using Git for Windows, the system file is in C:\ProgramData\Git\config. In other systems, look in the directory where Git was installed. Global Configuration at the global level implies that a configuration value applies to all of the repositories for a particular user, unless overridden at the local level. Unless you need repository-specific settings, this is the most common level for users to work with because it saves the effort of having to set values for each repository. An example of setting values at the global level would be the configuration I did earlier for user.name and user.email where the --global option was incorporated. These settings are stored in a file named .gitconfig in each user's home directory. Local Setting a configuration value at the local level means that the setting only applies in the context of that one repository. This can be useful in cases where you need to specify unique settings that are particular to one repository. It can also be useful if you need to temporarily override a higher-level setting. An example could be overriding the global end of line settings because content in a repository is targeted for a different platform. To update settings at this level, you can specify the --local option or just omit any of the local, global, or system options for the configuration. As an example of this last point, the following two commands are equivalent: git config --local core.autocrlf true and git config core.autocrlf true. The local repository's configuration is stored within the local Git repository, in .git/config (or in config under wherever your Git directory is configured to be.) These scope options (--local, --global, and --system) can be applied to other options and forms of the git config command to indicate the scope to be referenced for that command. Settings Hierarchy When determining what configuration setting to use, Git uses a particular search order to find these settings. First, it looks for a setting in the local repository configuration, then in the global configuration, and finally in the system configuration. If a specific value is found in that search order, then that value is used. Beyond that, the union of all of the levels (unique local + unique global + unique system) forms the superset of configuration values used when working with a repository. Figure 4.1 summarizes the different configuration scopes in Git and how to work with them. Figure 4.1 Understanding the scopes of Git configuration files Seeing Configuration Values To see what value a particular configuration setting has, you can use git config as in git config user.name. Git then prints the value associated with that setting. Because I didn't specify one of the scope options (--system, --global, --local), Git first checks to see if there is a local setting, and if so, it displays that value. If there is no explicit local setting, then it looks for a global setting, and, if one is found, displays the global value. If there is no global setting specified, Git looks for a system setting and displays that value. This is an example of the search order that I outlined earlier. You can also use the scope options to specifically direct the config command to a particular level, as I did when setting configuration values earlier. To better understand how this works at a practical level, consider the following sequence: $ git config --global user.name "Global user" $ git config user.name This returns the value Global user because there was no local value defined; Git looked for a global setting and found this one. On the other hand, say you were to use this sequence: $ git config user.name "Local user" $ git config user.name This returns the value Local User because the local option was implied in setting the value and thus it finds a local value defined. Undoing a Configuration Setting Occasionally, you may need to remove a user setting at a particular level. Git provides the unset option for this, and it's pretty straightforward: $ git config --unset Other options here would generally refer to one of the scope options. Continuing the earlier example, $ git config --unset --global user.name $ git config --global user.name In this case, nothing is returned because I just removed this value. Listing Configuration Settings Another option related to viewing configuration values is --list. Supplying the list option to git config results in a list of all configuration settings being dumped. By default, this list includes local, global, and system settings without qualification. So, if you have both a local and global value for the same setting, you will see both. $ git config --list … user.name = global user … user.name = local user If the settings have the same values, this can be confusing (and potentially misleading) if you're not aware of the reasons behind it. To work around seeing these multiple values, you can refine the list by specifying one of the scope options. $ git config --local --list … user.name = local user NOTE If you are ever unable to figure out where a particular configuration value is set, you can use the --show-origin option with the configuration setting name to figure it out. For example, if you run the command git config user.name "Joe Gituser" then git config --show-origin user.name shows this: file:.git/config Joe Gituser. This option can also be combined with the --list option to get a complete list of where all the settings are stored. One-Shot Configuration There is one additional way to set a configuration value: as a one-shot, one-time configuration for the current operation. This is done through one of the global options that can be passed to Git directly: -c. The format for this is git -c = . Notice that this format requires the “=” sign between the setting and the value. Using this option effectively creates an override for the duration of the current operation. Now that you understand how configuration settings are specified and managed in Git, let's look at configuration for some of the most common settings and behaviors that users deal with. NOTE To see a list of the different settings and values that can be configured, see the man page for git-config under the “Variables” section. Default Editor The default editor is primarily used when you need to type in a message while making a commit into the repository. If you don't supply the message in the command line when you do the commit, Git will bring up the default editor on your system to allow you to type one in. If you would rather use a different editor, you can use the following config command to specify which one to use: git config --global core.editor . The --global option is not required, but most users want to use the same editor for all of their repositories. Here again, you can break down core.editor as the editor value in the core section of the configuration. If the editor is already in the path that Git knows about, then the path isn't required. Here are some examples of configuring editors: $ git config core.editor vim (Linux) $ git config --global core.editor "nano" (OS X) c:\> git config core.editor "'C:\Program Files\windows nt\accessories\wordpad.exe'" (Windows) $ git config --global core.editor "'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -noSession -notabbar" (Bash shell on Git for Windows) Note the different uses for single quotes and double quotes in the respective examples. Also, in the last example, -multInst, -noSession, and -notabbar are all options to Notepad++ to make it simpler to use. (multInst tells Notepad++ to allow multiple instances to run; noSession tells it not to remember the session state—that is, not to load the last file you were working on; and notabbar just avoids displaying the tabbed selection bar at the top.) NOTE If you are working on Windows and want to set up the default editor automatically, you can use a utility program called GitPad. You can download it from https://github.com/downloads/github/GitPad/Gitpad.zip. Once you run GitPad, it will set Git's default editor to whatever application is set to open files of type txt on Windows. By default, that is Notepad, but it can be changed on Windows (through the file associations) so that it is a different application. End of Line Settings Now, let's look at one of the key settings users need to manage with Git: handling end of line (EOL) values. Git manages the two types of line endings: carriage returns/line feeds (CRLF) for Windows and line feeds (LF) for OS X/Linux. In the context of Git, there are two options that are controlled by the EOL setting: How line endings are stored in content when it is committed into the repository How line endings are updated (or not) when content is checked out of the repository onto a local disk The first item refers to whether or not Git normalizes line endings in the repository. Normalizing refers to stripping out CRs and only storing files with LFs. For the second item, when content is checked out of Git, Git can update line endings in text files. This option allows you to specify whether or not Git updates line endings in files after checkout, and, if it does, which type it sets them to. At a user or repository level, how Git handles these options is controlled by a configuration setting named core.autocrlf. As before, the “.” is a separator, and you can think of the first part as the section of the configuration, and the second part as the specific value being set in that section. The crlf part here obviously stands for carriage return, line-feed—meaning the common EOL sequence for files on a Windows environment. The auto part refers to automatically inserting CRLF sequences in files when they are checked out. There are three possible values for the core.autocrlf setting: core.autocrlf=true. This value tells Git to normalize line endings to just LFs when storing files in the repository and to automatically insert CRLFs when files are checked out. If users are working on a Windows environment, this is the recommended value. It allows them to get CRLFs in files when checked out from Git, but doesn't store the CRs in the repository. core.autocrlf=input. This value tells Git to normalize line endings to just LFs when storing files in the repository but not to change anything when files are checked out. If users are working in a Unix environment, this is the recommended value because Unix expects just LFs. core.autocrlf=false. This default value tells Git not to change anything when files are being checked in or checked out. This is the primary value for the setting that can get users into trouble. Suppose you have two users working on code for the same repository, one in a Windows environment and one in a Unix environment. If both users have specified the core.autocrlf=false value in their configurations, then when they commit changes, the files from Windows will have CRLFs and those from Unix will have just LFs. If the respective users later each check out the other's files, then the files will have the wrong line endings for their system. For this reason, this value should not be used when mixed environments are being used in a project. In general, it's a best practice to set the core.autocrlf setting to one of the values other than false, depending on which environment you're working in. It should also be noted that there are other configuration settings that can contribute to how line endings are handled. However, these settings are more obscure and broader in terms of what they affect. Also, their default values generally work well for what most users need to do. NOTE You cannot guarantee that everyone will have the appropriate core .autocrlf value set. However, there is an alternative method for controlling line endings in a repository: the .gitattributes file. I will discuss this file in more detail in Chapter 10, but essentially, this is a metafile that tells Git how to handle certain operations and characteristics based on the file's type. One of these characteristics is line endings. The advantage of controlling line endings in a .gitattributes file rather than relying on the configuration settings is that the file can be checked in to the repository along with the files it handles. Additionally, this file can also be used to tell Git which file types are binary. Aliases Configuration in Git also supports the concept of configuring aliases for command strings. The format for defining an alias is git config alias. . In this context: can be one of --system, --global, or --local. (Or it can be omitted, to default to local.) is the name you want to use for the alias. Once set, this can be used just like any other Git command. is the string of a command and any arguments that the alias will substitute for. There are two main reasons that aliases are convenient to create and use: To save typing frequently used strings of commands and arguments To create a more familiar command for a Git command As an example of the first case, the git log command displays history in Git and has many options. Here's an example log command: $ git log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short Because this is a long command string, it can be difficult to type each time you want to use it. So, you can create an alias instead. $ git config --global alias.hist git log --pretty=format:"%h %ad | %s%d [%an]" - -graph --date=short With this alias in place, you can now just type git hist instead of the longer, complicated command. As an example of the second use case, suppose a user is more accustomed to typing checkin from using other SCMs instead of commit. If they want, they can create an alias by using git config --global alias.checkin commit. After this, the user can type git checkin instead of git commit. (Note that, while this sort of alias can be created, it is not recommended because it is not universal and obscures Git's native commands.) The format of the command to create an alias is consistent with the other config syntax. When you see alias., you can think of this as creating a value in the alias section of the configuration file. The alias information is stored in the configuration file for the specified scope. Windows Filesystem Cache The underlying filesystem layer on Windows is fundamentally different from the filesystem layer on Linux. Git's filesystem access is optimized for the Linux filesystem, and in the past, some operations on Git for Windows were noticeably slower. To compensate, Git has added a filesystem cache that does bulk reads of file system data and stores it in memory. This speeds up many operations, once the cache is initially populated. In recent versions of the install for Git for Windows, this option is turned on by default. To set it manually, you change the core.fscache value to true via git config --global core.fscache true. INITIALIZING A REPOSITORY Now that you understand how to configure the Git environment, I'll move on to setting up a local environment. Recall that a local environment consists of the three levels I discussed in the previous chapter: working directory, staging area, and local repository. There are two ways to get a local environment for use with Git: Creating a new environment from a set of existing files in a local directory, via the git init command Seeding a local environment by copying an existing remote repository, via the git clone command I'll discuss each of these methods in turn. Git Init The git init command is used for creating a new, empty Git repository in the local directory. The syntax for the command is shown below. git init [-q | --quiet] [--bare] [--template=] [--separate-git-dir ] [--shared[=]] [directory] When this command is run, a new subdirectory named .git is created in the directory where the command was run, and populated with a skeleton repository. (Like many open-source applications, Git stores metadata in a subdirectory named for the tool and preceded by a dot.) This local environment is now ready for tracking and storing new content. Note that this command can be run at any time in a directory that does not already have a Git environment associated with it to create one, no matter how many or what types of files are already in the directory. The basic syntax for invoking init is git init. Before running git init, you should be at the top level of the tree you want to put under Git control. You also want to make sure this is done at an appropriate level of granularity. Recall that Git is intended to work with multiple, smaller repositories, not very large ones. So, running git init at your home directory level, for example, is not usually a good idea because this sets Git up to try and act on all files and subdirectories under your home directory for future operations—which is probably beyond the scope you intended. Git Clone Whereas the init command is used when you want to create a new, empty repository and begin adding content, the clone command is used to populate a local repository from an existing remote repository. The syntax for the command is shown below. git clone [--template=] [-l] [-s] [--no-hardlinks] [-q] [-n] [--bare] [--mirror] [-o ] [-b ] [-u ] [--reference ] [--dissociate] [--separate-git-dir ] [--depth ] [--[no-]single-branch] [--recursive | --recurse-submodules] [--] [] To use the clone command, you specify a remote repository location to clone from and Git does the following: Creates a local directory with the same name as the last component of the remote repository's path Within that directory, creates a .git subdirectory and copies the appropriate parts of the remote repository down to that .git directory Checks out the most recent version of a branch (usually the default master branch) into the local directory. This checked-out version with the flat files is what the user usually sees and works with immediately after the clone. The basic syntax for cloning a repository is git clone where is a path to a remote repository. Here's an example: $ git clone ssh://admin@gitserver.domain.com:path-to-repo.git I will discuss this command more in Chapter 12. What's in the Repository Whether the local environment is created by a git init or git clone command, the structure within the .git subdirectory is the same. Essentially, a Git repository is a content-addressable data store. That is, you put something into the repository, Git calculates a hash value (SHA1) for the content, and you can then use that value later to get something out. Figure 4.2 shows an outline of a .git repository on disk. Figure 4.2 Tree listing of a .git directory (local repository) The HEAD file keeps track of which branch the user is currently working with. The description file is used by GitWeb, a browser application to view Git repositories. The config file is the local repository's configuration file. The object and pack directories are the areas where content is actually stored. You can find more information about the files and content stored in the local repository in the optional steps of Connected Lab 2. ADVANCED TOPICS In this section, I'll look at several topics. The first is a quick note about how the init command works. Second is a further explanation about what's in a Git repository. The third is how Git config statements map to the text of the configuration files. Finally, I'll look at a way to create even more useful aliases that can have arguments passed to them and do multiple steps. While this information is not necessary for using Git, sometimes it's helpful to understand how Git works behind the scenes. The first two sections apply this approach to a couple of areas. Git Init Demystified If you're wondering how git init gets the initial content for the skeleton repository, the answer is that there's a template area containing the repository skeleton. This is installed when you install Git. If you're interested in looking at it, you can search for git-core on your filesystem in the area where you installed Git. On Windows, this is usually in a location such as C:\Program Files\Git\mingw64\share\git-core\templates (if you installed the Git for Windows package). On a Linux system, it may be in a location such as /usr/share/git-core/templates. On some installations, you may also see a contrib folder in the same area with items such as hooks that users have contributed over time that are now included as optional pieces that can be put in place as desired. I'll talk more about setting up hooks in Chapter 15. Running Git Init Twice on the Same Repository Running init twice may seem counterintuitive, but there are actually cases where it provides value. The good news is that it does not delete or modify any content that you have added or committed into the repository or your local configuration. It does update any changes to the subset of the templated files discussed previously. So what would be a use case to deliberately run init twice? Suppose you have multiple Git repositories on your system and you want to update a hook in all of them to provide some functionality, such as sending e-mails after a commit. You could update the hook in the templates area discussed earlier, and then do a git init on each of the repositories to get the updated hook put in place in each repository. Looking Further into a Git Repository As I've previously mentioned, a local Git repository is housed in the .git directory of the working directory. It is essentially a content-addressable database, meaning you supply a value (typically a SHA1) and you get content back out. Figure 4.3 shows the relationship and transformation of content from the working directory into the Git repository. Figure 4.3 Mapping files and directories to Git repositories Starting at the left side of the figure, files and directories first exist as normal OS items on the disk in the working directory. Git does not know anything about them. It does not track them until the user adds them to the staging area. Once they are tracked by Git, a new snapshot is created with metadata in the form of a commit record. Once committed, the pieces are stored in their respective areas in the underlying repository. As shown in the middle section of the figure, the pieces that Git stores are defined as one of three types: blob, tree, or commit. Blobs are essentially anonymous containers for files—anonymous in the sense that they don't contain filenames. Trees can be thought of as containers for directories that point to blobs for files and contain the filenames. Commits can be thought of as the header records with meta-information that Git uses for tracking. Internally, Git computes SHA1 checksums for each of these pieces and stores them referenced by those checksums. The checksums can be seen in the parts of the middle section and then in the tree view of the actual repository directory on the right side of the figure. As shown in that view, Git stores these internal objects in directories that start with the first two characters of the checksum. The filename is made up of the remaining characters. The files may be changed over time when certain events trigger Git to do further compression and rearrange content to efficiently store very similar versions. The only checksum those commands are concerned with (and the only one you need to be concerned with) is the checksum that is specifically associated with the commit record, not the ones for trees or blobs. By referencing that one checksum for the commit record, Git pulls in the underlying tree and blob content. Once you actually have a repository with content stored in it, you can change into the repository directories to see the stored objects or use this shortcut (on Linux systems): find .git/objects -type f. From there, you can use the cat-file plumbing command to examine objects. As an example git cat-file -p tells Git to figure out the type of object and neatly display its contents. A similar command, git cat-file -t returns the type of an object: commit, tree, or blob. Connected Lab 2 contains several optional steps that you can work through to understand what's happening in the underlying repository during an init, add, and commit sequence. It also further explains the files that are in the underlying repository tree at various points. Mapping Config Commands to Configuration Files In this chapter, I described the various configuration files that Git uses, as well as how to set values via the git config command. If you see something in a config file that you want to emulate or use, it can be helpful to understand how the config commands map to the file structure. This section will explain that. Suppose you configure a two-part value such as user.name in your local configuration with a command like git conig --local user.name "Git User". This translates into setting the name value of the user section, written into the .git/config file as follows: [user] name = Git User If you need to configure a given value for a named section, you can use a three-part value such as the following: $ git config --local remote.myremote.url http://github.com/brentlaster/calc2 $ cat .git/config … [remote "myremote"] url = http://github.com/brentlaster/calc2 Anything beyond three parts is still treated as three parts, with the extra pieces at the front just made part of the named section. $ git config --local remote.myremote.new.url http://github.com/brentlaster/calc2 $ cat .git/config … [remote "myremote.new"] url = http://github.com/brentlaster/calc2 Note that the git config operation also takes a --file option instead of --local, --global, or --system. This allows for writing configuration options to a file in a different location, such as for test purposes. $ git config --file test.config remote.myremote.test http://github.com/brentlaster/calc2git $ cat test.config [remote "myremote"] test = http://github.com/brentlaster/calc2git As one last tip, git config includes a --get-regexp option to find configuration values matching a specified pattern. I'll use this option in the next section so you can see how it works. Creating Parameterized Aliases Earlier in this chapter, I showed how to create simple aliases for specific Git command lines, such as git config alias.ci commit. It is certainly userful to alias fixed command strings, but only to the extent that arguments and options included in the alias never change. What if you want to create an alias that takes a parameter that is not normally part of a command? Or that may change over time? Or that may perform extra steps or processing—especially with system commands? As it turns out, on Linux systems you can do this with Git fairly easily. You just need to have your alias string take this form: "! f() { do some processing }; f" The ! at the beginning tells Git you are going to the shell. The “f() {}; f” allows you to define a function as part of the alias and then run that function when the alias is invoked. Values that you pass in as arguments are treated as positional parameters (for example, $1, $2, and so on). When including these parameters in the alias definition, a backslash needs to precede the $, as in “\$”. This is to ensure the parameter is included as part of the definition and not interpreted when you are defining the alias. Let's work through a couple of examples. First, I'll create a simple alias that takes an argument and lists out any matching global and local settings prefixed by an appropriate header for each section. The config command is used in this example. What I am doing in this command line is defining a local alias named scopelist, which does the following: 1. Echoes out a global settings header 2. Uses git config's --get-regexp option with a global qualifier to search for the value that is passed in 3. Echoes out a local settings header 4. Uses git config's --get-regexp option with a local qualifier to search for the values that are passed in Here's the command. (Pay attention to the quotes, semicolons, double hyphens, and backslashes.) $ git config --local alias.scopelist "! f() { echo 'global settings'; git config --global --get-regexp \$1; echo 'local settings'; git config --get-regexp \$1; }; f" Here is an example of running the alias: $ git scopelist name global settings user.name Git User (global) local settings user.name Git User (local) The following example will show you a simple way to dump out the contents of a particular scope into a file. This illustrates having two positional parameters. In this case, the alias will do the following: 1. Echo out a header. 2. Issue a git config command at the appropriate scope. 3. Dump the values from step 2 into a separate file. Here's the command to define this alias. (Again, pay attention to the punctuation characters that are used.) $ git config --local alias.dumpvalues "! f() { echo 'copying config' \$1; git config --list --\$1 > \$2; }; f" Here is an example of running this alias and looking at the results: $ git dumpvalues global global_values.out copying config global $ cat global_values.out alias.hist=log --pretty=format:"%h %ad | %s%d [%an]" -- graph --date=short push.default=simple core.autocrlf=false core.editor='C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -noSession -notabbar gitreview.remote=origin user.name=Git User (global) user.email=Git.User@domain.com Obviously, these examples don't cover all the possibilities of bad or missing input. However, they'll give you an idea of how to use this functionality if you ever need it. SUMMARY In this chapter, I discussed the form and structure of Git commands and related topics such as auto-completion. I introduced basic configuration for Git and described how to create local environments. I covered the different scope of configuration settings you can use and how to specify values for each scope. I also covered how to create aliases to simplify interacting with Git. I then described the two different ways to create local environments with Git—initializing a new environment from existing files or cloning down an existing repository. Finally, I offered a brief description of what's inside a .git repository. In the section on advanced topics, you took a closer look at how the init command works, the contents of a Git repository, and how to look at individual objects. Then you learned how configuration commands map to the actual configuration text files. Finally, you saw how to create advanced aliases that can run operating system commands and allow you to work with positional parameters. In the next chapter, you'll start putting content into Git and go over the commands to start promoting it up through the levels. Chapter 5 Getting Productive WHAT’S IN THIS CHAPTER? Getting and working with help in Git Understanding the multiple repositories model Staging files Partial and interactive staging Committing files into the local repository Writing good commit messages Now that you understand the Git workflow, how to create a repository, and how to configure the local environment, I’ll show you how to use Git to start tracking and managing content. I’ll also further explain concepts such as SHA1, options for staging files, and forming good commit messages. First, though, I’ll discuss something that both new and experienced users need to know: how to get help. Getting Help Git includes two different forms of help: an abbreviated version and a full version. The abbreviated version is a short list of options with brief explanations that display one per line on the terminal screen. It is invoked by using the -h option after the command, as in git commit -h. This is useful when you just need a quick reminder of what options are available or how to specify a particular option. Figure 5.1 shows an example of abbreviated on- screen help. Figure 5.1 Abbreviated version of help invoked with the -h option The full version is the man page for the command, which opens up in a browser on some systems. It is invoked by using one of two forms: either adding a --help after a command or using the help command itself as in git commit --help or git help commit. With either of these forms, you have access to the full documentation on the command and all its options, with explanations and some examples, via the man page. This is useful when you need to understand more about an option or command. The format for the help command is as follows: git help [-a|--all] [-g|--guide] [-i|--info|-m|--man|-w|--web] [COMMAND|GUIDE] The guide part of this command refers to some brief but helpful documentation on different aspects of using Git that you can select through help. For example, here’s a command to display a built-in guide that is a glossary: git help glossary. You can use the command git help -g to get a list of all available built-in guides. (Be aware that some of these guides might be out of date.) The remaining options have to do with whether the help is displayed as a web page, man page, and so on. You can specify the format to use by setting the help.format setting—for example with the commands git config --global help.format man or git config --global help.format web. USING WEB-BASED HELP ON OS X If you are trying to get web-based help working on OS X, you may be running into a problem where the help files are always presented as man pages. Setting help .format to web on OS X will sometimes return the following error: ‘/usr/local/git/share/doc/git-doc’ : Not a documentation directory To fix this, go to /usr/local/git/share. Create a doc subdirectory and change into it. Issue the following clone command to populate the area with the necessary files: $ sudo git clone git://git.kernel.org/pub/scm/git/git-htmldocs.git git-doc Then set the help.format value to web and try again. Figure 5.2 shows part of a web man page for one Git command. Figure 5.2 Git browser-based man page Of course, you can always Google a particular command or option to find out more about it. The Multiple Repositories Model In Chapter 2, you explored several key design considerations for repositories, including repository scope and file scope. The factors I discussed there explained why Git works best with multiple, smaller repositories rather than larger, monolithic ones. With this model, it is common to have each of the modules of your project housed in a different Git repository. As a result, you may need to clone several different repositories to get everything you need to work with locally. Each repository ends up in a separate directory tree on your disk. Likewise, if you’re starting a new project, you may be creating new modules that are each targeted for a separate Git repository. As I discussed in Chapter 4, the git init command is used for creating new repositories, one per directory tree. Although working with multiple repositories at the same time is common in Git, it is a different way of working for most people. Figure 5.3 shows a diagram that represents these kinds of scenarios. Here, some repositories are newly created by the init command, and some are cloned down from existing remote repositories. Notice that each repository is housed in a separate working directory where the actual repository is physically stored in the .git subdirectory tree within that directory. Figure 5.3 Working with multiple repositories In Chapter 4, I also talked about configuration for Git and the different levels: system, global, and local. To illustrate how that would work in a multiple repository model, you could group these directories into multiple users on the system with configuration files at the appropriate levels. Figure 5.4 shows one possible organization. Note that each repository has its own local configuration as represented by the document icons in each directory. Further, each user has their own global configuration (for all of their repositories) as represented by the document icons in the user sections. Finally, there is one system configuration (for all users) as represented by the document icon next to the System title. Figure 5.4 Overlaying configuration files on your model Adding Content to Track—Add I’ve already talked about adding content to Git with the add command. The dark arrow in Figure 5.5 reminds you where adding and staging fits within the overall promotion model workflow. Figure 5.5 Where adding and staging fit in It’s worth spending a moment here to discuss what I mean by three related terms: tracking, staging, and adding. Tracking refers to having Git control and monitor a file. The first step in getting Git to track a file is staging it. Here, staging means that you tell Git to take the latest change from your working directory and put it in the staging area. You do this using the Git add command. This is why I sometimes refer to staging a file as adding a file and vice versa. Another important point is that whether you are staging a completely new file that is not currently tracked by Git, or staging an update to a file already tracked by Git, you still use the add command. Think of it as always adding content into Git. Staging Scope As I discussed in Chapter 3, one of the purposes of the staging area is to allow you to build up a complete set of content to commit as a unit into the local repository. When it is done in stages like this, a user may be staging only a subset of eligible files at a time—some files may not be ready. As an example, I could do git add file1 followed by git add file2 followed by git add file *. For users who don’t choose to use the staging area in this way, it is more common to just stage everything that is eligible. The command git add . does this for you. (Note that “.” is a required part of the command here.). You can also supply a pattern to select groups of files from the directory structure, as with a command like git add *.c that selects only files with a “.c” extension. By everything that is eligible above, I meant all files that are new or updated AND not ignored. New or updated is self-explanatory. Not ignored requires further explanation. Ignoring Files Typically, when working in a local directory tree on a project, there is some subset of files that you don’t want (or need) the source management system to track. Examples include those files I talked about in Chapter 2: generated output files that should be re-created from the source each time, or external dependencies that are stored and managed in another system (such as an artifact repository). To tell Git to ignore certain files (meaning not to track them), you just need to list them in a Git ignore file. This is a text file named .gitignore that is placed at the root (top level directory) of the local environment. If this file exists locally, Git will read it and ignore files and directories that match the names and patterns specified within it. The Git ignore file is covered in more detail in Chapter 10. While not strictly required, having a Git ignore file is considered a best practice for any project managed by Git. Partial Staging Before you begin, this section outlines functionality that can be useful but is not required for using Git. If you are only interested in basic staging of files, you may want to skip over this topic for now. On the opposite end of the spectrum from staging all eligible files or sets of files, Git includes an option that allows for partial staging. This means choosing to take selected changes from a file, but not necessarily all of them. You can use the -p option to do this, as in git add -p . This command tells Git to treat the changes in any file being staged as one or more separate hunks. Here, a hunk is a change to a set of lines that is separated from other hunks by a set of unchanged lines. The number of hunks also depends somewhat on the size of the file. For small files, even those with several changes, Git may present the entire set of differences as a single hunk. NOTE If you try to do a partial add for a file that has not been added to Git previously, Git will tell you there are no changes. You need to add a copy of the file into Git first in the standard way (not as a patch) so that there is a base there to patch against. Through an interface that Git presents, users can choose which hunks they want to have staged and which they don’t, as well as other functionality. The interface will show the first hunk of the file, followed by a prompt. Here’s a simple example of output from the add with -p option: diff --git a/file b/file index SHA1..SHA1 filemode --- a/file +++ b/file @@ -1,7 +1,7 @@ line 1 line 2 line 3 -line4 +line 4 line 5 line 6 line 7 Stage this hunk [y,n,q,a,d,/,s,e,?]? What do you need to know from this? It is essentially a diff between the version in Git and the ​version in the working directory. These are represented as a and b in the header. The line, “@@ -1,7 +1,7 @@”, describes the range of differences for the two files. You can think of that line like this: Before the changes in this hunk, designated by the “-”, starting at line 1, you had 7 lines. After applying the changes in this hunk, designated by the “+”, there should be 7 lines. In the actual listing, lines that are added show up with a “+” in front of them. Lines that are deleted show up with a “-” in front of them. In this particular case, I modified the same line, but here, Git shows it as one line being removed in the original version and another line being added in the new version. As a result, the before and after line counts are the same. Now that you know how to interpret the hunk, you can decide what to do with it. If you select ? (or an option that isn’t supported), Git will display the meaning of the different available subcommands as follows: y - stage this hunk n - do not stage this hunk q - quit; do not stage this hunk or any of the remaining ones a - stage this hunk and all later hunks in the file d - do not stage this hunk or any of the later hunks in the file g - select a hunk to go to / - search for a hunk matching the given regex j - leave this hunk undecided, see next undecided hunk J - leave this hunk undecided, see next hunk k - leave this hunk undecided, see previous undecided hunk K - leave this hunk undecided, see previous hunk s - split the current hunk into smaller hunks e - manually edit the current hunk ? - print help Let’s look at a couple of the most useful subcommands here. As implied by the help text, y tells Git to stage this hunk. This means that this portion of the file’s changes will be staged. Likewise, selecting n means that this portion of the file’s changes will not be staged. Essentially, you are selecting which changes you want to take from the file or files in your working directory and stage for a future commit into the repository. Most of the other subcommands are for doing bulk operations with hunks or navigating around the set of hunks. If you select g and have multiple hunks, Git presents you with a list of the available hunks identified by number and allows you to select which one you want to work with next. If you type a “/” and specify text found in the file, Git will jump you to the hunk with that text. Two other subcommands of the patch staging interface are s for split and e for edit. I’ll briefly ​discuss the use of each one. The split subcommand tells Git to split the file into smaller, separate hunks during an add operation with the patch option. This is useful if you have a fairly small file and Git presents it initially as one single hunk. Note that if you do not see an s in the prompt list, this means that Git has already split it down as small as it reasonably can. This subcommand can be useful to let you get finer-grained control to stage or not stage smaller changes instead of having to try and deal with one big change. Editing a hunk allows you to modify the lines within it. When you choose this option, Git brings up the configured editor with the hunk in the patch format. The idea is to make your edits, save the file, and exit the editor. Each line of a hunk is indented one space in the editor. The first column is used as a way to specify the changes to make. Based on the existing changes between the two versions of a file, lines to be added have a “+” in the first column and lines to be deleted have a “-” in the first column. To remove one of these lines, the built-in help suggests deleting the line if it has a “+” or changing the “-” to a “ ” if you want to remove a line starting with a “-”. Other changes can be made in the patch, but they will increase the probability of the problems I’ll talk about next. Figure 5.6 shows an example of a session for editing a hunk. Figure 5.6 An edit session for a hunk The Problems with Editing Hunks Editing hunks via the Git command line is not recommended for beginners. The reason for this is that you are essentially editing a patch to be applied against a file. However, this patch is based on a starting place in the file and an expected number of lines (that is, the information between the @@ signs in the header). It is very easy to make an edit that will cause the patch to not align with the starting line and the expected number of lines. When that happens, the patch will not apply. After you exit the editor, you will see a message that says something like this: “Your edited hunk does not apply. Edit again (saying “no” discards!) [y/n]?”. This message may also be accompanied by an equally dubious one such as this: “fatal: corrupt patch at line ##”. This means that some change you made in the editor caused the patch (this hunk) to not be able to merge into the rest of the file. This is an easy state to get into and a hard state to get out of, especially because modifications in an earlier patch can affect the expected starting line and line counts for later patches. To make this all work from the command line in all but the simplest cases requires some calculations on where a particular patch should start, the number of lines affected, and so on. A better option is to stage those hunks that are ready, and not stage the ones that need further edits. You then edit the entire file in an editor, make the edits as needed, and stage those updated changes. (If needed and available, the split subcommand can further reduce the size of hunks before doing this.) NOTE While many operations in the Git command line provide increased functionality versus doing the operation in a GUI, selectively editing and staging parts of a file can be simplified using a GUI interface. In this kind of interface, users can often select and update content without having to worry about the line numbers and relative locations typically associated with patches. Interactive Staging of Commits There is one more variant of the staging (add) and commit functions that is available to users: interactive staging. This option presents a different command line interface that lists the various files and available staging functions and assigns a letter or number to each one. You then choose content and perform operations by entering the corresponding letters or numbers at an interactive prompt. To invoke this function, you must add the --interactive option at the time you execute the command. Here are some examples: $ git add --interactive $ git add --interactive *.c $ git commit -m "update" --interactive $ git commit --interactive $ git commit --interactive -m "my change" file1.java In short, you can add the --interactive option on any add or commit command line to use this interface. The interface actually performs the same function whether you are running it as part of an add command or a commit command—it allows control over what is in the staging area using a more concise interface. As a brief example of how the interface works, consider a case where you have a new Git repository with three files (file1.txt, file2.txt, and file3.txt) that have not yet been added to Git. In this state, the files are called untracked files (more on that in the next chapter). Now if you run the add command with the interactive option, you are presented with the interactive listing and prompt. $ git add --interactive *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help What now> Notice that the prompt is asking what you want to do now. You indicate which operation by entering either the number or the first letter (highlighted) of the command from the listing. In this case, you’ll add (stage) some of the currently untracked files. To do this, you start the operation by choosing 4 or a. What now> a 1: file1.txt 2: file2.txt 3: file3.txt Add untracked>> You’re presented with a list of the untracked files in the directory. Each file has been assigned a number by which you can refer to it. In this case, you’ll add (stage) files 1 and 3. You could do this via two separate inputs, or via a comma-separated list. Here, you’ll use the latter format. Add untracked>> 1,3 * 1: file1.txt 2: file2.txt * 3: file3.txt After you do this, Git tells you that you’ve staged the two files by putting the “*” in front of their names. Because you’re done with this command, you can just press Enter/Return with nothing after the prompt to return to the main prompt. Git tells you that two paths (files) were added. Add untracked>> added 2 paths *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help What now> If you now choose the status command, Git displays in this concise format what you have in the staging area and how it relates to what you have in your working directory. What now> s staged unstaged path 1: +1/-0 nothing file1.txt 2: +1/-0 nothing file3.txt *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help Let’s take a closer look at how to read this status for the first file. staged unstaged path 1: +1/-0 nothing file1.txt The number in front (1) is just an identifier that you can use to reference this item in the staging area if you update it further using this interface. The numbers under staged represent the number of lines added since you started staging this file and the number deleted. In this instance, file1.txt only contained one line, so you see one line added and zero lines deleted. Under unstaged you see nothing, which, of course, indicates that nothing is unstaged. Think of this as what’s different between the staging area and the working directory or what’s new in the working directory for this file. Because you don’t have any changes in the file in the working directory that aren’t staged, the version in the working directory and the version in the staging area are the same, so nothing is different. If there were differences, they would be in the same +/- format as used for the staged column. Finally you have the path name, which, in this case, is just the filename. Now, suppose you add a line in your working directory to file1.txt so that it has two lines instead of one. (This would be done outside of the interactive interface.) If you want to see what’s different between the version you have staged and the updated one, you can use the diff command here. What now> d staged unstaged path 1: +1/-0 +1/-0 file1.txt 2: +1/-0 nothing file3.txt Review diff>> You get a summary status. Notice that the unstaged section now shows +1/-0 because the staged and unstaged versions in the directory are different. The way to read this is that in the unstaged version of the file, one new line has been added (which I did previously) and no lines deleted. Your prompt has also changed to be relative to the command you selected and to allow you to choose which file you want to diff further (if you do). If you want to look at the actual diff for the file you changed, you can input 1 and get output like the following: Review diff>> 1 diff --git a/file1.txt b/file1.txt new file mode 100644 index 0000000..257cc56 --- /dev/null +++ b/file1.txt @@ -0,0 +1 @@ +newline *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help What now> This is the same type of patch format that I talked about earlier in the section, “Partial Staging.” Now, to get your updated content in the staging area, you can use the update command. The workflow will be as it was for the other commands. 1. You will get the same kind of list of what is eligible to update. 2. You can then select the number that corresponds to the item you want to update and you’ll get the “*” marker to indicate it was done. 3. You can then just press Enter/Return to exit the update mode. The sequence looks like this: What now> 2 staged unstaged path 1: +1/-0 +1/-0 file1.txt Update>> 1 staged unstaged path * 1: +1/-0 +1/-0 file1.txt Update>> updated one path *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help If you now take a look at the status after this update, you’ll see the following: What now> s staged unstaged path 1: +2/-0 nothing file1.txt 2: +1/-0 nothing file3.txt Note that you have two lines added for the file since you started staging it. Also, you are back to nothing unstaged because all of the changes made in the working directory have been added to the staging area. Lastly, if you decide you want to unstage a set of changes, you can use the revert command to do so. The sequence is the same as for the others: select the command, select the file, and the operation is executed. *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help What now> 3 staged unstaged path 1: +2/-0 nothing file1.txt 2: +1/-0 nothing file3.txt Revert>> 1 staged unstaged path * 1: +2/-0 nothing file1.txt 2: +1/-0 nothing file3.txt Revert>> rm 'file1.txt' reverted one path A status command now shows only the one file remaining in the staging area. *** Commands *** 1: status 2: update 3: revert 4: add untracked 5: patch 6: diff 7: quit 8: help What now> s staged unstaged path 1: +1/-0 nothing file3.txt For the remaining commands, patch will launch a similar workflow that allows for partial staging (as described in the section, “Partial Staging”). Also, as the name implies, help provides a quick summary of what the main commands do. Once you quit the interactive staging process, Git provides a brief status summary. What now> q Bye. [master 6a43d7e] update file3.txt 1 file changed, 3 insertions(+) SUMMARY OF THE INTERACTIVE STAGING WORKFLOW Start the interactive option: git add --interactive (or git add -i) A list of available commands appears, along with unique numbers to select each of them. You can also use the first letter of the command to select it. A what now> prompt appears for input. At the prompt, enter the letter or number corresponding to the command you want to use. The files that you can choose to operate on appear in a list, with a number to identify each one. The prompt changes to refl ect the current operation. Select the files you want to work with by entering the individual numbers, or a comma-separated list for multiple ones, or a range of numbers separated by a hyphen. The operation takes place on those files. Repeat for any other files to which you want to apply the same operation. When done with the operation, press Enter/Return (without any line numbers) at the operation prompt to return to the main what now> prompt. Bypassing the Staging Area In Chapter 3, I discussed the various uses and reasons for the staging area as a separate level in the Git promotion model. However, if you don’t need to have your changes staged as a separate step in the process, there is a shortcut that Git provides— although it is qualified. The shortcut is to use the -am option on the command line when doing a commit, as in git commit -am “comment”. I’ll talk more about the commit operation shortly, but the -am option effectively tells Git to stage and commit the updated content in one operation. It’s a nice convenience when you don’t need to hold the change in the staging area for any reason. COMBINING OPTIONS IN GIT I mentioned in Chapter 4 that options can be supplied to Git commands either spelled out completely (and preceded by two dashes) or abbreviated by their first letter (and preceded by a single dash)—for example, --all versul s -a. The abbreviated form of options can be combined together where it makes sense. As I mentioned, you can use -am to add and commit new versions of files. Here, -am is a contraction of the -a and -m options: -a is the short version of --all, an option that tells Git to stage all eligible changes before doing the commit, and - m (short for --message) is used to supply the message or comment for the commit. Combining the options in this way works because -a does not take an argument, while -m does. As a result, -a is interpreted as a standalone option. Trying to combine the options in the reverse manner (-ma) does not work. This is because -m expects an argument (the commit message/comment). Specifying – ma causes Git to interpret the a part as the expected argument to -m and tells Git the commit message/comment for this operation is a—not what you intended. So, combining abbreviated options works in Git as long as the option (or options) before the last one does not take arguments. The one caveat with the -am shortcut is that it will not work for new content or files. The first time a file is added to Git, it must have the git add command done first. Some IDEs will also provide a shortcut for doing the add and commit for files in their projects—for example, being able to drag and drop content to add and commit in one step. Finalizing Changes—Commit After content is staged, the next step is the commit into the local repository. This is done with the commit command. The syntax is shown below. git commit [-a | --interactive | --patch] [-s] [-v] [-u] [--amend] [--dry-run] [(-c | -C | --fixup | --squash) ] [-F | -m ] [--reset-author] [--allow-empty] [--allow-empty-message] [--no-verify] [-e] [--author=] [--date=] [--cleanup=] [--[no-]status] [-i | -o] [-S[]] [--] […] The dark arrow in Figure 5.7 reminds you where you are in the overall promotion model. Figure 5.7 Where commit fits in You can think of the commit action here as committing to make the change permanent. Committing always operates by promoting content from the staging area. (Even if you use the shortcut noted in the previous section on the commit command, you are not bypassing the staging area; you’re just moving content from the working directory to the staging area and then committing it with one command.) A key point to remember is that a commit commits changes into the local repository only. Nothing gets updated or changed in the remote repository. As I noted earlier, and as indicated in the promotion model figures, there are entirely separate commands for synchronizing content with the remote repository (discussed in Chapter 13). So, none of the changes the user commits will show up in the remote repository until those other commands are used to push them over. They are two different and distinct environments. Prerequisites In addition to having content in the staging area, it’s best to have the username and user e-mail configured as discussed in Chapter 4. As a reminder, the commands to configure these settings on the command line are git config --global user.name "Your Name" and git config --global user.email . If you don’t do this, then Git will attempt to figure out who you are based on the logged-in userid and the system name. If it can’t, you’ll be forced to set these values then. If you don’t explicitly set them, then you may see a message like the following one after your first commit: [master sha1] comment Committer: username Your name and email address were configured automatically based on your username and hostname. Please check that they are accurate. You can suppress this message by setting them explicitly with the commands I reminded you about above. After doing this, you may fix the identity used for this commit as described in the section Resetting the Author Information later in this chapter. Commit Scope The most common form of the commit command is git commit -m "". Here, the -m is the abbreviated form of the --message option. Git requires a message (also referred to as a comment) when doing a commit. If the commit message has spaces, it must be enclosed in quotes. In this form, without any specific set of files or content specified, Git takes everything in the staging area and commits it. Most of the time this is what you want. (Again, the general idea is to build up a set of content in the staging area that should be committed as a unit.) However, it is also possible to commit only a selected set of content, as in git commit -m "" file1.c or git commit -m "" *.c. Putting It All Together Figure 5.8 provides a visual way to think about the add and commit workflow. This is not exactly how things happen internally, but it is a convenient way to think about the overall process. Figure 5.8 The basic workflow for multiple commits In part A, you start out with your local stack: local repository, staging area, and working directory. The working directory contains three files. In part B, you specifically stage (add) one of the files, moving it into the staging area, and creating a snapshot. In part C, you stage the remaining files by using the git add command, updating your snapshot. Recall that this form of the command (with the “.”) means to traverse the directory tree, and stage all of the files that are new or changed AND not ignored (via the .gitignore file). In this case, the other two files in the working directory match these criteria, so they are staged. Next, you commit the set of files in the staging area to create a first commit in the local repository. This is illustrated in part D. Also here, the second file is modified again in the working directory. Now, in part E, you stage the newly modified file, creating a new snapshot, and then commit it in part F. This creates a second commit. Git is smart enough as it manages storage to not create duplicate copies of everything from the first commit, but instead link to it. Amending Commits One of the advantages and challenges I noted with Git in Chapter 1 was the ability to rewrite ​history. The simplest form of rewriting history in Git is amending the last commit. This means you are updating the last commit with content from the staging area, rather than creating a new commit with the changes. This is done using the --amend option with the next commit command. The basic syntax looks like this: git commit --amend . Staging the Updated Content for the Amend The amend option tells Git to update the last commit in the local repository with whatever content is currently in the staging area. If no updated content is in the staging area, then only the commit message is updated (if the user chooses to supply a new one). Figure 5.9 shows an example of this workflow. Figure 5.9 Workflow for an amended commit In part A, you are starting at the point where you have one commit in the local repository and a change (in File 2) in the working directory. In part B, you are staging this change with the git add command. In part C, you commit the change, but pass the --amend option. Instead of creating a new commit, you can think of Git pulling back the last commit (part D), expanding it, overlaying it with what’s in the staging area (part E), and then updating the same commit back in the repository (part F). Skipping the Edit of the Commit Message While it is best practice to update the commit message when amending content, if there is a ​reason not to do so, you can use the --no-edit option on the amend, as in git commit --amend --no-edit. Resetting the Author Information The amend option can also come in handy if you forget to initially set the user.name or user.email configuration settings (or you have made a typo in one of them). To update the username and user e-mail captured in the previous commit, you reset the configuration settings to the desired values. You then add the --reset-author option to the commit command. After you run this command, the commit’s information should show the updated values. $ git commit --amend –reset-author NOTE It is not recommended to amend content that has already been pushed to a remote repository where others may be working with it. Operations that rewrite history, such as amend, should ideally only be done in your local environment before content is initially pushed to the remote repository. Otherwise, other users may have accessed the copy before the rewrite and then can run into problems when they try to push their updates and discover that the history of the branch has been changed without their knowledge. Results of a Commit Once a commit is executed, Git displays information like this on the command line interface: $ git commit -m "add new files" [master e3ff86b] add new files 2 files changed, 2 insertions(+) create mode 100644 file1.java create mode 100644 file1.doc I’ll break down this output so you understand what Git is telling you. On the first line, master refers to the default branch in Git. Until you create other branches and switch to them, you’ll always be using master as your branch. The e3ff86b is the first seven characters of the SHA1 value that was computed for the overall commit object—the snapshot I’ve referenced in previous chapters. This section is immediately followed by the commit message associated with this change. The next line gives you information about how many files were affected by this commit, and how many changes there were in terms of insertions and deletions versus what was in the local repository before this commit. Next, you have a list of the files that were involved in this commit along with mode information. The create text here is an indication that these are new files. The 100644 mode indicates a standard file in the repository. This is the most common mode you’ll see, but other types exist for executable files, symbolic links, and so on. GIT MODE INFORMATION The mode information used in Git is coded as follows: 4-bit object type (Valid values in binary are 1000 [regular file], 1010 [symbolic link], and 1110 [gitlink]) 3-bit unused 9-bit Unix permission (Values 0755 and 0644 are valid for regular files. Symbolic links and gitlinks have value 0 here.) This translates into the following: 040000: Directory 100644: Regular non-executable file 100755: Regular executable file 120000: Symbolic link 160000: Gitlink. (A gitlink references a submodule commit in another repository.) Some of these modes may show up as output from commands such as commit. Others may only be visible when using plumbing commands such as cat-file that show the modes of items in the underlying repository. Two of these pieces of information are worth discussing in more detail: the SHA1 for the commit and the commit message. Commit SHA1s I discussed what a SHA1 is in Chapter 4. As a reminder, SHA1 is an acronym for Secure Hashing Algorithm 1. It is a checksum or hash that Git computes for every object it stores in its internal ​content management system. It is also the key that Git uses internally to map to stored content. Whenever a commit is done in Git, Git computes a SHA1 for each piece of the snapshot that it stores (each file, directory, and so on). However, it also computes a SHA1 for the overall commit. That commit SHA1 is the one that users see and work with. It serves as a handle or key to be able to reference that particular commit in the system. For any Git command that needs to point to a particular commit, it does that with the SHA1 value for that commit. In terms of use, you can think of this as being similar to a version or revision number in other tracking systems—a system-generated value that identifies a particular version of a change stored in the repository. NOTE While the SHA1 of a commit can serve a purpose similar to a revision or version number in other systems, unlike those systems, this value does not increase by some set amount each time. Rather, a SHA1 in Git is a 40-character hexadecimal string. Fortunately, you don’t have to remember or specify all 40 characters—just enough to uniquely identify any commit in the system. For most systems, this turns out to be the first seven characters of the SHA1 string. For projects with significantly more commits, more characters from the SHA1 may be needed to identify a particular commit. The most I have heard of users having to specify is 12 characters. (This was for the Linux OS development where there has been a much larger number of commits over the longer period of time that Git has been in use for the development of that OS.) Commit Messages When you commit into the local repository, Git requires you to supply a commit message. If you are working on the command line, you can supply one via the -m or -- message argument. If you don’t supply a commit message, Git will start up the default editor for your particular system for you to type in the message. Once you type in the commit message, you save the file and close the editor. The commit operation then completes. NOTE See the “Default Editor” section in the discussion on configuration values in Chapter 4 for information on how to configure the editor for commit messages. When creating a commit message, it is important that it is meaningful—not just to the user doing the commit, but also to others who may be looking at it later. In general, a commit message should do the following: Explain the reason for the change at a high level (for example, refactoring xyz class, adding new foo api, fixing bug 1234, and so on). Users can use Git to see what was changed, but they need information to understand why it was changed. Have a meaningful first line. It is typical in many Git interfaces to display only the first lines of commit messages when looking at changes that have gone into the repository. For this reason, the first line should provide a brief, meaningful summary. Incorporate a tracking ticket identifier in the first line if issues are being tracked via a ticketing system. Doing this provides another reference to a place to go to get more details for users scanning the first lines of commit messages. Follow any standards or guidelines that the team or company may have for commit messages. Chris Beams (http://chris.beams.io/posts/git-commit/) puts it this way: Separate the subject from the body with a blank line. Limit the subject line to 50 characters. Capitalize the subject line. Do not end the subject line with a period. Use the imperative mood in the subject line (for example, fix bug 1234 rather than fixed bug 1234). This matches the tense used in automatic commit messages that Git generates itself for certain operations. Wrap the body at 72 characters. Use the body to explain what and why versus how. Like well-formed comments in code, well-formed commit messages can help to ensure that you and others will find it easier to understand and maintain your changes over time. In fact, some in the Git community advocate for never using the -m option on a commit. The idea is that the -m option only suggests a short message format with less information, as opposed to always using an editor to enter the message so that more information about the commit (such as that outlined here) can be included. Advanced Topics In this section, you’ll look at how to use templates for commit messages, as well as how to use Git’s Autocorrect and Auto Execute options. One way to help standardize commit messages and ensure good form is by using commit message templates. A commit message template is simply a text file with text and comments that suggest the type and form of content to include in the commit message. Here’s an example: $ cat ~/.gitmessage Replace this line with a one-line meaningful summary Why this change is needed: # Explain why this change is needed What this change accomplishes: # Explain what this change does: # This is our company's default commit message template. # You should follow the following guidelines: # Guideline 1 # Guideline 2 # Guideline 3 This is only one example, and obviously more could be done to make it more self- explanatory (and add real guidelines). However, this should spark some ideas. Once the template file is created, it can be saved to a global area (under the user’s home directory in this example) or even to a more publicly accessible location for use among multiple users. There are three ways for a user to tell Git to include a commit message template at the time of doing a commit: 1. Use the -t (--template) option on the commit command itself. $ git commit -t