Or how to find copied and pasted code
Duplicate code can be hard to find, especially in a large project. But PMD’s Copy/Paste Detector (CPD) can find it for you! CPD has been through three major incarnations:
First we wrote it using a variant of Michael Wise’s Greedy String Tiling algorithm (our variant is described here).
Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform.
Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.
Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).
Note that CPD works with Java, JSP, C, C++, C#, Fortran and PHP code. Your own language is missing? See how to add it here.
CPD is included with PMD, which you can download here. Or, if you have Java Web Start, you can run CPD by clicking here.
Here are the duplicates CPD found in the JDK 1.4 source code.
Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).
CPD comes with its own starter batch file: cpd.bat. It’s located in the bin subdirectory in the PMD binary distribution zip-file. Let’s assume, you are in this directory, then you can start CPD this way:
cpd.bat --minimum-tokens 100 --files c:\temp\src\java
The options “minimum-tokens” and “files” are the two required options; there are more options, see below.
For Linux, there is since PMD 5.0 a combined start script for all command line tools. This includes CPD. The start script is called run.sh and is located in the bin subdirectory in the PMD binary distribution zip-file. Let’s assume, you are in this directory, then you can start CPD this way:
./run.sh cpd --minimum-tokens 100 --files c:\temp\src\java
The options “minimum-tokens” and “files” are the two required options; there are more options, see below.
Option | Description | Required | Applies for language |
---|---|---|---|
--minimum-tokens | The minimum token length which should be reported as a duplicate. | yes | |
--files | List of files and directories to process | yes | |
--language | Sources code language. Default value is `java` | no | |
--encoding | Character encoding to use when processing files | no | |
--skip-duplicate-files | Ignore multiple copies of files of the same name and length in comparison. | no | |
--exclude | Files to be excluded from CPD check | no | |
--non-recursive | Don't scan subdirectiories | no | |
--skip-lexical-errors | Skip files which can't be tokenized due to invalid characters instead of aborting CPD | no | |
--format | Report format. Default value is `text`. | no | |
--ignore-literals | Ignore number values and string contents when comparing text | no | java |
--ignore-identifiers | Ignore constant and variable names when comparing text | no | java |
--ignore-annotations | Ignore language annotations when comparing text | no | java |
--no-skip-blocks | Do not skip code blocks marked with --skip-blocks-pattern (e.g. #if 0 until #endif) | no | cpp |
--skip-blocks-pattern | Pattern to find the blocks to skip. Start and End pattern separated by |. Default is `#if 0|#endif`. | no | cpp |
--uri | URI to process | no | plsql |
--help / -h | Print help text | no |
Note: The following example use the Linux start script. For Windows, just replace “./run.sh cpd” by “cpd.bat”.
Minimum required options: Just give it the minimum duplicate size and the source directory:
$ ./run.sh cpd --minimum-tokens 100 --files /usr/local/java/src/java
You can also specify the language:
$ ./run.sh cpd --minimum-tokens 100 --files /path/to/c/source --language cpp
You may wish to check sources that are stored in different directories:
$ ./run.sh cpd --minimum-tokens 100 --files /path/to/other/source --files /path/to/other/source --files /path/to/other/source --language fortran
There should be no limit to the number of ‘–files’, you may add… But if you stumble one, please tell us !
And if you’re checking a C source tree with duplicate files in different architecture directories you can skip those using –skip-duplicate-files:
$ ./run.sh cpd --minimum-tokens 100 --files /path/to/c/source --language cpp --skip-duplicate-files
You can also specify the encoding to use when parsing files:
$ ./run.sh cpd --minimum-tokens 100 --files /usr/local/java/src/java --encoding utf-16le
You can also specify a report format - here we’re using the XML report:
$ ./run.sh cpd --minimum-tokens 100 --files /usr/local/java/src/java --format xml
The default format is a text report, and there’s also a csv report.
Note that CPD is pretty memory-hungry; you may need to give Java more memory to run it, like this:
$ export HEAPSIZE=512m $ ./run.sh cpd --minimum-tokens 100 --files /usr/local/java/src/java
In order to change the heap size under Windows, you’ll need to edit the batch file cpd.bat set the “OPTS” variable to -Xmx512m.
If you specify a source directory but don’t want to scan the sub-directories, you can use the non-recursive option:
$ ./run.sh cpd --minimum-tokens 100 --non-recursive --files /usr/local/java/src/java
Please note that if CPD detects duplicated source code, it will exit with status 4 (since 5.0). This behavior has been introduced to ease CPD integration into scripts or hooks, such as SVN hooks.
Andy Glover wrote an Ant task for CPD; here’s how to use it:
<target name="cpd"> <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" /> <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt"> <fileset dir="/home/tom/tmp/ant"> <include name="**/*.java"/> </fileset> </cpd> </target>
Attribute | Description | Applies for language | Required |
encoding |
The character set encoding (e.g., UTF-8) to use when reading the source code files, but also when
producing the report. A piece of warning, even if you set properly the encoding value,
let's say to UTF-8, but you are running CPD encoded with CP1252, you may end up with not UTF-8 file.
Indeed, CPD copy piece of source code in its report directly, therefore, the source files
keep their encoding. If not specified, CPD uses the system default encoding. |
No | |
format | The format of the report (e.g. `csv`, `text`, `xml`); defaults to `text`. | No | |
ignoreLiterals | if `true`, CPD ignores literal value differences when evaluating a duplicate block. This means that `foo=42;` and `foo=43;` will be seen as equivalent. You may want to run PMD with this option off to start with and then switch it on to see what it turns up; defaults to `false`. | java | No |
ignoreIdentifiers | Similar to `ignoreLiterals` but for identifiers; i.e., variable names, methods names, and so forth; defaults to `false`. | java | No |
ignoreAnnotations | Ignore annotations. More and more modern frameworks use annotations on classes and methods, which can be very redundant and trigger CPD matches. With J2EE (CDI, Transaction Handling, etc) and Spring (everything) annotations become very redundant. Often classes or methods have the same 5-6 lines of annotations. This causes false positives; defaults to `false`. | java | No |
skipDuplicateFiles | Ignore multiple copies of files of the same name and length in comparison; defaults to `false`. | No | |
skipLexicalErrors | Skip files which can't be tokenized due to invalid characters instead of aborting CPD; defaults to `false`. | No | |
skipBlocks | Enables or disabled skipping of blocks like a pre-processor; defaults to `true`. See also option skipBlocksPattern. | cpp | No |
skipBlocksPattern | Configures the pattern, to find the blocks to skip. It is a string property and contains of two parts, separated by `|`. The first part is the start pattern, the second part is the ending pattern. The default value is `#if 0|#endif`. | cpp | no |
language | Flag to select the appropriate language (e.g. `c`, `cpp`, `cs`, `java`, `jsp`, `php`, `ruby`, `fortran` `ecmascript`, and `plsql`); defaults to `java`. | No | |
minimumtokencount | A positive integer indicating the minimum duplicate size. | Yes | |
outputfile | The destination file for the report. If not specified the console will be used instead. | No |
Also, you can get verbose output from this task by running ant with the -v flag; i.e.:
ant -v -f mybuildfile.xml cpd
Also, you can get an HTML report from CPD by using the XSLT script in pmd/etc/xslt/cpdhtml.xslt. Just run the CPD task as usual and right after it invoke the Ant XSLT script like this:
<xslt in="cpd.xml" style="etc/xslt/cpdhtml.xslt" out="cpd.html" />
CPD also comes with a simple GUI. You can start it via some scripts in the bin folder:
For Windows:
cpdgui.bat
For Linux:
./run.sh cpdgui
Here’s a screenshot of CPD after running on the JDK 8 java.lang package:
By adding the annotations @SuppressWarnings(“CPD-START”) and @SuppressWarnings(“CPD-END”) all code within will be ignored by CPD - thus you can avoid false positivs. This provides the ability to ignore sections of source code, such as switch/case statements or parameterized factories.
//enable suppression @SuppressWarnings("CPD-START") public Object someParameterizedFactoryMethod(int x) throws Exception { // any code here will be ignored for the duplication detection } //disable suppression @SuppressWarnings("CPD-END) public void nextMethod() { }