I have been recently working with a client to give their site a refresh. Rather than rebuild the entire thing, they wanted to make sure their current site was up-to-date and make a few key functionality improvements. One of these improvements is to clean up a library of PDF files they have hosted using Document Library Pro.
The Problem
As near as I can tell, whoever set up the library did a post import. But things didn’t work the way they expected so they did another import. And another one…all without clearing out the previously imported posts. This resulted in multiples of each document being added to the website.
For fun additional complexity, each of the “dlp_document” is tied to a PDF file which may be uploaded via the Media Library or attached to the post via the “direct URL” meta data. Or, the file may not exist at all. This means we also need to remove any duplicate PDF files. Plus, check that any file which the dlp_document has saved in the meta-data actually exists.
The Process
Manually checking 14K+ documents would not only be time consuming, but also lead lots of room for error. Rather, I decided to do the clean up by writing scripts within a plugin. The scripts are then executable by custom WP-CLI commands.
When it came to what order of actions needed to be taken, I decided to approach the problem by breaking it down into two scripts:
- Remove any duplicate PDFs
- Remove documents posts where a PDF does not exist
- Remove any documents posts which is a duplicate
The Code
You can find the plugin code here: https://github.com/JessBoctor/jb-deduplication
The main plugin file, jb-deduplication.php, is really basic. Essentially, it is just used to load the two script files into WordPress.
jb-pdf-media-deduplication.php
The jb-pdf-media-deduplication.php file holds the PDF_Media_Deduplication_Command class plus two other clean up commands.
There are a number of properties listed in PDF_Media_Deduplication_Command class. The first four are all arguments which control the scope of the script.
$dry_run– Run the script without actually deleting things$skip_confirmations– Skip manually checking duplicates$batch_size– The number of posts to check$start_post_id– Where to start the query
The remaining properties are all used by the script to track progress.
$last_post_id– The ID of the last post to get checked$unique_post_titles– An array of unique post titles which can be checked agains for duplicates$duplicate_posts_to_log– A nested array of data which tracks duplicate posts which are found$total_duplicate_posts– A count of the duplicate posts which are found
This function is similar to __construct as it is the first thing called when the command is run. In the case of WP-CLI, you only want to use the __construct function if you are using data outside of the class to or the command arguments to run the command. For example, if you had options stored in the wp-options table. You could fetch those options, pass them to a new instance of the class, and then when the WP-CLI command is run, it would use those pre-set options.
In the case of this script, all we need are the arguments passed from calling the WP-CLI command, so we can skip __construct. Instead, we just use __invoke to set our class properties and get the ball rolling.
$batch_size, $start_post_id, and $unqiue_post_titles
Since there is such a large number of posts which needed to be sorted, I wanted to be able to run the script in batches. This way, I could spot check small amounts of posts. However, since the goal is to find unique posts across the whole data set, I needed to figure out a way not to lose track of the progress made between different batches.
This method determines where a batch should start its post query. If the --start--post-id argument is passed with the WP-CLI command, then that is the post ID which is used as a starting point. However, I don’t want to have to remember where the last batch run ended. Instead, the $last_post_id property is store in the wp-options table as 'pdf-deduplication-start-post-id' (mouthy, I know). This way, if a user runs continuous batches, then the script can pull the next start post ID from the options table. If their is no post ID saved and no --start-post-id argument, then the start post ID uses the default property value of 1.
In a similar way, I don’t want to lose track of the unique posts which were found during each batch run. The $unique_post_titles property is an empty array by default. To keep it up to date, if any unique post titles are found during a batch run, they are saved to the wp_options table as pdf-deduplication-unique-post-titles. When the __invoke function is called, it checks for this option and loads any previously found unique post titles to the $unique_post_titles property before starting the deduplication process.
This is where the main “deduplication” action happens. It gets called at the very end of __invoke once the class properties have been set up. The method does four things:
- Fetches all PDF attachment posts
- Handles the post if it is a duplicate or unique
- Updates the
$unqiue_post_titlesrecords - Logs the result of the batch run
This is how we fetch the PDF attachment posts. It runs a simple query for any posts in the media library
global $wpdb;
$results = $wpdb->get_results(
$wpdb->prepare(
"
SELECT * FROM {$wpdb->posts}
WHERE post_type = %s
AND post_mime_type = %s
AND ID > %d
ORDER BY ID ASC
LIMIT %d
",
'attachment',
'application/pdf',
$this->start_post_id,
$this->batch_size
)
);
One of the things which turned out to be key in the deduplication process is the order of the post results. Since we want to use the earliest version of the PDF file which was uploaded, to avoid keeping any PDF files with -1 or -2 suffixes, the post results have to be in ascending order.
Once we have the results, we can set the $last_post_id property for the class. This will let us keep track of where the batch for the script ended.
// Set the last_post_id property to the last post ID in the results, if any
if ( ! empty( $results ) ) {
$last_post = end( $results );
$this->last_post_id = $last_post->ID;
}
The results get returned to deduplicate_pdfs() to be looped through a series of logic filters.
To start, we save $post->post_title into a separate varaiable $post_title. This allows us to fuzzy match the post title against known unique titles by stripping out -1, -2, and -pdf from the post title without changing the original $post->post_title. Each of these variations of $post_title are checked against the $unique_post_titles array. If a match is found, the $post object and the ID of the post with the matching title get sent through handle_duplicate_post().
If there isn’t a match from the four variations, then the post is considered unique. The post gets added to $unique_post_titles in a $post->ID => $post->post_title key => value pair.
If a PDF attachment $post is considered to be a duplicate, we need to confirm what the user wants to continue, log the post, and most likely delete the $post and uploaded file.
In the case of a dry-run (without skipping confirmations), the script will confirm if the user wants to log the duplicate PDF. In the case of the code being run for real, it will ask the user if they want to delete the post and file. If the user responds anything other than “yes”, then the script will exit mid-run.
When the user gives a “yes”, the first thing which happens is some basic information for the original PDF file and the duplicate get saved in gather_duplicate_posts_data().
Once the information is saved, in the case of a real run, the attachment is deleted via a call to wp_delete_attachment().
This method captures the post ID, title, and URL of the original and duplicate PDF posts. In the case of the duplicate, it will also attempt to capture the size of the file. This way, we can see how much data is being removed.
$this->duplicate_posts_to_log[] = array(
'original_post_id'. => $matching_post_title_id,
'original_post_title' => $this->unique_post_titles[$matching_post_title_id],
'original_pdf_url' => get_attached_file( $matching_post_title_id ),
'duplicate_post_id' => $duplicate_post->ID,
'duplicate_post_title' => $duplicate_post->post_title,
'duplicate_pdf_url' => $duplicate_file,
'duplicate_pdf_file_exists' => $duplicate_file_exists,
'duplicate_pdf_filesize' => $duplicate_file_size
);
The data is added to the $duplicate_posts_to_log property as a nested array. This allows us to use each array as a row in a CSV file which gets created by log_results().
Once each post object in the query is checked for a duplicate, the pdf-deduplication-unique-post-titles option is updated to match the current version of the $unique_post_titles array via save_unique_post_titles_to_options().
Once the unique posts are recorded, the duplicates get logged. In addition to printing some basic stats about the batch in the command line, the method makes use of the built-in WP_CLI\Utils method write_csv to create a CSV file containing the information in $duplicate_posts_to_log.
The file gets stored in the plugin directory under “logs”.
The script is done. Any duplicates will be logged and deleted and the PDF attachments will have been cleaned up.
Script Clean Up
To avoid bloat from running the script, I created two extra WP-CLI commands, pdf-media-dedup-clear-options and pdf-media-dedup-delete-logs. These clear out any options created in the wp-options table and delete any log files, respectively.
To be continued…
Follow along for the break down of jb-dlp-document-deduplication.php and how it clears out not only duplicates, but also posts with bad references. Exciting stuff!
Update!
Part two can be found here:

