Deduplicating 14K Posts (Part II)

In my last post, I walked through the start of how I deduplicated 14,000 posts Document Library Pro posts. That post covered how I crafted a script to search through PDF attachment posts and delete any duplicates based on the filename.

The reason for starting with PDFs was to make sure that if a dlp_document post referenced any duplicate PDF attachments, it would be deleted. By cleaning out the PDF’s first, I had a clean base to check against the dlp_documents.

Now, it was time to move on to cleaning out the actual dlp_document post type.

jb-dlp-document-deduplication.php

Once I had written the script to deduplicate the PDFs, I was able to use it as a base for the dlp_document posts. Many of the properties and functions are the same with some slight name tweaking for dlp_document instead of pdf_media.

The general concept is the same, there are new WP_CLI commands introduced which run the deduplication script, clear out the options table, and clear out any log files.

One of the ways this script differs is that rather than saving the earliest posts as the originals, we are saving the newest. With the PDFs, we saved the earliest posts to avoid any -1, -2, or other suffixes added to the URLs and file names. However, with the dlp_document posts, the most recent post has the most complete information regarding taxonomies, and excerpts. So rather than querying the database in ascending order, we pull the dlp_document posts in descending order by their post ID.

$results = $wpdb->get_results(
   $wpdb->prepare(
      "
      SELECT * FROM {$wpdb->posts}
      WHERE post_type = %s
      AND ID < %d
      ORDER BY ID DESC
      LIMIT %d
      ",
      'dlp_document',
      $this->start_post_id,
      $this->batch_size
   )
);

Another way this script differs is that rather than handling just duplicate posts, we also had to verify that the PDF file stored in the post meta actually exists. If the file doesn’t exist, the dlp_document post needed to be deleted. It doesn’t help anyone to have a dead link displayed on the site.

Rather than trying to shove this extra check into the current process of handle_duplicate_post(), I decided to create a separate flow for posts where the PDF file did not exist. This allowed me to log the instances of duplicate posts and missing PDF’s separately.

If you have read the previous post, you are generally familiar with the flow of handling duplicate posts. Rather than repeating the concepts, here I will focus just on checking for missing PDFs.

It starts within the foreach loop in deduplicate_dlp_docs(). Before checking for a duplicate document, the loop check that the PDF is valid by calling determine_if_pdf_exists().

determine_if_pdf_exists()

The method takes the object of the dlp_document post as an argument. This object contains the ID and title of the document post. We use this to handle fetching the post meta where the PDF information is stored.

Here is where things got tricky; the document library pro plugin allows a PDF to be saved to a post using one of two options–a direct URL or a post ID. The posts in the customer’s database used both options intermittently.

In order to determine how the PDF is saved to the post, the first thing I had to fetch was the “link type”

// Confirm that PDF file is attached by checking the post meta
$pdf_link_type = get_post_meta( $dlp_document_post->ID, '_dlp_document_link_type', true ) ?? null;

The expected results for $pdf_link_type are either “url” or “file”. If the result is anything else, we should delete the dlp_document post because it is incomplete without an attached PDF.

The type of link determines the meta key for the post meta containing the actual PDF information. For example, for url the meta key is _dlp_direct_link_url. For file, the meta key is _dlp_attached_file_id. The simplest way to handle all three cases (url, file, anything else) was to create a switch statement.

For the url and file cases, I pull the post meta from the data base. If the post meta exists, then I check that the value it provides (e.g. a URL or post ID) actually exists.

If the data is sound, it is returned to deduplicate_dlp_docs() as part of an array. Otherwise, handle_missing_pdf_file() is called.

 switch ( $pdf_link_type ) {
    case 'url':
        $pdf_file_path = get_post_meta( $dlp_document_post->ID, '_dlp_direct_link_url', true ) ?? null;
        // If the post meta does not exist, the PDF file is missing
        if ( null === $pdf_file_path ) {
           $this->handle_missing_pdf_file( $dlp_document_post, $pdf_link_type, null );
           return $attached_pdf_meta;
        }

        // If the post meta exists, check that the file exists
        if ( ($pdf_file_path && ! file_exists( $pdf_file_path ) ) ) {
           $this->handle_missing_pdf_file( $dlp_document_post, $pdf_link_type, $pdf_file_path );
           return $attached_pdf_meta;
        }

        $attached_pdf_meta['link_type'] = $pdf_link_type;
        $attached_pdf_meta['pdf_file'] = $pdf_file_path;
        break;
    case 'file':
        $pdf_post_id = get_post_meta( $dlp_document_post->ID, '_dlp_attached_file_id', true ) ?? null;
        // If the post meta does not exist, we assume the PDF file is missing
        if ( null === $pdf_post_id ) {
           $this->handle_missing_pdf_file( $dlp_document_post, $pdf_link_type, null );
           return $attached_pdf_meta;
        }

        // If the post meta contains a document post ID, check that the document post exists
        if ( ( $pdf_post_id && ! get_post_status( $pdf_post_id ) ) ) {
           $this->handle_missing_pdf_file( $dlp_document_post, $pdf_link_type, $pdf_post_id );
           return $attached_pdf_meta;
        }

        $attached_pdf_meta['link_type'] = $pdf_link_type;
        $attached_pdf_meta['pdf_file'] = $pdf_post_id;
        break;
    default:
        // If the DLP Document post is neither a direct link nor a media library attachment, it should be deleted
        $this->handle_missing_pdf_file( $dlp_document_post, $pdf_link_type, null );
        break;
}

handle_missing_pdf_file()

Similar to handle_duplicate_post(), this method warns the user that a document with a non-existant PDF was found and confirms if the user wants to proceed with logging (dry-run) or deleting(for real) the dlp_document post.

In both scenarios, information about the dlp_document post and PDF are gathered to be later logged into a CSV.

gather_missing_pdf_posts_data()

This method takes the post object for the dlp_document post and the meta data for where a PDF was expected to exist as arguments. It then pushes it into an array that is saved with the stash_of_missing_pdf_posts array.

 $this->stash_of_missing_pdf_posts[] = array(
     'dlp_document_post_id'      => $dlp_doc_post->ID,
     'dlp_document_post_title'   => $dlp_doc_post->post_title,
     'pdf_link_type'             => $pdf_link_type,
     'missing_pdf_id_or_url'     => $missing_pdf_id_or_url,
);

Once the missing dlp_document is logged and handled, the code returns to determine_if_pdf_exists() where an empty array is returned to deduplicate_dlp_docs(). The empty array indicates to the foreach loop, where everything started, that there is nothing else to do with this post. There is no point in checking for if the post is a duplicate since it was already logged and possibly deleted. The loop continues on to the next post in the query results without checking for a duplicate post.

If determine_if_pdf_exists() does return a not-empty array, then the code continues processing through to check if the dlp_document post is a duplicate of a post which was already found and tracked.

Once the foreach loop concludes, both the missing PDF and duplicate posts are logged. I used separate CSV files to allow different information to be stored in each CSV and to make it easier to parse how many posts were true duplicates and how many had invalid documents saved in the post meta.

Script Clean Up

Since there are two types of log files for the dlp_document posts, I created a third clean-up commanddlp-document-missing-pdf-delete-logs. This gives the user flexibility to delete just the duplication log files (via dlp-document-dedup-delete-logs) or the logs for the missing PDFs.

Results

PDF Deduplication:

Processed 6994 PDF posts
Total duplicate posts found: 820
Unique PDF posts found: 6174

DLP_Document Deduplication:

Processed 14,402 DLP Document posts
Total duplicate posts found: 6108
Total posts with missing PDF file found: 3151
Unique DLP Document posts found: 5143

Now the library is all cleaned up. There are no more duplicate posts and all links to PDFs should work properly.

Have a site which needs some data clean up? I’m available! Fill out the contact form below to reach out.

← Back

Deduplicating 14K Posts (Part II)

Script Clean Up

Results

Thank you for your response. ✨

Comments

One response to “Deduplicating 14K Posts (Part II)”

Leave a comment Cancel reply

More posts

Crash Course in Technical Support

Lovable to Local with Supabase