Tag: PHP

  • Deduplicating 14,000 Posts

    I have been recently working with a client to give their site a refresh. Rather than rebuild the entire thing, they wanted to make sure their current site was up-to-date and make a few key functionality improvements. One of these improvements is to clean up a library of PDF files they have hosted using Document Library Pro.

    The Problem

    As near as I can tell, whoever set up the library did a post import. But things didn’t work the way they expected so they did another import. And another one…all without clearing out the previously imported posts. This resulted in multiples of each document being added to the website.

    For fun additional complexity, each of the “dlp_document” is tied to a PDF file which may be uploaded via the Media Library or attached to the post via the “direct URL” meta data. Or, the file may not exist at all. This means we also need to remove any duplicate PDF files. Plus, check that any file which the dlp_document has saved in the meta-data actually exists.

    The Process

    Manually checking 14K+ documents would not only be time consuming, but also lead lots of room for error. Rather, I decided to do the clean up by writing scripts within a plugin. The scripts are then executable by custom WP-CLI commands.

    When it came to what order of actions needed to be taken, I decided to approach the problem by breaking it down into two scripts:

    1. Remove any duplicate PDFs
    2. Remove documents posts where a PDF does not exist
    3. Remove any documents posts which is a duplicate

    The Code

    You can find the plugin code here: https://github.com/JessBoctor/jb-deduplication

    The main plugin file, jb-deduplication.php, is really basic. Essentially, it is just used to load the two script files into WordPress.

    jb-pdf-media-deduplication.php

    The jb-pdf-media-deduplication.php file holds the PDF_Media_Deduplication_Command class plus two other clean up commands.

    There are a number of properties listed in PDF_Media_Deduplication_Command class. The first four are all arguments which control the scope of the script.

    • $dry_run – Run the script without actually deleting things
    • $skip_confirmations – Skip manually checking duplicates
    • $batch_size – The number of posts to check
    • $start_post_id – Where to start the query

    The remaining properties are all used by the script to track progress.

    • $last_post_id – The ID of the last post to get checked
    • $unique_post_titles – An array of unique post titles which can be checked agains for duplicates
    • $duplicate_posts_to_log – A nested array of data which tracks duplicate posts which are found
    • $total_duplicate_posts – A count of the duplicate posts which are found

    __invoke

    This function is similar to __construct as it is the first thing called when the command is run. In the case of WP-CLI, you only want to use the __construct function if you are using data outside of the class to or the command arguments to run the command. For example, if you had options stored in the wp-options table. You could fetch those options, pass them to a new instance of the class, and then when the WP-CLI command is run, it would use those pre-set options.

    In the case of this script, all we need are the arguments passed from calling the WP-CLI command, so we can skip __construct. Instead, we just use __invoke to set our class properties and get the ball rolling.

    $batch_size, $start_post_id, and $unqiue_post_titles

    Since there is such a large number of posts which needed to be sorted, I wanted to be able to run the script in batches. This way, I could spot check small amounts of posts. However, since the goal is to find unique posts across the whole data set, I needed to figure out a way not to lose track of the progress made between different batches.

    determine_start_post_id()

    This method determines where a batch should start its post query. If the --start--post-id argument is passed with the WP-CLI command, then that is the post ID which is used as a starting point. However, I don’t want to have to remember where the last batch run ended. Instead, the $last_post_id property is store in the wp-options table as 'pdf-deduplication-start-post-id' (mouthy, I know). This way, if a user runs continuous batches, then the script can pull the next start post ID from the options table. If their is no post ID saved and no --start-post-id argument, then the start post ID uses the default property value of 1.

    In a similar way, I don’t want to lose track of the unique posts which were found during each batch run. The $unique_post_titles property is an empty array by default. To keep it up to date, if any unique post titles are found during a batch run, they are saved to the wp_options table as pdf-deduplication-unique-post-titles. When the __invoke function is called, it checks for this option and loads any previously found unique post titles to the $unique_post_titles property before starting the deduplication process.

    deduplicate_pdfs()

    This is where the main “deduplication” action happens. It gets called at the very end of __invoke once the class properties have been set up. The method does four things:

    1. Fetches all PDF attachment posts
    2. Handles the post if it is a duplicate or unique
    3. Updates the $unqiue_post_titles records
    4. Logs the result of the batch run

    get_pdf_posts()

    This is how we fetch the PDF attachment posts. It runs a simple query for any posts in the media library

    global $wpdb;
    
    $results = $wpdb->get_results(
       $wpdb->prepare(
          "
          SELECT * FROM {$wpdb->posts}
          WHERE post_type = %s
          AND post_mime_type = %s
          AND ID > %d
          ORDER BY ID ASC
          LIMIT %d
          ",
          'attachment',
          'application/pdf',
          $this->start_post_id,
          $this->batch_size
       )
    );
    

    One of the things which turned out to be key in the deduplication process is the order of the post results. Since we want to use the earliest version of the PDF file which was uploaded, to avoid keeping any PDF files with -1 or -2 suffixes, the post results have to be in ascending order.

    Once we have the results, we can set the $last_post_id property for the class. This will let us keep track of where the batch for the script ended.

    // Set the last_post_id property to the last post ID in the results, if any
    if ( ! empty( $results ) ) {
        $last_post = end( $results );
        $this->last_post_id = $last_post->ID;
    }
    

    The results get returned to deduplicate_pdfs() to be looped through a series of logic filters.

    To start, we save $post->post_title into a separate varaiable $post_title. This allows us to fuzzy match the post title against known unique titles by stripping out -1, -2, and -pdf from the post title without changing the original $post->post_title. Each of these variations of $post_title are checked against the $unique_post_titles array. If a match is found, the $post object and the ID of the post with the matching title get sent through handle_duplicate_post().

    If there isn’t a match from the four variations, then the post is considered unique. The post gets added to $unique_post_titles in a $post->ID => $post->post_title key => value pair.

    handle_duplicate_post()

    If a PDF attachment $post is considered to be a duplicate, we need to confirm what the user wants to continue, log the post, and most likely delete the $post and uploaded file.

    In the case of a dry-run (without skipping confirmations), the script will confirm if the user wants to log the duplicate PDF. In the case of the code being run for real, it will ask the user if they want to delete the post and file. If the user responds anything other than “yes”, then the script will exit mid-run.

    When the user gives a “yes”, the first thing which happens is some basic information for the original PDF file and the duplicate get saved in gather_duplicate_posts_data().

    Once the information is saved, in the case of a real run, the attachment is deleted via a call to wp_delete_attachment().

    gather_duplicate_posts_data()

    This method captures the post ID, title, and URL of the original and duplicate PDF posts. In the case of the duplicate, it will also attempt to capture the size of the file. This way, we can see how much data is being removed.

    $this->duplicate_posts_to_log[] = array(
      'original_post_id'.         => $matching_post_title_id,
      'original_post_title'       => $this->unique_post_titles[$matching_post_title_id],
      'original_pdf_url'          => get_attached_file( $matching_post_title_id ),
      'duplicate_post_id'         => $duplicate_post->ID,
      'duplicate_post_title'      => $duplicate_post->post_title,
      'duplicate_pdf_url'         => $duplicate_file,
      'duplicate_pdf_file_exists' => $duplicate_file_exists,
      'duplicate_pdf_filesize'    => $duplicate_file_size
    );
    

    The data is added to the $duplicate_posts_to_log property as a nested array. This allows us to use each array as a row in a CSV file which gets created by log_results().

    Once each post object in the query is checked for a duplicate, the pdf-deduplication-unique-post-titles option is updated to match the current version of the $unique_post_titles array via save_unique_post_titles_to_options().

    log_results()

    Once the unique posts are recorded, the duplicates get logged. In addition to printing some basic stats about the batch in the command line, the method makes use of the built-in WP_CLI\Utils method write_csv to create a CSV file containing the information in $duplicate_posts_to_log.

    The file gets stored in the plugin directory under “logs”.

    The script is done. Any duplicates will be logged and deleted and the PDF attachments will have been cleaned up.

    Script Clean Up

    To avoid bloat from running the script, I created two extra WP-CLI commands, pdf-media-dedup-clear-options and pdf-media-dedup-delete-logs. These clear out any options created in the wp-options table and delete any log files, respectively.

    To be continued…

    Follow along for the break down of jb-dlp-document-deduplication.php and how it clears out not only duplicates, but also posts with bad references. Exciting stuff!

    Update!

    Part two can be found here:

  • Get GOing; learning Go continued

    Picking back up with learning Go. Starting with Arrays.

    I have to admit, this will be a hard switch from PHP. When I create arrays in PHP, generally, I try to be as explicit as possible by using associative arrays so I can pull the information I want back out:

    <?php
    
    $new_array = array(
       'name' => 'Jane',
       'favorite_color' => 'Red',
       'favorite_fruit' => 'Apple',
    );
    
    ?>
    

    It doesn’t appear that Go has this option. If I wanted to create the same array in Go, it would look like this:

    package main
    import ("fmt")
    
    func main() {
      var new_array = [3]string{'Jane','Red','Apple'}
    
      fmt.Println(new_array)
    }
    

    Coming from a PHP perspective, this seems complicated because it means that I have to remember the order in which I input information to get the correct data out. This is a hassel. In PHP, I just have to remember that if I want a name, I would fetch $new_array['name']. In Go, this would have to be new_array[1] which isn’t as explicit.

    However, I think the reason behind this limitation is, in itself, a concept switch about how to use arrays. In PHP, arrays are a veritable Mary Poppins bag which can handle multiple data types, be deeply nested, and have items added or removed without consequence (until you need it). However, this can be a trap. If you haven’t thought through the structure of your data (for example, by creating a class object), it can be tempting to just throw everything in your array-bag. Longterm, this can cause your code to spaghetti out of control and be a performance pain.

    Go seems to seek to avoid this trap by requiring that all items in an array be of the same data type. There is also a built in method to control how many items should be placed within an array. In this way, arrays seems to be more like glass recycling bins in Germany.

    The bins only take one type of material, glass, it is sorted by color, and you can see when the bin is full. It doesn’t matter as much what order things are in because you know everything in the “brown glass” bin is a type of brown glass.

    Photo credit https://www.archer-relocation.com/how-to-recycle-in-germany/

    package main
    import ("fmt")
    
    func main() {
      var zipcodes = [3]int{ 10011, 90210, 20001 }
      var colors = [3]string{ 'red', 'blue', 'green' }
      var fruits = [3]string{ 'apples', 'oranges', 'cherries' }
    
      fmt.Println(zipcodes)
      fmt.Println(colors)
      fmt.Println(fruits)
    }
    

    This is the key point of the concept switch; in PHP arrays can be used as a catch all (even if it isn’t necessarily a good idea) and in Go arrays are used to catch a specific type of data with a grouped theme so you don’t care as much about the order of items. It almost seems like in PHP, it is easier to focus on where you need a collection of data because you can push so much into a single array (e.g. I am going to throw everything in my bag so I have it with me). In Go, a dev is required to think more about what you need since it makes more sense to group themed items together (e.g. I am going to use suitcase cubes to make sure its nice and neat).

    PHP Arrays

    <?php
    // Let's describe Mary Poppins in an array
    
    $mary_poppins = array(
       'height' => 'average',
       'singing' => 'often',
       'bag_items' => array( 'lamp', 'umbrella', 'spoonful of sugar'),
       'cleanliness' => 'strict',
       'hair' => 'brown',
    );
    ?>
    

    GO Arrays

    package main
    import ("fmt")
    
    func main() {
       // Let's describe Mary Poppins in a series of arrays
      var physical_traits = [2]string{ 'brown hair', 'average height'}
      var talents = [2]string{ 'cleaning', 'singing'}
      var bag_contents = [3]string{ 'lamp', 'umbrella', 'spoonful of sugar'}
    }
    

    Now, I don’t want to give the impression there isn’t any structure to Go arrays. It is based on the index of the array (which starts at 0).

    For example, let’s say you are running a competition. The competitors have to face off in a way that you know who got last place before you know who got first. Let’s set up an array to store where each competitor placed:

    package main
    import ("fmt")
    
    var winners_names = [6]string{}
    
    func main() {
      fmt.Println(winners_names)
    }
    

    Calling main() would result in printing [ "" "" "" "" "" "" ]. It is an empty array of strings. Now, let’s add a function to update the winners names as we receive them.

    package main
    import ("fmt")
    
    var winners_names = [6]string{}
    
    func main() {
      fmt.Println(winners_names)
    }
    
    func record_winner( place, name) {
       // Remember that indexes start with 0, so 6th place would actually be stored at winners_names[5]
       var index = place - 1 
       winners_names[index] = [name]
    }
    

    I think this syntax would work (haven’t got to writing functions yet πŸ˜…). So, when we found out that the first two competitors won 5th and 6th place, we could call record_winner() to update the winners_name list.

    record_winner( 5, 'Jane Doe')
    record_winner( 6, 'John Doe')
    

    Now, when we call main(), the output would be [ "" "" "" "" "Jane Doe" "John Doe" ]. You could repeat this until all of the placements are filled in.

    So this is really useful when you have a situation where the ranking of an item can match the order in which it needs to be displayed.

    Stopping here for today. I’ll pick this up next week with how to slice and dice things.

  • Lets GO!

    I am by trade and tutelage, a PHP developer. I can also work in JavaScript, TypeScript, and React. The bulk of my work has been done in PHP. As someone pretty heavily entrenched in WordPress ecosystem, this served me well.

    However, as I am looking to new horizons, it is becoming apparent to me that I will need to stretch my language skills even further. So let’s start with Go. Please enjoy my ramblings as I learn πŸ™‚

    I am starting with a really basic w3schools.com tutorial to see what the differences in syntax are.

    A Go file consists of the following parts:

    • Package declaration
    • Import packages
    • Functions
    • Statements and expressions
    https://www.w3schools.com/go/go_syntax.php

    From a PHP perspective, this sounds pretty similar.

    • Package declaration => PHP Namespaces. It gives the program or file scope or limitations
    • Import packages -> In PHP, this was done by the use statements. It allowed you to pull in functions from other namespaces, classes, or even just a single function
    • Funcations => This is pretty self explanatory
    • Statements and expressions => I am intrigued…

    Syntax

    In Go, statements are separated by ending a line (hitting the Enter key) or by a semicolon “;“.

    Hitting the Enter key adds “;” to the end of the line implicitly (does not show up in the source code).

    https://www.w3schools.com/go/go_syntax.php

    ❓ Does this mean there are hidden semicolons throughout the code, or does the complier read a new line as a semicolon?

    The answer seems to be yes, under certain conditions. I guess for now, I will continue to explicitly write out my semicolons to avoid confusion until I am more comfortable working in Go. Also, to avoid causing myself grief when I switch back to other languages.

    Comments

    • Single line comments start with //
    • Multiline comments are encased in /* {your comment here} */

    Creating variables

    • Use var
      • Benefit: this allows you to specify the type of the variable (e.g. var test string = "some words";)
      • This notation can be used within a function or without
      • Allows for value assignment to be done separately from declaration
      • Always requires at least a type or value
        • βœ… var test string = "string";
        • βœ… var test string;
        • βœ… var test = "string";
        • ❌ var test;
    • use :=
      • This is a new notation to me, I would have easily mixed it up for a logic statement if I had run into it in the wild
      • Benefit: It is a shorthand and so it may be faster to use
      • Like Typescript, the compiler infers the type of the variable based on the value
      • This notation can only be used within a function
      • Value must always be assigned at declaration
      • It is not possible to declare a variable without a value using this notation (which makes sense)
        • ❌ test string := "string"; (I think this would set a variable string as "string")
        • ❌ test := null; (null and empty strings both throw an error)
        • βœ… test := "string";
    • Declaring multiple vatiables
      • Similar to a mathematical matrix, you can declare multiple variables and values at the same time
        • So, for example var a, b = 6, "Hello!"; is the same thing as declaring var a = 6; var b = "Hello!";
        • If you declaring multiple variables at the same time only supports one type
          • βœ… var a, b string = "A", "B";
          • ❌ var a string, b int = “A”, 2;

    Go variable naming rules:

    • A variable name must start with a letter or an underscore character (_)
    • A variable name cannot start with a digit
    • A variable name can only contain alpha-numeric characters and underscores (a-z, A-Z,Β 0-9, andΒ _Β )
    • Variable names are case-sensitive (age, Age and AGE are three different variables)
    • There is no limit on the length of the variable name
    • A variable name cannot contain spaces
    • The variable name cannot be any Go keywords
    https://www.w3schools.com/go/go_variable_naming_rules.php

    Variable names support camel case, pascal case, and snake case.

    Constants

    Seems constants work as expected, they should be declare once, are unchangeable and read-only, can be typed or have the type inferred from the value. To make constants easy to identify, they should be written in uppercase letters (e.g. “USER_AGE”, not “userAge”, or “UserAge”, or “user_age”)

    Output

    • Print()
      • Prints out the value only, can have multiple comma separated arguments passed in (e.g. Print( "Hello", " ", "World");) Reminds me of the Google Sheets function Concatenate.
    • Println()
      • Adds whitespace between arguments and new line at the end. So Println( "Hello", "World"); prints out Hello World.
    • Printf()
      • Allows integrating the type or value of a variable into a string.
    package main;
    import ("fmt");
    
    func main() {
       var userName string = "Jane";
    
       fmt.Printf( "Hello %v!", userName ); 
       // Prints "Hello Jane!"
       fmt.Printf( "The name %v is a %t", userName, username ); 
       // Prints "The name Jane is a string"
    }
    

    The values are integrated into the string using formatting verbs.

    I’m stopping here for today. The next item up in the tutorial are arrays, and they are definitely different than how they work in PHP. I’ll save that for next time.