Handling large data with cron jobs and queue workers

One problem a lot of developers seem to come into, from reading their code, is when they try to handle large sets of data with background scripts. There are a few common mistakes that I have seen from inheriting the code of my predecessors, and I thought I would offer some solutions to the problems that can come about.

I will cover off these issues as problems and solutions, and I’m going to use Laravel solutions, but conceptually it doesn’t really matter what framework you’re working in.

Problem #1 – The script takes too long (or many hours) to run!

This is quite a common one, particularly when inheriting code from days gone by, when the business and database were small. Then things grew, and suddenly that script which took ten minutes to run, takes many hours.

When I have to tackle these kinds of issues, assuming I understand the script and what it is doing there are a couple of ways to solve the issue;

Firstly, if your script is doing many different things, split it out into multiple scripts, which run at their appropriate times, which do a single job each. However, this probably isn’t the ideal.

Usually a script that is doing some daily processing looks something like this (in Laravel, but it’s not really relevant), this is awful example we’re calculating some cumulative business values for the customer.

This is not an actual feature, or actual code, I’ve written it as a deliberately simple example. Honestly I’ve not even tried to run this code, it’s just an example

$customers = Customer::all();
foreach($customers as $customer){
    $orderTotal = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
    $thisMonthOrders = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
    $customer->total_spend = $orderTotal;
    $customer->total_spend_this_month = $thisMonthOrders;
    $customer->save();
}

In this example we’re getting the total value of the orders this customer has made in their lifetime and the last month, and then saving it against the customer record. Presumably for easy/fast search and filtering or something.

This is fine until you start hitting many thousands or hundreds of thousands of sales and/or customers, then the script is going to take a very long time to run.

Firstly, we can optimise the script, to only fetch the orders in the last month and therefore only calculate totals for the affected customers, that would have a huge impact. Buut that’s not the point that I am trying to make here. (Though it would be relevant).

It would be much more efficient in terms of execution speed to do something like the following, though it does assume you have a Queue Worker or Laravel Horizon running (or some kind of mechanism to handle jobs on a queue)


foreach(Customer::all() as $customer){
    CalculateTotals::dispatch($customer);
}

// app/Jobs/CalculateTotals

class CalculateTotals implements ShouldQueue{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
    protected $contact;
    public function __construct(Contact $contact)
    {
        $this->contact = $contact;
    }
    public function handle()
    {
        $this-contact->total_spend = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
        $this-contact->total_spend_this_month = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
        $this->contact->save();
    }
}

What we have actually done here, to speed up the execution time, is use the cron job to calculate what we should be handling/processing, but then passing off the work through to the job queues to handle. This means that you can have however many queue workers you have doing the processing and time consuming parts of the job. The cron will only be running for as long as it takes to find the relevant jobs to execute, and dispatch them via Redis or whatever queueing mechanism you’re using.

If you have 10,000 contacts to process, and it takes 2 seconds to process a contact, it would take 20,000 seconds to run the script.

If you run 20 queue workers, all all things are equal, the work from your script is now divided by 20 workers, or each worker would now have to deal with 500 jobs. 500 jobs multiplied by 2 seconds per job is 1,000 seconds to run (instead of 20,000).

NB: Personally, I would calculate these numbers on the fly, if they needed saving in the database I would calculate them on a listener to an event for when the orders were added, so they’re always live-ish, naturally I would not be using an ::all() as the basis to start any kind of processing in a large environment.

Problem #2 – The script uses too much memory, and crashes

This one depends on where the memory is coming from, but sticking with our above example…

$contacts = Contact::all();

Is a pretty bad place to start, it will work for a while, but if your database grows you’re eventually going to max out your memory trying to retrieve your whole data set.

$contacts = Contact::where('last_order_at', '>=', $carbonSomething->format('Y-m-d')->get();

This is a good place to start, whittle down how much you’re retrieving, but you might still be bringing back too much data.

Again, this is a demonstration, I’ve not run it, it’s just for the concept

$relevantContactsCount = Contact::where($qualifiers)->count();
$perPage = 10000;
$pages = ceil($relevantContactsCount / $perPage);
$currentPage = 1;
while($currentPage < $pages){
    $contacts = Contact::where($qualifiers)->orderBy('field','order')->limit($perPage)->offset(($currentPage - 1) * $perPage)->get();
    // Now we have 10,000 contacts only
    foreach($contacts as $contact){
        MyJobFromAbove::dispatch($contact); // Run it in a background job
    }
    $currentPage++;
}

Problem #3 – I can’t run my script as jobs, because it has to report back with totals, etc.

Now, this is an awkward one, but you simply need to change the way that you’re thinking about how you gather that information. You need a persistent way of storing the information, so that it can be summarised later.

There are a couple of ways to achieve this;

  1. When you start running the script, you create a record in the database or somewhere which you can report back to from your jobs (be wary of having every job run an incremental update for a “total contacts processed” type column!), or a way of consolidating the results
  2. Utilise caching mechanisms to remember certain information about what is running, so information can be shared between jobs and later on reported on
  3. Consolidate your results in somewhere very easy to calculate, and run your finalised report (which is perhaps emailed) once you know all processing is finished

Parting Thoughts

The normal rules of performance here still count…

  1. Only pull out from SQL what you actually need (select contact_id, subtotal, items_count from orders instead of select * from orders)
  2. Do what you can at SQL level, where possible SELECT SUM(subtotal) FROM orders WHERE contact_id = ?
  3. Ensure you have effective and efficient indices on the relevant columns
  4. With MySQL, for speed, use InnoDB tables, for saving storage space use MYISAM (very simplified!)
  5. Having issues with pulling information out of the database (rather than processing it?) consider NoSQL solutions like MongoDB
  6. Consider if there are better ways to get the data into your database, so that it can be retrieved more efficiently, or if there are other database design implementations which will suit your needs more closely
  7. Consider working in a near-live speed, instead of bulk processing, even if you hold that information off to be implemented the following day (for business logic purposes)

I'd love to hear your opinion on what I've written. Anecdotes are always welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.