Categories
Tutorials

Retrieving 3+ million MySQL records, in less than 2 minutes

I’m going to jump straight into this one, I have a task to do. This task consists of going through 3.2 million records in a MYISAM table, retrieving some information, doing some basic processing, and moving certain data elsewhere.

This is a very quick cheat or hack, to get that information out, but first, we need to start with the problem. Let’s say we are using Laravel, we can use DB Chunking, and this works pretty well, but it’s worth understanding how this works at its core, alongside MySQL.

We start with a fairly innocuous paginated query…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

Great, and then we continue on with that, until we reach the magic 3.2 million records. (3,200,000 divided by 10,000 items per page is 320 pages). Fair enough, cool.

If you’ve never tried to do this, I will save you the pain. Page 1 will be like lightening (say half a second), page 2 will be a bit slower, by the time you get to page 50 or 100 you will be looking at near enough a minute per page.

I was actually running batches of 100,000 so looking at 32 pages to cover my 3.2 million records. Running it using the LIMIT keyword meant that looping the batches (and I mean retrieving the page, and running a foreach in the PHP without doing anything) took 14.5 minutes to run.

Not good. Especially when I have lots of intensive stuff to do for each item in each loop. So here is the little hack to save you time, it does come with caveats.

  1. It assumes you have an id column that is unique and sequential
  2. This will not work if you have to do any kind of ordering
  3. We can’t guarantee every page will be the same size (records may be missing), so your mechanisms need to not rely on that idea

So instead of…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

We’re going to do…

SELECT * FROM mytable WHERE id >= 1 AND id <= 100000

What this means is that instead of counting to 300,000 and then counting your next 100,000 rows, you perform a much simpler query to retrieve the desired results.

The result of this for me (querying against a 5gb 3.2 million record table) was that it took around 90 seconds to retrieve and loop all the records, instead of 14.5 minutes.

When you need to loop through and trigger a process for every single record in a big table, this is definitely a faster way of doing it. Assuming there is no other viable way of retrieving the information, of course.

Categories
Laravel Tutorials

Handling large data with cron jobs and queue workers

One problem a lot of developers seem to come into, from reading their code, is when they try to handle large sets of data with background scripts. There are a few common mistakes that I have seen from inheriting the code of my predecessors, and I thought I would offer some solutions to the problems that can come about.

I will cover off these issues as problems and solutions, and I’m going to use Laravel solutions, but conceptually it doesn’t really matter what framework you’re working in.

Problem #1 – The script takes too long (or many hours) to run!

This is quite a common one, particularly when inheriting code from days gone by, when the business and database were small. Then things grew, and suddenly that script which took ten minutes to run, takes many hours.

When I have to tackle these kinds of issues, assuming I understand the script and what it is doing there are a couple of ways to solve the issue;

Firstly, if your script is doing many different things, split it out into multiple scripts, which run at their appropriate times, which do a single job each. However, this probably isn’t the ideal.

Usually a script that is doing some daily processing looks something like this (in Laravel, but it’s not really relevant), this is awful example we’re calculating some cumulative business values for the customer.

This is not an actual feature, or actual code, I’ve written it as a deliberately simple example. Honestly I’ve not even tried to run this code, it’s just an example

$customers = Customer::all();
foreach($customers as $customer){
    $orderTotal = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
    $thisMonthOrders = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
    $customer->total_spend = $orderTotal;
    $customer->total_spend_this_month = $thisMonthOrders;
    $customer->save();
}

In this example we’re getting the total value of the orders this customer has made in their lifetime and the last month, and then saving it against the customer record. Presumably for easy/fast search and filtering or something.

This is fine until you start hitting many thousands or hundreds of thousands of sales and/or customers, then the script is going to take a very long time to run.

Firstly, we can optimise the script, to only fetch the orders in the last month and therefore only calculate totals for the affected customers, that would have a huge impact. Buut that’s not the point that I am trying to make here. (Though it would be relevant).

It would be much more efficient in terms of execution speed to do something like the following, though it does assume you have a Queue Worker or Laravel Horizon running (or some kind of mechanism to handle jobs on a queue)


foreach(Customer::all() as $customer){
    CalculateTotals::dispatch($customer);
}

// app/Jobs/CalculateTotals

class CalculateTotals implements ShouldQueue{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
    protected $contact;
    public function __construct(Contact $contact)
    {
        $this->contact = $contact;
    }
    public function handle()
    {
        $this-contact->total_spend = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
        $this-contact->total_spend_this_month = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
        $this->contact->save();
    }
}

What we have actually done here, to speed up the execution time, is use the cron job to calculate what we should be handling/processing, but then passing off the work through to the job queues to handle. This means that you can have however many queue workers you have doing the processing and time consuming parts of the job. The cron will only be running for as long as it takes to find the relevant jobs to execute, and dispatch them via Redis or whatever queueing mechanism you’re using.

If you have 10,000 contacts to process, and it takes 2 seconds to process a contact, it would take 20,000 seconds to run the script.

If you run 20 queue workers, all all things are equal, the work from your script is now divided by 20 workers, or each worker would now have to deal with 500 jobs. 500 jobs multiplied by 2 seconds per job is 1,000 seconds to run (instead of 20,000).

NB: Personally, I would calculate these numbers on the fly, if they needed saving in the database I would calculate them on a listener to an event for when the orders were added, so they’re always live-ish, naturally I would not be using an ::all() as the basis to start any kind of processing in a large environment.

Problem #2 – The script uses too much memory, and crashes

This one depends on where the memory is coming from, but sticking with our above example…

$contacts = Contact::all();

Is a pretty bad place to start, it will work for a while, but if your database grows you’re eventually going to max out your memory trying to retrieve your whole data set.

$contacts = Contact::where('last_order_at', '>=', $carbonSomething->format('Y-m-d')->get();

This is a good place to start, whittle down how much you’re retrieving, but you might still be bringing back too much data.

Again, this is a demonstration, I’ve not run it, it’s just for the concept

$relevantContactsCount = Contact::where($qualifiers)->count();
$perPage = 10000;
$pages = ceil($relevantContactsCount / $perPage);
$currentPage = 1;
while($currentPage < $pages){
    $contacts = Contact::where($qualifiers)->orderBy('field','order')->limit($perPage)->offset(($currentPage - 1) * $perPage)->get();
    // Now we have 10,000 contacts only
    foreach($contacts as $contact){
        MyJobFromAbove::dispatch($contact); // Run it in a background job
    }
    $currentPage++;
}

Problem #3 – I can’t run my script as jobs, because it has to report back with totals, etc.

Now, this is an awkward one, but you simply need to change the way that you’re thinking about how you gather that information. You need a persistent way of storing the information, so that it can be summarised later.

There are a couple of ways to achieve this;

  1. When you start running the script, you create a record in the database or somewhere which you can report back to from your jobs (be wary of having every job run an incremental update for a “total contacts processed” type column!), or a way of consolidating the results
  2. Utilise caching mechanisms to remember certain information about what is running, so information can be shared between jobs and later on reported on
  3. Consolidate your results in somewhere very easy to calculate, and run your finalised report (which is perhaps emailed) once you know all processing is finished

Parting Thoughts

The normal rules of performance here still count…

  1. Only pull out from SQL what you actually need (select contact_id, subtotal, items_count from orders instead of select * from orders)
  2. Do what you can at SQL level, where possible SELECT SUM(subtotal) FROM orders WHERE contact_id = ?
  3. Ensure you have effective and efficient indices on the relevant columns
  4. With MySQL, for speed, use InnoDB tables, for saving storage space use MYISAM (very simplified!)
  5. Having issues with pulling information out of the database (rather than processing it?) consider NoSQL solutions like MongoDB
  6. Consider if there are better ways to get the data into your database, so that it can be retrieved more efficiently, or if there are other database design implementations which will suit your needs more closely
  7. Consider working in a near-live speed, instead of bulk processing, even if you hold that information off to be implemented the following day (for business logic purposes)