Retrieving 3+ million MySQL records, in less than 2 minutes

I’m going to jump straight into this one, I have a task to do. This task consists of going through 3.2 million records in a MYISAM table, retrieving some information, doing some basic processing, and moving certain data elsewhere.

This is a very quick cheat or hack, to get that information out, but first, we need to start with the problem. Let’s say we are using Laravel, we can use DB Chunking, and this works pretty well, but it’s worth understanding how this works at its core, alongside MySQL.

We start with a fairly innocuous paginated query…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

Great, and then we continue on with that, until we reach the magic 3.2 million records. (3,200,000 divided by 10,000 items per page is 320 pages). Fair enough, cool.

If you’ve never tried to do this, I will save you the pain. Page 1 will be like lightening (say half a second), page 2 will be a bit slower, by the time you get to page 50 or 100 you will be looking at near enough a minute per page.

I was actually running batches of 100,000 so looking at 32 pages to cover my 3.2 million records. Running it using the LIMIT keyword meant that looping the batches (and I mean retrieving the page, and running a foreach in the PHP without doing anything) took 14.5 minutes to run.

Not good. Especially when I have lots of intensive stuff to do for each item in each loop. So here is the little hack to save you time, it does come with caveats.

  1. It assumes you have an id column that is unique and sequential
  2. This will not work if you have to do any kind of ordering
  3. We can’t guarantee every page will be the same size (records may be missing), so your mechanisms need to not rely on that idea

So instead of…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

We’re going to do…

SELECT * FROM mytable WHERE id >= 1 AND id <= 100000

What this means is that instead of counting to 300,000 and then counting your next 100,000 rows, you perform a much simpler query to retrieve the desired results.

The result of this for me (querying against a 5gb 3.2 million record table) was that it took around 90 seconds to retrieve and loop all the records, instead of 14.5 minutes.

When you need to loop through and trigger a process for every single record in a big table, this is definitely a faster way of doing it. Assuming there is no other viable way of retrieving the information, of course.

Laravel Tutorials

Handling large data with cron jobs and queue workers

One problem a lot of developers seem to come into, from reading their code, is when they try to handle large sets of data with background scripts. There are a few common mistakes that I have seen from inheriting the code of my predecessors, and I thought I would offer some solutions to the problems that can come about.

I will cover off these issues as problems and solutions, and I’m going to use Laravel solutions, but conceptually it doesn’t really matter what framework you’re working in.

Problem #1 – The script takes too long (or many hours) to run!

This is quite a common one, particularly when inheriting code from days gone by, when the business and database were small. Then things grew, and suddenly that script which took ten minutes to run, takes many hours.

When I have to tackle these kinds of issues, assuming I understand the script and what it is doing there are a couple of ways to solve the issue;

Firstly, if your script is doing many different things, split it out into multiple scripts, which run at their appropriate times, which do a single job each. However, this probably isn’t the ideal.

Usually a script that is doing some daily processing looks something like this (in Laravel, but it’s not really relevant), this is awful example we’re calculating some cumulative business values for the customer.

This is not an actual feature, or actual code, I’ve written it as a deliberately simple example. Honestly I’ve not even tried to run this code, it’s just an example

$customers = Customer::all();
foreach($customers as $customer){
    $orderTotal = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
    $thisMonthOrders = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
    $customer->total_spend = $orderTotal;
    $customer->total_spend_this_month = $thisMonthOrders;

In this example we’re getting the total value of the orders this customer has made in their lifetime and the last month, and then saving it against the customer record. Presumably for easy/fast search and filtering or something.

This is fine until you start hitting many thousands or hundreds of thousands of sales and/or customers, then the script is going to take a very long time to run.

Firstly, we can optimise the script, to only fetch the orders in the last month and therefore only calculate totals for the affected customers, that would have a huge impact. Buut that’s not the point that I am trying to make here. (Though it would be relevant).

It would be much more efficient in terms of execution speed to do something like the following, though it does assume you have a Queue Worker or Laravel Horizon running (or some kind of mechanism to handle jobs on a queue)

foreach(Customer::all() as $customer){

// app/Jobs/CalculateTotals

class CalculateTotals implements ShouldQueue{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
    protected $contact;
    public function __construct(Contact $contact)
        $this->contact = $contact;
    public function handle()
        $this-contact->total_spend = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
        $this-contact->total_spend_this_month = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');

What we have actually done here, to speed up the execution time, is use the cron job to calculate what we should be handling/processing, but then passing off the work through to the job queues to handle. This means that you can have however many queue workers you have doing the processing and time consuming parts of the job. The cron will only be running for as long as it takes to find the relevant jobs to execute, and dispatch them via Redis or whatever queueing mechanism you’re using.

If you have 10,000 contacts to process, and it takes 2 seconds to process a contact, it would take 20,000 seconds to run the script.

If you run 20 queue workers, all all things are equal, the work from your script is now divided by 20 workers, or each worker would now have to deal with 500 jobs. 500 jobs multiplied by 2 seconds per job is 1,000 seconds to run (instead of 20,000).

NB: Personally, I would calculate these numbers on the fly, if they needed saving in the database I would calculate them on a listener to an event for when the orders were added, so they’re always live-ish, naturally I would not be using an ::all() as the basis to start any kind of processing in a large environment.

Problem #2 – The script uses too much memory, and crashes

This one depends on where the memory is coming from, but sticking with our above example…

$contacts = Contact::all();

Is a pretty bad place to start, it will work for a while, but if your database grows you’re eventually going to max out your memory trying to retrieve your whole data set.

$contacts = Contact::where('last_order_at', '>=', $carbonSomething->format('Y-m-d')->get();

This is a good place to start, whittle down how much you’re retrieving, but you might still be bringing back too much data.

Again, this is a demonstration, I’ve not run it, it’s just for the concept

$relevantContactsCount = Contact::where($qualifiers)->count();
$perPage = 10000;
$pages = ceil($relevantContactsCount / $perPage);
$currentPage = 1;
while($currentPage < $pages){
    $contacts = Contact::where($qualifiers)->orderBy('field','order')->limit($perPage)->offset(($currentPage - 1) * $perPage)->get();
    // Now we have 10,000 contacts only
    foreach($contacts as $contact){
        MyJobFromAbove::dispatch($contact); // Run it in a background job

Problem #3 – I can’t run my script as jobs, because it has to report back with totals, etc.

Now, this is an awkward one, but you simply need to change the way that you’re thinking about how you gather that information. You need a persistent way of storing the information, so that it can be summarised later.

There are a couple of ways to achieve this;

  1. When you start running the script, you create a record in the database or somewhere which you can report back to from your jobs (be wary of having every job run an incremental update for a “total contacts processed” type column!), or a way of consolidating the results
  2. Utilise caching mechanisms to remember certain information about what is running, so information can be shared between jobs and later on reported on
  3. Consolidate your results in somewhere very easy to calculate, and run your finalised report (which is perhaps emailed) once you know all processing is finished

Parting Thoughts

The normal rules of performance here still count…

  1. Only pull out from SQL what you actually need (select contact_id, subtotal, items_count from orders instead of select * from orders)
  2. Do what you can at SQL level, where possible SELECT SUM(subtotal) FROM orders WHERE contact_id = ?
  3. Ensure you have effective and efficient indices on the relevant columns
  4. With MySQL, for speed, use InnoDB tables, for saving storage space use MYISAM (very simplified!)
  5. Having issues with pulling information out of the database (rather than processing it?) consider NoSQL solutions like MongoDB
  6. Consider if there are better ways to get the data into your database, so that it can be retrieved more efficiently, or if there are other database design implementations which will suit your needs more closely
  7. Consider working in a near-live speed, instead of bulk processing, even if you hold that information off to be implemented the following day (for business logic purposes)

Using MySQL to return complex tagged results

Hi everyone

In this article I thought I would tackle complex tag searching, using MySQL. There are lots of ways this can be achieved at a PHP level, through looping and checking/comparing.

However, if you are working with a large dataset, and tagging is a required piece of functionality, for example in an enterprise blog, or a CRM, then you’ll need to get your results direct from the database, rather than loading into memory and looping, which can work in smaller applications.

We are going to cover 3 types of tag searching, and those types of search working with and correlating all 3 of those results together.

  • Record must have all specified tags (e.g. this contact must be both a customer and high value)
  • Record must have any specified tags (e.g. this post must be tagged with laravel or mvc)
  • Record must not have any specified tags (e.g. don’t show anybody from a CRM which has already received the “A” campaign, so that “A” customers don’t receive “B” campaign, too)

Outlining the example

To keep things simple, the example we’re going to work with is as follows;

  1. We have a customers table
  2. Every customer record has one, many, or no corresponding tags in the customer_tags table
  3. The customer_tags table has customer_id and tag_id bridging the customers and tags tables, respectively

The query we are working to needs to do return records which conform to the following tagged search

  1. Must have tags 999 and 888
  2. Must also have any of the tags 777, 666, 555
  3. Must not have any of the tags 444 or 333

Finding records which have all tags

The first thing we need, is a successful WHERE statement which will do this, of course, you are then going to want it to be programmatically generated.

In the example below we’re doing is joining on the mandatory tags, this is arguably the simplest part of the selection, because we’re doing an inner join – if the tag doesn’t exist, the record won’t be available, which means we don’t need to do any WHERE filtering etc.

Notice that I am naming the contact_tags joins as mh1 and mh2 (mh being short for must have)

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON = mh1.customer_id
AND mh1.tag_id = 999

JOIN customer_tags mh2
ON = mh2.customer_id
AND mh2.tag_id = 888

Finding records which also have any of the any-specified tags

Now we have 999 and 888 records only, we now need to make sure that we have one more more of the any tags (777, 666, and 555)

What we need here is to look for the connections; we need to do these as an outer join, as it is perfectly valid for the corresponding record to not be found.

Following that, we need to wrap a WHERE statement which ensures that at least one of those records were returned.

For the must-have tags we did an inner join, this means that unless the corresponding record is found, the result will be returned.

In this case we’re going to do an OUTER JOIN on the LEFT. The join being an OUTER JOIN means that the (contact) record will be returned, even if the corresponding contact_tag record was not found. Being a LEFT JOIN means that we’re going to base from our contacts table, we’re not going to return irrelevant contact_tags records.

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON = ha3.customer_id
AND ha3.tag_id = 555

Now this in itself isn’t going to actually solve the issue we have around making sure we have 777, 666, or 555 – it is simply going to join on those relationships, if they exist.

This is where we need a WHERE statement to filter the results for us.


This statement means that if any of the have any (ha) tag relationships were found (the brackets thus returning TRUE) will be valid and returned

Does not have any of these tags

So the final part in this requirement, is to ensure that the records returned do not have any of the specified tags that we have to avoid.

We are going to do this in a very similar way to how we did the have any tags, see below

LEFT OUTER JOIN customer_tags nh1
ON = nh1.customer_id
AND nh1.tag_id = 444

LEFT OUTER JOIN customer_tags nh2
ON = nh2.customer_id
AND nh2.tag_id = 333

And, to make sure they were not returned, we’re going to do the following


So we’re LEFT OUTER JOINing on here for the same reason we left outer joined on the have any tags – it is valid for this tag to not exist, in fact, in this case we actively don’t want it to exist.

Putting the query together

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON = ha3.customer_id
AND ha3.tag_id = 555

/* Join on the must not have tag relationships */
LEFT OUTER JOIN customer_tags nh1
ON = nh1.customer_id
AND nh1.tag_id = 444
LEFT OUTER JOIN customer_tags nh2
ON = nh2.customer_id
AND nh2.tag_id = 333

/* Ensure at least one of the have-any relationships is found */

/* Ensure not have relationships don't exist */

/* Never return duplicate customer */

/* Order by last updated customer records */
ORDER BY c.updated_at DESC

Adding further search filtering

I am not actually going to fully cover this, that will be on you with your application to make the best decisions, if you want to build a SQL query (not a prepared statement) then you may find this line handy, I can’t vouch for its safety.

$sanitised = DB::connection()->getPdo()->quote($string);

Dynamically Generating Your Query

All you really need to do is variablise (yes, I made that word up) your query generation;

I’m sorry, this isn’t PSR-2, I’m not going to lie to you, I wrote it straight into WordPress, and don’t want to spin up containers to run PHPCBF/PHPCS against it

// These are the must have tags
$mustHaveAllTags = [999,888,777];
// Must have 1+ of these tags
$mustHaveAnyTags = [666,555];
// Must not have any of these tags
$mustNotHaveTags = [444,333];
// Pagination of the query
$page = 1;
$perPage = 100;
// These are the components of our WHERE statement
$whereStatements = [];
// This is where we will hold our SQL query
$sql = '';
// Loop the must have tags
foreach($mustHaveAll Tags as $tagId){
$joinName = "mh$tagId";
$sql .= "
JOIN customer_tags $joinName
ON $joinName.customer_id =
AND $joinName.tag_id = $tagId";
$wheres[] = "$ IS NOT NULL";
// Handle the must have any tags
$anyWheres = [];
foreach($haveAny as $tagId){
$joinName = "ha$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags $joinName
ON $joinName.customer_id =
AND $joinName.tag_id = $tagId";
$anyWheres[] = "$ IS NOT NULL";
$wheres[] = "(" . implode(" OR ", $anyWheres) . ")";
// Not have any tags
foreach($notHave as $tagId){
$joinName = "nh$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags
$joinName ON $joinName.customer_id =
AND $joinName.tag_id = $tagId";
$wheres[] = "$ IS NULL";
// Append all the WHERE statements we created
$sql .= " WHERE " . implode(" AND ", $wheres);
// Group by the contact
$sql .= " GROUP BY ";
// Calculate pagination
$offset = ($page-1) * $perPage;
$limit = $perPage;
// Append query limiting
$sql .= " LIMIT $offset, $limit ";

Next Steps

The next steps on this, to be honest, are beyond the scope of this article. What you now have is a query which will provide your tagged results as appropriate.

For me, I would now extract that functionality into a class, probably 2 (one responsible for providing a search “interface”, and one for generating SQL), and depending on how you want to use Eloquent or raw SQL either extract the custom searching (if you wanted to search by name, for example) into Eloquent, using prepared statements, or escape strings and queries (I would personally discourage the latter approach).

If you do want to do it in Eloquent you will incur a second query, but there are (from a programmatic perspective) some advantages (like not worrying about generating and executing prepared statement), if this is the case only select from your query. From where run whatever normal Eloquent stuff you want, and simply add

$query->whereIn('id', $idsReturnedFromYourQuery);

What about performance? Joins are not performant

I’ve tested this kind of query with up to about 10 joins (10 tags), on a dataset of about 500,000 base records and it ran in about 300ms on a $40 Digital Ocean droplet.

Make sure that customer_tags has a compound index set on customer_id + tag_id to make the lookup faster, and foreign key both fields in that table.

In a lot of applications, this will be performant enough. When you start hitting the millions of records I am sure it will slow down and you will need a different approach.

Ask Johno Laravel Tutorials

Model Search and Filtering in Laravel

Hi guys, as I have written this functionality a few times at this point, and I have been asked how to do it a couple of times, too. That is Laravel searching and filtering, I am going to develop a simple composer package which I can use to allow users to search and filter Laravel (Eloquent) Models using PHP traits which can be used on Laravel controllers.


I am going to work on the assumption that you are comfortable with basics.

I shall also assume you’re fairly comfortable with the Illuminate HTTP Request object, a Controller, and Eloquent Model.


So the first thing we want to tackle, as a concept, is that the best way, in my experience, is to start with a query, and then modify it as required. This allows us to optionally do almost anything, but equally allows us to do nothing, and simply return an unfiltered set if we want to.

This point would be true outside of Eloquent too, even if working with vanilla SQL.

Starting Off

To start with you want something like this within your Controller.

public function index(Request $request)
// Of course you can change this if you don't want to select everything
$query = YourModel::select('*');

Now we have the $query object which we can work with as and where we want to.

Filtering Records

Excluding deleted/is deleted, as we’ll cover that later on.

Let’s say you now want to filter results on author_id just to keep things simple.

$query->where('author_id', '=', $request->author);

Of course take into account everything here, you may need to do sanitisation or validation or anything else in there.

Multiple Filters, Save Lines

If you have many fields on which you wish to do this I usually find something like this is better

$fields = ['author_id', 'category_id'];
foreach($fields as $field){
$query->where($field, '=', $request->$field);

That example assumes the exposure of your field names, you could always use an associative array so you could do “author” maps to “author_id” and so on, if you were so inclined.

Fluffy Search

Now you might want to do a fluffy search, taking the assumption of posts with content and an author name, and a category name (stored within the model) you could do something like this

$searchFields = ['title','content','author_name','category_name'];
$query->where(function($query) use($request, $searchFields){
$searchWildcard = '%' . $request->search . '%';
foreach($searchFields as $field){
$query->orWhere($field, 'LIKE', $searchWildcard);

Handling Trash

You probably want something which allows you to see soft deleted models, if you’re using them. I do something like this

if($request->trash == true){
// Retrieve soft deleted models only, you can only do this if you're using soft deletion on this model

Ordering and Limiting

Now to handle ordering and pagination – which Eloquent beautifully does for us, all we need to do is let the user modify it if they want to.

// Set the query ordering
$query->order($request->orderBy ?? 'updated_at', $request->order ?? 'DESC');

$perPage = $request->per_page ?? 25;

// It is a good idea to make sure we cap this
if($perPage > $maximumPerPage){
$perPage = $maximumPerPage;

// Final part - get the results
$results = $query->paginate($perPage);

// You will now need to do something with your results, and return some kind of Response - this could be a JSON response or adding the results to the data returned to the view, depending on your context

There’s an assumption here that our default order is updated at, last updated first, and that if not specified, we want to return 25 pages.

Of course you can store this information wherever you want.

In Summary, and Further Reading

Rather than creating a really complicated set of rules to handle all the things you may need from your API calls, start with a query, and then optionally modify that query.

Further Reading

Ask Johno Tutorials

An introduction to Checksums

Hi everyone

This is a quick introduction into checksums, and practically how to use them, at the request of someone through the Ask Johno section of the blog.

I know what a checksum is, but I’m not sure how to implement or use one (in PHP) to achieve what I need.

Anonymous Asker

They go on to explain what they are trying to achieve, which is essentially to verify a file hasn’t changed (in this case it’s the HTTP document sent by an API) before doing something.

Firstly, a checksum is a very small data snippet (datum) which represents something larger.

In my experience there are 2 main reasons to use one, these are;

  1. To verify a change in something larger
  2. To verify that a piece of data matches from that which was sent by its originator (checksum on an API payload)

I will very briefly cover both of these with PHP examples.

Note; I am using MD5 for simplicity, but depending on your requirements this likely won’t be the best option for your use case.

Edit: Please see Further Reading at the bottom of this article for more information about hashing

Using a Checksum to Detect Changes

Now I don’t know what we’ve got, but whatever it is; we are going to need a string representation, 2 main ways of getting this:

$stringRepresentation = serialize($myThingToCheck);


$stringRepresentation = json_encode($myThingToCheck);

Personally, I would advise PHP’s serialize because that will work with, and instantiate objects. However, the choice is yours depending on your use case, JSON is smaller than a PHP serialization.

Now that we have a string, we need to make it small and easy to check. Something like this will work fine;

$checksum = md5($stringRepresentation);

Now to check for changes, we just need the last checksum that we stored that something happened on.

if($checksum != $oldChecksum){
echo 'Something has changed';

So that covers how to use a checksum to detect changes in pieces of data, which is useful if you’re having to poll for changes.

Using a Checksum to Verify Validity

This is something I’ve noticed a couple of times, particularly when working in the financial industry, and around certain payment gateways.

It usually looks something like this; but it does change per API integration so be aware of that and follow their own documentation.

$secret = 'my_secret_api_key';

$payload = [
'foo' => 'Bar',
'another' => 'Thing',
'datetime' => '2018-12-25 00:00:00'

$jsonPayload = json_encode($payload);

$checksum = md5($jsonPayload . $secret);

$payload['checksum'] = $checksum;

// Do the rest of your stuff here, including sending the payload etc.

One advantage of this approach is that you never expose the API key in plain text.

If you were to want to verify on the API so that you’re the provider, rather than integration, you would simply do these steps in the opposite order, so it would look something like this

// Assuming you've done everything you need, and now have the $payload array/object back

$secret = 'the_secret_youre_expecting';

$checksumProvided = $payload['checksum'];
$checksum = md5(json_encode($payload) . $secret);

if($checksum != $checksumProvided){
// If the checksums don't match, in theory the secret key provided was incorrect

I think that about covers this topic, as a brief introduction to checksums, and how they’re often used in PHP.

Further Reading

PHP: hash() for more information about the best ways to hash data

I have deliberately not gotten into the discussion over hashing algorithms, as it really is out of the scope of this article, and indeed a whole book could be written on the topic alone.

Thanks to u/artemix-org and u/BradChesney79 on Reddit for suggesting this edit.