Retrieving 3+ million MySQL records, in less than 2 minutes

I’m going to jump straight into this one, I have a task to do. This task consists of going through 3.2 million records in a MYISAM table, retrieving some information, doing some basic processing, and moving certain data elsewhere.

This is a very quick cheat or hack, to get that information out, but first, we need to start with the problem. Let’s say we are using Laravel, we can use DB Chunking, and this works pretty well, but it’s worth understanding how this works at its core, alongside MySQL.

We start with a fairly innocuous paginated query…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

Great, and then we continue on with that, until we reach the magic 3.2 million records. (3,200,000 divided by 10,000 items per page is 320 pages). Fair enough, cool.

If you’ve never tried to do this, I will save you the pain. Page 1 will be like lightening (say half a second), page 2 will be a bit slower, by the time you get to page 50 or 100 you will be looking at near enough a minute per page.

I was actually running batches of 100,000 so looking at 32 pages to cover my 3.2 million records. Running it using the LIMIT keyword meant that looping the batches (and I mean retrieving the page, and running a foreach in the PHP without doing anything) took 14.5 minutes to run.

Not good. Especially when I have lots of intensive stuff to do for each item in each loop. So here is the little hack to save you time, it does come with caveats.

  1. It assumes you have an id column that is unique and sequential
  2. This will not work if you have to do any kind of ordering
  3. We can’t guarantee every page will be the same size (records may be missing), so your mechanisms need to not rely on that idea

So instead of…

SELECT * FROM mytable ORDER BY id ASC LIMIT 0, 10000

We’re going to do…

SELECT * FROM mytable WHERE id >= 1 AND id <= 100000

What this means is that instead of counting to 300,000 and then counting your next 100,000 rows, you perform a much simpler query to retrieve the desired results.

The result of this for me (querying against a 5gb 3.2 million record table) was that it took around 90 seconds to retrieve and loop all the records, instead of 14.5 minutes.

When you need to loop through and trigger a process for every single record in a big table, this is definitely a faster way of doing it. Assuming there is no other viable way of retrieving the information, of course.

I'd love to hear your opinion on what I've written. Anecdotes are always welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.