Using MySQL to return complex tagged results

Hi everyone

In this article I thought I would tackle complex tag searching, using MySQL. There are lots of ways this can be achieved at a PHP level, through looping and checking/comparing.

However, if you are working with a large dataset, and tagging is a required piece of functionality, for example in an enterprise blog, or a CRM, then you’ll need to get your results direct from the database, rather than loading into memory and looping, which can work in smaller applications.

We are going to cover 3 types of tag searching, and those types of search working with and correlating all 3 of those results together.

  • Record must have all specified tags (e.g. this contact must be both a customer and high value)
  • Record must have any specified tags (e.g. this post must be tagged with laravel or mvc)
  • Record must not have any specified tags (e.g. don’t show anybody from a CRM which has already received the “A” campaign, so that “A” customers don’t receive “B” campaign, too)

Outlining the example

To keep things simple, the example we’re going to work with is as follows;

  1. We have a customers table
  2. Every customer record has one, many, or no corresponding tags in the customer_tags table
  3. The customer_tags table has customer_id and tag_id bridging the customers and tags tables, respectively

The query we are working to needs to do return records which conform to the following tagged search

  1. Must have tags 999 and 888
  2. Must also have any of the tags 777, 666, 555
  3. Must not have any of the tags 444 or 333

Finding records which have all tags

The first thing we need, is a successful WHERE statement which will do this, of course, you are then going to want it to be programmatically generated.

In the example below we’re doing is joining on the mandatory tags, this is arguably the simplest part of the selection, because we’re doing an inner join – if the tag doesn’t exist, the record won’t be available, which means we don’t need to do any WHERE filtering etc.

Notice that I am naming the contact_tags joins as mh1 and mh2 (mh being short for must have)

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999

JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

Finding records which also have any of the any-specified tags

Now we have 999 and 888 records only, we now need to make sure that we have one more more of the any tags (777, 666, and 555)

What we need here is to look for the connections; we need to do these as an outer join, as it is perfectly valid for the corresponding record to not be found.

Following that, we need to wrap a WHERE statement which ensures that at least one of those records were returned.

For the must-have tags we did an inner join, this means that unless the corresponding record is found, the result will be returned.

In this case we’re going to do an OUTER JOIN on the LEFT. The join being an OUTER JOIN means that the (contact) record will be returned, even if the corresponding contact_tag record was not found. Being a LEFT JOIN means that we’re going to base from our contacts table, we’re not going to return irrelevant contact_tags records.

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON c.id = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON c.id = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON c.id = ha3.customer_id
AND ha3.tag_id = 555

Now this in itself isn’t going to actually solve the issue we have around making sure we have 777, 666, or 555 – it is simply going to join on those relationships, if they exist.

This is where we need a WHERE statement to filter the results for us.

AND (
ha1.id IS NOT NULL
OR ha2.id IS NOT NULL
OR ha3.id IS NOT NULL
)

This statement means that if any of the have any (ha) tag relationships were found (the brackets thus returning TRUE) will be valid and returned

Does not have any of these tags

So the final part in this requirement, is to ensure that the records returned do not have any of the specified tags that we have to avoid.

We are going to do this in a very similar way to how we did the have any tags, see below

LEFT OUTER JOIN customer_tags nh1
ON c.id = nh1.customer_id
AND nh1.tag_id = 444

LEFT OUTER JOIN customer_tags nh2
ON c.id = nh2.customer_id
AND nh2.tag_id = 333

And, to make sure they were not returned, we’re going to do the following

AND nh1.id IS NULL
AND nh2.id IS NULL

So we’re LEFT OUTER JOINing on here for the same reason we left outer joined on the have any tags – it is valid for this tag to not exist, in fact, in this case we actively don’t want it to exist.

Putting the query together

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON c.id = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON c.id = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON c.id = ha3.customer_id
AND ha3.tag_id = 555

/* Join on the must not have tag relationships */
LEFT OUTER JOIN customer_tags nh1
ON c.id = nh1.customer_id
AND nh1.tag_id = 444
LEFT OUTER JOIN customer_tags nh2
ON c.id = nh2.customer_id
AND nh2.tag_id = 333

/* Ensure at least one of the have-any relationships is found */
WHERE (
ha1.id IS NOT NULL
OR ha2.id IS NOT NULL
OR ha3.id IS NOT NULL
)

/* Ensure not have relationships don't exist */
AND nh1.id IS NULL
AND nh2.id IS NULL

/* Never return duplicate customer */
GROUP BY c.id

/* Order by last updated customer records */
ORDER BY c.updated_at DESC

Adding further search filtering

I am not actually going to fully cover this, that will be on you with your application to make the best decisions, if you want to build a SQL query (not a prepared statement) then you may find this line handy, I can’t vouch for its safety.

$sanitised = DB::connection()->getPdo()->quote($string);

Dynamically Generating Your Query

All you really need to do is variablise (yes, I made that word up) your query generation;

I’m sorry, this isn’t PSR-2, I’m not going to lie to you, I wrote it straight into WordPress, and don’t want to spin up containers to run PHPCBF/PHPCS against it

// These are the must have tags
$mustHaveAllTags = [999,888,777];
// Must have 1+ of these tags
$mustHaveAnyTags = [666,555];
// Must not have any of these tags
$mustNotHaveTags = [444,333];
// Pagination of the query
$page = 1;
$perPage = 100;
// These are the components of our WHERE statement
$whereStatements = [];
// This is where we will hold our SQL query
$sql = '';
// Loop the must have tags
foreach($mustHaveAll Tags as $tagId){
$joinName = "mh$tagId";
$sql .= "
JOIN customer_tags $joinName
ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$wheres[] = "$joinName.id IS NOT NULL";
}
// Handle the must have any tags
if(!empty($mustHaveAnyTags)){
$anyWheres = [];
foreach($haveAny as $tagId){
$joinName = "ha$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags $joinName
ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$anyWheres[] = "$joinName.id IS NOT NULL";
}
$wheres[] = "(" . implode(" OR ", $anyWheres) . ")";
}
// Not have any tags
foreach($notHave as $tagId){
$joinName = "nh$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags
$joinName ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$wheres[] = "$joinName.id IS NULL";
}
// Append all the WHERE statements we created
$sql .= " WHERE " . implode(" AND ", $wheres);
// Group by the contact
$sql .= " GROUP BY c.id ";
// Calculate pagination
$offset = ($page-1) * $perPage;
$limit = $perPage;
// Append query limiting
$sql .= " LIMIT $offset, $limit ";

Next Steps

The next steps on this, to be honest, are beyond the scope of this article. What you now have is a query which will provide your tagged results as appropriate.

For me, I would now extract that functionality into a class, probably 2 (one responsible for providing a search “interface”, and one for generating SQL), and depending on how you want to use Eloquent or raw SQL either extract the custom searching (if you wanted to search by name, for example) into Eloquent, using prepared statements, or escape strings and queries (I would personally discourage the latter approach).

If you do want to do it in Eloquent you will incur a second query, but there are (from a programmatic perspective) some advantages (like not worrying about generating and executing prepared statement), if this is the case only select c.id from your query. From where run whatever normal Eloquent stuff you want, and simply add

$query->whereIn('id', $idsReturnedFromYourQuery);

What about performance? Joins are not performant

I’ve tested this kind of query with up to about 10 joins (10 tags), on a dataset of about 500,000 base records and it ran in about 300ms on a $40 Digital Ocean droplet.

Make sure that customer_tags has a compound index set on customer_id + tag_id to make the lookup faster, and foreign key both fields in that table.

In a lot of applications, this will be performant enough. When you start hitting the millions of records I am sure it will slow down and you will need a different approach.

I'd love to hear your opinion on what I've written. Anecdotes are always welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.