Using MySQL to return complex tagged results

Hi everyone

In this article I thought I would tackle complex tag searching, using MySQL. There are lots of ways this can be achieved at a PHP level, through looping and checking/comparing.

However, if you are working with a large dataset, and tagging is a required piece of functionality, for example in an enterprise blog, or a CRM, then you’ll need to get your results direct from the database, rather than loading into memory and looping, which can work in smaller applications.

We are going to cover 3 types of tag searching, and those types of search working with and correlating all 3 of those results together.

  • Record must have all specified tags (e.g. this contact must be both a customer and high value)
  • Record must have any specified tags (e.g. this post must be tagged with laravel or mvc)
  • Record must not have any specified tags (e.g. don’t show anybody from a CRM which has already received the “A” campaign, so that “A” customers don’t receive “B” campaign, too)

Outlining the example

To keep things simple, the example we’re going to work with is as follows;

  1. We have a customers table
  2. Every customer record has one, many, or no corresponding tags in the customer_tags table
  3. The customer_tags table has customer_id and tag_id bridging the customers and tags tables, respectively

The query we are working to needs to do return records which conform to the following tagged search

  1. Must have tags 999 and 888
  2. Must also have any of the tags 777, 666, 555
  3. Must not have any of the tags 444 or 333

Finding records which have all tags

The first thing we need, is a successful WHERE statement which will do this, of course, you are then going to want it to be programmatically generated.

In the example below we’re doing is joining on the mandatory tags, this is arguably the simplest part of the selection, because we’re doing an inner join – if the tag doesn’t exist, the record won’t be available, which means we don’t need to do any WHERE filtering etc.

Notice that I am naming the contact_tags joins as mh1 and mh2 (mh being short for must have)

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999

JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

Finding records which also have any of the any-specified tags

Now we have 999 and 888 records only, we now need to make sure that we have one more more of the any tags (777, 666, and 555)

What we need here is to look for the connections; we need to do these as an outer join, as it is perfectly valid for the corresponding record to not be found.

Following that, we need to wrap a WHERE statement which ensures that at least one of those records were returned.

For the must-have tags we did an inner join, this means that unless the corresponding record is found, the result will be returned.

In this case we’re going to do an OUTER JOIN on the LEFT. The join being an OUTER JOIN means that the (contact) record will be returned, even if the corresponding contact_tag record was not found. Being a LEFT JOIN means that we’re going to base from our contacts table, we’re not going to return irrelevant contact_tags records.

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON c.id = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON c.id = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON c.id = ha3.customer_id
AND ha3.tag_id = 555

Now this in itself isn’t going to actually solve the issue we have around making sure we have 777, 666, or 555 – it is simply going to join on those relationships, if they exist.

This is where we need a WHERE statement to filter the results for us.

AND (
ha1.id IS NOT NULL
OR ha2.id IS NOT NULL
OR ha3.id IS NOT NULL
)

This statement means that if any of the have any (ha) tag relationships were found (the brackets thus returning TRUE) will be valid and returned

Does not have any of these tags

So the final part in this requirement, is to ensure that the records returned do not have any of the specified tags that we have to avoid.

We are going to do this in a very similar way to how we did the have any tags, see below

LEFT OUTER JOIN customer_tags nh1
ON c.id = nh1.customer_id
AND nh1.tag_id = 444

LEFT OUTER JOIN customer_tags nh2
ON c.id = nh2.customer_id
AND nh2.tag_id = 333

And, to make sure they were not returned, we’re going to do the following

AND nh1.id IS NULL
AND nh2.id IS NULL

So we’re LEFT OUTER JOINing on here for the same reason we left outer joined on the have any tags – it is valid for this tag to not exist, in fact, in this case we actively don’t want it to exist.

Putting the query together

/* Retrieve the fields we want */
SELECT id, first_name, last_name

/* Start with the customers table */
FROM customers c

/* For each MUST HAVE tag */
JOIN customer_tags mh1
ON c.id = mh1.customer_id
AND mh1.tag_id = 999
JOIN customer_tags mh2
ON c.id = mh2.customer_id
AND mh2.tag_id = 888

/* Join on the have-any tag relationships */
LEFT OUTER JOIN customer_tags ha1
ON c.id = ha1.customer_id
AND ha1.tag_id = 777
LEFT OUTER JOIN customer_tags ha2
ON c.id = ha2.customer_id
AND ha2.tag_id = 666
LEFT OUTER JOIN customer_tags ha3
ON c.id = ha3.customer_id
AND ha3.tag_id = 555

/* Join on the must not have tag relationships */
LEFT OUTER JOIN customer_tags nh1
ON c.id = nh1.customer_id
AND nh1.tag_id = 444
LEFT OUTER JOIN customer_tags nh2
ON c.id = nh2.customer_id
AND nh2.tag_id = 333

/* Ensure at least one of the have-any relationships is found */
WHERE (
ha1.id IS NOT NULL
OR ha2.id IS NOT NULL
OR ha3.id IS NOT NULL
)

/* Ensure not have relationships don't exist */
AND nh1.id IS NULL
AND nh2.id IS NULL

/* Never return duplicate customer */
GROUP BY c.id

/* Order by last updated customer records */
ORDER BY c.updated_at DESC

Adding further search filtering

I am not actually going to fully cover this, that will be on you with your application to make the best decisions, if you want to build a SQL query (not a prepared statement) then you may find this line handy, I can’t vouch for its safety.

$sanitised = DB::connection()->getPdo()->quote($string);

Dynamically Generating Your Query

All you really need to do is variablise (yes, I made that word up) your query generation;

I’m sorry, this isn’t PSR-2, I’m not going to lie to you, I wrote it straight into WordPress, and don’t want to spin up containers to run PHPCBF/PHPCS against it

// These are the must have tags
$mustHaveAllTags = [999,888,777];
// Must have 1+ of these tags
$mustHaveAnyTags = [666,555];
// Must not have any of these tags
$mustNotHaveTags = [444,333];
// Pagination of the query
$page = 1;
$perPage = 100;
// These are the components of our WHERE statement
$whereStatements = [];
// This is where we will hold our SQL query
$sql = '';
// Loop the must have tags
foreach($mustHaveAll Tags as $tagId){
$joinName = "mh$tagId";
$sql .= "
JOIN customer_tags $joinName
ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$wheres[] = "$joinName.id IS NOT NULL";
}
// Handle the must have any tags
if(!empty($mustHaveAnyTags)){
$anyWheres = [];
foreach($haveAny as $tagId){
$joinName = "ha$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags $joinName
ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$anyWheres[] = "$joinName.id IS NOT NULL";
}
$wheres[] = "(" . implode(" OR ", $anyWheres) . ")";
}
// Not have any tags
foreach($notHave as $tagId){
$joinName = "nh$tagId";
$sql .= "
LEFT OUTER JOIN customer_tags
$joinName ON $joinName.customer_id = c.id
AND $joinName.tag_id = $tagId";
$wheres[] = "$joinName.id IS NULL";
}
// Append all the WHERE statements we created
$sql .= " WHERE " . implode(" AND ", $wheres);
// Group by the contact
$sql .= " GROUP BY c.id ";
// Calculate pagination
$offset = ($page-1) * $perPage;
$limit = $perPage;
// Append query limiting
$sql .= " LIMIT $offset, $limit ";

Next Steps

The next steps on this, to be honest, are beyond the scope of this article. What you now have is a query which will provide your tagged results as appropriate.

For me, I would now extract that functionality into a class, probably 2 (one responsible for providing a search “interface”, and one for generating SQL), and depending on how you want to use Eloquent or raw SQL either extract the custom searching (if you wanted to search by name, for example) into Eloquent, using prepared statements, or escape strings and queries (I would personally discourage the latter approach).

If you do want to do it in Eloquent you will incur a second query, but there are (from a programmatic perspective) some advantages (like not worrying about generating and executing prepared statement), if this is the case only select c.id from your query. From where run whatever normal Eloquent stuff you want, and simply add

$query->whereIn('id', $idsReturnedFromYourQuery);

What about performance? Joins are not performant

I’ve tested this kind of query with up to about 10 joins (10 tags), on a dataset of about 500,000 base records and it ran in about 300ms on a $40 Digital Ocean droplet.

Make sure that customer_tags has a compound index set on customer_id + tag_id to make the lookup faster, and foreign key both fields in that table.

In a lot of applications, this will be performant enough. When you start hitting the millions of records I am sure it will slow down and you will need a different approach.

Quick Life Hacks with the Eloquent Model

Hi folks,

It’s no secret that I love Laravel, and especially love Eloquent. This post isn’t about why various people don’t like Eloquent, etc, etc. but just some “life hacks” of working with Eloquent.

#1 – Naming Conventions

I personally like to use the following naming convention on my databases (using a forum as an example)

forums
forums_topics
forums_topics_replies

Whereas Eloquent would expect the following:

forums
forum_topics
forum_topic_replies

I personally dislike this, for no other reason than I like to be able to see what’s being owned by what at a very quick glance. Additionally, namespacing; I like to namespace my Models, at the very least into a Model namespace, but often I like to further namespace these to represent the ownership of the data; so I end up with something like (I’m using a directory structure, because PSR-4 namespaces follow this anyway)

app/
- Model/
- - Forum
- - ForumTopic
- - ForumTopicReply

My only issue with this is that the Model namespace can end up huge, and with classes with some really ridiculously long names. So I prefer to break this down like so:

app/
- Model/
- - Forum
- - Forum/
- - - Topic
- - - Topic/
- - - - Reply

I realise this begins to look like overkill, but I find it particularly useful when separating functionality by module or something like that.

The Autonaming

The above means I have two options, number one is to follow the convention to the letter. Option two is to overwrite the naming convention on a base model, from which I then extend. This base/abstract model contains the following method:

/**
 * Overwrite the autonaming conventions
 * @return string
 */
public function getTable()
{
    $class = substr(get_class($this), strlen(self::MODEL_NAMESPACE));
    $namespaces = explode('\\', $class);
    foreach($namespaces as &$namespace){
        if(substr($namespace, -1) == 'y'){
            $namespace = strtolower(substr($namespace, 0, -1) . 'ies');
        } else {
            $namespace = strtolower($namespace . 's');
        }
    }
    return implode('_', $namespaces);
}

Note: You will need to have the MODEL_NAMESPACE constant declared as well

All this does is turned app\Model\Forum\Topic into forums_topics, but it’s pretty useful

PSR-2 properties

So the other thing I find occasionally annoying is that when writing PHP my properties follow the PSR-2 naming convention (camelCaseForVariables) – but using eloquent I am forced to snake_case_columns because (understandably) this is what is in the database.

By adding the following 3 methods to your base model you can use PSR-2 compliant field names. So that created_at becomes createdAt; without having to modify the database;

/**
 * Overwrite the getter so that we can use camelCase field names
 * @param string $property
 * @return mixed
 */
public function __get($property)
{
    return parent::__get($this->convertToSnakeCase($property));
}

/**
 * Overwrite the setter so that we can use camelCase field names
 * @param string $property
 * @param mixed $value
 */
public function __set($property, $value)
{
    parent::__set($this->convertToSnakeCase($property), $value);
}
/**
 * Turn a camelCase string into snake_case
 * @param $string
 * @return mixed
 */
protected function convertToSnakeCase($string)
{
    return strtolower(preg_replace('/([a-z])([A-Z])/', '$1_$2', $string));
}

Quite simply, all these methods are doing is plugging into the original getter and setter, but first converting camelCase to snake_case – so that throughout my code I can refer to these fields in a way which is fitting with the rest of my code, which conforms to PSR-2

There you have it

A couple of very quick, but very useful, ways to extend the base Eloquent Model so that it is in better fitting with the rest of your code base and database standards (or mine, anyway)

Pure OS: The way forward?

Tonight is just a quick one, not my usual essay. I was just curious to see what other people though. Being a PHP Developer I primarily work within LAMP environments. Occasionally having swapped out Apache for Nginx, but I am by far a specialist.

Until recently I was able to jump around through the Linux command line and do whatever I needed to without much issue, but nothing heavy.

I decided, as I often do, that I wanted to expand my skill set and start looking at pure OS, doing everything from scratch, how hard could it be, right? Previously I had used control panels like cPanel and Plesk to make management easy, as I’ve been offloading mail management to the likes of Office 365 and Google Suite, because they’re awesome at it and provide drive solutions, I figured there wasn’t that much really left for my server to manage.

So I put on my big boy pants and launched a Droplet with Digital Ocean and got learning how to set up and manage my own LAMP stack environments. I’ve spent a bit of time learning a few bits, and realised a few things, which I thought I would share. I’m by no means an expert, but these are just some soundbites I’ve picked up

  1. Setting up LAMP stack isn’t difficult
  2. Neither is setting up SFTP and Mail (though I’m dubious on mail and will leave it to the pros!)
  3. Neither is setting up SSL (seriously, look up LetsEncrypt!)
  4. Control panels are heavy!
  5. You really don’t need much on your server to do basic website stuff
  6. If you keep it lean, tiny servers can do amazing things

If you’re a developer, and have never tried to build and manage your own servers, seriously give it a go.

On a side note, some other useful tips and guides – please take heed – this isn’t new knowledge to me by any means, but thought it might be useful if you stumble across this article.

  1. Disable root login access!
  2. Chroot your FTP users
  3. Only install what you NEED. If you’re not doing anything with SFTP (because you’re only using git, for example), don’t install it
  4. Research what you’re installing, only install it if you need it and trust it
  5. If this is the first time you’ve tried to manage your own DNS, most panels create aa bunch of stuff for you; at the very least you need to create a CNAME record for ‘www’ and point it to ‘@’ (so that www.website.com also works as well as just website.com)

I think that’s everything for now. If I think of anything else I’ll pop it in a little soundbite.