Johno the Coder

PHP Developer & Solutions Architect

Tag: object orientated programming

Handling large data with cron jobs and queue workers

One problem a lot of developers seem to come into, from reading their code, is when they try to handle large sets of data with background scripts. There are a few common mistakes that I have seen from inheriting the code of my predecessors, and I thought I would offer some solutions to the problems that can come about.

I will cover off these issues as problems and solutions, and I’m going to use Laravel solutions, but conceptually it doesn’t really matter what framework you’re working in.

Problem #1 – The script takes too long (or many hours) to run!

This is quite a common one, particularly when inheriting code from days gone by, when the business and database were small. Then things grew, and suddenly that script which took ten minutes to run, takes many hours.

When I have to tackle these kinds of issues, assuming I understand the script and what it is doing there are a couple of ways to solve the issue;

Firstly, if your script is doing many different things, split it out into multiple scripts, which run at their appropriate times, which do a single job each. However, this probably isn’t the ideal.

Usually a script that is doing some daily processing looks something like this (in Laravel, but it’s not really relevant), this is awful example we’re calculating some cumulative business values for the customer.

This is not an actual feature, or actual code, I’ve written it as a deliberately simple example. Honestly I’ve not even tried to run this code, it’s just an example

$customers = Customer::all();
foreach($customers as $customer){
    $orderTotal = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
    $thisMonthOrders = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
    $customer->total_spend = $orderTotal;
    $customer->total_spend_this_month = $thisMonthOrders;
    $customer->save();
}

In this example we’re getting the total value of the orders this customer has made in their lifetime and the last month, and then saving it against the customer record. Presumably for easy/fast search and filtering or something.

This is fine until you start hitting many thousands or hundreds of thousands of sales and/or customers, then the script is going to take a very long time to run.

Firstly, we can optimise the script, to only fetch the orders in the last month and therefore only calculate totals for the affected customers, that would have a huge impact. Buut that’s not the point that I am trying to make here. (Though it would be relevant).

It would be much more efficient in terms of execution speed to do something like the following, though it does assume you have a Queue Worker or Laravel Horizon running (or some kind of mechanism to handle jobs on a queue)


foreach(Customer::all() as $customer){
    CalculateTotals::dispatch($customer);
}

// app/Jobs/CalculateTotals

class CalculateTotals implements ShouldQueue{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
    protected $contact;
    public function __construct(Contact $contact)
    {
        $this->contact = $contact;
    }
    public function handle()
    {
        $this-contact->total_spend = Order::where('customer_id', '=', $customer->id)->sum('subtotal);
        $this-contact->total_spend_this_month = Order::where('customer_id','=',$customer->id)->where('created_at','>=',now()->subMonths(1)->format('Y-m-d'))->sum('subtotal');
        $this->contact->save();
    }
}

What we have actually done here, to speed up the execution time, is use the cron job to calculate what we should be handling/processing, but then passing off the work through to the job queues to handle. This means that you can have however many queue workers you have doing the processing and time consuming parts of the job. The cron will only be running for as long as it takes to find the relevant jobs to execute, and dispatch them via Redis or whatever queueing mechanism you’re using.

If you have 10,000 contacts to process, and it takes 2 seconds to process a contact, it would take 20,000 seconds to run the script.

If you run 20 queue workers, all all things are equal, the work from your script is now divided by 20 workers, or each worker would now have to deal with 500 jobs. 500 jobs multiplied by 2 seconds per job is 1,000 seconds to run (instead of 20,000).

NB: Personally, I would calculate these numbers on the fly, if they needed saving in the database I would calculate them on a listener to an event for when the orders were added, so they’re always live-ish, naturally I would not be using an ::all() as the basis to start any kind of processing in a large environment.

Problem #2 – The script uses too much memory, and crashes

This one depends on where the memory is coming from, but sticking with our above example…

$contacts = Contact::all();

Is a pretty bad place to start, it will work for a while, but if your database grows you’re eventually going to max out your memory trying to retrieve your whole data set.

$contacts = Contact::where('last_order_at', '>=', $carbonSomething->format('Y-m-d')->get();

This is a good place to start, whittle down how much you’re retrieving, but you might still be bringing back too much data.

Again, this is a demonstration, I’ve not run it, it’s just for the concept

$relevantContactsCount = Contact::where($qualifiers)->count();
$perPage = 10000;
$pages = ceil($relevantContactsCount / $perPage);
$currentPage = 1;
while($currentPage < $pages){
    $contacts = Contact::where($qualifiers)->orderBy('field','order')->limit($perPage)->offset(($currentPage - 1) * $perPage)->get();
    // Now we have 10,000 contacts only
    foreach($contacts as $contact){
        MyJobFromAbove::dispatch($contact); // Run it in a background job
    }
    $currentPage++;
}

Problem #3 – I can’t run my script as jobs, because it has to report back with totals, etc.

Now, this is an awkward one, but you simply need to change the way that you’re thinking about how you gather that information. You need a persistent way of storing the information, so that it can be summarised later.

There are a couple of ways to achieve this;

  1. When you start running the script, you create a record in the database or somewhere which you can report back to from your jobs (be wary of having every job run an incremental update for a “total contacts processed” type column!), or a way of consolidating the results
  2. Utilise caching mechanisms to remember certain information about what is running, so information can be shared between jobs and later on reported on
  3. Consolidate your results in somewhere very easy to calculate, and run your finalised report (which is perhaps emailed) once you know all processing is finished

Parting Thoughts

The normal rules of performance here still count…

  1. Only pull out from SQL what you actually need (select contact_id, subtotal, items_count from orders instead of select * from orders)
  2. Do what you can at SQL level, where possible SELECT SUM(subtotal) FROM orders WHERE contact_id = ?
  3. Ensure you have effective and efficient indices on the relevant columns
  4. With MySQL, for speed, use InnoDB tables, for saving storage space use MYISAM (very simplified!)
  5. Having issues with pulling information out of the database (rather than processing it?) consider NoSQL solutions like MongoDB
  6. Consider if there are better ways to get the data into your database, so that it can be retrieved more efficiently, or if there are other database design implementations which will suit your needs more closely
  7. Consider working in a near-live speed, instead of bulk processing, even if you hold that information off to be implemented the following day (for business logic purposes)

From procedural to object orientated, a tutorial

Hi there everyone

Something I’m often faced with, having lots of friends with varying degrees of programming experience, is how do (PHP) developers move from sub 25k roles to the higher end of the spectrum, the 35-60k developer roles.

Generally, in my experience, the are some key differences in the salary expectations and the skills you can expect for a developer demanding those salaries. These can be broadly summed up as below:

  • True understanding of object orientated programming
  • Knowledge and application of programming principles
  • Exposure to multiple technologies and ability to move around within them
  • Web application development vs website development (the difference between relying on browsers and being able to do things like offloading, queueing, sharding, true separation of concerns, performance optimisation, caching, all that stuff)

To this end I’ve had a decent number of developers ask me how to start working with classes, and work in an object oriented way. So I thought I would do a tutorial on this. I’m going to cover some design patterns, some PHP functionality and various other things.

This is going to be a long and wordy tutorial, but hopefully what it will do is give you some understanding in the differences between procedural and OO programming.

All code samples are available in this project on my GitHub

Step One: The Hardest Part

The first thing I am going to disclaim is simple: Please do not try to have a half procedural, half OOP project or system. It’s going to be an absolute nightmare to maintain!

Now that’s out of the way, let’s talk about some design basics. Avoid god classes! A god class is a class which has many, many responsibilities, it can do everything. In terms of a practical application of a God class think of an ecommerce system: if a single class is responsible for checking stock, adding items to your basket, emptying your basket, and the checkout process – it has far too much responsibility.

I always think it’s a good idea to follow the Single Responsibility Principle – to those new to this I simple explain it as follows:

A class should be responsible for a single job. If you can’t tell me, in a sentence, what it does; then it is almost definitely doing too much.

As such if you have a requirement for an ecommerce system as defined above, checking stock levels would be it’s own class. The point of this is so that:

  1. The class can be used throughout your project, anywhere that you need to check stock levels
  2. The class can be modified and know that all stock checking functionality happens through a single place
  3. Any business logic can be contained in a single place
  4. The class could be swapped out if needs be, again you know all functionality is encapsulated here

Encapsulation: Goes hand in hand with “DRY” (don’t repeat yourself). Basically bringing everything to do with a certain concern (i.e. stock checking) into a single place, rather than leaving it scattered throughout your code.

So you now have a basic idea of what you use a class for, and in what scenarios you would create a new class – basically, any time you need to get something done.

Now working with objects, as opposed to working with a bunch of variables, has some real perks.

NB: Throughout this article I am going to refer to “classes” and “objects”. For all intents and purposes a class is defined, an object is instantiated. Therefore my User class, once it physically exists, becomes an object, until that point it is a class.

If I have an “Order” object rather than a whole heap of variables or a massive multi-dimensional array, I can put functionality in there which I need, I can do decisions and logic based on information contained within that order. What you’re doing is neatly organising everything into it’s own compartments within your code.

The user may be hitting a button to “add to cart”, but in practice you might be doing all kinds of things; checking the stock level, applying voucher codes, modifying the stock level, calculating the value of the cart so far, all sorts. So this separation becomes invaluable.

However, the point of this part of the article is simple. It’s going to be really difficult to follow, and make almost no sense, to have a bunch of objects floating around a procedural execution. The reason for this is, again, simple; if you have some stuff procedural, and some wrapped in classes and objects; how the heck could I possibly know where to look?

Learning Point Two: Using and understanding the syntax

In this point we’re going to cover some basic concepts:

  1. Inheritance – abstracting and extending classes
  2. Interfaces – implementation and usages
  3. Properties, Methods, Privacy and Scope

Firstly, one of the beautiful things about classes is inheritance. Let’s take a basic example of a User. A User might be a Guest, a Member, a Moderator and an Administrator; but they almost definitely share a bunch of common functionalities, like having a user ID for example (though a guest’s would be 0 or null). You don’t want to have to write a whole heap of code to get the user ID a bunch of times, when it’s the same functionality. You want all of your different types of users to share this functionality, this is where inheritance comes in.

Inheritance

<?php

abstract class User
{
    protected $userId = 0;
    protected $isLoggedIn = true;
    protected $isStaff = false;
    public function getUserId()
    {
        return $this->userId;
    }
    public function isLoggedIn()
    {
        return $this->isLoggedIn;
    }
}

class Guest extends User
{
    protected $isLoggedIn = false;
}

class Member extends User{}

class Moderator extends Member
{
    protected $isStaff = true;
    public function hasPermission($permission)
    {
        // Some logic here and return TRUE or FALSE
    }
}

class Administrator extends Moderator
{
    public function hasPermission($permission)
    {
        return true; // Administrators can do anything
    }
}

See in GitHub

The handy thing about this is that everyone is a User. So if ever I try to manage some dependency and state that a User is required as an object, I can do this really easily, because everything extends off of User, or one of its derivatives.

Also everyone from Guest to Administrator has a getUserId method, which is quite handy and an isLoggedIn method, so it doesn’t matter if my factory returns me a Guest or an Administrator, the functionality is going to work.

Just to clarify some bits here. Guest::isLoggedIn is false, but Member::isLoggedIn (and everyone who extends Member, or Moderator) returns true. Moderator::isStaff returns true as does Administrator::isStaff (because of the inheritance).

Site note: You could never do: $user = new User(); Because User is defined as an abstract class, as such you could do $administrator = new Administrator(); (or any of the other classes)

Interfaces

An interface is, the best way I’ve heard it described, is the difference between plugging a socket into the wall, vs having to wire in your lamp by hand. You can define an interface on an object, to ensure it conforms to certain standards, basic example now of an Emailable interface.

This interface ensures that the entity, whatever it is (User, Customer, Employee, Organisation, Website) can be emailed, by specifying it must have some methods available to it.

In this example we can make anything we want emailable, by simply adding the methods defined and stating that this class implements the interface – now we can send an email to the fridge if we wish, as long as it can define those methods!

<?php

interface Emailable
{
    public function getRecipientName();
    public function getEmailAddress();
    public function acceptsHtmlEmail();
}

class Client implements Emailable
{
    public function getRecipientName()
    {
        return 'Very Important Company Plc';
    }
    public function getEmailAddress()
    {
        return 'someone.somewhere@clientwebsite.com';
    }
    public function acceptsHtmlEmail()
    {
        return true;
    }
}

class Employee implements Emailable
{
    public function getRecipientName()
    {
        return 'John Doe';
    }
    public function getEmailAddress()
    {
        return 'john.doe@ourcorporateemail.com';
    }
    public function acceptsHtmlEmail()
    {
        return false;
    }
}

Sample on GitHub

Implementing an interface: Simply means that we have defined an interface, and the class which implements that interface conforms to it. Then if we define Emailable as a type hint, PHP will force not only that the class implementing the interface, but also that any object parsed into an Emailable type hinted parameter conforms. Otherwise it’ll throw a hissy fit and not work

Scope!

The big one that catches a lot of new-to-OOP developers out is variable and method scope. So here is a quick and simple one:

  1. Public – these methods and properties can be accessed (as long as the object is instantiated) from anywhere that has access to the object
  2. Protected – these methods and properties can be accessed within this object (or derivatives)
  3. Private – These can only be accessed specifically within this class
  4. Static – These can be accessed from the class itself, without needing the instantiation of an object
  5. Constants – As in PHP itself, these never change

That’s the long and short of it. Word of warning! Always assume your code is going to be copied, recreated and used throughout a system and if it is open source by anyone anywhere in any way they feel like it. So be very careful what you expose as public, once it’s public you have to assume code is relying on it, and as such ensure you are backward compatible – what I’m saying is it is easier to change $myProperty and myMethod to be public later, if they were protected before, than to change them from public to protected – because who knows what you might break!

Accessing properties can be done as follows:

Please do not try to run this code, it won’t work 🙂

<?php

class Scope
{

    // I will never change
    const GITHUBURL = 'https://github.com/johnothecoder';

    // I can be called from the class, without instantiation, and can be shared across multiple instances
    public static $fullName = 'Matt Johnson';

    // I can be access from anywhere the object exists
    public $alias = 'JohnoTheCoder';

    // I can only be accessed from within Scope or a class which extends Scope
    protected $name = 'Matt Johnson';

    // I can be called from anywhere
    public function getAlias()
    {
        return $this->alias;
    }

    // I can be only be called within Scope (or classes which extend scope)
    protected function getName()
    {
        return $this->name;
    }

}

// Executing some code

echo Scope::GITHUBURL;
echo Scope::$fullName;

$scope = new Scope();
echo $scope->alias;
echo $scope->getAlias();

// But I can't do this
echo $scope->name;
// Or this
echo $scope->getName();

As always, sample available on GitHub

I can’t really talk you through the full spec of this one, as there’s not much to talk through, really it’s just a way of showing you what can and can’t be done within the scopes of an object.


Hopefully this article has been of some use to those of you looking to get into the big wide world of object orientated programming with PHP. Next time I will be covering how to use Dependency Injection, the Factory and Service locator pattern and polymorphism to your advantage 🙂

Thanks for reading!

Powered by WordPress & Theme by Anders Norén