Monday, September 07, 2009

PHP Serialization And Recursion Demystified

Introduction

PHP has different in-core callbacks able to help us with daily deployment, debug, improvements. At the same time, PHP is loads of intrinsic "gotcha", too often hard to understand, hard to explain, or simply hard to manage. One common problem is about debug, caching, or freezing, and the way we would like to debug, cache, or freeze, variables.
For freezing, I mean those procedures able to regenerate a stored variable and its status, in order to reuse that variable, to understand what happened in that moment with that variable, or just to speed up expensive tasks already completed.


The Problem

One of the most common procedures to freeze variables is their serialization, performed in core via a well known serialize function.
Please consider this example:

$person = new Employee('Mr. Lucky Me');
// ... do some useful task
myCompanyFreezer($person);

// the myCompanyFreezer function

function myCompanyFreezer(Employee $p){

$company = Company::getInstance();
// note that this company has exclusive
// control over the employee work (reference)
$company->employees[] = &$p;

// on the other hand employee
// has finally a company to work with
// but no control over the company
$p->company = $company;

// update and freeze the employee status
$company->add(serialize($p));
}

So, while company has an exclusive contract, and each employee is totally under company control, the employee has nothing to do with company decisions, but it can proudly say: "Look at me, I work for Company::getInstance()!".
But being serialization recursive, we will find the company instance present as employee "company" property.
The problem is that the company instance has an "employees" property which contain one or more employees, included the employee Mr. Lucky Me.
And so on and on until infinite recursions, a massive waste of resources and ... ALT!, serialize is clever enough to understand when there are recursions and rather than going on with nested serializations it simply put a reference to the serialized object.
Got headache already?

Two Different Kind Of Recursions

Being serialize main purpose to freeze a variable status, and being PHP still a bit hybrid about references and shadow copies, serialize could produce two kinds of pointer: r and R.
The lowercase "r" will be a recursion, while the uppercase "R" will be a recursion by reference.

// serialized recursion - the ABC
$o = new stdClass;

// recursion
$o->normal = $o;

// recursion by reference
$o->reference = &$o;

echo serialize($o);
// O:8:"stdClass":2:{s:6:"normal";r:1;s:9:"reference";R:1;}

We should focus into r:1; and R:1;.
While the "r", or the "R", means there is a recursion, the unsigned integer indicates the exact object that "caused" that recursion.
When we perform an unserialize operation, the parser cannot obviously de-serialize as we read, because if we have an instance or an array, internal values should be ready to be assigned already "unserialized".
This simply means that the number after the R is not sequential, and there is no relation with the length of the string, but only a relation with de-serialization process.

What Is WrongWith Serialize

First of all, PHP serialization is not human readable as JSON, as example, or an XML is.
If we use this format to debug our application we'll definitively need an extra layer able "to introduce" us the object as is. In few words, what we need is something that is not serialized.
Moreover, serialize and unserialize would like to be as much reliable as possible, and for these reasons these functions are 3 times slower than json_encode or json_decode.
The truth is that JSON, as is, cannot compete with serialize and unserialize, due to protocol simplicity which is unable to store class names, lambdas, or public, protected, and private instances properties.
Last, but not least, JSON PHP parsers are a bit ambiguous, because an array not fully populated is usually converted into an object:

define('MY_WELCOME_STRING', 1);
$a = array();
$a[MY_WELCOME_STRING] = 'Hello World';

echo json_encode($a);
//{"1":"Hello World"}

// in JavaScript would have been
// [null,"Hello World"]
// where square brackets mean Array, and not Object

So again, another serializer is not worth it to freeze variables, what's left for us?

var_export

var_export() gets structured information about the given variable. It is similar to var_dump() with one exception: the returned representation is valid PHP code.

EUREKA! There is a core level function which aims is to serialize PHP into valid PHP, how can we ask something more efficient? I mean: "native performances to serialize and native performances to have back, it must be the solution"!
It's not!

$o = new stdClass;
$o->normal = $o;

echo var_export($o);

Fatal error: Nesting level too deep - recursive dependency?

Nice one! From bogus 39116 and Derick reply:
We can't change this by adding some text when this happens, as that
would not result in valid PHP code in that case (which is the purpose of
this function
).

Let me summarize:
  1. serialize/unserialize understand recursions almost without problems but unserialize is slow and both are PHP dedicated

  2. json_encode is not compatible with recursion, and as general purpose PHP serializer, it looses too much PHP information
  3. var_export would be perfect but in PHP we cannot manually represent a recursion that will be valid and correctly parsed
  4. var_dump is magic but its produced output is not reliable, *RECURSION* won't be recognized as valid PHP value
  5. I had already headache at line 10 of this post, and now I am still here to see there are no solutions?


How To Remove Recursion Without Loosing It

Well, solutions are different, but performances speaking, we do not have too many chances. A first solution could be a maximum nested level limit, where an object cannot serialize its properties "forever" and after N times it has to stop!
This technique has more cons than pros, and reasons are these:
  • it could require a manual parser, slower, and due to the problem nature, not that simple to maintain or debug
  • it could be extremely redundant, causing a lot of wasted resources, due to its artificial stupidity, since a recursion should never be serialized, being indeed a recursion, and in this way a waste of time, references, and resources
  • as mentioned 5 words ago, in this way we are loosing the recursion, so we should stop saying we are serializing ...
Accordingly, there is only another chance to perform this task: understand recursions, and remove them without loosing their meaning.

$o = new stdClass;
$o->n = $o;
$o->r = &$o;

echo serialize($o), '
',
serialize(
remove_recursion($o)
)
;

Produced output:

O:8:"stdClass":2:{s:1:"n";r:1;s:1:"r";R:1;}
O:8:"stdClass":2:{s:1:"n";s:12:"?recursion_1";s:1:"r";s:12:"?Recursion_1";}

Et voilĂ ! problem solved! ... but what is that?
The remove_recursion function has been introduced in latest Formaldehyde Project Version 1.05, and its purpose is to make debuggable any kind of trace, backtrace, or logged information.
The resulting var_export will be something like this:

stdClass::__set_state(array(
'n' => '' . "\0" . 'recursion_1',
'r' => '' . "\0" . 'Recursion_1',
))

The chosen form to store a recursion is exactly the same used by PHP for lambdas

echo serialize(create_function('',''));
//s:10:"?lambda_1";

In PHP a lambda is stored as "protected" string, and the number at the end of the string "lambda_" indicates its reference. Until we restart our webserver, lambda functions will persist in the entire PHP context, that is why it is possible to serialize lambda functions and unserialize them, as long as the environment does not change, or restart.
The additional difference between "r" and "R" in case of recursion is necessary to avoid info about references.
On the other hand, recursions are truly useless to debug or export variables, but they can always be present.
PHP will not understand my chosen syntax, but only and if necessary, we can always use a function like this to recreate correct recursions:

function recreate_recursion($o){
return unserialize(
preg_replace(
'#s:[0-9]+:"\x00(r|R)ecursion_([0-9]+)";#',
'\1:\2;',
serialize($o)
)
);
}


Pros

  1. we can finally forget every kind of recursion problem, letting PHP understand them via serialize, without doing anything
  2. performances and produced size will be better than every other nested based parser, thanks to a simple parser which ... surprise!!! ... it does not use recursion at all!
  3. once we pass a variable through formaldehyde_remove_recursion we can transform that kind of variable in whatever format, included var_export, JSON and XML, forgetting recursions headaches


Cons

  1. being based over serialize and unserialize, the transformation could implicitly call, if present, both __sleep and __wakeup events, it's gonna happen in any case if we use serialize/unserialize, but if we serialize a transformed variable __sleep will be called twice
  2. it could require extra effort to regenerate internal recursions, in any case it is better than loose them forever as most of us have done 'till now
  3. the convertion is assumining that a serialized string will not contain an exact match, such a manual string. This is actually the same assumption PHP developers did about serialized lambdas.


Conclusion

With a lightweight function, and after this post, I hope we can better understand recursion problems, and relative serializations. My suggestion is to give Formaldehyde a try, but as long as the Mit Style License is respected, you can extract its internal formaldehyde_remove_recursion.

Any question? :)

20 comments:

devsmt said...

extremely interesting, really i need to cleanup often my objects as i like to inspect them via var_dump(). tnks for sharing this inspired idea!

Anonymous said...

andrea, come sempre scrivi girovita (waist) invece di waste (spreco)...

Andrea Giammarchi said...

er ... LOL, one was correct though. Any other comment? :D

Shawn said...

That has to be the dumbest use of serialize I've ever seen. Serialize wasn't meant to be human readable or to "freeze state", it was meant as a simple way to store PHP values and restore them with unserialize mainly with databases. And as for trying Formaldehyde I'll stick with set_error_handler with a FirePHP wrapper.

Andrea Giammarchi said...

Shawn, I take as a compliment, since I have simply solved an extremely common problem via something native going further normal conventions - I call it strike, usually,so thanks.

About Formaldehyde, I have contacted the FirePHP author but the point is that FirePHP is a logger, not a debugger, so I am not sure how we will integrate these two projects, but I am still waiting for a reply otherwise I gonna comment your loved FirePHP and everything I did not like about it: the reason I have created Formaldehyde which does not suffer, thanks to my new and fresh idea, recursions (neither redundant code via nested encoding limits)

Regards

Shawn said...

Your Formaldehyde isn't a debugger either. It's simply logging errors which is exactly what is possible with wrapping FirePHP's error method within set_error_handler. I don't see anywhere in Formaldehyde where you can set breakpoints or do anything else that constitutes debugging instead of just logging.

Andrea Giammarchi said...

FirePHP fails with most problematic errors, it is in the source code.
Formaldehyde does not fail.

FirePHP requires manual logging implementation, Formaldehyde does not.

I think you are missing the whole point about what is Formaldehyde and what is FirePHP as well, so I can suggest this page hoping this will make things more clear.

I personally asked the FirePHP author to integrate Formaldehyde just because they are different but if you want to go on, that's fine, it's a sort of habit here for people that discover this blog "a bit late"

Regards

Andrea Giammarchi said...

P.S. Shawn, you did not get the meaning of this post and Formaldehyde has a dedicated one, please keep talking about this post, if you have questions.

I would like to underline that nobody said serialize was created to be human readable, I wrote about human readability because if we want to debug something, which is frozen as a string, whatever you say, serialize is not "confortable" for debug purpose because it requires to be unserialized to be understood.

So, if you read again this post and formaldehyde sections I'll be more than happy to answer your question.

If you are here to say: why you write bullshit, without any code, argument, and talking for third parts, please feel free to leave ASAP this blog and do not come back, thanks for understanding, and see you next trick.

Giorgio said...

As far as I know serialization in used in $_SESSION storage between requests... Or am I wrong?

Andrea Giammarchi said...

yes Giorgio, only if no session handler has been defined.
But how does session cope with this post?

Giorgio said...

Because using sessions is using trasparently serialization (unless another storage is provided), so recursion "issues" are present. For instance, say you memorize in session an entity loaded from an ORM like Doctrine 2 or Zend_Db_Mapper and this entity has a lazy loading collection as a property... Since it has a reference to the entity manager every single object will be serialized and this has to be taken into account with a detached/merge process like Doctrine does.

Andrea Giammarchi said...

Giorgio ... still, so what? This post is about a technique to serialize without recursion problem and without loosing recursions, when and if necessary in order to be able to transform variables into json, XML, whatever ... still without loosing recursions properties and avoiding useless nested serializations.

So, what is your concern about SESSION which are not affected at all for a debug purpose like debug_backtrace, Exception->getTrace, or other?

Shawn said...

As for your serialization comments: It's not supposed to be comfortable to debug because you're not supposed to bloody debug it.



Correct, FirePHP does not automatically log errors that's why you do something like

function logPhpError($code, $error, $file, $line, $context)
{
if(!($code & ini_get('error_reporting'))) return;
$logItem = array(
'data' => $error,
'type' => 'error',
'file' => $file,
'line' => $line
);
fb($logItem, 'error');
}
set_error_handler('logPhpError');

Then in your .htaccess you'd do

php_value auto_prepend_file "/path/to/file/blah.php"

Andrea Giammarchi said...

.htaccess, which requires a proper parsing for each page call? prepend file? which means you cannot escape from fb ?
I think you are still confused about what is Formaldehyde ... it's for Ajax calls, and it does not suppose to be in every page or goodbye performances.

Your main error is to compare these two files completely different. I send you back again to that page.

serialize has been used to debug since PHP 3, I am not sure why you have to insist with a pointless comment but you can go on, I do not mind, but I'll stop to reply off topic comments.

Shawn said...

You're debugging, if you're stupid enough to leave the auto_prepend_file in when you launch to production then you deserve the performance hit.

And they aren't completely different. The code I provided acts almost identically to Formaldehyde, AJAX call or not save the fact that with Formaldehyde you have to do a console.log() in your javascript and FirePHP is done automatically since it's sent through the HTTP headers.

Andrea Giammarchi said...

so you missed the second image, and you are talking withouth even knowing FirePHP which requires a plug-in specific for Firefox indeed ... ah ah ah, you are so funny, please go on!

Shawn said...

Indeed, it does require a Firefox specific add-on which happens to be widely used, has a large community, works extremely well and only requires that you use PHP to log, has libraries for almost all of the most widely used frameworks and has excellent documentation.

As a moot point I don't know any PHP developer that doesn't use Firefox while at least coding (Though of course we have to switch to IE to fix their bullshit)

Andrea Giammarchi said...

Shawn, stop complain here and open your mind: Formaldehyde JS - THe Circle Is Close

You are not stats, you are a single developer and you do not know how many configurations, situations, and problems are related with Ajax debug. I am simply trying to solve my problems making my code public, if you do not like it you do not have to use it and please find a new blog to complain 'cause I am kinda bored here.

Regards

Artur Ejsmont said...

Well im not convinced that i would like to use that solution. I think i know where you are getting with it but seems a bit too complicated to me. I like simple solutions that just work. Having to know hacks about internal implementations of each method or combining them or doing it manually is kind of adding complexity and increasing risk that someone else reading the code will never figure out what was the intention and what are the key aspects of the new method.

On the other hand if that is what you need and it solves your problems then cool, i dont have a problem with that.

Personally i just like to keep stuff simple. If its exeptional errors logging just use serialize as i dont expect 10 errors per second ;- ) otherwise it means site is screwed up any way. If its for storage ... then again ... the less magic calls are involved or hacks the easier to follow the 'default'.

But thanks for the article it was actually interesting and detailed. I liked it :- )

THANKS!

egran said...

Hello Andrea, it's a smart solution to avoid the recursion problem of other core functions' output.

The first application that comes into my mind (and the one I was searching for before landing here :) is the object storage for caching purposes: has already been implemented some class to cache objects which makes use of this solution?