Saturday, August 3, 2013

Doctrine2 (PHP): inserting large amount of entities

First of all IMO Doctrine2 don't really fits for inserting large amount of data/entities because of its abstraction layer "overhead". Nevertheless during development my IpToCountry Symfony2 Bundle I have chosen to use it for easier installation and better integration of that bundle.

Data set which needs to be inserted is a .csv file from http://software77.net/geo-ip/. Its decompressed size (as I writing this note) is about 9MB and it contains about 127000 records. However I'll use gzipped file (~1.5MB) and decompress it "on the fly".

Lets look at inserting function prototype:
...
    public function load(ObjectManager $manager)
    {
        $fp = gzopen('data.csv.gz', 'r');
        while(!gzeof($fp)) {
                $line = trim(gzgets($fp, 2048));
                if ($line[0]!='#') { // dont parse comments
                    $csv = str_getcsv($line);

                    $o = new EntityObject();
                    /* ...
                       filling entity with data from csv line
                       ...
                    */

                    $manager->persist($o);
 
                    unset($o);
                }  
        }
        $manager->flush();

        gzclose($fp);
    }
...
This code won't work, because it'll trigger fatal error (memory exhaustion) soon or later (depends on memory limit and number of entities). What we need to do is to use chunked inserting approach and depends on doctrine debug configuration (in symfony2 it directly depends on %kernel.debug%) we need also to turn off sql logging.

The final code looks like this:
...
    public function load(ObjectManager $manager)
    {
        $count = 0;
        $objectsPerChunk = 1000;

        $manager->getConnection()->getConfiguration()->setSQLLogger(null);

        $fp = gzopen('data.csv.gz', 'r');
        while(!gzeof($fp)) {
                $line = trim(gzgets($fp, 2048));
                if ($line[0]!='#') { // dont parse comments
                    $csv = str_getcsv($line);

                    $o = new EntityObject();
                    /* ...
                       filling entity with data from csv line
                       ...
                    */

                    $manager->persist($o);
                    $count++;
                    if ($count%$objectsPerChunk == 0) {
                        $manager->flush();
                        $manager->clear();
                    }
 
                    unset($o);
                }  
        }
        $manager->flush();

        gzclose($fp);
    }
...
In line 23 there is important $manager->clear() which "detaches" all object currently managed by $manager. Basically detaching frees memory, but if you want to look under the hood and see what really happening, go to doDetach() method in Doctrine/ORM/UnitOfWork.php

You may be wondering why I choose 1000 objects per chunk, not 100, 5000 or 10000? Well in my case it turns out that the ~1000 is the best choice considering speed. Here are some benchmarks I did:

objects per chunktime
1003m6s
5002m43s
8002m43s
10002m37s
12002m38s
15002m46s
20002m39s
50002m53s
200003m17s