Marcin Probola: Doctrine2 (PHP): inserting large amount of entities

First of all IMO Doctrine2 don't really fits for inserting large amount of data/entities because of its abstraction layer "overhead". Nevertheless during development my IpToCountry Symfony2 Bundle I have chosen to use it for easier installation and better integration of that bundle.

Data set which needs to be inserted is a .csv file from http://software77.net/geo-ip/. Its decompressed size (as I writing this note) is about 9MB and it contains about 127000 records. However I'll use gzipped file (~1.5MB) and decompress it "on the fly".

Lets look at inserting function prototype:

...
    public function load(ObjectManager $manager)
    {
        $fp = gzopen('data.csv.gz', 'r');
        while(!gzeof($fp)) {
                $line = trim(gzgets($fp, 2048));
                if ($line[0]!='#') { // dont parse comments
                    $csv = str_getcsv($line);

                    $o = new EntityObject();
                    /* ...
                       filling entity with data from csv line
                       ...
                    */

                    $manager->persist($o);
 
                    unset($o);
                }  
        }
        $manager->flush();

        gzclose($fp);
    }
...

This code won't work, because it'll trigger fatal error (memory exhaustion) soon or later (depends on memory limit and number of entities). What we need to do is to use chunked inserting approach and depends on doctrine debug configuration (in symfony2 it directly depends on %kernel.debug%) we need also to turn off sql logging.

The final code looks like this:

...
    public function load(ObjectManager $manager)
    {
        $count = 0;
        $objectsPerChunk = 1000;

        $manager->getConnection()->getConfiguration()->setSQLLogger(null);

        $fp = gzopen('data.csv.gz', 'r');
        while(!gzeof($fp)) {
                $line = trim(gzgets($fp, 2048));
                if ($line[0]!='#') { // dont parse comments
                    $csv = str_getcsv($line);

                    $o = new EntityObject();
                    /* ...
                       filling entity with data from csv line
                       ...
                    */

                    $manager->persist($o);
                    $count++;
                    if ($count%$objectsPerChunk == 0) {
                        $manager->flush();
                        $manager->clear();
                    }
 
                    unset($o);
                }  
        }
        $manager->flush();

        gzclose($fp);
    }
...

In line 23 there is important $manager->clear() which "detaches" all object currently managed by $manager. Basically detaching frees memory, but if you want to look under the hood and see what really happening, go to doDetach() method in Doctrine/ORM/UnitOfWork.php

You may be wondering why I choose 1000 objects per chunk, not 100, 5000 or 10000? Well in my case it turns out that the ~1000 is the best choice considering speed. Here are some benchmarks I did:

objects per chunk	time
100	3m6s
500	2m43s
800	2m43s
1000	2m37s
1200	2m38s
1500	2m46s
2000	2m39s
5000	2m53s
20000	3m17s

Marcin Probola

Saturday, August 3, 2013

Doctrine2 (PHP): inserting large amount of entities

1 comment: