Data set which needs to be inserted is a .csv file from http://software77.net/geo-ip/. Its decompressed size (as I writing this note) is about 9MB and it contains about 127000 records. However I'll use gzipped file (~1.5MB) and decompress it "on the fly".
Lets look at inserting function prototype:
... public function load(ObjectManager $manager) { $fp = gzopen('data.csv.gz', 'r'); while(!gzeof($fp)) { $line = trim(gzgets($fp, 2048)); if ($line[0]!='#') { // dont parse comments $csv = str_getcsv($line); $o = new EntityObject(); /* ... filling entity with data from csv line ... */ $manager->persist($o); unset($o); } } $manager->flush(); gzclose($fp); } ...This code won't work, because it'll trigger fatal error (memory exhaustion) soon or later (depends on memory limit and number of entities). What we need to do is to use chunked inserting approach and depends on doctrine debug configuration (in symfony2 it directly depends on %kernel.debug%) we need also to turn off sql logging.
The final code looks like this:
... public function load(ObjectManager $manager) { $count = 0; $objectsPerChunk = 1000; $manager->getConnection()->getConfiguration()->setSQLLogger(null); $fp = gzopen('data.csv.gz', 'r'); while(!gzeof($fp)) { $line = trim(gzgets($fp, 2048)); if ($line[0]!='#') { // dont parse comments $csv = str_getcsv($line); $o = new EntityObject(); /* ... filling entity with data from csv line ... */ $manager->persist($o); $count++; if ($count%$objectsPerChunk == 0) { $manager->flush(); $manager->clear(); } unset($o); } } $manager->flush(); gzclose($fp); } ...In line 23 there is important $manager->clear() which "detaches" all object currently managed by $manager. Basically detaching frees memory, but if you want to look under the hood and see what really happening, go to doDetach() method in Doctrine/ORM/UnitOfWork.php
You may be wondering why I choose 1000 objects per chunk, not 100, 5000 or 10000? Well in my case it turns out that the ~1000 is the best choice considering speed. Here are some benchmarks I did:
objects per chunk | time |
100 | 3m6s |
500 | 2m43s |
800 | 2m43s |
1000 | 2m37s |
1200 | 2m38s |
1500 | 2m46s |
2000 | 2m39s |
5000 | 2m53s |
20000 | 3m17s |
super. Widzę, że nie za nowe rozwiązanie (2013) a działa rewelacyjnie. Dzięki
ReplyDelete