Add speed to node import by subtracting
I started a node import of 700,000 address records to a custom content type in Drupal 6. I was only getting in input rate of about 91 nodes/minute, which would take about 128 hours.
Then I switched from MyISAM to InnoDB tables. No significant change.
Then I dropped all the database indexes from the content type table, node, and node_revisions table except for the primary keys. No noticeable change.
To speed it up, I disabled Xdebug and added APC. Then I disabled the following modules to bring the enabled module count down from 58 to 42:
- admin menu
- content copy
- color
- update status
- content form jammer
- generic jammer
- mail logger
- advanced help
- diff
- r4032login
- user import
- rules admin ui
- filter permissions
- log user filter
- bonus views export
- views ui
That changes things. It moved from about 91 nodes/min to 908 nodes/min.
| nodes/min | Xdebug | Drop indexes | Disabled modules | Disabled modules w/ APC | Disabled modules w/ APC, no indexes |
|---|---|---|---|---|---|
| MyISAM | 91 | 92 | 883 | 908 | 940 |
| InnoDB | 90 | 91 | 875 | 836 | 1003 |
I noted that the InnoDB import went faster than MyISAM without indexes on the content type, node, node_access, and node_revisions tables. I even took extra rate samples just to make sure that was correct, and it seems to be.
ALTER TABLE `content_type_participants` ENGINE = InnoDB;
ALTER TABLE `node` ENGINE = InnoDB;
ALTER TABLE `node_access` ENGINE = InnoDB;
ALTER TABLE `node_revisions` ENGINE = InnoDB;ALTER TABLE `content_type_participants` ENGINE = MyISAM;
ALTER TABLE `node` ENGINE = MyISAM;
ALTER TABLE `node_access` ENGINE = MyISAM;
ALTER TABLE `node_revisions` ENGINE = MyISAM;Another thing which isn't represented here is that the import clearly slows down as the tables grow in size, so if you're importing several hundred thousand records as nodes, expect that the rate of import will decrease as the import progresses. That fact is probably reflected in the difference between the "Disabled modules" and "Disabled modules w/ APC" figures as my timing in doing rate samples wasn't exactly the same when I did each case.
The summary is clear, though, and Khalid is right again. The single biggest impact you can make to speed up your node imports, or your site in general, is to disable some modules.

update status - a likely culprit
You should have disabled modules one by one, to find out the actual cause. Those UI modules probably didn't affect shit.
My educated guess would be the core "update_status". It has this nasty habit of checking all current modules (the stored metadata might be over 1M), and occasionally check d.org for new modules. And it does this ONLY when you (the admin) are visiting /admin/* pages. If the node import is run on /admin/, and it does a lot of ajax calls thru the batch API, update_status just might be wasting your resources.
A better way to import nodes is using drush (command line: no timeout, no ajax, no admin).
Another solution
Another solution is do LOAD DATA IN FILE in MySQL. If the source file can be read from one disk and inserted on another you can get speeds of 10000++ rows a second. Ordinarily I wouldn't recommend this approach on complexity grounds but it becomes very attractive when a traditional node import could take days/weeks/months. Turning off indexing results in faster imports for sure but building the indexes can still take forever if temp tables are created on disk. I cover it briefly in this article: http://cruncht.com/361/uriverse-dbpedia-drupal-case-study. cheers Murray
Were you able to detect which
Were you able to detect which module gave the most significant gain? In general it would mean less hook_nodeapis are invoked. I could not figure out which of these modules implement them?
Interested. most disabled
Interested. most disabled modules are UI stuff tho. Why it brings down so much, Node Import is through standard API that means it also affect real world usages.
Post new comment