Add speed to node import by subtracting

I started a node import of 700,000 address records to a custom content type in Drupal 6. I was only getting in input rate of about 91 nodes/minute, which would take about 128 hours.

Then I switched from MyISAM to InnoDB tables. No significant change.

Then I dropped all the database indexes from the content type table, node, and node_revisions table except for the primary keys. No noticeable change.

To speed it up, I disabled Xdebug and added APC. Then I disabled the following modules to bring the enabled module count down from 58 to 42:

That changes things. It moved from about 91 nodes/min to 908 nodes/min.

nodes/min Xdebug Drop indexes Disabled modules Disabled modules w/ APC Disabled modules w/ APC, no indexes
MyISAM 91 92 883 908 940
InnoDB 90 91 875 836 1003

I noted that the InnoDB import went faster than MyISAM without indexes on the content type, node, node_access, and node_revisions tables. I even took extra rate samples just to make sure that was correct, and it seems to be.

ALTER TABLE `content_type_participants` ENGINE = InnoDB;
ALTER TABLE `node` ENGINE = InnoDB;
ALTER TABLE `node_access` ENGINE = InnoDB;
ALTER TABLE `node_revisions` ENGINE = InnoDB;

ALTER TABLE `content_type_participants` ENGINE = MyISAM;
ALTER TABLE `node` ENGINE = MyISAM;
ALTER TABLE `node_access` ENGINE = MyISAM;
ALTER TABLE `node_revisions` ENGINE = MyISAM;

Another thing which isn't represented here is that the import clearly slows down as the tables grow in size, so if you're importing several hundred thousand records as nodes, expect that the rate of import will decrease as the import progresses. That fact is probably reflected in the difference between the "Disabled modules" and "Disabled modules w/ APC" figures as my timing in doing rate samples wasn't exactly the same when I did each case.

The summary is clear, though, and Khalid is right again. The single biggest impact you can make to speed up your node imports, or your site in general, is to disable some modules.


» deekayen's blog · Printer-friendly version Topics:

update status - a likely culprit

You should have disabled modules one by one, to find out the actual cause. Those UI modules probably didn't affect shit.

My educated guess would be the core "update_status". It has this nasty habit of checking all current modules (the stored metadata might be over 1M), and occasionally check d.org for new modules. And it does this ONLY when you (the admin) are visiting /admin/* pages. If the node import is run on /admin/, and it does a lot of ajax calls thru the batch API, update_status just might be wasting your resources.

A better way to import nodes is using drush (command line: no timeout, no ajax, no admin).

reply

Another solution

Another solution is do LOAD DATA IN FILE in MySQL. If the source file can be read from one disk and inserted on another you can get speeds of 10000++ rows a second. Ordinarily I wouldn't recommend this approach on complexity grounds but it becomes very attractive when a traditional node import could take days/weeks/months. Turning off indexing results in faster imports for sure but building the indexes can still take forever if temp tables are created on disk. I cover it briefly in this article: http://cruncht.com/361/uriverse-dbpedia-drupal-case-study. cheers Murray

reply

Were you able to detect which

Were you able to detect which module gave the most significant gain? In general it would mean less hook_nodeapis are invoked. I could not figure out which of these modules implement them?

reply

Interested. most disabled

Interested. most disabled modules are UI stuff tho. Why it brings down so much, Node Import is through standard API that means it also affect real world usages.

reply

Post new comment

  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <hr /> <a> <p> <em> <strong> <cite> <code> <blockquote> <ul> <ol> <li> <dl> <dt> <dd>
  • Web page addresses and e-mail addresses turn into links automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.