General Representation for Drupal feeds
FeedAPI is an excellent module to deliver information from the outside world into a Drupal installation. Currently FeedAPI supports to import RSS nodes, which include most applications. But legacy systems (which are probably 3 to 7 years old) and most enterprise information systems have no RSS output that supports only a few fields as a content type. In such an occasion, a more generic format is required to feed data into Drupal with FeedAPI. One of the best practices is to acquire information from legacy systems in XML format that is supported by a FeedAPI’s parser (which is already described in the previous post). Although any format is parsable, XML is more intuitive.
Generally speaking, FeedAPI accepts a list of items that are described in the same format. To generalize FeedAPI usage of feeds, a simple XML format is proposed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <?xml version="1.0" encoding="utf-8"?> <feed xmlns:foo="http://example.com/foo" id="an id for a feed" title="the title for a feed" description="the description for this feed" link="the link to be imported as the original_url, can be ignored"> <node id="a unique id" title="the title to be imported as the node title" description="the description to be imported as the node description" link="the link to be imported as the original_url, which can be ignored"> <data> <foo:id>foo's id</foo:id> <foo:first-name>foo's name</foo:first-name> <foo:last-name>foo's last name</foo:last-name> <foo:email>foo's email</foo:email> <foo:website>http://foohomepage.com</foo:website> </data> </node> </feed> |
In the above example, a simple XML wrapper is used to encasulate the actual data (foo’s item description). Foo’s original XML elements an be directly ecansulated into the data field with proper namespace settings.
Elements
feed
This element has four attributes, including id, title, description, and link. id is a unique identification string in a drupal installation, it is imported into drupal as the guid field. title and description are respectively title and description in a FeedAPI feed. When a feed in the above format is imported, these two fields are imported as the feed’s title and description. link is imported as the original_url field for a FeedAPI feed. In current implementation, guid and original_url must have at least one exist. In my SimpleXML parser implementation, guid (the id attribute) is required.
The feed tag consists of a series of node tags.
node
The node tag is a container for a node’s details. Currently only data tag is allowed under node tag. Other tags can also be added under node to specify any node information. node has also the four attributes that are described in feed to represent drupal-specific meta data.
data
The data tag is a container of the original data that is converted (or directly copied) from an original XML. A namespace is suggested to indicate the source of the data. Any XML data can be added under data tag, however, the format must be consistent so the parser is able to get the same set of information for each node.
Feed Synchronization
FeedAPI acts a standard aggregator’s behaviour. For example, when FeedAPI is watching a feed that updates periodically, the items of the watched feed are updated rather than synchronized to the drupal site. This behaviour suggests that if an item does not exist in the updated feed, it is not removed from the drupal site. While this behaviour is suitable for most drupal sites that act as news aggregators, it is not suitable in some enterprise applications that need real synchronization between the presentation layer and the EIS. Two methods can be implemented to synchronize data between the presentation layer. One method is that EIS exports incremental information of data (which is the difference between the previous revision and the current revision of data), and the presentation layer parse and apply the incremental data. The other method is that EIS exports all data, the presentation layer analyze the difference between the previous revision and the current revision, and remove deleted items that do not exist in the latest update from EIS.
Although the former method is usually more optimized, the EIS with which I am working is a legacy system that supports no data warehousing technologies — in short, it contains no timeline data and fails to support row revisions. Therefore, my implementation is limited to the latter method. Fortunately, FeedAPI provides a flexible interface that allows me to implement the synchronization without contaminating FeedAPI’s source code.
FeedAPI provides feedapi_refresh_feedapi hook for parsers and processors to post-process a feed after refresh. Synchronization of feeds will depend of this post-process mechanism.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | function parser_simplexml_feedapi_after_refresh($feed) {
// list all guids from new items
$new = array();
foreach ($feed->items as $item) {
$new[] = $item->options->guid;
}
// remove non-existed items
$condition = 'f.guid NOT IN ("' . implode('", "', $new) . '")';
$sql = "SELECT f.nid FROM {feedapi_node_item_feed} n, {feedapi_node_item} f WHERE n.feed_nid = $feed->nid AND n.feed_item_nid = f.nid AND $condition";
$result = db_query($sql);
while ($item = db_fetch_object($result)) {
// this is a hack from the drupal original node_delete function, which avoids permission check
$node = node_load($item->nid);
db_query('DELETE FROM {node} WHERE nid = %d', $node->nid);
db_query('DELETE FROM {node_revisions} WHERE nid = %d', $node->nid);
// Call the node-specific callback (if any):
node_invoke($node, 'delete');
node_invoke_nodeapi($node, 'delete');
// Clear the cache so an anonymous poster can see the node being deleted.
cache_clear_all();
// Remove this node from the search index if needed.
if (function_exists('search_wipe')) {
search_wipe($node->nid, 'node');
}
drupal_set_message(t('%title has been deleted.', array('%title' => $node->title)));
watchdog('content', t('@type: deleted %title.', array('@type' => t($node->type), '%title' => $node->title)));
}
unset($result);
} |
This implementation of feedapi_refresh_feedapi hook provides a synchronization mechanism to remove all deleted items from the imported feed. However, drupal’s node_delete function does permission check against current user, while the routine checks a feed using drupal’s cron. With node_delete, an anonymous user is unable to remove items. So this hook circumvents the permission check. Although it introduces a possible security leak, this hack is neccessary unless a better cron is implemented.
Update: Due to FeedAPI’s mechanism to deal with unique feed items, an item’s ID must be unique across ALL feeds rather than in one feed.