Automatically Archiving RSS Items in Google Reader

Google Reader is ten kinds of awesome, and has enabled my problematic RSS addiction. It allows me to easily churn through hundreds of interesting news/blog items a day on a whole range of subjects. Increasingly, I find myself going back to find items I read weeks or months ago, and searching with Google Reader is generally easy. Like all Google products, it seems to save nearly everything you ever put into it, but in the case of Google Reader, not absolutely everything. The content of a given RSS entry, an XML file with inline images, is saved, but the images themselves (along with any other external content) are not preserved, only the reference to their remote location. This presents an issue of longevity, as a given image may not still exist when I call up a certain RSS entry at some point in the future.

In an attempt the create a more permanent archive, I’ve cobbled together a piece of embarrassingly poorly written code, so dumb and inefficient that I very seriously considered not posting it, if only to save face, which captures all of the RSS items I’ve marked “shared” in Google Reader, and emails them, with images attached, to my Gmail account. This serves to build an archive of the posts I find most interesting, with copies of the images they reference. To sum it up on one sentence, and for the sake of anyone searching for something to do this, this script performs an automatic backup of Google Reader, complete with images and the text of the RSS entries.

 
 
 
The first step is marking items in Google Reader as “shared.” This makes them public, and is a way of narrowing down a torrent of information to items which I want to save.
 
google reader

 
 
 
The resulting feed is passed to xfruits‘s RSS->mail tool, which transforms the feed into a digest.
 
xfruit

 
 
 

From there it gets slightly more convoluted. Using a PHP class I found called “PHP Text / HTML Email with Unlimited Attachments” I’ve written a script which sends me a copy of the xfruits digest, with images attached and inline. I’ve included these files below, and some comments on the code.
 

Aforementioned PHP hackjob: rss_backup.php
PHP HTML Email Class: class.Email.php

 
 

This snippet loads the class, and establishes the destination of the email


<?php
//** load email class definition.

  include('class.Email.php');  

//** establish to,from, and any other recipiants.

  $Sender = "";
  $Recipiant = "username@email.com";
  $Cc = "";
  $Bcc = "";
 

 
 

This uses wget to grab the HTML and images from the xfruits digest and place these files in one directory. Wget also rewrites the image links in the HTML so that they reference images stored in the same directory as the HTML file, this becomes important in displaying attached images in HTML email.


//** wget content

  echo "\n\n\n\n"."**** BEGIN WGET HERE ****"."\n\n";

  $foo = system('mkdir wget_dump');
  chdir('wget_dump');
  $foo = system('wget --span-hosts --convert-links --page-requisites 
			--no-directories http://www.xfruits.com/username/shared');

  echo "\n\n"."**** SO ENDS WGET ****"."\n\n\n\n";


//** read in the content

  $filename = "index.html";
  $handle = fopen($filename, "r");
  $contents = fread($handle, filesize($filename));
  fclose($handle);
  $orig_file = $contents;
 

 
 

This makes and checks the MD5 hash of the wgetted HTML and compares it to the HTML which was downloaded the last time the script was run. This is an attempt to prevent repeat emails, it's doesn't work perfectly at this time.


//** check if the content has changed since last run

  chdir('..');
  $filename = "rss_backup.txt";
  $handle = fopen($filename, "r");
  $old_hash = fread($handle, filesize($filename));
  fclose($handle);

  $hash = md5($orig_file);

  echo "**** HASH CHECK of index.html ****"."\n\n";

  echo "old hash: ".$old_hash."\n";
  echo "new hash: ".$hash."\n";


  if ( $old_hash == $hash ) {
  	$foo = system('rm -r wget_dump');
  	exit("\n"."no rss update at this time"."\n\n\n\n");
  }

  chdir('wget_dump');
 

 
 

This uses a regular expression to locate all of the images referenced in the HTML, and also rewrites these image links to include "cid:" so that something that read foo.jpg will now read cid:foo.jpg which is the syntax used in HTML emails when referencing attached images. I used the very very awesome RegExr to write the search expressions.


//** regex to extract image filenames and to rewrite the scr tags

  $pattern = '/[^"]+\.(jpg|jpeg|gif|png|tif|tiff|bmp)/i';
  $replace = 'cid:$0';
  preg_match_all($pattern, $contents, $matches);
  $contents = preg_replace($pattern, $replace, $contents);

  $maxcounter = count($matches[0]);
  $maxcounter = $maxcounter;
 


 
 

The rest of the code (and some from above) is largely pulled from the class example, it attaches and sends the email and does some cleanup.


//** create the HTML version of the body content here.

  $htmlVersion = $contents;
  unset($msg);

//** !!!! SEND AN HTML EMAIL w/ATTACHMENT !!!!
//** create the new message using the to, from, and email subject.

  $d = date('r');
  $msg = new Email($Recipiant, $Sender, "Google Reader Shared Items Backup for ".$d);
  $msg->Cc = $Cc;
  $msg->Bcc = $Bcc;

//** set the message to be text only and set the email content.

  $msg->TextOnly = false;
  $msg->Content = $htmlVersion;

//** attach any images to the email

  for ($i = 0; $i <= $maxcounter-1; $i++) {
  	$mime = "image/".$matches[1][$i];
  	$target_file = $matches[0][$i]; 
  	$msg->Attach($target_file, $mime);
  }

//** send the email message.

  $SendSuccess = $msg->Send();

//  echo "HTML email w/attachment was ",($SendSuccess ? "sent" : "not sent"), "<br>"/n;

  unset($msg);

//** do MD5 to prevent repeat emails

  chdir('..');	
  $myFile = "rss_backup.txt";
  $fh = fopen($myFile, 'w') or die("can't open file");
  fwrite($fh, $hash);
  fclose($fh);

//** clean up

  $foo = system('rm -r wget_dump');

?>
 

 
 

I've set up a cron job on my dreamhost account to run the script a couple of times a day. The result can be seen below, an HTML email containing attached images, displayed inline.

 
gmail