Categories
Uncategorized

More Robot Fun

We mentioned yesterday that WWW::SimpleRobot did not respect the robots.txt file. Well, there’s always WWW::RobotRules, which was easily dropped into place providing me a simple robot that followed the rules of a robots.txt file, if one exists.

My “outlinebot” is working quite well now, and I’m sure I’ll be tweaking it for the next few weeks or months… If I haven’t said this in a while, I really like Perl…

Categories
Uncategorized

R$$ and Privacy

Tim Bray had this idea, and I must admit, I had the same idea as well, an RSS feed of my financial transactions. I know, it’s most likely a long way off… or is it? While driving home last night I heard a commercial promoting email alerts from a bank. They seemed to be saying you could get sent account information via email. Now, I don’t know what kind of information they are sending, and I hope it’s encryped with PGP/GPG or something, but here’s where it gets interesting. If my bank sends me email with useful data, I can easily parse that data and build it into some sort of RSS feed for my own use. I know, it’s a lot more complex than that, but it’s the start of an idea anyway…

Which brings up another interesting issue. Privacy of RSS feed subscription information. Many people share their subscription file on their sites, which is a good idea, and does neat things, but I found when I did this, I first had to delete a feed, not because of a privacy concern, but because it was a resource that could not be reached by the world, and internal project server. So, I’d propose the following to the aggregator makers, add a way for a feed to be marked as private, so that when I export out my file, it can provide me with a list of public feeds that you subscribe to. It would also be useful for people wishing to avoid the embarrasing “You subscribe to what feed?” question…

Categories
Uncategorized

Site Outline!

I also found some code to build a site outline, as mentioned yesterday. I’m using WWW::SimpleRobot. It just took a few small tweaks to the example to get what I needed. What I’m really after is a spider I can point to a site, and have it show me all the urls it can find, so I can compare it against the files of the site (on my local filesystem) to see what doesn’t get spidered. It’s a search engine robot simulator.

(Note: WWW::SimpleRobot does not respect the robots.txt file, so use it with care.)

Categories
Uncategorized

md5checker

I wrote a simple perl wrapper for my md5sum differ idea, and it does work well, but it’s slow, mainly due to the fact it’s checking large files across the network. Not much I can do about that right now, but it’s a start…

Categories
Uncategorized

Site Outline?

Long ago I had a pretty simple perl script that you would point at a url, and it would spider the site and give you an outline. The output was something like this:

  • http://example.com/
  • http://example.com/about/
  • http://example.com/about/foo.html
  • http://example.com/contact/
  • http://example.com/help/
  • http://example.com/help/fee.html

I can’t find that code anywhere. Does anyone have something quick-n-dirty that might work? Let me know