Archiving-tutorial

From Emre´s Wiki

Jump to: navigation, search

Contents

Introduction

This tutorial is a step-by-step guide for setting up an archive for collecting mails and/or newsgroup postings.

Besides covering the usual configuration tasks I´ll also explain how to split up the files that get created during the archiving progress into YYYY.MM (Year.Month) type of chunks - without having to use the contributed scripts in the main Hypermail distribution (which I find non-comprehensive).


Needed Software

There are several tools needed - some of them are usually already included in a Linux/FreeBSD/etc. distributions.

Please refer to your local packet managment tool to find precompiled binaries. If you´re happy with compiling stuff yourself, feel free to do so.

The tools are:

  • Hypermail: Converts a Unix-style Inbox file into HTML. Source for downloading and further information is http://www.hypermail.org. For getting nice results, compiling the source code after doing some modifications to it is recommended.
  • Procmail: After all tools and config scripts are in place, procmail will do the "move-the-mails-by-date-into-the-according-inbox-file"-job.
  • Formail: You already might have a big Inbox file that you want to split up. If this is the case you will need formail, which will do some formating jobs for you. This one´s only neccessary if you want to split an existing archive.

Other Prerequisites

For all further steps I´ll assume that your webservers document root is in /usr/local/apache/htdocs and that you have created a user on your system, that has the name of your archive.

For this tutorial we´ll be collecting the group comp.dcom.sys.cisco, so you might wish to add a user 'arch-cdsc' to your box.

  • /home/arch-cdsc -> The path to the users´ home directory
  • /home/arch-cdsc/archives -> this is where we´ll place the Inbox files
  • /home/arch-cdsc/etc -> configuration files go here
  • /home/arch-cdsc/bin -> shell scripts go here

Creating /usr/local/apache/htdocs/comp.dcom.sys.cisco might also me a good idea. We´ll place the HTML files in this directory.

Step by Step

Step One - Creating the Newsfetch Configuration File

This one´s not too hard, simply having a file 'newsfetch.rc' in /home/arch-cdsc/etc with the following line will do:

comp.dcom.sys.cisco 0 0

Step Two - Creating the Hypermail Configuration File

We´ll be putting the following content in the file /home/arch-cdsc/etc/hypermail.rc :

ietf_mbox = 0
htmlsuffix = html
usemeta = 0
#archives = http://www.emre.de/hypermail/index.php
custom_archives = NONE
#about = http://www.emre.de/hypermail.php
defaultindex = thread
reverse = 0
usetable = 0
indextable = 0
progress = 1
attachmentlink = "%p"
show_msg_links = 1
showheaders = 0
showreplies = 1
showhtml = 2
showbr = 1
iquotes = 1
showhr = 0
overwrite = 0
readone = 0
increment = 0
discard_dup_msgids = 0
require_msgids = 0
dateformat = "%D-%r Z"
stripsubject = NONE
eurodate = 1
dirmode = 0755
filemode = 0644
mailcommand = mailto:$TO?subject=$SUBJECT
mailto = NONE
domainaddr = NONE
ihtmlheaderfile = NONE
ihtmlfooterfile = NONE
mhtmlheaderfile = NONE
mhtmlfooterfile = NONE
hmail = NONE
icss_url = http://localhost/style.css
mcss_url = http://localhost/style.css
show_headers = From,Subject,Date
inline_types = image/gif image/jpe
ignore_types = text/x-vcard
ignore_types = application/x-msdownload
text_types = text, text/plain, message/rfc822
prefered_types = text/plain, text/html
thrdlevels = 8
spamprotect = 1
attachmentsindex = 0
linkquotes = 0
monthly_index = 1
yearly_index = 1
usegdbm = 1
domainname = NONE

There is nothing special about the info above, except maybe the URLs to your stylesheet file (the lines with 'localhost'). Please adapt them to your stylesheet file, in case you want to use any. If you don´t, just delete those lines.

Step Three - Creating a shell script for calling Newsfetch as Cron-Job

Having this in /home/arch-cdsc/bin/newsfetch.sh will do the collecting of the newsgroup posts (please adjust your newsserver entry):

#!/usr/bin/bash
/usr/local/bin/newsfetch news.something.com \
 -f /home/arch-cdsc/etc/newsfetch.rc \
 -p "/usr/local/bin/procmail"

This tells Newsfetch to retrieve postings one by one and piping each output to procmail (which will decide on the output file according to the year and month of the posting).

Step Four - Creating the Procmail configuration File

It took a great deal of testing with this one. The reason is that there are several formats out there how the date of a mail/posting is being placed in the mail header.

There are two lines in mail headers that help retreiving the date: the 'From ...' part and the 'Date:...' part.

Just some examples:

From test@somewhere.de Wed Dec 12 20:00:07 2001

and

Date: Wed, 12 Dec 2001 17:54:09 +0100

The 'From ...' line is more accurate as it sets the time according to the timezone of your MTA (Sendmail, Exim, etc.), whereas 'Date:...' is set according to the timezone of the senders´mail client.

Using 'From ...' should be preferred where possible, unfortunately newsfetch sets a standard entry of 'From localhost Sat Apr 26 18:57:03 WAT 1997' in all postings retrieved. Of course this makes sense, as there is no MTA involved as known from regular mails. So, for newsgroup archiving we´ll have to use the 'Date:...' part of the mail header.

A .procmailrc file for newsgroup archiving can be found here: http://www.emre.de/files/news.procmailrc

The .procmailrc file for all other archiving purposes can be found here: http://www.emre.de/files/mail.procmailrc

Please rename the downloaded file to '.procmailrc' and place it in '/home/arch-cdsc'.

The reason why it makes sense to split the mails/postings up is, that having one huge Inbox heavily confuses Hypermail and once took my server offline (due to consumption of all RAM and swap-space). So using smaller files helps maintaing performance and stability of your system. Please see http://www.hypermail.org/hypermail-faq.html#15 for performance issues of Hypermail.

Step Five - Creating a shell Script for Calling Hypermail

This script will be called 'hypermail.sh' and will be located in /home/arch-cdsc/bin:

#!/usr/bin/bash
for i in `ls /home/arch-cdsc/archives/`
do
 /usr/local/bin/hypermail \
  -m /home/arch-cdsc/archives/$i \
  -d /usr/local/apache/htdocs/comp.dcom.sys.cisco/$i \
  -l comp.dcom.sys.cisco \
  -c /home/arch-cdsc/etc/hypermail.rc

The switches being used are:

  • -m -> which Inbox file to use
  • -d -> where to place the HTML files
  • -l -> the 'label' of the archive as printed on the upper part of the created HTML file
  • -c -> which config file to use

Step Six (the last one :) - Creating the Cron Jobs

Now that all´s in place, you only need to create two cron-job entries for the user 'arch-cdsc':

# Fetch the news and place them in their Inboxes
0 2 * * * /home/arch-cdsc/bin/newsfetch.sh > /var/log/newsfetch_cdsc.log 2>&1
# Convert those Inboxes to HTML
0 3 * * * /home/arch-cdsc/bin/hypermailh.sh > /var/log/hypermail_cdsc.log 2>&1

Some Words on Indexing with Search Engines

The next task would be getting your archive indexed with a search engine such as htdig or MnoGoSearch. Explaining those would be beyond the scope of this tutorial. Still, if you decide on using MnoGoSearch, you might find the following hints useful.

The HTML files produced by Hypermail always have navigation elements for 'next mail', 'next in thread', etc. When you run MnoGoSearch, these navigation elements get into your database and make it grow unnecessarily.

Furthermore, when you use the web-frontend for searching your archive, you´ll get the navigation text as part of the summary of the search result - which is definitly not what you want. Seeing the content of the mail/posting makes more sense, so you´ll need to prevent MnoGoSearch from indexing the unwanted links.

The feature of MnoGoSearch for achieving this are the propreatory tags <!--UdmComment--> and <!--/UdmComment--> which exclude certain parts of a web page from the indexing process.

There is no configuration setting in Hypermail that helps enclosing navigation elements in such tags, so some tweaking with the source code is neccessary. In case you want to make use of that feature, you can download my diff file for Hypermail 2.1.4: http://www.emre.de/files/hypermail.diff

Prior to compiling Hypermail, just cd to the directory where you have untarred it and do a 'patch < /path/to/hypermail.diff'

Please contact me if something does not work the way it´s supposed to.

Having the next release of Hypermail supporting this feature would of course be far better.

Splitting Up an Existing Inbox File

I guess you have followed the steps above? In this case just do an 'su arch-cdsc' and cd to the users´ home directory.

You might wish to create a temporary folder where to place the split files and check if everything worked out fine during the conversion process.

Make a modification to the .procmailrc file and change the destination folder of the Inboxes to '$HOME/tmp' so the results of your conversion do not screw up other files you might already be using.

Let´s assume your big Inbox file is located in /tmp/mybigfatinbox and your temporary folder is /home/arch-cdsc/tmp. You would now enter the following command to get things going:

formail -b -e -d -c -f -m 6 -s procmail < /tmp/mybigfatinbox

The Meaning of the switches is:

  • -b -> do not escape fake "From " headers
  • -e -> no empty lines required to separate mails from each other (makes the conversion more tolerant)
  • -d -> disable evaluation of the "Content Length" header (makes the conversion more tolerant)
  • -c -> concatenate header-fields spanning multiple lines
  • -f -> just pass along data, that can not be interpreted
  • -m -> minimum number of known consecutive header fields needed to detect the starting of a new message (far better then using the default of 2). Setting this prevents formail from mistaking quotes in the mail body as starting of a new mail
  • -s -> execute the following programm after extraction of each mail (in this case run procmail)

After a while everything should be finished. You should now have files YYYY.MM files in your $HOME/tmp folder. If there is a file called 'Other' - this is where all the mails went, whose date could not be determined - please let me know if this happens and mail me the header of this/these mails so I can adjust the regular expressions to catch this particular variation of the date formatting too.

Thanks and Credits

Many thanks go out to Don Hammond (procmail1(a)tradersdata.com) and Roland Rosenfeld (roland(a)spinnaker.de) who greatly helped getting the Procmail recipe working.

Contact

If you have suggestions for this tutorial, please feel free to contact me: Emre Bastuz (info(a)emre.de).

Personal tools