[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feedback/todo on cback



I've combined two replies into one here. I don't think the second of
your two messages got to the list for some reason, though.  Odd.

This whole thing got kind of long, sorry.  Hopefully, this is
interesting stuff to the others on the list, too.

> >>1· cdrecord issues
> >    cdrecord -scanbus dev=ATA
> That did the trick I didn't know I had to specify dev=ATA without it 
> indeed it showed me just error about the sg/scsi emulation not found.
> I was a bit in a hurry today and didn't get the time to try harder sorry
> I'm now able to record :)

That's no problem.  I will make a note to add diagnostic procedures like
these into some sort of appendix.

> >>Anyway I would expect cback to create an iso file anyway even if it 
> >>fails to actually record it. It would be nice if it creates an ISO also 
> >>for dvd size even if it can't manage to record it directly.
> > 
> > If what you want is a different action (perhaps, "makeiso") that creates
> > an ISO image rather than writing a CD, this is something I could
> > accomodate.  We could build that action so it wouldn't have any
> > restrictions on image size, or perhaps only optionally enforced an image
> > size.
> > 
> > I will also eventually get around to adding support for writing DVD
> > media, but I just haven't had time yet.  (It shouldn't be too difficult;
> > what I need is time to figure out the syntax for whatever command I'll
> > be using, and then time to adequately test things.)
> 
> I'll help you for this for sure buddy! Indeed I just started programming 
> in python (I do java) and this would be the best simple exercise to 
> start working on. Indeed one of the reason I started looking at cback 
> was the fact was writen in py and was a simple well structured project 
> to start working on and I have also the need to solve my luxurious 
> backup needs :) 

I'm glad you think (on first glance) that Cedar Backup is simple and
well-structured.  That was my goal.  I know it's not perfect, and there
are some things I would change having looked back on it.  However, I
think it's in a form that makes it a good starting point for future
enhancements.

I am also a Java programmer professionally, and that has influenced the
way I work with Python to a certain extent.  I find that while Java
might be better "in the enterprise", I am often a lot more productive in
Python when I get the chance to use it.  I appreciate that simple things
can be accomplished quickly (i.e. I don't need to instantiate a
LinkedList object, I just have a list right there) and I like that I can
use arbitrary combinations of functions and classes.  It's also nice to
be able to divide code up by module (file) rather than needing to have
each class in its own source file.

Anyway, I think you'll find you like Python once you get used to it.
Your biggest problem might be switching back and forth between the two.
Sometimes, they're close enough that I confuse myself ("damn! it's
'catch', not 'except'!").

> I whant to 'try' to explain my dream for a backup 
> solution and where cback can go in my vision.

Ok.

> I currently don't have a real backup system. I just replicate data in
> an other disc and once upone a time I make a cd out of it. This works
> well for simple needs but is not a professional solution. 

Yep, that's where I was before Cedar Backup.

> When I say 
> professional I really meant it:
>
>  From the user perspective it should be a (simple) GUI/web showing the 
> filesystem tree where you can see for each directory/files a different 
> color based on the backup policy it has.

Would be nice.

> This tool will produce an xml that might be close to the actual cback 
> xml syntax collecting the preference for each dir/files-backup policy.
> After that this xml file is transformed with an ad-hoc xsl to specific 
> dialect for tool like unison 
> (http://www.cis.upenn.edu/~bcpierce/unison/index.html) or rsync
> (http://www.mikerubel.org/computers/rsync_snapshots/)
>
> I currently use unison (a two way sync) and the main advantage of this 
> approach is the way incremental backup are made. It really save disk 
> space and processing time. It's a very fast and cheap way of creating 
> snapshot-style backups of your data.
>
> These tool better deals (performance and functionality wise) with 
> problems like checksum/diffing and even merging when needed then cback 
> and are trully tested tools used in production since years. They really 
> shines when it comes to replica but of course they are not touring 
> complete backup solutions but you can of course make snapshot of the 
> repository the make.

Ok, and what these tools get you is a duplicate of the filesystem,
correct, not a set of tar files like Cedar Backup?

> So cback again would be usefull to make the snapshot from these rep and 
> record it on a media (if needed).

Well, certainly, it would not be difficult to write a Cedar Backup
extension to "collect" data via Unison or rsync, and then write it to
disc using a process much like the current store process.

> But what if I whant to go back on the situation where I was one month 
> ago for a particular file or for the whole filesystem?
> This is the job of subversion so it probably makes sense to import the 
> rsync/unison created repository in it, tag it and maybe take the 
> snapshot of it with cback like tool.

Ah, ok.

> So this is (my) big picture, the tools are already there they just need 
> to be integreated.
> What you think? I whanted to share my dream

I'm glad you shared your dream. :)

Obviously, Cedar Backup does most everything I need or want it to do,
otherwise I wouldn't have written it the way I did.  So, I need people
like you to help me understand how it might be extended.

My most frequent request related to Cedar Backup is to somehow modify
the collect and stage processes.  For people like me, it works fine.
Others want to use tools like BSD dump to collect the data, and have
Cedar Backup stage it.  Your request for using rsync or unison to
collect/stage the data comes in a similar vein.  Perhaps we can come up
with a general way to accomodate these kinds of requests?

> Now I go to sleep I'm awake since 23 hours and I'm surprised I can still 
> formulate proper english sentence... I'm italian afterall
> I'll reply to the other things tomorow :)

Heh.  Your English is fine.

> >>> 2· exclude/include patterns
> >>> In my opinion <ignore_file> should be removed because with <exclude> 
> >>> you have full power and is not beautifull to have these files around 
> >>> just to override the general setup. instead a user conf file should 
> >>> reside in some dir like ~/.cback and user can setup cron job as well...
> >>
> >> Heh.  Well, if it's not beautiful to you, there's no need to use it. :)
> >>
> >> Seriously, keep in mind that a per-user ~/.cback file does not really
> >> provide equivalent functionality to a per-directory ignore indicator
> >> file. 
> >> Per-directory ignore files apply system-wide.  If any user creates an
> >> ignore file in a particular directory, any Cedar Backup run by any user
> >> will ignore that directory (assuming the backup is configured to pay
> >> attention to an ignore indicator file at all). 
> >> Assuming that the per-user ~/.cback file behaved in the "standard" way,
> >> it would only apply to Cedar Backup runs executed by that particular
> >> user, which is not the same thing as you get with ignore indicator
> >> files.
> >>
> >> Remember, Cedar Backup is primarily intended to be run as root for large
> >> parts of a system which might contain multiple users, rather than being
> >> run by lots of individual users on a system.  (See the distinction?)
> 
> I see what you mean but I disagree since what you call system-wide conf 
> is just operations that _may_ require root privilige based on what you 
> are backing up. So in short in my opinion ~/.cback is just a conf used 
> by root or by a user with root priviliges for system-wide operation and 
> by normal user for user specific needs. However overide meccanism to not 
> backup a particular dir/file may be usefull but I think this is the 
> wrong way to enforce them. A user can change the group ownership of the 
> particular dir/file to make the backup user not to read it for example 
> or make a global exclude in their ~/.cback... see below

First: you are correct that if the backup is being run by some user
other than root, a user can change file or directory ownership to
prevent something from being backed up.  However, up until now, I have
generally assumed that backups would either be run by root, or would be
run by individual users on their own data.  Is there some other use-case
I've missed? 

There are three issues here, as I see it:

   1) Where does global configuration reside?
   2) Are per-user configuration files allowed, and what do they do?
   3) Should we allow per-directory exclusions using an ignore file?

Regarding 1): I am convinced that it makes sense for global
configuration to reside in /etc.  It has to reside somewhere, and that
is the most consistent place for it to be.  A few Debian packages put
some global configuration in ~root (MySQL login information seems to be
one example) but that's an unusual case.

Regarding 2): Cedar Backup does currently allow custom configuration
files via the --config switch.  However, this is a complete
configuration replacement.  I guess I'm open to adding a true per-user
configuration option (to override part or all of global configuration),
but I would want to have a good use case before doing it.  

Note, however, that this per-user configuration file would only be used
when the backup was being run by the user in question.  I wouldn't want
Cedar Backup to have to look in the home directory of one user when
executing a backup run by another user.

Regarding 3): I'm still convinced that it's worth having per-directory
exclusions, and I do have a good use-case for it, even if we went ahead
and added real per-user configuration files.  Let's take as an example
skyjammer.com, which I help administer.

On skyjammer.com, Cedar Backup is configured to back up certain
directories, including some home directories.  My friends on that box do
not have root access and do not run Cedar Backup themselves.  They just
know that Cedar Backup runs once per day as root and backs up things we
have mutually decided are important.

If my friend Phil decides to upload his movie collection to his home
directory some day, for some reason, he knows that this will completely
overload the skyjammer backup capacity.  Instead of having to contact me
or another admin to update configuration for the global backup, he just
creates .cbignore in his music directory.  There.  He gets control over
what's backed-up in his home directory, even though the backup is being
run globally by root and he's not involved at all in adminstering the
backup.  Same goes if he decides to use $TMP=~phil/tmp rather than
$TMP=/tmp.  The other admins and I don't have to care, because Phil just
creates ~phil/tmp/.cbigore and things work.

If Phil were running the backups himself, then a per-user configuration
file would do him some good.  However, he's not.  There's just one
backup for all of skyjammer.com, and Phil doesn't have access to
configuration for that backup.  Through per-directory ignore files,
we've now given Phil a good course-grained way to leave things out of
the backup.  It doesn't help much if what he wants is to ignore just a
couple of files in his home directory, but it does help with common
exclusions tasks (common in my experience, anyway).

The bottom line is: while I'm open to extending Cedar Backup to allow
other kinds of configuration, I certainly won't be removing
per-directory exclusions, which have been available in Cedar Backup for
years.  Worst-case, an admin who doesn't want this functionality can
disable it.

> >>> Instead what I really miss from cback is the ability to include files 
> >>> from excluded dir. I'd like to make exceptions to subtree/files from 
> >>> dirs excluded. This is in my opinion the most needed feature.
> >>
> >> Hmm.  I can see why you might want that, but you can accomplish the same
> >> thing today by specifying finer-grained backups and exclusions, so it's
> >> kind of low on my priority list.
> >>
> >> What would you expect configuration to look like?  Some sort of
> >> exclusion-within-an-exclusion?  That makes me wonder whether it's really
> >> worth making configuration any more complicated than it already is.
> 
> What I expect is exception to exclusion/inclusion. I agree with you when 
> you say you can actually accomplish it with fine grained <dir>/<exclude> 
> patterns but the process is clumsy at best because each <dir> directive 
> cause the creation of a separate archive. 

Aha!  I think I understand now.  You see, I *want* a separate archive
for each of the various directories I back up.  You (and a couple other
users I've talked with recently) seem to want one gigantic archive for
the entire machine, which is not what I planned for.

> I'll make you a simple example:
> Say I whant to backup for each user 
> /home/$USER/.mozilla/firefox/$PROFILE/ bookmarks.html the pref.js and 
> the extensions dir, but not the other files in there. So a conf can be:
> <dir>
> 	<pattern>/home/[^/]*/.mozilla/firefox/[^/]*/</pattern>
> 	<exclude>
> 		<pattern> * </pattern>
> 		<excludenot>
> 			<rel_path> bookmarks.html </rel_path>
> 			<rel_path> pref.js </rel_path>
> 			<rel_path> extensions </rel_path>
> 		</excludenot>
> 	</exclude>
> </dir>
> 
> can you immagine doing the same with the actual config? beside the 
> several <dir> and <exclude> you have to specify, you end up with 
> different tar.gz.

Ahhh... interesting.  You threw two different enhancements in there,
didn't you?  One enhancement is the <excludenot> stuff, and the other is
the option of specifying the directories to back up using a pattern.
That's interesting.  I never considered it.

Anyway, you're right.  This syntax is clumsy.

> And what if I whant to exclude all the contents of extensions but a 
> particular one? You end up with nested <exclude>/<excludenot> so a 
> better xml example might be:
> 
> <pattern value="/home/[^/]*/.mozilla/firefox/[^/]*/" exclude="no">
> 	<pattern value="*" exclude="yes">
> 		<rel_path value="bookmarks.html" exclude="no"/>
> 		<rel_path value="pref.js" exclude="no"/>
> 		<rel_path value="extensions" exclude="no">
> 			<pattern value="*" exclude="yes">
> 				<pattern value="\{[^}]*\}/fooext" exclude="no"/>
> 			</pattern>
> 		</rel_path>
> 	</pattern>
> </pattern>
> 
> This example syntax make possibile very complex rule. In this case 
> fooext is a file present an arbitrary directory named like 
> {68836a21-fc7d-4ea1-a065-7efabd99d414}/
> 
> In conclusion exclude="no" and the nested nodes are simply exception to 
> exclude="yes". So a deep first approach is needed.

Interesting.  I can see why you would want this.  Again, we're into
use-cases I didn't consider here.  Cedar Backup does offer some
flexibility, but I didn't consider people wanting to be this picky about
what to include or exclude in a given directory.

The problem with this is the implementation.  Cedar Backup currently
uses a top-down approach when collecting files, rather than a
depth-first approach.  So, even if I added this syntax to configuration,
it would require completely reworking the FileList object to allow
depth-first exclusions.  

If what you describe above is available in rsync or unison, perhaps the
best solution is to develop an alternative to the collect action using
one of those tools?

> >>> 3· global collect
> >> Does that make sense?
> 
> Sure it makes sense to have global policy and per dir policy:) sorry

No worries. :)

> >>> 4· index/search
> >>> There is no way to know in which backup disc you had a certain file. 
> >>> It would be really nice if the digest file can be easly searched and 
> >>> connected to a backup-set in order to know where you can find a 
> >>> certain file you had now deleted. I still have to found 
> >>> backup/restore solutions providing this :(
> >>
> >> Ah.  I think what you're looking for is a way to say, "which backup disc
> >> should I look in to find this particular file?". 
> >> You're right, Cedar Backup doesn't give you a way to do this.  In fact,
> >> it doesn't even really know anything about this, because (for instance)
> >> it has no way to know whether you switched discs at the beginning of the
> >> week, or overwrote your current disc, or even if perhaps you put in
> >> another bogus disc which Cedar Backup was able to attach a new ISO
> >> session to.  It's not something that I could add very easily. :(
> 
> I agree I still have to think the best way to deal with this kind of 
> problem. For sure subversion can solve the need of "take me back at foo 
> version of the archive" so a snapshot of the repository taged "foo" 
> might have the volumename "foo" and let sunbersion manage what files was 
> present on "foo" moment in time... what you think ?

Right now, Cedar Backup writes the same volume name to every disc.  It
wouldn't be too difficult to modify what volume name is used, if we can
come up with a good way to create useful volume names.

> >>> 5· slowness/debug info
> >>> The backup process seems really to much slow. For example on one 
> >>> excluded tree it took 11 minutes just to realize it had not to 
> >>> proccess it:
> >>> 2005-10-19T10:32:09 CEST --> [DEBUG  ] Path [/home] is excluded based 
> >>> on excludePaths.
> >>> 2005-10-19T10:43:52 CEST --> [DEBUG  ] Path [/var/tmp/backup] is 
> >>> excluded based on excludePaths.
> >>>
> >>> WHY???
> 
> >> Well, because did didn't get to /home util that point. :)
> 
> Are you sure? The slowness as you can see from the log is between /home 
> and /var/tmp/backup /home has around 15 Gb in it and it took 14 minutes 
> to pass over even if is excluded.

Yeah, I'm pretty sure.  Cedar Backup is working top-down and you might
be expecting it to work depth-first.  Besides that, since I didn't
envision Cedar Backup being used this way, what you're getting is a
really HUGE python list in memory, which is probably slowing things
down.

The flow of control is something like this in incremental mode:

   for collectDir in config.collect.collectDirs:
      backupList = BackupFileList()
      backupList.addDirContents(collectDir)
      newDigest = backupList.generateDigestMap()
      backupList.removeUnchanged(oldDigest)
      backupList.generateTarFile(tarfilePath, archiveMode)

The list of files to back up is created in addDirContents(), and then
unchanged files are removed from the list using removeUnchanged().  The
tarfile is actually created in generateTarFile().  The log messages you
are seeing are from addDirContents().

BackupList just inherits from the standard Python in-memory list, so you
can see where we might begin to run into performance issues there for
very large sets of files.

Besides that, now that I look at it, I could probably optimize the code
by combining the generateDigestMap() and removeUnchanged() steps.  This
sequence of calls actually results in generating the SHA digest value
twice for each file, which doesn't make sense.  That wouldn't have any
effect on the log messages you saw above, though.

> Anyway as I wrote in my previous email I think the collect process is 
> better handled by rsync like tool. more fast and cheap performance wise

Yes, that's an option.  However, it does require storing directory trees
on disk rather than tarfiles, which isn't what Cedar Backup was really
designed to do.  Like I mentioned above, I'm open to an option like
this, perhaps an alternative collect/stage method.

I'm really beginning to think that for people like you who want to back
up really large sets of files are not going to get acceptable
performance unless I rewrite the BackupFileList functionality and/or
create an alterative collect/stage method.  (I currently back up three
machines onto one CD, so this performance problem is just not something
I see.)

> >>> There is not enought debug info and the programm run for hours 
> >>> without saying what is trying to do.
> >>
> >> Yeah, and then the last user who complained said that there was too much
> >> in the debug log. :) 
> 
> >> It's a struggle to find a balance in the right level of logging, without
> >> providing so much information that it's useless.  Cedar Backup v1.0
> >> listed every file it backed up, but that seemed excessive.  Cedar Backup
> >> v2.0 just logs information about individual collect directories, because
> >> that's how I expected users to configure it.
> 
> The user isn't expect to look into the log at debug level. At debug 
> level I expect all the info I can gather even at file level.
> At info level information about individual collect directories is fine 
> with message like "computing hash for foo" or "collecting 
> exclusion/inclusion for foo" etc.

I'll tell you what: I'll add a few extra log messages and send you a
patch, and you can look and tell me what you think.

> >> You see, I never really expected users to want to back up their entire
> >> root directory, and it's not something I've really ever done.  
>  >> [snip]
> 
> also a /home dir can be huge (more likely bigger then the system tree).

Hrm, I suppose.  However, since I only have 600 MB to work with (on a
CD) that wasn't really an issue for me when designing Cedar Backup.  It
only becomes an issue when you have a 4 GB DVD to work with or don't
care about writing to disc (like some other users).

> >>> 6· There is no pre/post process command execution. This is important 
> >>> in particular to hack some script to add features cback doesn't (yet) 
> >>> have or integrate it in the flow of other programs.
> >>
> >> Funny, I just had someone else ask for that just last week, in bug #27:
> >>
> >>    http://cedar-solutions.com/cgi-bin/bugzilla/show_bug.cgi?id=27
> >>
> >> Can you give me some thoughts on how you would expect this to work? 
> >> What would the commands be -- just shell commands, perhaps required to
> >> start with an absolute path?  Would you ever need to list more than one
> >> shell command in Cedar Backup configuration for a given action, or would
> >> you expect to combine all of your actions into one single shell script
> >> somewhere on the filesystem?  Can you ever imagine wanting to provide
> >> Python code (a function) rather than a shell command?
> >>
> >> Would it be enough to specify a single "pre-action hook" and a single
> >> "post-action hook" in configuration for that command, or would it be
> >> better to have a separate configuration action mapping hooks to actions?
> 
> In my opinion a single "pre-action hook" and a single "post-action hook" 
> is enough to open up cback to the scenarios I explained in my previous 
> email. Indeed cback can be used to run just some of the actions and be 
> integrated with other programs providing the collect action or other 
> actions. I think a shell script hook is the best but you can always 
> specify a type attribute to decide wether hook a python function call or 
> a bash script. See below on 7· Messaging

Ok.  That's basically the feedback I'm getting from The Anarcat too.

> >> (The first might be easier to understand, but the second would allow you
> >> to hook extensions without extensions having to know anything about it.)
> I don't get what you mean by "without extensions having to know anything 
> about it"

Hmm.  How can I explain this?  I think of things like this as "services"
that the Cedar Backup "runtime" should provide.  

For instance, the existing command-override configuration (that lets a
user override the path to commands like mkisofs) is a "service" that is
available to extension authors via the util.resolveCommand() function.
Extension authors don't need to know how resolveCommand() is configured,
or even if it was configured.  And, they don't need any configuration of
their own to take advantage of it.  The functionality just works.

I think that it would be nice if configuration for pre- and post-command
hooks were handled by the Cedar Backup "runtime", too.  This way, users
could configure a hook even for an extension whose author was completely
unaware hooks were even possible.  

> >>> 7· Messaging
> >>> It would be nice to make the email message configurable and to be 
> >>> able to have an email or other form of message (IM for ex) run also 
> >>> on success.
> >>
> >> The thing is, Cedar Backup right now doesn't even know anything about
> >> email or any other form of notification.  It just assumes it's running
> >> in a terminal and prints things to stdout.  If something else (i.e.
> >> cron) emails that output around, then so much the better.  So, when you
> >> suggest making the email message configurable, you're really suggesting
> >> that I somehow make Cedar Backup's output to stdout configurable (or
> >> cron's email format configurable), which is not something I'm likely to
> >> do.
> >>
> >> I guess I'm open to adding in some sort of hook (maybe something a
> >> little like an extension, a Python function with a standard interface)
> >> for notification, but I would have to give it some thought before
> >> deciding for sure.  One problem is how to appropriately handle error
> >> conditions.  In other words, it's easy to send success status messages,
> >> but if the program crashes or fails hard, it may be difficult to send
> >> failure messages.  In the current model, cron handles all of the
> >> difficult parts for me. :)
> >>
> I agree, and for this reason I think of a specilized extension that do 
> messaging (in the prefered form) hooked to <post-action call="foo" 
> onsucces="yes/no"> sending the log output for that particular action.
> One can put any <pre/post-action> call he/she whant in the config making 
> the cback really powerfull backup flow framework. What you think?

Hmm.  That's interesting.  It would require collecting the log
information somewhere during execution, which is something I don't do
now (I just let Python's logging module do whatever it does).

In this case, what does call="foo" represent?  A call to a particular
Python function, kind of like an extension?  Could it even *be* an
extension, with the function same syntax and everything?  We could just
add an optional logMessages argument or something, perhaps?

> >>> 8· Space waste
> needed :/

Well, it shouldn't be too difficult.  What's more important to you,
combining the collect/stage steps, or having some alternative collect
mechanism?  There's no point in working on both if you'll only use one
or the other.

> >>> Hope is not too much and you feel depressed ;)
> >>
> >> Nope, it doesn't make me feel depressed.  I'm actually kind of
> >> fascinated, because as people start to use Cedar Backup, almost none of
> >> them seem to use it the way it was intended, in other words the way I
> >> use it. :) 
> 
> If we make cback hook pre/post actions you open it to infinite numeber 
> of possibility you can't even immagine :)

I will start giving this some thought.

KEN

--
Kenneth J. Pronovici <pronovic@ieee.org>
http://www.cedar-solutions.com/



--
To unsubscribe, send mail to cedar-backup-users-unsubscribe@cedar-solutions.com.