Results 1 to 5 of 5

Thread: Identifying file when filename is not enough

  1. #1
    Join Date
    Jan 2006
    Location
    Knivsta, Sweden
    Posts
    153
    Thanks
    30
    Thanked 13 Times in 12 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11

    Default Identifying file when filename is not enough

    Hi,

    I'm writing an app that associates metadata to files in a directory hierarchy. Not all relevant filetypes allow embedding of metadata, so it is kept in a separate database.
    The user might have modified the files while my program was not running, so it can
    scan the directory and compare the results to that of a previous scan.

    Let's say it turns out that two files were missing and two new files have appeared.
    This can be because
    1. Two files were deleted and two other files created, OR
    2. Two files were renamed

    In the first case, the metadata associated with the missing files should also be removed.
    In the second case, the metadata should be transferred over to the new files.

    So how can one tell which of 1) and 2) has happened?
    I have considered matching them by remembering the filesizes, but it is not unique enough to identify a single file among many.
    Another option is to compute a hashcode of the file's contents (to see if any of the "new" files have the same code as a "missing" one), but that would make importing large amounts of files into the directory very slow.
    The best solution I have so far is to identify the files by inode-number, as I've noticed that it will not change when a file is moved or renamed within the filesystem.
    However, QFileInfo does not have a method to return the inode-number, making the solution Linux-specific. Is there a more portable way?

  2. #2
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Identifying file when filename is not enough

    This is indeed very tricky, especially that you'd probably want to handle hard and soft links as well
    I see another problem - what happens if a file is renamed/deleted and another file is created with the same name? How should your system behave then? If it is to notice that, it'd have to scan all the files all the time. Hashes won't help you here as well as two files might have the same contents which would give the same hash value and for large file sets you could encounter hash clashes.
    The inode thing is very tricky too, as if you move a file, it might be transfered to another filesystem which would make it change the inode too and there is that problem with links as well, links can point to the same inode... And different filesystems have different inode structures and some don't have inodes at all (FAT?).

    First thing I would do is to think about the ways of maintaining/tracing the information you desire - without going into implementation details (so no inodes, no Qt, no C++). For example tracing file contents is not sufficient because two distinct files can have the same content. Tracing file path and file contents is better, but still files can be moved so the path changes. Another idea is to trace last modification times - if you keep a database of those, you can try to guess if something happened to the file when your system was out of order. Try this first and when you come up with a theoretical solution, then dive into implementation details.

  3. The following user says thank you to wysota for this useful post:

    drhex (9th November 2006)

  4. #3
    Join Date
    Oct 2006
    Location
    New Delhi, India
    Posts
    2,467
    Thanks
    8
    Thanked 334 Times in 317 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Identifying file when filename is not enough

    What are you using the database for ?
    Is it for identiifying files based on meta-data or you want to search files based on some meta-data ??

    If the case is former, I cant suggest anything for now, but for the latter I can suggest the following..

    1. Get the input meta-data from the user.
    2. Search in the database
    3. Get the filename from database associated with the meta-data.
    4. Regenerate meta-data for that file.
    5. If the meta-data is same in both cases, then the file must be same.
    6. If the meta-data is not same, the file must have been modified / deleted / etc.
    ..........In this case regenerate your database, as it needs updation, and start from Step 2

    I guess this algo will help to
    - identify if files have been modified/deleted/renamed
    - update database only when required

  5. The following user says thank you to aamer4yu for this useful post:

    drhex (9th November 2006)

  6. #4
    Join Date
    Jun 2006
    Location
    Sweden
    Posts
    99
    Thanks
    11
    Thanked 3 Times in 3 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Identifying file when filename is not enough

    I would hash each file and then simply compare hashes. To speed things up you might want to use a fast hash-function and perform the hashing in a different thread. Perhaps even have a queue of files to be hashed running in the background. If i'm not mistaken Tiger Tree Hash (TTH) is both fast (hundreds of meg / sec, depending on your hardware of course) and secure: http://en.wikipedia.org/wiki/Hash_tree#Tiger_tree_hash

  7. The following user says thank you to TheRonin for this useful post:

    drhex (9th November 2006)

  8. #5
    Join Date
    Jan 2006
    Location
    Knivsta, Sweden
    Posts
    153
    Thanks
    30
    Thanked 13 Times in 12 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11

    Default Re: Identifying file when filename is not enough

    The metadata is to be used for finding files based on metadata.

    4. Regenerate meta-data for that file.
    It would surely be wonderful if metadata could be generated automatically by a program. Alas, the data is things like "name of person in this picture".

    Tracing file path and file contents is better
    Yes, the primary link between metadata and contents is through the full file path.
    Identifying files that have been renamed/moved is considered an extra bonus for the user,
    if he e.g. swaps two files so the names remain the same it is supposed to be intentional.

    Another idea is to trace last modification times
    I've considered that as well. It seems less precise than inodes - the resolution is rarely better than a second and systems can process many files per second. Also, a simple edit to a file (were no metadata needs to be changed) changes last modification time but not inode. To keep the modification-time database up-to-date, the directories would have to be scanned automatically on each programstart.

    a queue of files to be hashed running in the background.
    Now that's clever! Might work for that auto-scan-on-startup as well. I want a dual-core cpu

Similar Threads

  1. Draging a non-existing file to the Windows Desktop
    By klaus1111 in forum Qt Programming
    Replies: 13
    Last Post: 20th September 2007, 11:47
  2. Qt 3.3.4 -> Qt 4.1.2, moc file issue
    By philski in forum Qt Tools
    Replies: 1
    Last Post: 11th September 2006, 21:08
  3. .ui file name and classname
    By Rekha in forum Newbie
    Replies: 3
    Last Post: 12th August 2006, 01:53
  4. SQLite-DB in a qrc file
    By Lykurg in forum Qt Programming
    Replies: 5
    Last Post: 31st July 2006, 19:24
  5. create file in another directory
    By raphaelf in forum Qt Programming
    Replies: 3
    Last Post: 16th February 2006, 10:04

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.