Author Topic: The Vault Preservation Project  (Read 6067 times)

Legacy_Mecheon

  • Hero Member
  • *****
  • Posts: 664
  • Karma: +0/-0
The Vault Preservation Project
« Reply #60 on: October 02, 2012, 09:20:45 am »


               Just going to say Bannor, I once was a moderator on a Warcraft website that exceeded its allocation by about 5 TB

I think we only had 500 megs. And we'd filled nearly every server this place had. ISPs can miss a lot
               
               

               
            

Legacy_Pstemarie

  • Hero Member
  • *****
  • Posts: 4368
  • Karma: +0/-0
The Vault Preservation Project
« Reply #61 on: October 02, 2012, 11:38:40 am »


               I feel your pain Bannor - I pay $75 per month for cable through charter for a 30mb line with unlimited bandwidth. On top of that I have DirecTV (since their cable tv package goes up every 3 months and keeps dropping channels).
               
               

               
            

Legacy_acomputerdood

  • Sr. Member
  • ****
  • Posts: 378
  • Karma: +0/-0
The Vault Preservation Project
« Reply #62 on: October 02, 2012, 11:40:19 am »


               my next attempt:



#!/usr/bin/perl


$OUTPUT = "projects";
$project = "";


for($id=1; $id < 5; $id++){
#       $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
#       $url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
        $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;

        $page = `curl -s $url`;

        @lines = split /\\n/, $page;

        $files = 0;
        $comments = 0;
        foreach $l (@lines){
                if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
                        $project = $1;
                        $project =~ s/\\// /g;
                        $project =~ s/&/and/g;
                        $project =~ s/--/-/g;
                        $project =~ s/\\(//g;
                        $project =~ s/\\)//g;
                        $project =~ s/ /_/g;
                        print "\\nprocessing $project -> $OUTPUT/$project\\n";
                        `mkdir $OUTPUT/$project`;
                }
                if($l =~ /<a name="Files"><\\/a>Files/){
                        $files = 1;
                }
                if($l =~ /<\\/TABLE>/){
                        $files = 0;
                }


                if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
                        print "downloading: $2\\n";
                       `wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
                }
                if($comments == 0){
                        if($l =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
                                $comments = 1;
                                &get_next_page($url, 2);
                        }
                }
                next if !$files;

        }

        open(FILE, ">$OUTPUT/$project/index.html");
        print FILE $page;
        close FILE;
}


sub get_next_page{
        $u = shift;
        $num = shift;
        print "fetching comments page: $num\\n";

        $u2 = $u . "\\\\&comment_page=$num";

        $p = `curl -s $u2`;
        open(FILE, ">$OUTPUT/$project/index$num.html");
        print FILE $p;
        close FILE;


        @lines2 = split /\\n/, $p;
        foreach $l2 (@lines2){
                if($l2 =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
                        &get_next_page($u, $num + 1);
                }
        }
}

it will iterate through all of the entries in the for loop, creating a new project directory for each page, downloading each file, and grabbing each comment page.  it doesn't do screenshots - do we care about that?

also, i've not been able to find a page with an external linked source to test against, but it *should* try to download it from the vault and fail.

i've tested it for entries 1-10 and it works great.  i'll try to capture all of the scripts directory next, but i don't know how much space i'll be using.  anybody want to volunteer testing it?


PS Tarot Redhand:
i changed my url to process the link you posted for the textures page.  it seems the vault pages are standardized enough that it works beautifully against it.

i did notice, however, that i'm not trying to grab anything linked in from the "description" section.  i think that's fine because those links are either to external files or to other vault pages.
               
               

               
            

Legacy_acomputerdood

  • Sr. Member
  • ****
  • Posts: 378
  • Karma: +0/-0
The Vault Preservation Project
« Reply #63 on: October 02, 2012, 12:05:03 pm »


               just an update - this script will grab the screenshots and thumbs:


#!/usr/bin/perl


$OUTPUT = "projects";
$project = "";


for($id=147; $id < 148; $id++){
#   $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
#   $url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
#   $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;
   $url = "http://nwvault.ign.com/View.php?view=Textures.Detail\\\\&id=" . $id;

   $page = `curl -s $url`;

   @lines = split /\\n/, $page;

   $files = 0;
   $comments = 0;
   $images = 0;
   foreach $l (@lines){
      if($images == 1){
         if($l =~ /<a href/){
            &grab_screenshots($l);
         }
      }
      
      if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
         $project = $1;
         $project =~ s/\\// /g;
         $project =~ s/&/and/g;
         $project =~ s/--/-/g;
         $project =~ s/\\(//g;
         $project =~ s/\\)//g;
         $project =~ s/ /_/g;
         print "\\nprocessing $project -> $OUTPUT/$project\\n";
         `mkdir -p $OUTPUT/$project`;
      }
      if($l =~ /<a name="Files"><\\/a>Files/){
         $files = 1;
      }
      if($l =~ /<\\/TABLE>/){
         $files = 0;
      }


      if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
         print "downloading: $2\\n";
         `wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
      }
      if($comments == 0){
         if($l =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
            $comments = 1;
            &get_next_page($url, 2);
         }
      }
      if($l =~ /-START OF IMAGE CODE-/){
         $images = 1;
      }
   }

   open(FILE, ">$OUTPUT/$project/index.html");
   print FILE $page;
   close FILE;
}


sub get_next_page{
   $u = shift;
   $num = shift;
   print "fetching comments page: $num\\n";

   $u2 = $u . "\\\\&comment_page=$num";

   $p = `curl -s $u2`;
   open(FILE, ">$OUTPUT/$project/index$num.html");
   print FILE $p;
   close FILE;


   @lines2 = split /\\n/, $p;
   foreach $l2 (@lines2){
      if($l2 =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
         &get_next_page($u, $num + 1);
      }
   }
}

sub grab_screenshots{
   $images = 0;
   $imgline = shift;

   @imgchunks = split /<p>/, $imgline;

   foreach $ic (@imgchunks){
      if($ic =~ /<a href="(fms\\/Image.php\\?id=(.*?))"/){
         `wget -O $OUTPUT/$project/$2.jpg http://nwvault.ign.com/$1`;
      }
      if($ic =~ /src="(http:\\/\\/vnmedia.ign.com\\/nwvault.ign.com\\/fms\\/images\\/.*?\\/.*?\\/(.*?))"/){
         `wget -O $OUTPUT/$project/$2 $1 `;
      }
   }
}

               
               

               
            

Legacy_werelynx

  • Hero Member
  • *****
  • Posts: 1110
  • Karma: +0/-0
The Vault Preservation Project
« Reply #64 on: October 02, 2012, 03:26:27 pm »


               @acomputerdood: "not trying to grab anything linked in from the "description" section"
Sometimes in the modules section, the hakpaks are linked in the description... sometimes those hakpaks are module-specific. Of course it's good as long as you are going to grab all the hakpaks as well, but still there is a need to "link them" those haks.

Good luck with this project.
I hope I'll be able to join soon, but probably I won't have time untill February.
               
               

               
            

Legacy_acomputerdood

  • Sr. Member
  • ****
  • Posts: 378
  • Karma: +0/-0
The Vault Preservation Project
« Reply #65 on: October 02, 2012, 04:37:16 pm »


               well, the way i see it, the things liked in the description section fall into one of three categories:

1) they're already included in the downloads section below which i'm grabbing
2) they're linked to another vault page that somebody else in charge of archiving
3) they're hosted on an external site, which we're not to download, only preserve the link

i believe my script handles all 3 cases.
               
               

               
            

Legacy_werelynx

  • Hero Member
  • *****
  • Posts: 1110
  • Karma: +0/-0
The Vault Preservation Project
« Reply #66 on: October 02, 2012, 05:49:03 pm »


               What I meant for 2) is that old link will direct you to the nwvault(..) address while it should direct you to its nwn-ccc(..) equivalent. In case vault goes down it would be a dead link.

What you could do is make your code write with big red letters that there is link that needs to be changed(visible when checking the uploaded page), so it could be changed manually after all content is already on nwn-ccc.
               
               

               
            

Legacy_acomputerdood

  • Sr. Member
  • ****
  • Posts: 378
  • Karma: +0/-0
The Vault Preservation Project
« Reply #67 on: October 02, 2012, 05:56:32 pm »


               that's easy enough to fix in post-processing.  once everything is grabbed, i assume somebody will take up the effort of reformatting everything into the new pages and layouts.  until that happens, there's no reason to try and correct links now.

it will be just as easy to identify incorrect links at that time as it is now.


oh, and incidentally, my first run just finished.  it took 348 minutes to download the 3865 entries in the "scripts" section.

rolo, can you run a perl script on the server you're archiving on?
               
               

               


                     Modifié par acomputerdood, 02 octobre 2012 - 05:11 .
                     
                  


            

Legacy_Rolo Kipp

  • Hero Member
  • *****
  • Posts: 4349
  • Karma: +0/-0
The Vault Preservation Project
« Reply #68 on: October 02, 2012, 07:56:23 pm »


               <doing the whole...>

@ Bannor: I agree, now that we have some legitimacy from Maximus. Concentrating on the Last In (Newest) projects first makes a lot of sense to me. (Note: the VPP has an upload limit of 100mb. Larger files will need to be somehow sent to me (dropbox, sky-drive, google drive, yousendit... etc.) for FTP. Or dropped into the Vault's "over 25mb" FTP site. I can access them from there directly (just not after they've been moved to the Vaults permanent storage :-P ).

@ Tarot: My take is to bundle up the Realms stuff into one 7z or rar :-P Personally, when I do the haks (soon, real soon) I'll be re-archiving anything that isn't either rar, zip (:-P ) or 7z. I'm only including zip because it's ubiquitous ;-/

@ Virusman: Save everything! Newest stuff first, though... But eventually, I hope to update (like neverwinter connections did) the site and incorporate the full Vault functionality.

@ ACD: I can run perl (though I'm currently perl ignorant). Email me ( rolo@amethysttapestry.com )? Let's talk. Or call me after 7pm PST if you have a cell phone and are in the US (pm me for number?)

My current thought for automating stuff is to get a CSV or Db of the metadata, create all the projects (flagged "pending upload") using drupal's migration tools and then present a filterable list of projects still needing files/screenies/comments... Make any sense?

In that context, preserving the data is still the first priority, while I structure the site.

@ Oda: I would love to take this opportunity for improving the Vault. But then, I really only know how to do tricky stuff in Drupal (and 3DS Max, but that's a different thread ;-). I am negotiating <pleading> for Db/files access.

@ Werelynx: The Original Page link is specifically there to point to the NwVault page the project was salvaged from. My reasoning is two-fold; first to be sure proper credit is given to the original author and secondly to provide a quick compare link to find/fix mistakes. If the NwVault does go down, that field may be disabled. I.e. it's for construction purposes, at the moment.

<...one-armed paper-hanger bit>
               
               

               


                     Modifié par Rolo Kipp, 02 octobre 2012 - 07:01 .
                     
                  


            

Legacy_Rolo Kipp

  • Hero Member
  • *****
  • Posts: 4349
  • Karma: +0/-0
The Vault Preservation Project
« Reply #69 on: October 02, 2012, 10:58:52 pm »


               <going...>

Just an update in the middle of other things, Maximus has fixed my admin tools and I've approved most of the backlog on the Vault.

<...approval-happy>
               
               

               
            

Legacy_Bard Simpson

  • Sr. Member
  • ****
  • Posts: 276
  • Karma: +0/-0
The Vault Preservation Project
« Reply #70 on: October 02, 2012, 11:10:08 pm »


               You're the man, Rolo! Oh, wait; should that be you're the wizard? Eh, either way, thank you for starting this project and thank you for all your hard work on the Vault itself.
               
               

               
            

Legacy_Lovelamb

  • Jr. Member
  • **
  • Posts: 68
  • Karma: +0/-0
The Vault Preservation Project
« Reply #71 on: October 02, 2012, 11:15:42 pm »


               Sir, did you say the Vault is now read-only? I've devoted over a year of my recent life to working on an evil module that I doubt the Nexus, with their strict rules, would accept... (Should I kill myself for being late? '<img'>)

I would like to help with backuping the Vault, though I might need an explanation as to how to upload the content to your site. I can save all the web pages and related files for now. You can sign me up for the first 10 pages (or 250 modules) on the module list. I'm not sure how the modules are ordered, hope everyone sees the same list. I have the disk space, but my upload speed isn't very high.
               
               

               


                     Modifié par Lovelamb, 02 octobre 2012 - 10:26 .
                     
                  


            

Legacy_Vibrant Penumbra

  • Full Member
  • ***
  • Posts: 238
  • Karma: +0/-0
The Vault Preservation Project
« Reply #72 on: October 02, 2012, 11:26:46 pm »


                Hmmm, like the look, lambchops '=]'

Yeah, the Kipper said it was read-only... for a while '<img'>

Maxy-dear fixed the old man's wagon and now it's working again. For another little while. ':unsure:'

Ack! Sunshine!

Toodles!
               
               

               
            

Legacy_meaglyn

  • Hero Member
  • *****
  • Posts: 1451
  • Karma: +0/-0
The Vault Preservation Project
« Reply #73 on: October 03, 2012, 03:16:58 am »


               ACD - drat you beat me to it '<img'>

I've just about completed a set of scripts which do almost the same thing yours does. The major difference being the creation of a key value metadata file along with the downloads. The idea
there was to make it easy to get that data into a new DB.  But that could be done with tools on
the saved raw html too once it's all downloaded.

Cheers,
Meaglyn
               
               

               
            

Legacy_Rolo Kipp

  • Hero Member
  • *****
  • Posts: 4349
  • Karma: +0/-0
The Vault Preservation Project
« Reply #74 on: October 03, 2012, 05:56:25 pm »


               <reaching out...>

@ meaglyn: But that is what I want! :-P The key value metadata, that is... preferably in CSV or Excel format.

Would you be willing to share with ACD and incorporate that? He's sent me one updated version, why not another =)

Getting the metadata into an easily imported format would make things vastly easier. I'd then use that CSV file to generate the projects. Then all I need to do is link up the files/screenies and comments.

Actually, comments could be collected in a keyed file, also. Drupal gives each comment its own node and links the nodes to the project. So I'd just need a field for each comment with the unique identifier for the project... I think... :-P

<...with both hands>