Kephisto wrote...
That’s great news, pkpeachykeen.
Your process is more advanced than mine. I’m up for cooperating on this but unfortunately the way I’m doing it is really basic. I’m just downloading the entire forum directory, so one moment its gathering topics from the Scripting forums, the next its backing up the Server Admin topics, then switches to CEP then to General Discussion, etc.. I can’t really control it.
Same here, I'm using WinHTTrack to spider and mirror it. So far it's going well, if slowly. I haven't gotten to many more pages done, it keeps hanging on random ones. I have around 3 gigs of data down now, no sure how many topics that is.
At the moment I’m at 33,514 downloaded pages and the counter estimates 125,733 to go. I’ve noticed many topics have navigational phrases and instructions in another language, so it seems I’m backing up topics multiple times, one in English and again in other languages. So just how big this project turns out to be is anyone’s guess.
Here’s hoping we finish before the deadline. '>
I noticed the same thing. Luckily, the classes/HTML code is the same.
I've been playing with this for most of today, and I put together a basic C# application that links an HTML and XML parser. I've been able to, with almost a 98% success rate at the moment, parse the pages I'm downloading. Been testing with a small set (137 random pages), and I've been able to parse all of the successfully at the moment.
It's not terribly smart, but it should be able to coalesce topics into a single XML file (it uses a CRC32 checksum of the topic's title to generate the filename, so pages with the same topic title
should generate the same pagename).
It takes the page, pulls the title, then goes through the body and copies out each post, including author, body text and signature. All are put in a basic XML file using the following schema:
thread title
me
post text!
This is a biomirror page
I then whipped up an XSL stylesheet and linked them. It's currently able to parse Bioboards pages (as saved from any browser or most spiders) into something like this:
http://cx029a.dnsdoj..._1118418858.xml
http://cx029a.dnsdoj...c_874755937.xml
http://cx029a.dnsdoj..._2789098980.xml
It's not pretty or smart, and needs some work, but it is relatively fast (can go through a hundred pages in a few seconds) and spits out that. If you take a look at the source, it's really simple to read and you could parse it with PHP, Java or C# and stuff it into any kind of database in a few minutes.
Modifié par pkpeachykeen, 15 juillet 2010 - 11:54 .