Portable All-in-One GeoCities Web Server

Firstly, the database works. It turns out PostgreSQL (psql) is quite different from what I have become used to in MariaDB. That’s alright, I got there. Although… I did notice I may not have installed the dependencies required in the Perl scripts. Oh boy, yet another language I don’t speak at all. Kind of brings up the question, have any of the past scripts silently been failing? To test the code I was running the scripts line by line:

$GEO_SCRIPTS/ingest-doubles.pl $GEO_LOGS/dir-index.txt

The above line was giving me trouble in the image at the top of this post. Imagine my horror when I realised it missing the $GEO_LOGS section on the official 009 script! So that might help us more forward again. I must have entered that while testing to see if I could get the script to progress past the errors.

psql -d $GEO_DB_DB -f $GEO_SCRIPTS/sql/create/doubles.sql

This line in the 009 script probably prompted me to add it. $GEO_LOGS being referenced by psql but not perl? That can’t be right, thinks me at 3am, and figured that perl would want the same variable passed onto it. Turns out, it doesn’t! :joy: I’ll check tomorrow but am pretty confident that is the reason for that block.

Now dependencies are probably going to be a great help too. I’ve been reading up on the two Perl versions referenced in the scripts, we have 5.12.0 and 5.14.0. Perl 5.12.0 seems to only be used in script 004-normalize-tracking.pl. All other scripts use 5.14.0, which is good as it includes a lot more “core” modules, i.e. included inside Perl itself.

Now I know I installed a few Perl modules using CPAN, apparently this is a no-no unless you want trouble. If possible, it is recommended to stick with the packages made available for the distribution of Ubuntu, in my case 12.04, being run. Some of the “core” modules can still be updated to add new features while being able to run on older versions of Perl. It’s messy. Here’s a list of the core modules present in both 5.12 and 5.14:

  • warnings
  • diagnostics
  • strict
  • File::Find
  • File::stat
  • utf8

Pretty neat. So these will work regardless of what I do. As long as I have Perl installed these will run. The following are modules that are not present in either 5.12 or 5.14:

  • DBI
  • Data::Dumper
  • IO::All
  • Try::Tiny
  • XML::TreePP
  • DBD::Pg

These are the commands that should do the trick at the Ubuntu command line for the above required modules:

sudo apt-get install -y libdbi-perl data-dumper libio-all-perl libtry-tiny-perl libxml-treepp-perl libdbd-pg-perl

The following modules are only built in to the later 5.14 and are not present in 5.12. They are not used in the 004 script so we don’t need to worry about them but here they are anyway for reference:

  • threads
  • threads::shared
  • Encode
  • Cwd qw(abs_path)
  • YAML qw(LoadFile)
  • Digest
  • File::Path qw(make_path)

The next few commands may or may not be Ubuntu 12.04 compatible. Actually all the ones above may not be 12.04 compatible either! Guess I’ll find out tomorrow. These need to be run, if they haven’t already:

sudo apt-get install build-essential p7zip-full convmv postgresql libpq-dev libglib2.0-bin

sudo apt-get install convmv findutils

And hopefully the database will start to fill up with more data than just the username and database name tomorrow. Fingers crossed! :guitar:

3 Likes

Words fail me. It’s actually running. Previously we’d get stuck on the highlighted text in ingest-doubles.pl at printing the asterisk. Now you can see the asterisk happily being printed in the terminal window at the back of all the windows. It’s committing records to the database.

The “line 69” error that was being encountered is gone. In order to fix it, I blew away the entire Ubuntu 12.04 install and started again. Following my install steps from above for dependencies, I had installed some Perl modules not through Ubuntu’s package manager (apt-get) but through CPAN. This was a big mistake.

Sure, the window server seems crashing if I monitor the stats via the System Monitor GUI. That might be caused by a lack of proprietary NVIDIA drivers I haven’t quite installed yet. I also had some issues with setting up psql (again) with my notes not quite working but I eventually pushed on to work out the configuration. It was a little different as I didn’t choose the Ubuntu user to be named “despens”, but “ubuntu” instead.

To avoid crashing the system, we now use top instead. Postgres is running crazy on one of the CPU cores. Lucky for me single core processing is where this machine excels. When I left it, we were stepping through script 009 to check for any errors. So far so good. This is the best condition I’ve seen the script in to date. If all goes well I’ll have data in the terminal window in the background.

To shake things up a little, I added a database named turtles - instead of Geocities and everything just started working. Are the turtles watching over me, keeping this database safe? It’s the only thing that makes sense. :turtle:

I’ve also been using pgAdmin3 (via GUI) in cahoots with psql (via CLI) to get an idea of what I am doing what I drop a database or create a table or forget to give the user any permissions. It has been extremely helpful in visualising the database structure. To install it just run the following:

sudo apt-get install pgadmin3

The final step in the 009 script is to run a ~18-24 hour script - a lot to ask for a database that wouldn’t even connect yesterday. After this there is one other large processing step, step 012 - the ingest.

Here’s the environment variables I was using in ~/.bashrc

It’s important to login as postgres to edit things via sudo su -postgres. You’ll note the prompt change from ubuntu@ubuntu-MacPro:~$ to postgres@ubuntu-MacPro:~$. This allows psql to access postgres, the superuser for psql. We then create a new role for the despens user and verify the user has been created with the same permissions as the superuser postgres.

Additional work to get the database to be able to connect to the script via despens user was needed on two files followed by a services restart. Editing /etc/postqresql/9.1/main/pq hba.conf as above. This lets everything connect, however it wants. That’s fine. This isn’t going to be a production machine with an open to the internet database subject to hackers.

The second file, /etc/postaresql/9.1/main/postgresql.conf, needs to be edited to allow all listening on all address. Once done, a simple…

sudo service postgresql restart

And we’re creating tables with ease. The table “doubles” didn’t exist but now it does and pgAdmin3 confirms it!

You can also see database creation through psql. It’s amazing to see it all come together. Of course, once this is up and running it can be improved upon further - as all things can. I still consider this to be at a testing stage as I am still learning how the systems in these scripts run.

After RetroChallenge 2022/10 is completed, I would like to start from step 001 to make sure I capture as much of the extra data added to the GeoCities archive after the core torrent was released. It would be great to modernise it to a newer Ubuntu system, such as 22.04, but I need to make sure I have the steps as correct as possible in 12.04 first. Regardless of what is done in the future, we’re moving forward now and that’s where we need to be! :partying_face:

2 Likes

This failed. The tail should have given me a list of duplicates. The good news is…

I ran it again and it yielded a 72.6MB text file! It looks like the crashing of the window server was bad for whatever was processing and stopped something running the way it should. I’m only monitoring system stats via top for the time being so we’re only drawing the desktop, terminal windows and a text editor. Maybe I really should install that driver.

Looking inside the sorted text file hits us with the the above. As you can see case is all over the place. 009-case-insensitivity-dirs.sh just so happens to be the script I am stepping through. That’s what we’re trying to fix.

This is the step we are now up to and it’s the big one. I am not expecting it to complete until around lunch tomorrow. We’re not even halfway there yet!

This is what the terminal script looks like, very busy and zips by at a thousand miles an hour. I am more impressed that this is working. I’ve never had it progress this far before with so few errors.

Clown. clown. Clown. clown. Watch that crazy recursive case sensitivity combined with symlink cancer. It has spent around 4 hours processing just Clown alone. The content better be amazing! :joy:

Around lunch tomorrow, step 009 will be but a memory and we can check how much of the archive has been trimmed out. Remembering we need less than 1000GB to fit on the internal drive inside the Raspberry Pi. As it stands we are just over 1000GB. Never give up, never surrender. :city_sunset:

2 Likes

Script 009 completed! :partying_face: That means that all the directories will now have their case sensitivity sorted as best can be.

Next up is an unexpected reference to turtles in script 010. The good news is, it’s a very fast script and completes quickly.

It removes files and folders that were downloaded in recursive loops. As you can see in the background there’s a lot of recursion going on. Now they are gone and we can move on to script 011.

Script 011 begins. It looks to sort the case sensitivity at a file level now. We’ve already sorted it at a folder level, so it makes sense to do the files as well.

An important consideration to note is that when configuring the web server it can be set to ignore case. If you click a link that has TURTLES.txt but the file is actually named Turtles.txt or turtles.txt it will select whichever one is present, and ignore the case. It’s an Apache module. That’s still a while off, so we won’t worry about that just yet.

Postgres loves to eat up my single core of CPU. Database stuff always has me on edge. You might notice I added a “-U despens” to the command that is called upon. If I don’t do this, it will think the user is my Ubuntu logged in user, which is aptly named “ubuntu”. If I kept the name at despens like on the first run, this would have been a non-issue - I guess I had to know what would happen! :innocent:

Here’s an example of some of the output we get when comparing case between files. When GeoCities was “downloaded” en masse the crawler would just follow whatever link it could find. So you end up with LION.gif and lion.gif, but a million times over several million files.

Above shows the 011 script at the end. This is an example of the file it saved. There were three possibilities and it picked this one. What was it’s reasoning for picking this one? I do not know. That’s all buried in perl scripting which I honestly can’t read - but that’s OK, I don’t have to as the file survived.

But what is the database up to? Last time we checked it was around 249MB in size. Now it’s a whopping 2168MB. I don’t know what the numbers above the index size are but I can speculate.

30,689,439 / 30,739,475 - Last time I did this we had around 33,000,000 files when performing file level clones. So that’s likely the number of files. I haven’t come across a number like 122,916,135 before so I can’t even guess as to what that might be.

Before we head on to script 012, I thought I’d have a look at how much we have culled off. We’re now at 47% of 2TB used, that puts us under 1TB - good. What we have will fit on the 1TB web server when it is setup. With script 011 out of the way, we’ll next start stepping through script 012 - and it’s a big one! :turtle:

1 Like

Script 012 begins. This one is big. The first stage of the script moves the files from the work directory. Very quick, in the blink of an eye. The next step will run overnight at the very least.

Next is database ingesting! Make sure the tables ‘files’, ‘urls’ and ‘props’ are ready. See the ‘sql’ directory.

Before starting this script, if not stepping through, ensure that the commented out notes above are followed.

You’ll notice that GeoIngest.pl and postgres are both hard at work for the first stage. The hardware despens was running in 2012 show the real as being 1055 minutes with real representing actual elapsed time. That’s 18 hours! I’m on a quad core CPU running a single core. I’ll check back in tomorrow with how it all went.

Uh oh, we hit an error! Is that a question mark in a triangle in the filename? It’s an encoding issue! I thought we fixed all those in script 007. A closer look and we can see it was this file:

/media/Geocities2TB/archive/archiveteam.torrent/www.geocities.com/TimesSquare/corridor/1041/exercecio.gif

Looking at the 007 script I am using, we can see that exact file should have been dealt with.

Wait a minute, is that a typo? There’s a typo! It must have failed silently too. It should have read -t utf8 not -tf8. Not present on the original file from despens. Remember I forked this from here as it offered some updates to the code for Ubuntu 22.04. I guess the point is moot as I’m back on Ubuntu 12.04 now.

A quick edit and it’s now UTF-8… but what of the database. It would still have the old named file present. We are going to have to speed run through the scripts again. Here we go! :crossed_fingers:

Alright, a bit of a clean up. For starters I’m installing the graphics drivers, something I should have probably done when I installed the OS. That should help with stability.

We’re back up to script 012 and ran the script in full, not step by step. We ended up with the following at the end:

Is that just a warning? Maybe if we look at the Perl file it will help guide us.

Nope that doesn’t help at all. In that case, let’s just run it again.

Insanity is doing the same thing over and over and expecting different results.

It ran successfully. Yep that about sums up the above quote.

Postgres seems to be doing something involving the INSERT command and has been running a while. Most likely inserting database stuff.

14 hours later, it’s still running. We’re on a different stage now though, looks like GeoIngest.pl and all those psql commands have finished and it’s now onto GeoURLs.pl.

The console window is spewing out addresses, and occasionally spits out the following:

No option but to let it run for now. I’ll dig into what it’s doing in the next post, assuming it is still running. But it is on the last step of script 012. Then we have 013 and we’re done with the processing. I’m starting to think manually processing this was a little quicker… then again, I have taken the long route! :joy:

012-ingest has completed after nearly four days! :crazy_face:

Which leads us to the final script 013-indexes. This script only has two parts, the psql database and a Perl script. Simple compared to the others, should be easy. Let’s step through it one command at a time.

The psql command appears to have worked. And we can see our file sizes which look good well under 1TB no matter which way you calculate it (base 2 or base 10).

Look an error on the second Perl script, how semi-expected! :laughing: It seems to be referencing the last line of the two line script. This means the first half ran ok but this filter-indexes.pl on line 25 is causing issues which we can see is a YAML module that is required. Now I thought I had all these sorted and YAML was built into Perl 5.14. But maybe it isn’t. How to fix it? Very easy!

sudo apt-get install libconfig-yaml-perl

Rerun the same script and…

Viola!

A little while later and it’s committed all the data. This means we are done processing all the raw data. Now it’s time to clone it off this 2TB drive to a 1TB drive as an intermediate step. This will then allow us to the Raspberry Pi M.2 SATA storage. First some sanity checks.

Final file sizes look good and are ~6% less than the 1TB we require.

A file listing inside Area51 shows everything looks to be good titlecase-wise and what was expected to be there - folders, lots of folders. Trying to open the main GeoCities folder in the file system causes it to lock up for a looooong time. There were many force quits. Next we copy off to the 1TB from the 2TB, to free it up for other projects. However… the target drive has this showing!

We’re going to ignore it for now but keep an eye on it after doing a full transfer to ≥80% of the drive. The command we are running is:

rsync -av ‘/media/Geocities2TB/archive/archiveteam.torrent/geocities/www.geocities.com’ ‘/media/GeoFinalclone’

We’ll check back in with how it goes soon! :city_sunset:

1 Like

It finished the rsync. But I forgot the international sites…

How big can it be?

Only an extra 4%; that’s only another ~40GB, not too bad. 89% of the 1TB is used up - which reminds me…

That’s good news too, the reallocated sector count hasn’t increased on the starting to fail SSD so the data should be good. In scripts we trust! :laughing:

I’m not sure how du -hs interprets sizes differently to df -H and df -h but at least the two datasets are the same size. That’s what counts here as we’re moving house from a 2.5" SSD to an M.2 SATA blade inside the Raspberry Pi. I’ll use dd for the clone.

sudo dd if=/dev/sdb | pv -s 932G | sudo dd of=/dev/sda bs=4M

Look at those USB2 speeds, thankfully though, it only took around seven and a half hours. Next is the first and most important goal - getting the data ready for the web. There’s nothing more fun than configuring Apache. I can use my old notes to a degree, we’ll be using a proxy server so the Raspberry Pi can be portable. I’ll report back once I get that up and running. :upside_down_face:

1 Like

Progress! The main web server is now pointed at the new Raspberry Pi that will be hosting GeoCities. At this stage I haven’t configured anything on the GeoCities side.

Hello world indeed, the proxy server is passing on requests successfully. It now acts as though it were the one facing the internet directly. The foundations are laid.

This is something that bugged me last time, case sensitivity on a non-case sensitive file system when mounted as a SMB share. This time around, as you can see I have three folders, and there’s no issues. How was it achieved? This post helped immensely.

fruit:locking=netatalk
fruit:metadata=netatalk
fruit:resource=file
streams_xattr:prefix=user.
streams_xattr:store_stream_type=no
oplocks=no
level2 oplocks=no
strict locking=auto

case sensitive=yes
preserve case=yes
short preserve case=yes

That did the trick, I should roll that out on all my servers. Those last three are the case sensitivity and the first bunch are all to do with better compatibility with Apple devices - which is what I happen to use as my daily driver.

After fixing some permissions to enable www-data as the owner and group of the newly copied files, I tried to load the directory listing of the www.geocities.com folder. It never loads, the directory listing is just far too big.

So I dusted off some old code and fixed some broken links. The basic front-end is looking good even if none of the links to the actual GeoCities data work yet.

I mean it looks really good with those icons. Now how do I get the GeoCities data in the right place? Last time I used three folders root (for storing the html above), core (for storing the core neighborhoods) and yahooids (for all the user folders, there were so many…)

By using the despens scripting, most of this is all in the one folder www.geocities.com. I could try splitting the data up, but then again why would I need to do that?

Of course that didn’t stop me trying to symlink the GeoCities folder contents, which is massive, into the main html folder. This failed. Bash didn’t want any of it.

Argument list too long

This meant that while the data was there, it was in the wrong places. Back to the drawing board, I need to work out the best way to structure this before I can continue any further.

Now, to sort of the directory layout, I’ve realised the best solution is likely the most simple. As we want all the neighbourhoods and YahooIDs to be in the root, let’s just put them in the root. Apache2 has been reconfigured to recognise the folder www.geocities.com as the root folder.

I’ve now folded all data that was in the same directory as www.geocities.com into it as well. This will allow images to load properly when we use mod_substitute to make www.geocities.com queries into geocities.mcretro.net queries.

This was another oddity I was getting and it appears to be related to Cloudflare popping some tracking scripts in. We don’t want that in there while testing.

Flip the switch and the geocities subdomain is no longer proxied through Cloudflare. I’ll reactivate this later if need be.

We were still getting 404 errors in the apache2 logs on the server. Looks like I forgot to mount the M.2 SATA drive after a reboot… I really should edit fstab. The joys of working things out late at night! :sweat_smile:

I did get optional SSL enabled, this is great for better ranking on Google… Search Engine Optimisation (SEO) and all that junk. The good news is that old browsers (e.g. Internet Explorer 4) won’t try to use it if it can’t work out the handshake ciphers.

We’re using the old sitemap.xml files from the RC2021/10 because the content hasn’t changed - just the way it is organised internally. As far as the end user, the website visitor, is concerned everything is the same as before. However the jump scripts are broken, so I’m still looking into that but should have it worked out in the next few hours.

Here is a sample of jump data that should have worked. We were trying to jump inside Athens. It failed with error 500. What is error 500?

The HTTP status code 500 is a generic error response. It means that the server encountered an unexpected condition that prevented it from fulfilling the request . This error is usually returned by the server when no other error code is suitable.

Well that’s not very helpful. I had a look at the paths set and everything looked fine. Next I poked around in the script itself and realised a dependency was missing. SimpleXML. It’s used for manipulating XML data, such as that in my sitemaps. That’s probably why jump wasn’t working, it couldn’t read the file to tell it where to go and of course Apache has no idea what to do.

sudo apt-get install -y php-simplexml

Look at it roar to life!

http://geocities.mcretro.net/ResearchTriangle/System/5694/

http://geocities.mcretro.net/horseclipart/

http://geocities.mcretro.net/Eureka/Park/9322/

Wolfenstein, horse clip art and how to make a horrible website. Yep, that ticks all the boxes I think. You can check it out now by pressing this MegaJump link. Bookmark the link and keep hitting it until something interesting turns up.

Each time it is pressed, it will load a different page on the GeoCities archive. Be warned though. Some of that stuff is NSFW. If you do come across something like that be sure to report it.

I’ve marked the last core to do item as done from the list of goals. With the main project complete, I can work on the stretch goals over the next couple of days. :upside_down_face:

:white_check_mark: :white_check_mark: :white_check_mark:

1 Like

Here we are with the main homepage, note we have a little .onion available at the top in the location bar. Clicking on that button now takes us to http://geocitiesllczuf44da2nj45jn3fntdjpw27ercfbhkrw3mnegti7pid.onion. The only catch is you have to have be using the Tor Browser. It will not show on any other browsers but also doesn’t interfere with older browsers.

Testing it on individual sites also worked. Here we are on the clearnet offering. One press of the purple button and…

Here we are now looking at the light page. Better find my eastern facing window to put all my… wait a minute. I don’t have any houseplants! That strikes another item off the goals list. Now there’s only the last stretch goal to dial up with a real dialup modem. How will this be achieved?

Thankfully we don’t need to reinvent the wheel for that. We can use this guide from Doge Microsystems to create a dial-up pool. We only really need two (server and client) but I recently came across this lovely SPA8000-G4.

Of course, we recapped it. Interestingly it only seemed to be from 2017, so maybe a bit premature on the preventative maintenance. We noticed the capacitors from factory were…

Su’scon branded capacitors. Su’scon. Really? I’ll try and get this last stretch goal done. If I do end up dialling out successfully it will be from the MiSTer FPGA. Don’t forget to visit:

http://geocities.mcretro.net

Do the jumps and stay on the line! :city_sunrise:

1 Like

The first step troublesome step to getting online is picking the right serial adapter. I’ve had the most luck with my FTDI USB serial device converter which has a FT232RL chip. I’ve previously used Prolific chipsets with mixed results.

Next up was getting the pesky VoIP configuration setup. The above shows Asterisk running before the two SIPs activated.This indicates that the Cisco SPA8000 can communicate successfully with the Asterisk server on the Raspberry Pi.

Here we have me restarting Asterisk, and I was wondering why it wasn’t working after changing the config files… tsk tsk! Once rebooted a dial from the Windows 98 in the MiSTer resulted in a successful connection and right at the end a hang-up when I was done surfin’ the net.

What was even better is that on the Asterisk / GeoCities server is we can see that there are no RX or TX errors after 380kB of data going back and forth. I’m setup with a 14.4k modem, and let me say it took a while to get to that amount of data! :sweat_smile:

I’ll get some video of everything happening to give an idea of how it all works. It’s been a surprising success and has put me way ahead of schedule too. Stay tuned for the video update in the next post! :cityscape:

2 Likes

And there it is! I’ll be doing a live stream a little later on just browsing through the pages, I really should have picked out a faster modem though… :laughing:

And with that, we have all five objectives complete - :white_check_mark:

As always, it’s a nice touch to end on Halloween. On the topic of halloween and halloween themed websites, here’s one from last year, now available in spooky-vision. Some things on the predictions page are a little spooky!

I’ll be sure to post the live stream here once it’s complete. There won’t be any audio but that shouldn’t stop you from visiting a random neighbourhood today! :wink:

1 Like

Yikes! Talk about slow loading. Who cares about 56k warnings, where was my 14.4k warning? While 14.4k would have been rare in the mid-2000s, it would have been right at home in the mid-1990s. It all depends on where jump lands us on the internet of old. Of course, the stream dropped out mid-way, but, ummm - needless to say I would have eventually won at Solitaire… eventually. :sweat_smile:

I’ll update this post as I come across anything interesting or do any more live streams. I might even upgrade the modem to something V.34 or at least V.FC (V.Fast Class).

Thanks for reading and watching! :upside_down_face:

3 Likes

Thought I’d give the old Spirit Viper a launch. And it held steady for over four hours and 12 megabytes. Running on Windows 98 Second Edition with Internet Explorer 5.0 and Netscape Navigator 4.8 for all the internet needs.

Over on the Raspberry Pi hosting GeoCities and Asterisk, we are very pleased to see the ppp0 bridge report zero faults for both TX and RX! Thanks Rockwell!

And just in case you’re hungry for double the speed of the last video, i.e. 28,800bps vs 14,400bps, enjoy this four hour browse through GeoCities. I also fixed up my MT-32, which was broken due to a firmware update on the MiSTer. That didn’t stop my Volume Control from freezing again though! :laughing:

2 Likes

@ShaneMcRetro - Thank you for entering the Retro challenge October 2022!

Your continued effort and work on this project should be commended, and if it wasn’t for your win last year, you may have come in first again! Best of luck for our next challenge which is forthcoming. Thanks for your work!

Adrian and the ACMS team!

1 Like

One of the huge disadvantages of dealing with huge amounts of data you don’t own is not know what is actually contained within. One of my bigger concerns was anything that went against GeoCities/Yahoo’s terms of service. There seem to be plenty of violations. One that I was happy to purge was detected by Cloudflare’s CSAM.

I was glad to see that the technology works though and was contacted by the NCMEC (National Center for Missing and Exploited Children) and was able to remove offending data. I tried to reach out to the other hosts of GeoCities data to see if they had offending lists but none appear to keep one.

Knowing that GeoCities was archived by the Archive.org Archive Team, I reached out to archive.org for them to remove it as well. This leaves me asking why isn’t archive.org leveraging Cloudflare’s CSAM scanning tool?

But this helps complement the report tool that we implemented early on. I wonder if GeoCities just became too much of a liability for Yahoo and that’s why they pulled the plug.

Regardless, at this rate I won’t have the sitemap rebuilt with user directories until RetroChallenge 2023/10. There’s still things on my to do list for GeoCities, the hardware failure back in late December set me back a bit and then I ran out of free time. I guess you can stay tuned though! :satellite:

3 Likes