GeoCities Rebuild for Older Browsers

The above video shows the rm command running through stripping static, server-generated index.html files away. Recapping yesterday:

  1. Found and removed all index.html files from directory listings :white_check_mark:
  2. Removed 0 byte files, removed 0 byte folders (in that order) :white_check_mark:
  3. Started to clean up some of the redirection looped files and folders.
  4. Less than 1TB of data is present now.
  5. Symlinks were all deleted (again) - we need mod_speling to work properly.
  6. Sitemap has been created successfully and is now split into smaller chunks.
  7. The jump script now covers all neighborhoods and is a fair bit quicker.

Onto number three in the list. Cleaning up files and folders that suffered from redirection loops when being harvested by Archive Team.

While looking for empty directories, the longest of the long path lengths appeared. The above image shows an example of this output. The redirection loops (recursive loops) get a little extreme sometimes. For the purposes of this project, as long as they keep below the maximum path length, we are OK.

The above image shows an example of what a typical loop would look like in Finder. At the time of writing the longest one we have is around 220 characters which is more than acceptable. The web crawler / spider would have been going a little insane finding a link within a link that was a link to itself inside a link, inside another link, linked to another link, etc., etc.

Running the following command will show the 10000 longest pathnames:

find . | awk '{print length($0), $0}' | sort -n | tail -n 10000

I’ve run this command a few times now and fixed the loops by deleting duplicated data. One interesting error I’ve come across is seen in the next image.

In particular, take note of;

  1. The file names music.html and it’s variations and,
  2. The file sizes.

The correct music.html is the file named music.html.1. However, all the files in the music.html folder are also the same music.html.1 file. It doesn’t seem to happen on every recursive loop, but has happened enough for me to take note on how to fix it.

Deleting these has helped dip the entire decompressed data set below the 1TB mark. Some of those loops were big, causing folders to inflate to gigabytes. I should really run a search on folder size to detect any excessive utilisation of hard drive space. Low priority for now. But it does mean I can backup onto a 1TB drive/partition in full now. Lucky!

Touching on Apache’s mod_speling again, symlinks have all been removed from the local drive (not the server facing one, yet). If symlinks are present they have been confirmed to mess with mod_speling. This prevents correct redirections from happening. It’s more the nested directories that have issues. For example take:

> https://geocities.mcretro.net/Athens/olympus/1043/ 
> https://geocities.mcretro.net/Athens/Olympus/1043/

/Athens/olympus/ is not redirected to /Athens/Olympus/ as /Athens/olympus/ exists as a symlink. This causes issues for mod_speling as it can’t fix simple things like this as it sees two olympus folders, even though they are one and the same.

Additionally, the goal is to have everything in the correctly capitalised folders. i.e. Area51, SoHo, SiliconValley, etc. Any incorrect case will likely cause issues with mod_speling, breaking the ability to redirect correctly or locate source files - such as images and midis.

This means symlinks are off the menu again. They will only be used on the live server to fold the neighborhoods (/core/) into the much larger YahooIDs (/yahooids/). We need to give mod_speling the opportunity to thrive. We can always recreate symlinks if required as a last ditch effort.

sitemap-magic

301,000 pages are now in the sitemap.xml split over seven files. That’s with a depth flag of 4. I’m scared to think how long it would take if we did the entire site. Now that the sitemap is working properly, it’s time to fix jump!

Jump definitely loads faster now and the stress on the weak Raspberry Pi CPU is clearly less when parsing the smaller sub-sitemap files. The amount of code I am running through the command line is fantastic.

I’ve been using awk, grep (and pcregrep for multiline matching), find, rsync and xargs. I think I am getting better at regular expressions (regex). Regex are extremely powerful provided you are using the right tool. I may have nearly wiped my desktop out (again) with an rm command so I was quite careful with what I was doing on the unbacked up work-in-progress.

The Future
The next steps over the week will be to continue sorting out the pictures and images from the rest of the decompressed data. I’ll be tidying up the YahooIDs to remove any core folders, such as Area51, Hollywood, etc. They can’t be left in the YahooIDs folder. That would cause pretty big issues when, for example, /core/Area51/ tries to mount at the same point as /yahooids/Area51.

I also need to continue to think about the best way to serve the images. Currently there are many locations images were stored at, including international sites. I need to work out the best way to write that into Apache’s mod_substitute. A sample of the website addresses is below:

We create a simple /images folder and put the domains in our root directory and would look something like this:

/images/pic.geocities.com/
/images/geocities.com/pictures/
/images/us.yimg.com/

I think this is something like what Yahoo did when they took over in mid-1999 and again after a brand refresh in the 2000s. For the immediate future though… a backup.

backup-number-one

Creating a snapshot of the progress so far, to rotational media, is a good idea. It should finish late tonight. I’ll then be able to work on merging neighborhood (core) data together. Not making a backup at this point could easily be disastrous. One more week and we should have a clearer picture of how everything looks.

2 Likes

Great work on this project @ShaneMcRetro

hey Shane,
I had one of those 56K Banksia modems and its 33.6K predecessor that looked identical in design (actually I probably still have them in boxes somewhere). That was my first internet experience. I was late to get on the internet compared to some because of the high cost of dial up in it’s early years.
Mark C

Thanks @DigitalRampage, we’re on the final stretch now. Stay tuned.

@Mark-C Absolutely agree, I didn’t enter the internet age until mid-late 90s and I have a hunch our modem might have been secondhand. Looking back at some of those advertisements for $1000+ modems really makes me glad we didn’t jump in sooner. They seem to be a solid little modem though and have found a place in my setup owing to their compact design. :smile:

For the past week, my primary goal has been to merge all the core data together, avoiding copy blocks as pictured above. I’ve been mostly successful at this, although I might have had to call upon the backup from last week a few times. It’s very easy to enable autopilot and accidentally overwrite all of Heartland by replacing instead of merging. Thankfully, that’s now water under the bridge and we have arrived at…

That’s right, the core has been merged
The neighborhoods are all in one place and now have correct TitleCase (see above). MotorCity is no longer motorcity or motorCity. Trying to visit an incorrect title now invokes Apache’s mod_speling to boot you into the right directory. Previously when we had a mish-mash of lowercase, and uppercase letters it would result in some quirky behaviour. YahooIDs have been cleared out and contain only usernames now. YahooIDs have not had their case switched to lowercase as I haven’t found the right command to do it… yet.

The pictures are (mostly) merged
In the image above we can see clipart tiled backgrounds making an appearance which makes text a lot easier on the eyes. I’ve been thinking about implementing a calibration page to help get the best 800x600 experience on modern browsers as some of the tiling tends to, well tile I guess, vertically making the page look like sloppy construction.

There’s now a GeoCities favicon.ico in the title bar. It looks like the favicon standard wasn’t around until the late 1990s when GeoCities rebranded, just prior to Yahoo! takeover.

Since just before 7am we have been rebuilding the sitemap.xml to reflect the lack of empty folders and blank index.html files. This will likely take the rest of the day.

After this has completed, I’ve allocated some time to spend browsing the GeoCities via Jump™. I’ll live stream this browsing session over one of my YouTube channels. This will be done with the MiSTer which feels like a 486 running at 33MHz. Armed with Windows 98 and Internet Explorer 5.0, I’ll dial into the world wide web (WWW) through my two Banksia SP56 modems at ~28.8k. I’ve got an MT32-pi hooked up so I’ll be rocking out to any MIDIs I come across. Stay tuned! :city_sunrise:

2 Likes

The sitemap generation is still chugging along. We’re up to 16 hours and we’ve just hit SunsetStrip. This will likely still be running tomorrow morning as I now have all the YahooIDs on there too. I’ll probably manually delete them off the sitemap though as they are usually just storage for the neighbourhood sites.

Another thought I had while beaming between sites randomly, is that the jump link might be better suited for each neighbourhood. For example, to randomise a visit to a site from Area51 you would could load /jump_area51.php. It would make it a little less random, but then you can focus in on what you may actually want to see. For example, I might be an X-Files fanatic, I’m unlikely to find X-Files in Heartland or MotorCity.

The above video is a live stream of me jumping between random sites and checking some of the source code. I also installed the wrong version of WinZip 6.3, 16-bit instead of 32-bit. I should get some DOS-based benchmarks to see what the MiSTer is powerful enough to do. Tomorrow should be an interesting day. :white_check_mark:

Sitemap generation complete. 299337 pages in just shy of 24 hours. It hadn’t occurred to me that the users (YahooIDs) won’t be visible due to the way the website is structured, which I think is a good thing. Yahoo used to offer usernames alongside a regular neighbourhood address, so the two could co-exist and share resources between each other. As the goal of this project was GeoCities prior to Yahoo takeover this suits well.

I noticed that since rejigging the Jump™ system pages were taking a good while longer to load. It looks like it was to do with the PHP having to parse a 4MB file each time a random jump was initiated. I’ve now split it into neighbourhoods as shown above. Smaller files = better performance. Heartland is still a monster at a whopping 2MB, but that’s still an improvement.

From this we now have a faster Mega Jump™. Mega Jump™ picks from a random neighbourhood and beams you to a random page in there. If you want more precision though, we have this new feature…

Bookmark an icon to your favourites bar to be able to load all your best X-Files fanfic, or maybe some extreme sports in Pipeline. I’m impressed with how it turned out.

Above is the script used to call on the random page, it should look familiar from a few posts back. In other news the YahooIDs are integrated quite well, see the following fish page with the highlighted code calling on the tiled jewel background. Note the location bar address. Neat!

This also works if websites were spliced and diced between the YahooID and the neighbourhood address, as seen below, are linked inside the same page to each other. Again, note the location bar address.

nh-referencing-yahooid-2

For now that seems to be the end of the immediate project. Some things I would do differently a second time around:

  1. Merge with the command line instead of Finder in Mac OS. I could see that some files were not merging even though they didn’t exist in the destination.

That was a short list. I suppose a lot did go well. Things for future consideration include:

  1. Darknet compatibility via Tor, because why not. My main blog and retrosites as all darknet enabled, it would be interesting to get GeoCities on the darknet.
  2. Enabling caching for Apache to reduce server load. We’re on a mere Raspberry Pi 3.
  3. Cloudflare CDN for the same reasons as above.

I’ll upload a live stream for the final demonstration of how the site works a little later on this evening. But first, time to watch some Babylon 5 to celebrate. :beers:

1 Like

And there it is, the incredible era that spanned the mid-late 1990s that was GeoCities. I hope you’ve enjoyed the progression, I know I have. You now should have some idea on how to create a similar setup using a Raspberry Pi 3 and Ubuntu 20.04.

To those of you from the future looking to do exactly that I wish you the best of luck on your journey. For everyone else, enjoy the animated gifs, midis and horrendous web design. :stuck_out_tongue:

Try it out at http://geocities.mcretro.net/ if you haven’t already.

This has been Shane McRetro and I’m signing off for now!

:city_sunrise: :cityscape: :city_sunset:

2 Likes

I’d like to just personally say I am very proud of the work you have carried out during this challenge @ShaneMcRetro and the level of detail and efforts you went to document and dig deep into making this resource accessible and far more robust than others who have previously attempted to do so.
Great work!

1 Like

Hello @DigitalRampage, thanks for the kind words. I’d love to say it’s as complete as it could be but there’s always more work to do.

Other archives such as oocities.org and geocities.restorativland.org are great in their own respects as well. Then there’s archive.org - which is just absolutely massive. They also happen to own the shoulders I have been standing on to attempt this project.

And for what it’s worth, I’m still sorting through my unpublished notes trying to see what I could do differently and how I could potentially improve upon what’s already been done above. It sure is a learning experience!

IMG_0001