GeoCities Rebuild for Older Browsers

ShaneMcRetro · 24 October 2021 00:32

The above video shows the rm command running through stripping static, server-generated index.html files away. Recapping yesterday:

Found and removed all index.html files from directory listings
Removed 0 byte files, removed 0 byte folders (in that order)
Started to clean up some of the redirection looped files and folders.
Less than 1TB of data is present now.
Symlinks were all deleted (again) - we need mod_speling to work properly.
Sitemap has been created successfully and is now split into smaller chunks.
The jump script now covers all neighborhoods and is a fair bit quicker.

Onto number three in the list. Cleaning up files and folders that suffered from redirection loops when being harvested by Archive Team.

While looking for empty directories, the longest of the long path lengths appeared. The above image shows an example of this output. The redirection loops (recursive loops) get a little extreme sometimes. For the purposes of this project, as long as they keep below the maximum path length, we are OK.

The above image shows an example of what a typical loop would look like in Finder. At the time of writing the longest one we have is around 220 characters which is more than acceptable. The web crawler / spider would have been going a little insane finding a link within a link that was a link to itself inside a link, inside another link, linked to another link, etc., etc.

Running the following command will show the 10000 longest pathnames:

find . | awk '{print length($0), $0}' | sort -n | tail -n 10000

I’ve run this command a few times now and fixed the loops by deleting duplicated data. One interesting error I’ve come across is seen in the next image.

In particular, take note of;

The file names music.html and it’s variations and,
The file sizes.

The correct music.html is the file named music.html.1. However, all the files in the music.html folder are also the same music.html.1 file. It doesn’t seem to happen on every recursive loop, but has happened enough for me to take note on how to fix it.

Deleting these has helped dip the entire decompressed data set below the 1TB mark. Some of those loops were big, causing folders to inflate to gigabytes. I should really run a search on folder size to detect any excessive utilisation of hard drive space. Low priority for now. But it does mean I can backup onto a 1TB drive/partition in full now. Lucky!

Touching on Apache’s mod_speling again, symlinks have all been removed from the local drive (not the server facing one, yet). If symlinks are present they have been confirmed to mess with mod_speling. This prevents correct redirections from happening. It’s more the nested directories that have issues. For example take:

> https://geocities.mcretro.net/Athens/olympus/1043/ 
> https://geocities.mcretro.net/Athens/Olympus/1043/

/Athens/olympus/ is not redirected to /Athens/Olympus/ as /Athens/olympus/ exists as a symlink. This causes issues for mod_speling as it can’t fix simple things like this as it sees two olympus folders, even though they are one and the same.

Additionally, the goal is to have everything in the correctly capitalised folders. i.e. Area51, SoHo, SiliconValley, etc. Any incorrect case will likely cause issues with mod_speling, breaking the ability to redirect correctly or locate source files - such as images and midis.

This means symlinks are off the menu again. They will only be used on the live server to fold the neighborhoods (/core/) into the much larger YahooIDs (/yahooids/). We need to give mod_speling the opportunity to thrive. We can always recreate symlinks if required as a last ditch effort.

sitemap-magic

301,000 pages are now in the sitemap.xml split over seven files. That’s with a depth flag of 4. I’m scared to think how long it would take if we did the entire site. Now that the sitemap is working properly, it’s time to fix jump!

Jump definitely loads faster now and the stress on the weak Raspberry Pi CPU is clearly less when parsing the smaller sub-sitemap files. The amount of code I am running through the command line is fantastic.

I’ve been using awk, grep (and pcregrep for multiline matching), find, rsync and xargs. I think I am getting better at regular expressions (regex). Regex are extremely powerful provided you are using the right tool. I may have nearly wiped my desktop out (again) with an rm command so I was quite careful with what I was doing on the unbacked up work-in-progress.

The Future
The next steps over the week will be to continue sorting out the pictures and images from the rest of the decompressed data. I’ll be tidying up the YahooIDs to remove any core folders, such as Area51, Hollywood, etc. They can’t be left in the YahooIDs folder. That would cause pretty big issues when, for example, /core/Area51/ tries to mount at the same point as /yahooids/Area51.

I also need to continue to think about the best way to serve the images. Currently there are many locations images were stored at, including international sites. I need to work out the best way to write that into Apache’s mod_substitute. A sample of the website addresses is below:

We create a simple /images folder and put the domains in our root directory and would look something like this:

/images/pic.geocities.com/
/images/geocities.com/pictures/
/images/us.yimg.com/

I think this is something like what Yahoo did when they took over in mid-1999 and again after a brand refresh in the 2000s. For the immediate future though… a backup.

backup-number-one

Creating a snapshot of the progress so far, to rotational media, is a good idea. It should finish late tonight. I’ll then be able to work on merging neighborhood (core) data together. Not making a backup at this point could easily be disastrous. One more week and we should have a clearer picture of how everything looks.

DigitalRampage · 24 October 2021 11:51

Great work on this project @ShaneMcRetro

Mark-C · 24 October 2021 14:28

hey Shane,
I had one of those 56K Banksia modems and its 33.6K predecessor that looked identical in design (actually I probably still have them in boxes somewhere). That was my first internet experience. I was late to get on the internet compared to some because of the high cost of dial up in it’s early years.
Mark C

ShaneMcRetro · 30 October 2021 00:15

Thanks @DigitalRampage, we’re on the final stretch now. Stay tuned.

@Mark-C Absolutely agree, I didn’t enter the internet age until mid-late 90s and I have a hunch our modem might have been secondhand. Looking back at some of those advertisements for $1000+ modems really makes me glad we didn’t jump in sooner. They seem to be a solid little modem though and have found a place in my setup owing to their compact design.

ShaneMcRetro · 30 October 2021 02:29

For the past week, my primary goal has been to merge all the core data together, avoiding copy blocks as pictured above. I’ve been mostly successful at this, although I might have had to call upon the backup from last week a few times. It’s very easy to enable autopilot and accidentally overwrite all of Heartland by replacing instead of merging. Thankfully, that’s now water under the bridge and we have arrived at…

That’s right, the core has been merged
The neighborhoods are all in one place and now have correct TitleCase (see above). MotorCity is no longer motorcity or motorCity. Trying to visit an incorrect title now invokes Apache’s mod_speling to boot you into the right directory. Previously when we had a mish-mash of lowercase, and uppercase letters it would result in some quirky behaviour. YahooIDs have been cleared out and contain only usernames now. YahooIDs have not had their case switched to lowercase as I haven’t found the right command to do it… yet.

The pictures are (mostly) merged
In the image above we can see clipart tiled backgrounds making an appearance which makes text a lot easier on the eyes. I’ve been thinking about implementing a calibration page to help get the best 800x600 experience on modern browsers as some of the tiling tends to, well tile I guess, vertically making the page look like sloppy construction.

There’s now a GeoCities favicon.ico in the title bar. It looks like the favicon standard wasn’t around until the late 1990s when GeoCities rebranded, just prior to Yahoo! takeover.

Since just before 7am we have been rebuilding the sitemap.xml to reflect the lack of empty folders and blank index.html files. This will likely take the rest of the day.

After this has completed, I’ve allocated some time to spend browsing the GeoCities via Jump™. I’ll live stream this browsing session over one of my YouTube channels. This will be done with the MiSTer which feels like a 486 running at 33MHz. Armed with Windows 98 and Internet Explorer 5.0, I’ll dial into the world wide web (WWW) through my two Banksia SP56 modems at ~28.8k. I’ve got an MT32-pi hooked up so I’ll be rocking out to any MIDIs I come across. Stay tuned!

ShaneMcRetro · 30 October 2021 12:14

The sitemap generation is still chugging along. We’re up to 16 hours and we’ve just hit SunsetStrip. This will likely still be running tomorrow morning as I now have all the YahooIDs on there too. I’ll probably manually delete them off the sitemap though as they are usually just storage for the neighbourhood sites.

Another thought I had while beaming between sites randomly, is that the jump link might be better suited for each neighbourhood. For example, to randomise a visit to a site from Area51 you would could load /jump_area51.php. It would make it a little less random, but then you can focus in on what you may actually want to see. For example, I might be an X-Files fanatic, I’m unlikely to find X-Files in Heartland or MotorCity.

The above video is a live stream of me jumping between random sites and checking some of the source code. I also installed the wrong version of WinZip 6.3, 16-bit instead of 32-bit. I should get some DOS-based benchmarks to see what the MiSTer is powerful enough to do. Tomorrow should be an interesting day.

ShaneMcRetro · 31 October 2021 05:37

Sitemap generation complete. 299337 pages in just shy of 24 hours. It hadn’t occurred to me that the users (YahooIDs) won’t be visible due to the way the website is structured, which I think is a good thing. Yahoo used to offer usernames alongside a regular neighbourhood address, so the two could co-exist and share resources between each other. As the goal of this project was GeoCities prior to Yahoo takeover this suits well.

I noticed that since rejigging the Jump™ system pages were taking a good while longer to load. It looks like it was to do with the PHP having to parse a 4MB file each time a random jump was initiated. I’ve now split it into neighbourhoods as shown above. Smaller files = better performance. Heartland is still a monster at a whopping 2MB, but that’s still an improvement.

From this we now have a faster Mega Jump™. Mega Jump™ picks from a random neighbourhood and beams you to a random page in there. If you want more precision though, we have this new feature…

Bookmark an icon to your favourites bar to be able to load all your best X-Files fanfic, or maybe some extreme sports in Pipeline. I’m impressed with how it turned out.

Above is the script used to call on the random page, it should look familiar from a few posts back. In other news the YahooIDs are integrated quite well, see the following fish page with the highlighted code calling on the tiled jewel background. Note the location bar address. Neat!

This also works if websites were spliced and diced between the YahooID and the neighbourhood address, as seen below, are linked inside the same page to each other. Again, note the location bar address.

nh-referencing-yahooid-2

For now that seems to be the end of the immediate project. Some things I would do differently a second time around:

Merge with the command line instead of Finder in Mac OS. I could see that some files were not merging even though they didn’t exist in the destination.

That was a short list. I suppose a lot did go well. Things for future consideration include:

Darknet compatibility via Tor, because why not. My main blog and retrosites as all darknet enabled, it would be interesting to get GeoCities on the darknet.
Enabling caching for Apache to reduce server load. We’re on a mere Raspberry Pi 3.
Cloudflare CDN for the same reasons as above.

I’ll upload a live stream for the final demonstration of how the site works a little later on this evening. But first, time to watch some Babylon 5 to celebrate.

ShaneMcRetro · 31 October 2021 09:55

And there it is, the incredible era that spanned the mid-late 1990s that was GeoCities. I hope you’ve enjoyed the progression, I know I have. You now should have some idea on how to create a similar setup using a Raspberry Pi 3 and Ubuntu 20.04.

To those of you from the future looking to do exactly that I wish you the best of luck on your journey. For everyone else, enjoy the animated gifs, midis and horrendous web design.

Try it out at http://geocities.mcretro.net/ if you haven’t already.

This has been Shane McRetro and I’m signing off for now!

DigitalRampage · 11 November 2021 03:33

I’d like to just personally say I am very proud of the work you have carried out during this challenge @ShaneMcRetro and the level of detail and efforts you went to document and dig deep into making this resource accessible and far more robust than others who have previously attempted to do so.
Great work!

ShaneMcRetro · 14 November 2021 10:58

Hello @DigitalRampage, thanks for the kind words. I’d love to say it’s as complete as it could be but there’s always more work to do.

Other archives such as oocities.org and geocities.restorativland.org are great in their own respects as well. Then there’s archive.org - which is just absolutely massive. They also happen to own the shoulders I have been standing on to attempt this project.

And for what it’s worth, I’m still sorting through my unpublished notes trying to see what I could do differently and how I could potentially improve upon what’s already been done above. It sure is a learning experience!

ShaneMcRetro · 4 December 2021 01:48

This is something I was meaning to do a little while back - enabling darknet accessibility via a Tor v3 onion vanity address. I’ve been tinkering with my SSL certificates on my servers to try and get them working the way I want too.

So now if you happen to be on a modern computer running the Tor Browser (based on Firefox) and visit the HTTPS version of geocities.mcretro.net, you will notice a little purple banner in the address bar.

.onion available

Once clicked you end up with the very easy to remember:
http://geocitiesllczuf44da2nj45jn3fntdjpw27ercfbhkrw3mnegti7pid.onion

This link will not resolve in a clearnet browser, you need to use one that has tor capabilities (or a tor proxy service).

As you can see it is travelling through quite a complicated looking circuit.

The jump page mostly works. I noticed that on it tends to take you to the clearnet site but does provide the purple “.onion available” button again. I’m sure I’ll work it out eventually as to why it is doing that. I must have a “geocities.mcretro.net” hardcoded somewhere instead of just a “/”. Actually it might be because it uses the sitemap.xml files for jump. Ahhh yes. I don’t think I’ll change that, at least it provides darknet search indexers a bit more content.

Grab the Tor Browser and have a tinker with the technology, it’s quite interesting to setup. Then stop by http://geocitiesllczuf44da2nj45jn3fntdjpw27ercfbhkrw3mnegti7pid.onion/

ShaneMcRetro · 3 January 2022 05:49

Fixed darknet jumping.

If you visit https://geocities.mcretro.net in the Tor Browser, note the activation of the location bar. Onion-Location activates. This will not work if visiting on clearnet non-ssl site, http://geocities.mcretro.net, which is by design according to The Tor Project. Hope everyone has a good new year!

EdS · 3 January 2022 14:32

Just for fun, here are one or two retro-computing relevant pages for you:

(Just possibly this should be a new thread?)

ShaneMcRetro · 8 January 2022 05:01

Good finds! I moved the server from the Raspberry Pi 3 to the Raspberry Pi 4 this morning and in the process killed all the symlinks. This is why documenting the whole process was a great idea.

Apache (the web server software) was not happy with symlinks being broken. These are three very useful commands for debugging errors:

sudo tail -f /var/log/apache2/other_vhosts_access.log
sudo tail -f /var/log/apache2/error.log
sudo tail -f /var/log/apache2/access.log

They even show up maroon! How did I manage this, I had renamed the external drive from “External” to “geocities” to help differentiate it from another project I had activated in the past week or so - The Assembler Games Archive. All my old symlinks died as a result.

I ran a test to make sure I wasn’t going to wipe out the entire internet first - i.e. without sudo tacked on the front of the command I stole from the internet. It worked, it found all the correct symlinks to remove and them I did!

Here’s the command that brought them back to life. The neighbourhoods need to be inside the yahooids subfolder for everything to work nicely.

And we’re back. You’re right @EdS, we should make a new thread with retrocomputing relevant finds from GeoCities Archive.

ShaneMcRetro · 9 April 2022 10:24

Just when I thought everything was done and dusted, I received this message through the contact form on my website a few months back.

Dear Administrator，

We are researchers from INSC, Tsinghua University. In our recent research for dark web misconfiguration, we found that your website may be suffering from this security problem. Due to the Apache mod_status vulnerability, results from scanning show that at least 7 dark web domains (starting with geoc| chir| fish| pepp| retr| soni| spac|) and 2 surface web domains (http://geocities.mcretro.net | http://space.mcretro.net) are running on the same host.

This vulnerability is low risk on the Internet, but high risk on the dark web. It is recommended to fix it in time to avoid being attacked or having your information compromised.

And we are looking forward to your reply which is important to our ongoing research.

Best Regards,
INSC, Tsinghua University

I thought, that’s interesting. Naturally, I was curious as to what I had misconfigured. So I visited my sitemap domain as a test http://retrojunkie.net/server-status and was presented with this:

Works as expected on the clearnet, external IPs shouldn’t be able to see it. But onion domains use localhost (127.0.0.1) to do some funky routing. Let’s try the darknet version of the same site at: http://retro56qdosu3zuisr5tiuh7ihpn6owzpwjwvs7djlweiqb3mcspahyd.onion/server-status

Having a look around why this was happening resulted in me finding this great site explaining how it all works in detail. After a quick read I ran the following commands to try and solve the problem I didn’t know I had.

sudo a2dismod status
sudo a2dismod info
sudo apache2ctl configtest
sudo systemctl restart apache2.service

A quick refresh of the onion site…

And all is well… Well, not found because it isn’t supposed to be showing all the intimate server details and pages being loaded. This will probably be my last update on here for this project. I’ve made another copy of the work done above on my blog available at https://mcretro.net/the-geocities-rebuild-worklog/ in case anything happens to this site.

I’ll try to keep both updated if anything else exciting happens. Thanks for reading, I hope you’ve enjoyed the saga!

DigitalRampage · 20 June 2022 03:49

@ShaneMcRetro It is great to see you have persisted with this after the RetroChallenge and I have been enjoying flicking through some of the more ridiculous and risqué content of the late 90s. How the communications have changed from One way on Geocities to our two way “fightclub” of social media.