GeoCities Rebuild for Older Browsers

A quick redesign to make it a little nicer, still a work in progress - it wouldn’t be a GeoCities website if it wasn’t under construction after all.

Being able to teleport or jump to random sites inside the GeoCities archives is going to need a list of all available links to jump to. A sitemap.xml file should help us with that. That will be the database of available links for jumping.

To generate a sitemap we’ll need a… sitemap generator. I’ve decided on the aptly named Sitemap Generator CLI. I decided it would be best to run it on one of the Ubuntu 20.04 servers. Installation was straightforward.

sudo apt update
sudo apt install npm
sudo npm install -g sitemap-generator-cli

Then the generation of the sitemap.xml file. Unsurprisingly, this will also serve as a sitemap for Google, Bing and others crawling the website for links. Depth of 4 seems to be sufficient to capture the majority of user data.

sitemap-generator http://geocities.mcretro.net --verbose --max-depth 4

After waiting what felt like a few hours, move the sitemap into the appropriate folder:

sudo mv sitemap.xml /var/www/html/geocities/

We now have a sitemap.

Next, how to best implement the teleport/jump feature to use the sitemap.

Jump implemented.

The code comes from stackoverflow.com. The jump code now runs a server-side script (in PHP) to randomly select a website from the sitemap.xml file. If it weren’t server-side you’d know because the file is 1.8MB. Downloading that over dialup is going to be one slow page load.

You can try it here to beam to a random GeoCities website. We still have only Area51 and CapeCanaveral active. Hope you like X-Files fanfic…

Now we have the code that helps us jump to random sites that is pulled from our sitemap.xml. So where to next? Finding a way to keep a “Home”, “Jump”, and “Report” text or buttons on the majority of sites visited. Initial tests were somewhat successful…

I rediscovered TARGET="_top" after having my test links as same window. TARGET="_top" makes links open in the topmost frame, i.e. the full browser tab and window. Lesson learnt.

The current iteration of nav buttons embed into the page by Apache’s mod_substitute. We replace the </BODY> tag of the website server-side (not on the client visiting). No solution was perfect, so I went with the one that causes the least damage.

  • </HEAD> - Could cause the body tag to start prematurely, resulting in malformed content and it breaks frames.
  • <BODY> - Most pages have this contain most of their formatting. Injecting code into here results in a mess, often stripping formatting.
  • </BODY> - This only rarely causes issues.

The above shows issues when using BODY as a place to inject when FRAMESET is present. Substituting the </BODY> results in no nav buttons showing because FRAMESET is used in place of BODY tags. While I could inject code into FRAMESET as well, I think it would cause more issues depending on if the FRAMESET contained body tags such as the above sample site code.

I did tinker with having JavaScript overwriting some of the code to inject a nav bar. However, it was not working on the targeted older browsers. The reason? The script always landed outside the closing BODY tag meaning it wouldn’t render on the page.

Loaded the VirtualHost config:

sudo nano /etc/apache2/sites-available/geocities.conf

and inserted the following:

# Insert Home, Jump and Report Icons - Works without breaking too much, doesn't show on sites without a closing body tag </BODY>.
Substitute 's|</body>|<BR><p align="right"><A HREF="/" target="_top"><IMG SRC="/pictures/home.gif" ALT="Home"></A> <A HREF="/jump" target="_top"><IMG SRC="/pictures/jump.gif" ALT="Jump!"></A> <A HREF="http://mcretro.net/report" target="_blank"><IMG SRC="/pictures/report.gif" ALT="Report"</A></p></body>|i'

We can now jump from most pages - including directory listings (i.e. pages without an index).

In other news, I’ve mostly finished the neighborhood listing. This means if I copy all the files across they will now load up well enough. I’m still working on directory structure though. Images for sites seem to come from the following pages:

Pure chaos.

I’m looking into aliasing some of the data directories for YahooIDs to prevent them flooding the root folder with random usernames. Still trying to work out the best way to achieve that though.

I’ve tidied up the counters and the root substitutes, I noticed a lot more backgrounds work now. Once I merge all the root pictures/images together, we’ll have much better looking sites. Well, perhaps better isn’t quite be the right word. :slight_smile:

code_reduce

For now I have removed some of the JavaScript hiding on almost all pages. I couldn’t work out how to remove the geovisit(); code though. Two out of three ain’t bad.

Not much more to report except for some insight into those new Matrix sequels. I’ll keep working away on working out structure on the backend. I need to verify the structure of the neighborhoods against other sources such as Blade’s Place, OoCities.com and Wayback Machine.

Then I’ll need to poke around the YahooIDs, Neighborhoods and pictures spanning three periods: GeoCities, the GeoCities/Yahoo transition and Yahoo. Working these ones out will help determine the best way to mount the external media.

1 Like

Here’s the file structure. I want to have External be an external USB SSD to hold all the data from Core (Neighborhoods) and YahooIDs (YahooIDs). Athens is an example of a neighborhood. The YahooIDs, bananaphone, grantshome and jackiluvcatz are examples of usernames. The geocities subfolder contains the HTML I have written up in the past month or so.

There’s a lot of cross over between YahooIDs and neighborhoods. It looks like if you had a page at geocities.com/Athens/4829/ you would have also had geocities.com/jackiluvcatz as a symlink. The page builders that Yahoo (and maybe GeoCities) used seem to have willy nilly used the YahooID over the neighborhood location in referencing images, and other files in your website.

So really I needed two separate locations to meet (parallel and overlapping data) under the root folder where I already had data.

I attempted to use mod_alias only to find it kind of worked, but not really. Then I came across this incredible idea of having two DocumentRoots. My VirtualHost configuration became this:

DocumentRoot /var/www/html/geocities
<Directory /var/www/html/geocities>
                Options Indexes MultiViews FollowSymLinks
                AllowOverride None
                Require all granted
</Directory>

RewriteCond "/var/www/html/External/YahooIDs%{REQUEST_URI}" -f [OR]
RewriteCond "/var/www/html/External/YahooIDs%{REQUEST_URI}" -d
RewriteRule ^/?(.*)$ /var/www/html/External/YahooIDs/$1 [L]

RewriteCond "/var/www/html/External/Core%{REQUEST_URI}" -f [OR]
RewriteCond "/var/www/html/External/Core%{REQUEST_URI}" -d
RewriteRule ^/?(.*)$ /var/www/html/External/Core/$1 [L]

It worked! Almost perfectly. I found that we would lose the ability to have “athens” be corrected to “Athens” once passed through the rewrite rule. It turns out mod_speling and mod_rewrite are incompatible.

“mod_speling and mod_rewrite apply their rules during the same phase. If rewrite touches a URL it generally won’t pass the url on to mod_speling.”

One document root (my HTML) and two rewrite rules (Neighborhoods and YahooIDs) were so close to working. Every article I read told me it wasn’t possible, everyone else gave up or threads went dead. I did learn that Microsoft IIS has this issue often when migrating to a Linux host with Apache. Great.

Symlinks are back on the menu. I could use:

sudo ln -s Athens athens

This would allow me to have athens corrected to Athens. But what about SoHo, Soho and soho. Let alone the case sensitivity on YahooIDs… Plus there’s a bunch of unmerged case differences. I’ll merge them one day, maybe. mod_speling might negate the need to do that.

There’s too many variations of each YahooID, e.g. bananaphone vs BANANAPHONE and every iteration inbetween to create symlinks for. If mod_speling is broken, then case sensitivity is broken stopping people visiting the second address listed:

almost

It was so close to working. Just the lowercase variations were broken. I even dabbled in rewriting all files to lowercase through mod_rewrite. Nope, that didn’t work at all. Hmmm. Then it hit me. Like a bus. Reverse it.

reverse-polarity

I know that my created HTML is all lowercase and doesn’t require mod_speling. That means that the “geocities” folder can be ran through mod_rewrite with no consequences of mod_speling being disabled. Great!

# Rewrites for GeoCities (self-made) files. Allows mod_speling to function.
# mod_rewrite does not allow mod_speling to modify / ignore case.
# This is fine on my created files as everything is lowercase.
RewriteEngine on
RewriteCond "/var/www/html/geocities%{REQUEST_URI}" -f [OR]
RewriteCond "/var/www/html/geocities%{REQUEST_URI}" -d
RewriteRule ^/?(.*)$ /var/www/html/geocities/$1 [L]

With “geocities” being taken care of we move on to the contents of the “External” folders. We need their contents to show at:

Simple? Not really. No wonder I got stuck on this so long! The larger of the two folders is going to be YahooIDs. There’s a finite amount of neighborhoods, so core will have less items. Sounds like I need to symlink Core folders into YahooIDs so they both display at root in http://geocities.mcretro.net/ - symlink activate!

sudo ln -s /var/www/html/External/Core/* /var/www/html/External/YahooIDs

Now we have an Athens folder inside YahooIDs. Excellent. YahooIDs is now our DocumentRoot. Core (neighborhoods) is symlinked into YahooIDs. My GeoCities HTML landing page is rewritten into / alongside the above two. Three folders feeding into one. Incredible.

I’m hoping you’re starting to see why the GeoCities Archive is such a mess. The backend would have been chaos.

These all redirect correctly:

Also important to check the older links are still working and have mod_speling active:

On that note, we’ll need to move any neighborhoods from /geocities/* to /External/Core/*.

It seems that the /geocities/Area51/ rewrite rule has priority over /External/Core/Area51/ - interesting. Throw some case in there and it all redirects correctly though. This might not mean anything at this stage since I need to reshuffle the files around and fix permissions.

final-product

Above is the VirtualHost configuration to date. I better get around to moving everything across into final directories for further testing. :grinning_face_with_smiling_eyes:

Update: Copied all the neighborhoods across (core) and a few usernames (yahooids) - it looks like it might be working. I’ll rebuild the sitemap.xml file tonight. It should wind up being pretty big. Then jump should teleport across any page that is available! :upside_down_face:

1 Like

Jump is working, however the server does not like having to parse a file that is 15MB everytime the link is clicked. There’s over 280,000 links indexed. Yikes! That’s before I add 100,000+ user folders… That’s a lot of files.

What I can possibly do to speed it up a little is to split the sitemap into multiple files. Then the jump PHP script could call on one of the sitemaps at random, and pick a random URL inside the sitemap called upon. Of course you could just pick a suburb from the neighhborhood section and explore that way too.

Now that I’ve got everything redirecting to the root where it should be, I’m coming across some issues. This is one of them. This is mod_speling in action, trying to work out what address I was supposed to go to. I gave it capeCanaveral (invalid, non-existent) instead of CapeCanaveral (valid, exists) or capecanaveral (invalid, non-existent).

cc-line-marsh

As you can see from the above gif, marsh.gif is present in only one and it mod_speling failed to splice the two directories together. This might be a by-product of the way the DocumentRoot and mod_rewrite was configured earlier… I guess it doesn’t matter too much for now. I will need to find a way to merge the lower case and upper case variations on directory names eventually.

Next steps will be capturing more core (neighborhoods) directories to merge from the YahooID folders. The data dump included a few extra sources apart from the main files, I’ll need to get these into the mix.

The next step after that will be sorting out the structure of the root pictures and images folders. These are pretty important as some people hot-linked directly to the root source files which aren’t all present yet.

I won’t be able to work on this too much until the weekend again, but it does give some food for thought until then. :cowboy_hat_face:

1 Like

halloween

I know I said I wouldn’t do too much during the week but a few things did click together. In preparation for merging the neighborhood case differences with rsync (or Carbon Copy Cloner) we’ve made some big strides in the past week. Here’s the more exciting things:

  1. Found and removed all index.html files from directory listings.
  2. Removed 0 byte files, removed 0 byte folders (in that order).
  3. Started to clean up some of the redirection looped files and folders.
  4. Less than 1TB of data is present now.
  5. Symlinks were all deleted (again) - we need mod_speling to work properly.
  6. Sitemap has been created successfully and is now split into smaller chunks.
  7. The jump script now covers all neighborhoods and is a fair bit quicker.

All these will help with the merging of the various data sources together. It’s a bit like Minesweeper before you tinker with the settings to win faster.

The best order for cleaning up the file and directory structure (#2 and #5) was:

  • Delete .DS_Store files (thanks Mac OS for creating them)
  • Delete zero byte (empty) files
  • Delete empty directories
  • Delete symlinks

The creation of hidden .DS_Store files and directory listing index.html files (see #1) were blocking directories from being deleted as they were not empty. The above image was one of the offending index.html files that were likely autogenerated by Yahoo in order to reduce server load. The cost? They aren’t dynamic anymore. Deleting the index.html files was a tough feat.

blank-boring-index

As you can see above, these directory listing indexes are missing their icons and are quite boring. They don’t have a lot in common, but their source code shared some unique beginnings. This is one of the files:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
 <HEAD>
  <TITLE>Index of /CapeCanaveral/6276/wreck</TITLE>
 </HEAD>
 <BODY>
<H1>Index of /CapeCanaveral/6276/wreck</H1>

From that we can steal the first part - four lines worth:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
 <HEAD>
  <TITLE>Index of

And that’s pretty unique in itself as it narrows down HTML files that have “Index of” in their title. That’s only going to be indexes of directories. Perfect. Another bonus is the !DOCTYPE declaration. HTML3.2 Final (we’re up to HTML5 now). Very handy.

Next up I did some searching to find what command line tool(s) I could use. Enter grep via this post:
grep --include=\*.{htm,html} -rnw '/path/to/somewhere/' -e "pattern"

Grep seemed to work, but it was detecting my three lines as three searches and not stringing them together. So I’d have it return a positive for each line, I need all lines to be considered at the same time. Turns out grep wasn’t good enough. I needed pcregrep.

After playing around with switches and settings and adapting the code to suit my needs. I ended up with this:

pcregrep --buffer-size=2M --include=\index.{htm,html} -rilM "(?s)<\!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">.*(?s)<HTML>.*<HEAD>.*(?s)<TITLE>Index of " '.' 2>&1 |tee ~/Desktop/blankindex.txt

Firstly, pcregrep has a buffer limit of 20K. Not enough. It outputs errors and messes up my blankindex.txt file. Buffer size booster to 2M. Great!

We only want to include index.htm or index.html files. -rilM, in order: recursive, case-insensitive, list matches show only, search beyond line boundaries. That last one is important. Then you have a bunch of regex stuff stringing all the lines together. The '.' searches the current folder, so make sure you are in the right directory when running this.

But that wasn’t enough, I wanted to send it all to a text file for review. 2>&1 sends errors and terminal output to my new text file, blankindex.txt, using tee. This resulted in the following:

Excellent, I have a massive list of all the directory listing index.html files I don’t want. 219,060 files. Now to check my work, I checked how many index.html files were present overall. How? Well let’s try to dumb down our search with pcregrep. We’ll look for the following in index.html files.

<HTML>
 <HEAD>

And we’ll search with this code. This should yield a lot more results.

pcregrep --buffer-size=2M --include=\index.{htm,html} -rilM "(?s)<HTML>.*<HEAD>" '.' 2>&1 |tee ~/Desktop/fullindex.txt

And it did. The original was 12MB at 219,000, the new one was well over 50MB before I cancelled it. There’s definitely a lot of index.html files with both HTML and HEAD tags. Past cool! :+1:

I checked some samples from the blankindex.txt file to make sure I was definitely only removing static directory listings and not user generated content. Everything checked out. Next up, xargs.

xargs -n 1 ls -v < ~/Desktop/blankindex.txt

Whoops. Good thing that had an ls in place of the rm. Apparently white space is not liked by xargs unless enclosed in quotes…

quotes

Adding quotes is quite easy using BBEdit as it has a built in grep function. As I just happened to have the knowledge fresh in my memory, I did some grepping to add to the front of each line with the ^ symbol and to the end of each line with a $ symbol. Done. Now the file should be able to get parsed correctly in terminal (Mac OS).

xargs -n 1 ls -v < ~/Desktop/blankindex.txt

A lot of reading indicated I should have been using the -0 (or --null) flag for xargs. I think it would have saved putting the quotes around each line. But I didn’t, I took the risk and watched it list attributes of files in the blankindex.txt file. I checked for any errors. None. We are go for the final command.

xargs -n 1 rm -v < ~/Desktop/blankindex.txt

It worked. Next stop, clean out some more empty files, folders and .DS_Stores:

find . -name .DS_Store -delete
find . -empty -type f -delete
find . -empty -type d -delete

We now have a much tidier directory structure for merging and a lack of empty files (or useless files) will prevent collisions when merging folders. We’ll have a look at more case sensitivity, redirection loops, sitemap and jump tomorrow.

The above video shows the rm command running through stripping static, server-generated index.html files away. Recapping yesterday:

  1. Found and removed all index.html files from directory listings :white_check_mark:
  2. Removed 0 byte files, removed 0 byte folders (in that order) :white_check_mark:
  3. Started to clean up some of the redirection looped files and folders.
  4. Less than 1TB of data is present now.
  5. Symlinks were all deleted (again) - we need mod_speling to work properly.
  6. Sitemap has been created successfully and is now split into smaller chunks.
  7. The jump script now covers all neighborhoods and is a fair bit quicker.

Onto number three in the list. Cleaning up files and folders that suffered from redirection loops when being harvested by Archive Team.

While looking for empty directories, the longest of the long path lengths appeared. The above image shows an example of this output. The redirection loops (recursive loops) get a little extreme sometimes. For the purposes of this project, as long as they keep below the maximum path length, we are OK.

The above image shows an example of what a typical loop would look like in Finder. At the time of writing the longest one we have is around 220 characters which is more than acceptable. The web crawler / spider would have been going a little insane finding a link within a link that was a link to itself inside a link, inside another link, linked to another link, etc., etc.

Running the following command will show the 10000 longest pathnames:

find . | awk '{print length($0), $0}' | sort -n | tail -n 10000

I’ve run this command a few times now and fixed the loops by deleting duplicated data. One interesting error I’ve come across is seen in the next image.

In particular, take note of;

  1. The file names music.html and it’s variations and,
  2. The file sizes.

The correct music.html is the file named music.html.1. However, all the files in the music.html folder are also the same music.html.1 file. It doesn’t seem to happen on every recursive loop, but has happened enough for me to take note on how to fix it.

Deleting these has helped dip the entire decompressed data set below the 1TB mark. Some of those loops were big, causing folders to inflate to gigabytes. I should really run a search on folder size to detect any excessive utilisation of hard drive space. Low priority for now. But it does mean I can backup onto a 1TB drive/partition in full now. Lucky!

Touching on Apache’s mod_speling again, symlinks have all been removed from the local drive (not the server facing one, yet). If symlinks are present they have been confirmed to mess with mod_speling. This prevents correct redirections from happening. It’s more the nested directories that have issues. For example take:

> https://geocities.mcretro.net/Athens/olympus/1043/ 
> https://geocities.mcretro.net/Athens/Olympus/1043/

/Athens/olympus/ is not redirected to /Athens/Olympus/ as /Athens/olympus/ exists as a symlink. This causes issues for mod_speling as it can’t fix simple things like this as it sees two olympus folders, even though they are one and the same.

Additionally, the goal is to have everything in the correctly capitalised folders. i.e. Area51, SoHo, SiliconValley, etc. Any incorrect case will likely cause issues with mod_speling, breaking the ability to redirect correctly or locate source files - such as images and midis.

This means symlinks are off the menu again. They will only be used on the live server to fold the neighborhoods (/core/) into the much larger YahooIDs (/yahooids/). We need to give mod_speling the opportunity to thrive. We can always recreate symlinks if required as a last ditch effort.

sitemap-magic

301,000 pages are now in the sitemap.xml split over seven files. That’s with a depth flag of 4. I’m scared to think how long it would take if we did the entire site. Now that the sitemap is working properly, it’s time to fix jump!

Jump definitely loads faster now and the stress on the weak Raspberry Pi CPU is clearly less when parsing the smaller sub-sitemap files. The amount of code I am running through the command line is fantastic.

I’ve been using awk, grep (and pcregrep for multiline matching), find, rsync and xargs. I think I am getting better at regular expressions (regex). Regex are extremely powerful provided you are using the right tool. I may have nearly wiped my desktop out (again) with an rm command so I was quite careful with what I was doing on the unbacked up work-in-progress.

The Future
The next steps over the week will be to continue sorting out the pictures and images from the rest of the decompressed data. I’ll be tidying up the YahooIDs to remove any core folders, such as Area51, Hollywood, etc. They can’t be left in the YahooIDs folder. That would cause pretty big issues when, for example, /core/Area51/ tries to mount at the same point as /yahooids/Area51.

I also need to continue to think about the best way to serve the images. Currently there are many locations images were stored at, including international sites. I need to work out the best way to write that into Apache’s mod_substitute. A sample of the website addresses is below:

We create a simple /images folder and put the domains in our root directory and would look something like this:

/images/pic.geocities.com/
/images/geocities.com/pictures/
/images/us.yimg.com/

I think this is something like what Yahoo did when they took over in mid-1999 and again after a brand refresh in the 2000s. For the immediate future though… a backup.

backup-number-one

Creating a snapshot of the progress so far, to rotational media, is a good idea. It should finish late tonight. I’ll then be able to work on merging neighborhood (core) data together. Not making a backup at this point could easily be disastrous. One more week and we should have a clearer picture of how everything looks.

2 Likes

Great work on this project @ShaneMcRetro

hey Shane,
I had one of those 56K Banksia modems and its 33.6K predecessor that looked identical in design (actually I probably still have them in boxes somewhere). That was my first internet experience. I was late to get on the internet compared to some because of the high cost of dial up in it’s early years.
Mark C

Thanks @DigitalRampage, we’re on the final stretch now. Stay tuned.

@Mark-C Absolutely agree, I didn’t enter the internet age until mid-late 90s and I have a hunch our modem might have been secondhand. Looking back at some of those advertisements for $1000+ modems really makes me glad we didn’t jump in sooner. They seem to be a solid little modem though and have found a place in my setup owing to their compact design. :smile:

For the past week, my primary goal has been to merge all the core data together, avoiding copy blocks as pictured above. I’ve been mostly successful at this, although I might have had to call upon the backup from last week a few times. It’s very easy to enable autopilot and accidentally overwrite all of Heartland by replacing instead of merging. Thankfully, that’s now water under the bridge and we have arrived at…

That’s right, the core has been merged
The neighborhoods are all in one place and now have correct TitleCase (see above). MotorCity is no longer motorcity or motorCity. Trying to visit an incorrect title now invokes Apache’s mod_speling to boot you into the right directory. Previously when we had a mish-mash of lowercase, and uppercase letters it would result in some quirky behaviour. YahooIDs have been cleared out and contain only usernames now. YahooIDs have not had their case switched to lowercase as I haven’t found the right command to do it… yet.

The pictures are (mostly) merged
In the image above we can see clipart tiled backgrounds making an appearance which makes text a lot easier on the eyes. I’ve been thinking about implementing a calibration page to help get the best 800x600 experience on modern browsers as some of the tiling tends to, well tile I guess, vertically making the page look like sloppy construction.

There’s now a GeoCities favicon.ico in the title bar. It looks like the favicon standard wasn’t around until the late 1990s when GeoCities rebranded, just prior to Yahoo! takeover.

Since just before 7am we have been rebuilding the sitemap.xml to reflect the lack of empty folders and blank index.html files. This will likely take the rest of the day.

After this has completed, I’ve allocated some time to spend browsing the GeoCities via Jump™. I’ll live stream this browsing session over one of my YouTube channels. This will be done with the MiSTer which feels like a 486 running at 33MHz. Armed with Windows 98 and Internet Explorer 5.0, I’ll dial into the world wide web (WWW) through my two Banksia SP56 modems at ~28.8k. I’ve got an MT32-pi hooked up so I’ll be rocking out to any MIDIs I come across. Stay tuned! :city_sunrise:

2 Likes

The sitemap generation is still chugging along. We’re up to 16 hours and we’ve just hit SunsetStrip. This will likely still be running tomorrow morning as I now have all the YahooIDs on there too. I’ll probably manually delete them off the sitemap though as they are usually just storage for the neighbourhood sites.

Another thought I had while beaming between sites randomly, is that the jump link might be better suited for each neighbourhood. For example, to randomise a visit to a site from Area51 you would could load /jump_area51.php. It would make it a little less random, but then you can focus in on what you may actually want to see. For example, I might be an X-Files fanatic, I’m unlikely to find X-Files in Heartland or MotorCity.

The above video is a live stream of me jumping between random sites and checking some of the source code. I also installed the wrong version of WinZip 6.3, 16-bit instead of 32-bit. I should get some DOS-based benchmarks to see what the MiSTer is powerful enough to do. Tomorrow should be an interesting day. :white_check_mark:

Sitemap generation complete. 299337 pages in just shy of 24 hours. It hadn’t occurred to me that the users (YahooIDs) won’t be visible due to the way the website is structured, which I think is a good thing. Yahoo used to offer usernames alongside a regular neighbourhood address, so the two could co-exist and share resources between each other. As the goal of this project was GeoCities prior to Yahoo takeover this suits well.

I noticed that since rejigging the Jump™ system pages were taking a good while longer to load. It looks like it was to do with the PHP having to parse a 4MB file each time a random jump was initiated. I’ve now split it into neighbourhoods as shown above. Smaller files = better performance. Heartland is still a monster at a whopping 2MB, but that’s still an improvement.

From this we now have a faster Mega Jump™. Mega Jump™ picks from a random neighbourhood and beams you to a random page in there. If you want more precision though, we have this new feature…

Bookmark an icon to your favourites bar to be able to load all your best X-Files fanfic, or maybe some extreme sports in Pipeline. I’m impressed with how it turned out.

Above is the script used to call on the random page, it should look familiar from a few posts back. In other news the YahooIDs are integrated quite well, see the following fish page with the highlighted code calling on the tiled jewel background. Note the location bar address. Neat!

This also works if websites were spliced and diced between the YahooID and the neighbourhood address, as seen below, are linked inside the same page to each other. Again, note the location bar address.

nh-referencing-yahooid-2

For now that seems to be the end of the immediate project. Some things I would do differently a second time around:

  1. Merge with the command line instead of Finder in Mac OS. I could see that some files were not merging even though they didn’t exist in the destination.

That was a short list. I suppose a lot did go well. Things for future consideration include:

  1. Darknet compatibility via Tor, because why not. My main blog and retrosites as all darknet enabled, it would be interesting to get GeoCities on the darknet.
  2. Enabling caching for Apache to reduce server load. We’re on a mere Raspberry Pi 3.
  3. Cloudflare CDN for the same reasons as above.

I’ll upload a live stream for the final demonstration of how the site works a little later on this evening. But first, time to watch some Babylon 5 to celebrate. :beers:

1 Like

And there it is, the incredible era that spanned the mid-late 1990s that was GeoCities. I hope you’ve enjoyed the progression, I know I have. You now should have some idea on how to create a similar setup using a Raspberry Pi 3 and Ubuntu 20.04.

To those of you from the future looking to do exactly that I wish you the best of luck on your journey. For everyone else, enjoy the animated gifs, midis and horrendous web design. :stuck_out_tongue:

Try it out at http://geocities.mcretro.net/ if you haven’t already.

This has been Shane McRetro and I’m signing off for now!

:city_sunrise: :cityscape: :city_sunset:

2 Likes

I’d like to just personally say I am very proud of the work you have carried out during this challenge @ShaneMcRetro and the level of detail and efforts you went to document and dig deep into making this resource accessible and far more robust than others who have previously attempted to do so.
Great work!

1 Like

Hello @DigitalRampage, thanks for the kind words. I’d love to say it’s as complete as it could be but there’s always more work to do.

Other archives such as oocities.org and geocities.restorativland.org are great in their own respects as well. Then there’s archive.org - which is just absolutely massive. They also happen to own the shoulders I have been standing on to attempt this project.

And for what it’s worth, I’m still sorting through my unpublished notes trying to see what I could do differently and how I could potentially improve upon what’s already been done above. It sure is a learning experience!

IMG_0001

1 Like

This is something I was meaning to do a little while back - enabling darknet accessibility via a Tor v3 onion vanity address. I’ve been tinkering with my SSL certificates on my servers to try and get them working the way I want too.

So now if you happen to be on a modern computer running the Tor Browser (based on Firefox) and visit the HTTPS version of geocities.mcretro.net, you will notice a little purple banner in the address bar.

.onion available

Once clicked you end up with the very easy to remember:
http://geocitiesllczuf44da2nj45jn3fntdjpw27ercfbhkrw3mnegti7pid.onion

This link will not resolve in a clearnet browser, you need to use one that has tor capabilities (or a tor proxy service).

As you can see it is travelling through quite a complicated looking circuit.

The jump page mostly works. I noticed that on it tends to take you to the clearnet site but does provide the purple “.onion available” button again. I’m sure I’ll work it out eventually as to why it is doing that. I must have a “geocities.mcretro.net” hardcoded somewhere instead of just a “/”. Actually it might be because it uses the sitemap.xml files for jump. Ahhh yes. I don’t think I’ll change that, at least it provides darknet search indexers a bit more content.

Grab the Tor Browser and have a tinker with the technology, it’s quite interesting to setup. Then stop by http://geocitiesllczuf44da2nj45jn3fntdjpw27ercfbhkrw3mnegti7pid.onion/

Fixed darknet jumping.

If you visit https://geocities.mcretro.net in the Tor Browser, note the activation of the location bar. Onion-Location activates. This will not work if visiting on clearnet non-ssl site, http://geocities.mcretro.net, which is by design according to The Tor Project. Hope everyone has a good new year! :cityscape:

1 Like

Just for fun, here are one or two retro-computing relevant pages for you:

(Just possibly this should be a new thread?)

1 Like

Good finds! I moved the server from the Raspberry Pi 3 to the Raspberry Pi 4 this morning and in the process killed all the symlinks. This is why documenting the whole process was a great idea.

Apache (the web server software) was not happy with symlinks being broken. These are three very useful commands for debugging errors:

sudo tail -f /var/log/apache2/other_vhosts_access.log
sudo tail -f /var/log/apache2/error.log
sudo tail -f /var/log/apache2/access.log

They even show up maroon! How did I manage this, I had renamed the external drive from “External” to “geocities” to help differentiate it from another project I had activated in the past week or so - The Assembler Games Archive. All my old symlinks died as a result.

I ran a test to make sure I wasn’t going to wipe out the entire internet first - i.e. without sudo tacked on the front of the command I stole from the internet. It worked, it found all the correct symlinks to remove and them I did!

Here’s the command that brought them back to life. The neighbourhoods need to be inside the yahooids subfolder for everything to work nicely.

And we’re back. You’re right @EdS, we should make a new thread with retrocomputing relevant finds from GeoCities Archive.

Just when I thought everything was done and dusted, I received this message through the contact form on my website a few months back.

Dear Administrator,

We are researchers from INSC, Tsinghua University. In our recent research for dark web misconfiguration, we found that your website may be suffering from this security problem. Due to the Apache mod_status vulnerability, results from scanning show that at least 7 dark web domains (starting with geoc| chir| fish| pepp| retr| soni| spac|) and 2 surface web domains (http://geocities.mcretro.net | http://space.mcretro.net) are running on the same host.

This vulnerability is low risk on the Internet, but high risk on the dark web. It is recommended to fix it in time to avoid being attacked or having your information compromised.

And we are looking forward to your reply which is important to our ongoing research.

Best Regards,
INSC, Tsinghua University

I thought, that’s interesting. Naturally, I was curious as to what I had misconfigured. So I visited my sitemap domain as a test http://retrojunkie.net/server-status and was presented with this:

Works as expected on the clearnet, external IPs shouldn’t be able to see it. But onion domains use localhost (127.0.0.1) to do some funky routing. Let’s try the darknet version of the same site at: http://retro56qdosu3zuisr5tiuh7ihpn6owzpwjwvs7djlweiqb3mcspahyd.onion/server-status

Having a look around why this was happening resulted in me finding this great site explaining how it all works in detail. After a quick read I ran the following commands to try and solve the problem I didn’t know I had.

sudo a2dismod status
sudo a2dismod info
sudo apache2ctl configtest
sudo systemctl restart apache2.service

A quick refresh of the onion site…

And all is well… Well, not found because it isn’t supposed to be showing all the intimate server details and pages being loaded. This will probably be my last update on here for this project. I’ve made another copy of the work done above on my blog available at https://mcretro.net/the-geocities-rebuild-worklog/ in case anything happens to this site.

I’ll try to keep both updated if anything else exciting happens. Thanks for reading, I hope you’ve enjoyed the saga! :white_check_mark:

2 Likes