Portable All-in-One GeoCities Web Server

ShaneMcRetro · 8 October 2022 11:09

My goal is to create an all-in-one GeoCities Apache web server that fits in the palm of your hand. RetroChallenge 2021/10 was a test run, it went darned well. We need our retro hardware, 386/486 and newer, to be able to revisit the internet of our past.

Goal #1 - Web Server Activated 2022-10-29
A portable GeoCities server that fits in the palm of your hand powered by a Raspberry Pi. Self-contained, just add power and ethernet.

Goal #2 - Data Rebuild Completed 2022-10-25
GeoCities will need to be rebuilt, again, to make this work. The previous work done in the RC2021/10 lost unique data because of the way the data was merged. We can rebuild it. We have the technology.

Goal #3 - Hardware Acquired 2022-10-10
Source era-accurate hardware and software, Ubuntu 12.04. This will help us with merging data, thanks to work done previously by despens.

Stretch Goal #1 - Onions Activated 2022-10-30
Onion integration allowing GeoCities to be available on the darknet, browsable with Tor Browser. Helping to light up the darknet; it doesn’t have to be a spooky place.

Stretch Goal #2 - Dialled in 2022-10-31
Four localised (i.e. no remote dial-in capabilities) dial-in ports for accessing the GeoCities web server as well as the rest of the web at extremely slow speeds (≤ 28.8kbps). Who needs the rest of the web when you have GeoCities though!

I expect to have the three main goals achieved by the end of the month and stretch goals by the end of the December. I guess I had better start finding some suitable hardware to run Ubuntu 12.04.

ShaneMcRetro · 9 October 2022 07:46

First up, I’ve ordered hardware to help with the later parts of the project, more on that in a future post. I’ve also secured a loan machine that is from the correct era to suit Ubuntu 12.04. That likely solves goal #3.

However, as things go I’ve had some issues straight off the bat. I have a 2TB Western Digital Green drive with a frozen copy of the data I need. It was all handled and manipulated on Ubuntu 22.04. It does not read on Ubuntu 12.04. Because of a metadata checksum (metadata_csum) built into the journal on newer versions of Ubuntu.

The good news was, using the above commands on Ubuntu 22.04 I was able to disable the metadata checksumming feature of ext4 and mount the drive properly on Ubuntu 12.04 after a few commands to fix the journal superblock.

Eventually it performed a filesystem check and everything turned out OK, at least as far as I could see.

The drive mounted properly though the GUI, which is promising. Browsing some folders for data that looks intact… Yes! Look at that, the data is still in there, phew!

Next up I’ll clone the master (frozen) copy to this new (software) RAID0 I’ve configured for maximum speed when working with the data.

I’m still not sure if this is the best way or whether I should just work from the 2TB SSD I have specifically for this project. Let’s see what happens overnight.

Go team 1TB Western Digital Blacks! Appropriately named GeoRAID

ShaneMcRetro · 10 October 2022 04:07

It says it cloned, but there sure is a lot of weirdness going on there I was not expecting that. I’d also not realised that it was using 2.0TB instead of 1.8TB, the old base 2 vs base 10 dilemma, but that shouldn’t have affected the cloned data from the frozen drive. Interestingly it took just over six hours to give me nothing. Well, maybe not nothing, I know that the RAID0 route isn’t for me. The two WD Black drives have been pulled. New setup!

Configuration:
Bay 1 - Boot drive. 250GB Samsung 850 EVO (SSD)
Bay 2 - Frozen drive; not to be modified. 2TB WD Green (Rotational)
Bay 3 - Work drive. 2TB Samsung 870 EVO (SSD)
Bay 4 - Backup drive. 2TB Hitachi (Rotational)

Hmmm, I’m still getting odd output on dd. Let’s sudo su, shall we?

Looking much better. Another six hour wait is needed for the clone from frozen drive to the work drive. While waiting for that, I can explain a little bit more of the hows and whys of this project.

One of the reasons for targeting a ten year old version of Ubuntu is that I discovered this particular archive created by despens. It’s the same thing I was trying to do last year but has a lot of automation built-in via SQL databases and perl scripts. It sure would have saved me a lot of time, so I thought I should start from scratch and try to adapt it to Mac OS or at least see if it still worked on a newer version of Ubuntu.

Earlier this year, I forked the above GitHub project and tried to adapt it; first for Mac OS, and then for Ubuntu 22.04. Mac OS character and path length limits, ~1024 characters, resulted in showstoppers when dealing with perl. Which left me with Ubuntu 22.04. It had other quirks and did some things slightly different. Software packages have had bugs fixed and enhancements over the past ten years. That seemed to be causing errors. So many errors.

So what to do and where to go? Anyone here ever done the time warp?

I was able to secure the loan of a functional Mac Pro (Early 2009). This places me a lucky 13 years back in time and lines up with the era despens worked in on the initial GitHub repository, so I should have higher chances for success. Cross Goal #3 off the list!

That’s not to say Ubuntu 22.04 was a complete bust. I successfully completed up to and including step 004, which was mostly downloading and decompressing the files. But when trying to run through all remaining steps, the merging and ingesting failed. That’s what we’ll get working this time around.

Right now, we are cloning the step 004 complete (pre-005) drive (a.k.a. frozen) with dd at block-level which should take around six or so hours. If we were to use rsync (file-level), I believe it would take around 60 hours from memory.

That won’t be the end of cloning things though. Once the work drive is ready for action, I’ll make another clone from the 2TB WD Green (frozen) to the 2TB Hitachi (backup) for redundancy. Then the real work will begin on the 2TB SSD work drive as we resume at step 005. Simple but elegant, it will merge geocities.com into www.geocities.com.

Tomorrow, I’ll check back in and hopefully report that the clone completed without incident. If that’s the case, I will remove the frozen drive from the Mac Pro as there’s no reason to keep it online. Stay tuned and fingers crossed!

ShaneMcRetro · 11 October 2022 10:07

Cloned successful! The data is now unfrozen. We now have two copies of the decompressed data. Interestingly, the “Disks” application didn’t offer the option to mount the newly cloned drive until the machine was restarted. This makes me wonder if perhaps my RAID0 was actually OK… I might revisit that later if we have spare time as a curiosity more than anything.

I’ve pulled the master frozen drive (2TB WD Green) to keep it safe while we edit the work drive (2TB Samsung 870 EVO SSD). Before doing any modifications though, a backup is in order again. You really can’t have enough backups when working with this much data. We’ll clone the work drive to the aptly named backup drive (2TB Hitachi).

This will need to run overnight. We’ll check back again in the morning. If all is well we should be able to start some of the scripts to start the merging of data. Exciting!

EdS · 11 October 2022 10:51

Well done! I always get very nervous about large datasets, and backups, and the importance of knowing which is which.

ShaneMcRetro · 11 October 2022 11:09

Thanks @EdS! The amount of data being merged is way bigger than anything I deal with on a day-to-day basis. I’m very interested to see how the amount of files and end data size compare to the more manual method I used in the last RetroChallenge. If it’s exactly the same, I might just cry!

ShaneMcRetro · 13 October 2022 12:17

It’s been a crazy couple of days. Let’s see if I can recap some of the major events! For starters, all this has now turned up. The necessary gear for a self-contained GeoCities…

I’ll have to look into setting this hardware up while running data crunching on the Mac Pro. It will need to be configured a little closer to when we have the data ready I think. Probably close to the last week of October. I’ll pop this together in the coming days and give it a bit of a stress test to make sure it’s working as expected.

On the topic of data crunching, step 005 has completed. This merged the folders geocities.com and www.geocities.com together.

Right now, 006 is running overnight. This merges the data from YAHOOIDS folder into the www.geocities.com folder. 006 is a one of the longer running scripts and will give an idea of how well the Mac Pro performs to splice all the duplicates and merge existing data in future steps. I’m looking at you 009!

Thrown in above are the variables of my ~/.bashrc file, simply matching what is on the GitHub at this point. The database variables aren’t needed until step 009. I’ll need to remember how to configure that too. Unsurprisingly, SQL is not my mother tongue!

ShaneMcRetro · 16 October 2022 12:03

Finally got around to assembling the Argon One M.2 with the Raspberry Pi 4 as the data crunching is still, kind of, running. Look at the 1TB chip. It’s just one chip. Probably a microSD card taped underneath.

Next up we have the Argon connection to the Raspberry Pi 4. Look at that breakout. That’s how you do breakout. Goodbye micro-HDMI, hello full-sized HDMI. This server will run headless but it’s nice to have those sorts of ports available.

Fully assembled from the front, looking good!

And from the rear. The M.2 SATA board interfaces over one of the USB 3.x ports. I have a feeling this might cause problems later on with super high speed devices (i.e. USB 3.x, at 5Gbps) causing USB lockups. It reminds me of an issue I may have visited in RetroChallenge 2021/10 when trying to use USB 3.x external drives. Time will tell!

On the topic of the database ingest scripting - I’ve managed to get stuck on step 009 with database permission issues. I might have missed a big chunk of sleep sometime midweek which really threw off my ability to think clearly. This came to be when I was playing with the web hosting. If this set of commands solves my hold up on step 009… I’ll be chuffed! Need to try running this in PSQL:

CREATE USER 'despens'@'localhost' IDENTIFIED BY 'despens';
CREATE DATABASE Geocities;
GRANT ALL PRIVILEGES ON Geocities.* TO 'despens'@'localhost';
FLUSH PRIVILEGES;
exit;

I usually use MariaDB for my web server needs. I had a look into why PostgreSQL is used instead of a lighter weight SQL database solution. It turns out:

PostgreSQL outperforms MariaDB in regard to reads and writes and is therefore more efficient. MariaDB is more suitable for smaller databases, and is also capable of storing data entirely in-memory — something not offered by PostgreSQL.

Of course, I realise this is the problem on the last day of the weekend after I have left where the Mac Pro is. We’ll find out more about that in the coming days.

I’ve also run some stress tests on the new RPi4 to see how toasty warm it gets with the M.2 SATA board in there. After four hours of stress testing the CPU, we ended up with ~55ºC. The stress testing package is:

Stress-Terminal UI, s-tui, monitors CPU temperature, frequency, power and utilization in a graphical way from the terminal.

It’s almost too good to be true. I had always wished for a better way to run stress tests on new hardware and now I have it. Not quite sure how I’ve missed out on finding it until now. Either way, it certainly makes the fan run!

ShaneMcRetro · 17 October 2022 09:39

Firstly, the database works. It turns out PostgreSQL (psql) is quite different from what I have become used to in MariaDB. That’s alright, I got there. Although… I did notice I may not have installed the dependencies required in the Perl scripts. Oh boy, yet another language I don’t speak at all. Kind of brings up the question, have any of the past scripts silently been failing? To test the code I was running the scripts line by line:

$GEO_SCRIPTS/ingest-doubles.pl $GEO_LOGS/dir-index.txt

The above line was giving me trouble in the image at the top of this post. Imagine my horror when I realised it missing the $GEO_LOGS section on the official 009 script! So that might help us more forward again. I must have entered that while testing to see if I could get the script to progress past the errors.

psql -d $GEO_DB_DB -f $GEO_SCRIPTS/sql/create/doubles.sql

This line in the 009 script probably prompted me to add it. $GEO_LOGS being referenced by psql but not perl? That can’t be right, thinks me at 3am, and figured that perl would want the same variable passed onto it. Turns out, it doesn’t! I’ll check tomorrow but am pretty confident that is the reason for that block.

Now dependencies are probably going to be a great help too. I’ve been reading up on the two Perl versions referenced in the scripts, we have 5.12.0 and 5.14.0. Perl 5.12.0 seems to only be used in script 004-normalize-tracking.pl. All other scripts use 5.14.0, which is good as it includes a lot more “core” modules, i.e. included inside Perl itself.

Now I know I installed a few Perl modules using CPAN, apparently this is a no-no unless you want trouble. If possible, it is recommended to stick with the packages made available for the distribution of Ubuntu, in my case 12.04, being run. Some of the “core” modules can still be updated to add new features while being able to run on older versions of Perl. It’s messy. Here’s a list of the core modules present in both 5.12 and 5.14:

warnings
diagnostics
strict
File::Find
File::stat
utf8

Pretty neat. So these will work regardless of what I do. As long as I have Perl installed these will run. The following are modules that are not present in either 5.12 or 5.14:

DBI
Data::Dumper
IO::All
Try::Tiny
XML::TreePP
DBD::Pg

These are the commands that should do the trick at the Ubuntu command line for the above required modules:

sudo apt-get install -y libdbi-perl data-dumper libio-all-perl libtry-tiny-perl libxml-treepp-perl libdbd-pg-perl

The following modules are only built in to the later 5.14 and are not present in 5.12. They are not used in the 004 script so we don’t need to worry about them but here they are anyway for reference:

threads
threads::shared
Encode
Cwd qw(abs_path)
YAML qw(LoadFile)
Digest
File::Path qw(make_path)

The next few commands may or may not be Ubuntu 12.04 compatible. Actually all the ones above may not be 12.04 compatible either! Guess I’ll find out tomorrow. These need to be run, if they haven’t already:

sudo apt-get install build-essential p7zip-full convmv postgresql libpq-dev libglib2.0-bin

sudo apt-get install convmv findutils

And hopefully the database will start to fill up with more data than just the username and database name tomorrow. Fingers crossed!

ShaneMcRetro · 18 October 2022 12:15

Words fail me. It’s actually running. Previously we’d get stuck on the highlighted text in ingest-doubles.pl at printing the asterisk. Now you can see the asterisk happily being printed in the terminal window at the back of all the windows. It’s committing records to the database.

The “line 69” error that was being encountered is gone. In order to fix it, I blew away the entire Ubuntu 12.04 install and started again. Following my install steps from above for dependencies, I had installed some Perl modules not through Ubuntu’s package manager (apt-get) but through CPAN. This was a big mistake.

Sure, the window server seems crashing if I monitor the stats via the System Monitor GUI. That might be caused by a lack of proprietary NVIDIA drivers I haven’t quite installed yet. I also had some issues with setting up psql (again) with my notes not quite working but I eventually pushed on to work out the configuration. It was a little different as I didn’t choose the Ubuntu user to be named “despens”, but “ubuntu” instead.

To avoid crashing the system, we now use top instead. Postgres is running crazy on one of the CPU cores. Lucky for me single core processing is where this machine excels. When I left it, we were stepping through script 009 to check for any errors. So far so good. This is the best condition I’ve seen the script in to date. If all goes well I’ll have data in the terminal window in the background.

To shake things up a little, I added a database named turtles - instead of Geocities and everything just started working. Are the turtles watching over me, keeping this database safe? It’s the only thing that makes sense.

I’ve also been using pgAdmin3 (via GUI) in cahoots with psql (via CLI) to get an idea of what I am doing what I drop a database or create a table or forget to give the user any permissions. It has been extremely helpful in visualising the database structure. To install it just run the following:

sudo apt-get install pgadmin3

The final step in the 009 script is to run a ~18-24 hour script - a lot to ask for a database that wouldn’t even connect yesterday. After this there is one other large processing step, step 012 - the ingest.

Here’s the environment variables I was using in ~/.bashrc

It’s important to login as postgres to edit things via sudo su -postgres. You’ll note the prompt change from ubuntu@ubuntu-MacPro:~$ to postgres@ubuntu-MacPro:~$. This allows psql to access postgres, the superuser for psql. We then create a new role for the despens user and verify the user has been created with the same permissions as the superuser postgres.

Additional work to get the database to be able to connect to the script via despens user was needed on two files followed by a services restart. Editing /etc/postqresql/9.1/main/pq hba.conf as above. This lets everything connect, however it wants. That’s fine. This isn’t going to be a production machine with an open to the internet database subject to hackers.

The second file, /etc/postaresql/9.1/main/postgresql.conf, needs to be edited to allow all listening on all address. Once done, a simple…

sudo service postgresql restart

And we’re creating tables with ease. The table “doubles” didn’t exist but now it does and pgAdmin3 confirms it!

You can also see database creation through psql. It’s amazing to see it all come together. Of course, once this is up and running it can be improved upon further - as all things can. I still consider this to be at a testing stage as I am still learning how the systems in these scripts run.

After RetroChallenge 2022/10 is completed, I would like to start from step 001 to make sure I capture as much of the extra data added to the GeoCities archive after the core torrent was released. It would be great to modernise it to a newer Ubuntu system, such as 22.04, but I need to make sure I have the steps as correct as possible in 12.04 first. Regardless of what is done in the future, we’re moving forward now and that’s where we need to be!

ShaneMcRetro · 19 October 2022 11:22

This failed. The tail should have given me a list of duplicates. The good news is…

I ran it again and it yielded a 72.6MB text file! It looks like the crashing of the window server was bad for whatever was processing and stopped something running the way it should. I’m only monitoring system stats via top for the time being so we’re only drawing the desktop, terminal windows and a text editor. Maybe I really should install that driver.

Looking inside the sorted text file hits us with the the above. As you can see case is all over the place. 009-case-insensitivity-dirs.sh just so happens to be the script I am stepping through. That’s what we’re trying to fix.

This is the step we are now up to and it’s the big one. I am not expecting it to complete until around lunch tomorrow. We’re not even halfway there yet!

This is what the terminal script looks like, very busy and zips by at a thousand miles an hour. I am more impressed that this is working. I’ve never had it progress this far before with so few errors.

Clown. clown. Clown. clown. Watch that crazy recursive case sensitivity combined with symlink cancer. It has spent around 4 hours processing just Clown alone. The content better be amazing!

Around lunch tomorrow, step 009 will be but a memory and we can check how much of the archive has been trimmed out. Remembering we need less than 1000GB to fit on the internal drive inside the Raspberry Pi. As it stands we are just over 1000GB. Never give up, never surrender.

ShaneMcRetro · 21 October 2022 22:57

Script 009 completed! That means that all the directories will now have their case sensitivity sorted as best can be.

Next up is an unexpected reference to turtles in script 010. The good news is, it’s a very fast script and completes quickly.

It removes files and folders that were downloaded in recursive loops. As you can see in the background there’s a lot of recursion going on. Now they are gone and we can move on to script 011.

ShaneMcRetro · 22 October 2022 02:50

Script 011 begins. It looks to sort the case sensitivity at a file level now. We’ve already sorted it at a folder level, so it makes sense to do the files as well.

An important consideration to note is that when configuring the web server it can be set to ignore case. If you click a link that has TURTLES.txt but the file is actually named Turtles.txt or turtles.txt it will select whichever one is present, and ignore the case. It’s an Apache module. That’s still a while off, so we won’t worry about that just yet.

Postgres loves to eat up my single core of CPU. Database stuff always has me on edge. You might notice I added a “-U despens” to the command that is called upon. If I don’t do this, it will think the user is my Ubuntu logged in user, which is aptly named “ubuntu”. If I kept the name at despens like on the first run, this would have been a non-issue - I guess I had to know what would happen!

Here’s an example of some of the output we get when comparing case between files. When GeoCities was “downloaded” en masse the crawler would just follow whatever link it could find. So you end up with LION.gif and lion.gif, but a million times over several million files.

Above shows the 011 script at the end. This is an example of the file it saved. There were three possibilities and it picked this one. What was it’s reasoning for picking this one? I do not know. That’s all buried in perl scripting which I honestly can’t read - but that’s OK, I don’t have to as the file survived.

But what is the database up to? Last time we checked it was around 249MB in size. Now it’s a whopping 2168MB. I don’t know what the numbers above the index size are but I can speculate.

30,689,439 / 30,739,475 - Last time I did this we had around 33,000,000 files when performing file level clones. So that’s likely the number of files. I haven’t come across a number like 122,916,135 before so I can’t even guess as to what that might be.

Before we head on to script 012, I thought I’d have a look at how much we have culled off. We’re now at 47% of 2TB used, that puts us under 1TB - good. What we have will fit on the 1TB web server when it is setup. With script 011 out of the way, we’ll next start stepping through script 012 - and it’s a big one!

ShaneMcRetro · 22 October 2022 09:12

Script 012 begins. This one is big. The first stage of the script moves the files from the work directory. Very quick, in the blink of an eye. The next step will run overnight at the very least.

Next is database ingesting! Make sure the tables ‘files’, ‘urls’ and ‘props’ are ready. See the ‘sql’ directory.

Before starting this script, if not stepping through, ensure that the commented out notes above are followed.

You’ll notice that GeoIngest.pl and postgres are both hard at work for the first stage. The hardware despens was running in 2012 show the real as being 1055 minutes with real representing actual elapsed time. That’s 18 hours! I’m on a quad core CPU running a single core. I’ll check back in tomorrow with how it all went.

ShaneMcRetro · 23 October 2022 12:58

Uh oh, we hit an error! Is that a question mark in a triangle in the filename? It’s an encoding issue! I thought we fixed all those in script 007. A closer look and we can see it was this file:

/media/Geocities2TB/archive/archiveteam.torrent/www.geocities.com/TimesSquare/corridor/1041/exercecio.gif

Looking at the 007 script I am using, we can see that exact file should have been dealt with.

Wait a minute, is that a typo? There’s a typo! It must have failed silently too. It should have read -t utf8 not -tf8. Not present on the original file from despens. Remember I forked this from here as it offered some updates to the code for Ubuntu 22.04. I guess the point is moot as I’m back on Ubuntu 12.04 now.

A quick edit and it’s now UTF-8… but what of the database. It would still have the old named file present. We are going to have to speed run through the scripts again. Here we go!

ShaneMcRetro · 23 October 2022 13:20

Alright, a bit of a clean up. For starters I’m installing the graphics drivers, something I should have probably done when I installed the OS. That should help with stability.

We’re back up to script 012 and ran the script in full, not step by step. We ended up with the following at the end:

Is that just a warning? Maybe if we look at the Perl file it will help guide us.

Nope that doesn’t help at all. In that case, let’s just run it again.

Insanity is doing the same thing over and over and expecting different results.

It ran successfully. Yep that about sums up the above quote.

Postgres seems to be doing something involving the INSERT command and has been running a while. Most likely inserting database stuff.

14 hours later, it’s still running. We’re on a different stage now though, looks like GeoIngest.pl and all those psql commands have finished and it’s now onto GeoURLs.pl.

The console window is spewing out addresses, and occasionally spits out the following:

No option but to let it run for now. I’ll dig into what it’s doing in the next post, assuming it is still running. But it is on the last step of script 012. Then we have 013 and we’re done with the processing. I’m starting to think manually processing this was a little quicker… then again, I have taken the long route!

ShaneMcRetro · 26 October 2022 11:50

012-ingest has completed after nearly four days!

Which leads us to the final script 013-indexes. This script only has two parts, the psql database and a Perl script. Simple compared to the others, should be easy. Let’s step through it one command at a time.

The psql command appears to have worked. And we can see our file sizes which look good well under 1TB no matter which way you calculate it (base 2 or base 10).

Look an error on the second Perl script, how semi-expected! It seems to be referencing the last line of the two line script. This means the first half ran ok but this filter-indexes.pl on line 25 is causing issues which we can see is a YAML module that is required. Now I thought I had all these sorted and YAML was built into Perl 5.14. But maybe it isn’t. How to fix it? Very easy!

sudo apt-get install libconfig-yaml-perl

Rerun the same script and…

Viola!

A little while later and it’s committed all the data. This means we are done processing all the raw data. Now it’s time to clone it off this 2TB drive to a 1TB drive as an intermediate step. This will then allow us to the Raspberry Pi M.2 SATA storage. First some sanity checks.

Final file sizes look good and are ~6% less than the 1TB we require.

A file listing inside Area51 shows everything looks to be good titlecase-wise and what was expected to be there - folders, lots of folders. Trying to open the main GeoCities folder in the file system causes it to lock up for a looooong time. There were many force quits. Next we copy off to the 1TB from the 2TB, to free it up for other projects. However… the target drive has this showing!

We’re going to ignore it for now but keep an eye on it after doing a full transfer to ≥80% of the drive. The command we are running is:

rsync -av ‘/media/Geocities2TB/archive/archiveteam.torrent/geocities/www.geocities.com’ ‘/media/GeoFinalclone’

We’ll check back in with how it goes soon!

ShaneMcRetro · 27 October 2022 11:34

It finished the rsync. But I forgot the international sites…

How big can it be?

Only an extra 4%; that’s only another ~40GB, not too bad. 89% of the 1TB is used up - which reminds me…

That’s good news too, the reallocated sector count hasn’t increased on the starting to fail SSD so the data should be good. In scripts we trust!

I’m not sure how du -hs interprets sizes differently to df -H and df -h but at least the two datasets are the same size. That’s what counts here as we’re moving house from a 2.5" SSD to an M.2 SATA blade inside the Raspberry Pi. I’ll use dd for the clone.

sudo dd if=/dev/sdb | pv -s 932G | sudo dd of=/dev/sda bs=4M

Look at those USB2 speeds, thankfully though, it only took around seven and a half hours. Next is the first and most important goal - getting the data ready for the web. There’s nothing more fun than configuring Apache. I can use my old notes to a degree, we’ll be using a proxy server so the Raspberry Pi can be portable. I’ll report back once I get that up and running.

ShaneMcRetro · 28 October 2022 21:49

Progress! The main web server is now pointed at the new Raspberry Pi that will be hosting GeoCities. At this stage I haven’t configured anything on the GeoCities side.

Hello world indeed, the proxy server is passing on requests successfully. It now acts as though it were the one facing the internet directly. The foundations are laid.

This is something that bugged me last time, case sensitivity on a non-case sensitive file system when mounted as a SMB share. This time around, as you can see I have three folders, and there’s no issues. How was it achieved? This post helped immensely.

fruit:locking=netatalk
fruit:metadata=netatalk
fruit:resource=file
streams_xattr:prefix=user.
streams_xattr:store_stream_type=no
oplocks=no
level2 oplocks=no
strict locking=auto

case sensitive=yes
preserve case=yes
short preserve case=yes

That did the trick, I should roll that out on all my servers. Those last three are the case sensitivity and the first bunch are all to do with better compatibility with Apple devices - which is what I happen to use as my daily driver.

After fixing some permissions to enable www-data as the owner and group of the newly copied files, I tried to load the directory listing of the www.geocities.com folder. It never loads, the directory listing is just far too big.

So I dusted off some old code and fixed some broken links. The basic front-end is looking good even if none of the links to the actual GeoCities data work yet.

I mean it looks really good with those icons. Now how do I get the GeoCities data in the right place? Last time I used three folders root (for storing the html above), core (for storing the core neighborhoods) and yahooids (for all the user folders, there were so many…)

By using the despens scripting, most of this is all in the one folder www.geocities.com. I could try splitting the data up, but then again why would I need to do that?

Of course that didn’t stop me trying to symlink the GeoCities folder contents, which is massive, into the main html folder. This failed. Bash didn’t want any of it.

Argument list too long

This meant that while the data was there, it was in the wrong places. Back to the drawing board, I need to work out the best way to structure this before I can continue any further.

ShaneMcRetro · 29 October 2022 07:52

Now, to sort of the directory layout, I’ve realised the best solution is likely the most simple. As we want all the neighbourhoods and YahooIDs to be in the root, let’s just put them in the root. Apache2 has been reconfigured to recognise the folder www.geocities.com as the root folder.

I’ve now folded all data that was in the same directory as www.geocities.com into it as well. This will allow images to load properly when we use mod_substitute to make www.geocities.com queries into geocities.mcretro.net queries.

This was another oddity I was getting and it appears to be related to Cloudflare popping some tracking scripts in. We don’t want that in there while testing.

Flip the switch and the geocities subdomain is no longer proxied through Cloudflare. I’ll reactivate this later if need be.

We were still getting 404 errors in the apache2 logs on the server. Looks like I forgot to mount the M.2 SATA drive after a reboot… I really should edit fstab. The joys of working things out late at night!

I did get optional SSL enabled, this is great for better ranking on Google… Search Engine Optimisation (SEO) and all that junk. The good news is that old browsers (e.g. Internet Explorer 4) won’t try to use it if it can’t work out the handshake ciphers.

We’re using the old sitemap.xml files from the RC2021/10 because the content hasn’t changed - just the way it is organised internally. As far as the end user, the website visitor, is concerned everything is the same as before. However the jump scripts are broken, so I’m still looking into that but should have it worked out in the next few hours.