@justjanne Thanks. I've been with Hetzner for about 2.5 months now, using their Germany location, and I'm actually quite happy so far.

@rmrubert @fribbledom I've heard various complex theories suggesting he's actually an evil genius disguised as an idiot, but I still believe the simplest explanation, which is that he is an idiot. That doesn't mean he isn't surrounded by other less idiot-like people, e.g. shady SEO people directing him to do things like talk about cheese to hide the lockdown cheese and wine, or model busses to hide the Brexit bus.

So I've found out where the 99.999% of search requests a day are coming from - it seems my search engine has been added to an SEO tool and the searches are "scraping footprints" executed by spammers looking for URLs to target. Trying to block but it is feeling like a losing battle, with 157,587 spammer requests on 10 May, and a battle with nothing to gain by winning, given daily searches by real users are still in the single digits after 2 years. More info at blog.searchmysite.net/posts/al

I'm seeing a vast amount of traffic (99.999%?) to searchmysite.net come from what looks like some other search. Requests have different IPs and user agents and random looking queries so I'm not convinced it is bots. The referrer isn't set though so I don't know the source. I've stopped returning results for those requests which takes some of the load off the server, but I'd like to try to find out what is going on. Any ideas as to how to trace?

Anyone have any recommendations for cloud hosting for searchmysite.net? I've run out of the free trial with AWS and am starting to be billed at approx USD90 per month, so need to switch. I had hoped to get on fosshost.org but they're not taking new projects. Current favourite is Hetzner which would be EUR8.21 monthly if I dropped wikipedia indexing (or EUR27.25 monthly if I continued indexing wikipedia).

@rysiek Agreed that the web isn't just set up your own web server or be locked into a centralised service. I had my first public site in the ~/public_html folder for my university account where the university managed the web server, and even now my main personal site on my own domain is actually served by GitLab Pages behind the scenes. In both cases the content is completely portable, and in the last case even the URL is portable.

@natecull @gregorysalvan @fishidwardrobe Anything that is disliked by extremists on opposite sides is probably a good middle ground. But unfortunately nowadays there seems to be some kind of centrifugal force (perhaps social media echo chambers) driving people away from the middle.

@gregorysalvan @fishidwardrobe @natecull Actually, if you go into a home decorating shop you'll find many different shades of white paint, e.g. "shadow white" from Farrow & Ball, and even one called "dark white" from Susie Watson Designs.

@fishidwardrobe @natecull There are different forms of capitalism though, e.g. responsible capitalism where profit is generally good but not necessarily the only goal, vs untrammelled capitalism where profit must be maximised at all costs irrespective of the impact to other people, the environment etc.

So I've finally managed to index Wikipedia. I reckon it turns searchmysite.net into a much more useful search engine for day-to-day usage. I've put together a post with some technical information (data source, disk and memory requirements, etc.), including links to the source code so you can try it yourself, at blog.searchmysite.net/posts/se , if anyone is interested.

Based on a review of recent failures in my indexing log, roughly 5% of all sites were not being indexed due to indexing errors, and of these indexing errors 62% were sites that had disappeared, 19% were sites where robots.txt blocked indexing, and 5.5% were sites where Cloudflare blocks indexing. Further information at blog.searchmysite.net/posts/so .

@benjancewicz David Gerrold is also the writer of the classic ST:TOS episode "The Trouble With Tribbles".

@amolith Lawns aren't entirely useless: they help with carbon sequestration, and soak up water when it rains thereby reducing load on drainage systems (in the UK at least there's a growing problem of flash flooding in cities linked to people having paved over lawns).

Been a while, but I've just posted a short progress update: blog.searchmysite.net/posts/pr . Short summary is that I've been reviewing submissions daily, fixed some minor bugs, and the system has been stable, but I've not had chance to do any enhancements. One interesting thing I've noticed is that I hardly get any hits from Google: anyone have any idea why?

@readwriteas Each validated site has its own API. There isn't an all-site API at the moment (to save having to implement throttling etc. to prevent abuse), but one could be implemented easily, or you could use the Solr API directly. In this case you'd probably want to auto-insert and auto-validate all writeas.com submissions. Shouldn't be a big customisation. Would be nice to try and do it in a resusable way, e.g. for other such platforms. Let me know if you want more info.

@readwriteas It should be fairly straightforward to launch an instance if you've a server with Docker (steps at github.com/searchmysite/search). Domain ownership validation etc. all work off domains, but you can configure some domains to allow subdomains, e.g. on production I have INSERT INTO tblSettings (setting_name, setting_value) VALUES ('domain_allowing_subdomains', 'writeas.com'); to allow <user1>.writeas.com, <user2>.writeas.com etc. rather than having 1 user owning the whole of writeas.com.

@duckhp Thanks. I'm still optimistic the internet can be made a better place, and there certainly seem to be lots of others with that aim too. The difficulty is that, in this new dark age of digital feudalism, the internet lords have amassed mind bogglingly large cash reserves to defend their fiefdoms, and they are supported by vast numbers of vassals who have a vested interest in maintaining the status quo (SEO practitioners, people who work in the AdTech industry, etc.)

@duckhp YaCy is certainly interesting, although it was very slow last time I took a look.

