Quite an active discussion on the latest searchmysite.net blog post on Hacker News at news.ycombinator.com/item?id=3

So I've found out where the 99.999% of search requests a day are coming from - it seems my search engine has been added to an SEO tool and the searches are "scraping footprints" executed by spammers looking for URLs to target. Trying to block but it is feeling like a losing battle, with 157,587 spammer requests on 10 May, and a battle with nothing to gain by winning, given daily searches by real users are still in the single digits after 2 years. More info at blog.searchmysite.net/posts/al

I'm seeing a vast amount of traffic (99.999%?) to searchmysite.net come from what looks like some other search. Requests have different IPs and user agents and random looking queries so I'm not convinced it is bots. The referrer isn't set though so I don't know the source. I've stopped returning results for those requests which takes some of the load off the server, but I'd like to try to find out what is going on. Any ideas as to how to trace?

Anyone have any recommendations for cloud hosting for searchmysite.net? I've run out of the free trial with AWS and am starting to be billed at approx USD90 per month, so need to switch. I had hoped to get on fosshost.org but they're not taking new projects. Current favourite is Hetzner which would be EUR8.21 monthly if I dropped wikipedia indexing (or EUR27.25 monthly if I continued indexing wikipedia).

So I've finally managed to index Wikipedia. I reckon it turns searchmysite.net into a much more useful search engine for day-to-day usage. I've put together a post with some technical information (data source, disk and memory requirements, etc.), including links to the source code so you can try it yourself, at blog.searchmysite.net/posts/se , if anyone is interested.

Based on a review of recent failures in my indexing log, roughly 5% of all sites were not being indexed due to indexing errors, and of these indexing errors 62% were sites that had disappeared, 19% were sites where robots.txt blocked indexing, and 5.5% were sites where Cloudflare blocks indexing. Further information at blog.searchmysite.net/posts/so .

Been a while, but I've just posted a short progress update: blog.searchmysite.net/posts/pr . Short summary is that I've been reviewing submissions daily, fixed some minor bugs, and the system has been stable, but I've not had chance to do any enhancements. One interesting thing I've noticed is that I hardly get any hits from Google: anyone have any idea why?

I've summarised how much searchmysite.net (the open source search engine and search as a service) has cost over the past 6 months, and estimated how much it may cost to keep going in future: blog.searchmysite.net/posts/se Short summary: it is (perhaps) surprisingly expensive to run a search engine and search as a service. The good news is that there is a plan to cover costs (without resorting to advertising). Let's see if it works.

Today I noticed: When I make pretend phone calls I have my fist by my cheek with my little finger out towards my mouth and my thumb out towards my ear, but when my children are making pretend phone calls they place flat palms towards their face.

I posted this to IndieHackers and HN. Top post in the blogging section on IndieHackers is from someone making $8K MRR from "productized SEO content marketing" (I think that means generating blogspam) so I don't think that's the target audience:-)

Wow, my advert-free search engine comment got a fair bit of negativity on Hacker News. Apparently I should think about all the poor writers trying to eke out a living from the advertising on their blogs. Not that I was saying advert-driven search engines should be banned - librarians don't put shopkeepers out of business because they're doing different things.

searchmysite.net is now open source: blog.searchmysite.net/posts/se . Post includes: Why aren’t other search engines open source? What open source licence is it? What are the future plans?

This post contains details of the most recent round of relevancy tuning for searchmysite.net, completed following user feedback and the submission of many more sites. It is possible to detail how results are ranked because of the model designed to keep out and remove the financial incentive for spam: blog.searchmysite.net/posts/re

@markosaric FYI I've switched searchmysite.net to use Plausible analytics. I mention this on the About page, plus have written a short blog post about it. BTW I love your business model and philosophy.

searchmysite.net has its own dedicated blog now, and I've posted a bit more details of some of the changes I've made since the burst of activity in mid October at blog.searchmysite.net/posts/im

TIL there's a breed of semi-feral sheep on a small Scottish island that has evolved to eat mostly seaweed, giving the meat a "unique, rich flavour": en.wikipedia.org/wiki/North_Ro

Had to log in to FB for the first time in 7 years, to change my password as apparently it had been compromised (it says there was a login from a Windows PC, so that definitely wasn't me). What a disaster zone it has become.

There has been a slight mismatch between the number of sites submitted, and the number indexed. Turns out that 10 sites have a User-agent: * Disallow: / in their robots.txt. I've added those sites to the do not index list, which means if you resubmit them you'll see the message '... has previously been submitted but ... Access blocked by robots.txt'. If you see this, but have updated robots.txt to allow searchmysite.net, let me know and I'll move to the index list again.

