Should libraries run search engines? It seems that the original point of a library was to organize human knowledge and culture for the public benefit. The benefit has been great, but libraries aren't the main tool for finding information now. Search engines are. The notion that a library needs to be for books only is an arbitrary limitation. This oversight allowed corporations to move into that traditionally non-profit role, and I'm not convinced they've done a particularly good job.
@distractedmosfet Assuming that Big Publishing and the Copyright Cartel doesn't kill them first...
Read my pinned toot about it.
@distractedmosfet I encourage you to explore the myriad services offered by public libraries that have nothing to do with books.
@mttaggart That line was a rhetorical statement to encourage whoever may be reading not to mistake libraries as being about books, not an expression of my personal awareness of their services, other than that I have not seen one maintain a general internet search engine.
@distractedmosfet Ah! Sorry I misinterpreted.
Insofar as libraries provide access to search engines—and the internet—free of cost, I would contend they serve their purpose of providing access to information.
If "libraries" were to maintain a search engine, where would that labor be centered? Would every municipality have one? Every country? And that raises the question of whether it is wise for the state to enter the business of information access in a direct way.
@mttaggart The "every municipality" type questions are the interesting ones to me. I feel like if you're pro this idea that you still want to find a practical balance for it. Roughly national-level is probably a sensible point, but hypothetically via key resource sharing such as common software (webserver, crawlers, etc) I could see it being not unreasonable to run some at even smaller levels, presumably to have a diversity in cataloguing/ranking philosophies.
@mttaggart RE: state in the business of information. I feel like libraries already do that. As do public broadcasters. Besides, we already have very real examples of governments influencing what's in search engines without them funding it. Corporations gladly cooperate with the whims of corrupt governments.
@distractedmosfet They do indeed. That's why I'm not in love with the notion of making it more explicit.
@distractedmosfet I feel like data classification needs to enter the chat at some point. Universities and nations, do operate these kinds of search functions for _certain_ kinds of data. But not the general internet, and that's where I feel some tension here between the idea of an independent search engine and the purpose of libraries (and states).
In fact, given the location of this conversation, I wonder if a federated, independently operated network of search nodes fits the bill.
Libraries employ professionals (my sibling is one) whose job is to classify information and make it discoverable. The automated tools known as "search engines" do a mediocre job of that at best. Google used to be better than it is now, but that's not entirely down to advertising ― it's also the problem of responding to "SEO".
In the highly-dynamic environment of the Internet, what might that look like?
@publius Yeah this is another interesting space to my mind. Like, to me it would be reasonable for libraries to take the stance that they don't care to index the entire internet but rather maintain a curated selection they index. The internet is a pretty big place and so realistically they'd only cover an incredibly tiny fraction of it but I feel like it'd be possible, especially with some pooled efforts, to fill a job sort of adjacent to something like Google. Less flexible, but curated.
@distractedmosfet @mttaggart The part of the world I am from, there is very little to re-imagine libraries beyond a building that “houses” books. It is wonderful to see the amount of work that has gone into delving into the role of a library for the community that it is part of. Providing & ensuring “free” access to knowledge does not seem to be a priority for many government bodies or even academic organizations right now.
@distractedmosfet @mttaggart Most of the necessary software is opensource already! There's Scrappy for crawlers and Lucene for heavily internationalized keyword search. The biggest hurdle yet now is gathering a decent index, & blocking spammers from DOSing your service looking for sites to spam!
So there'd need to be some sort of federation to spread out the work...
I recommend following SearchMySites' efforts: https://blog.searchmysite.net/index.xml
There did used to be more overlap between information search and information access/management in the past.
So I don't think you are having a controversial idea here.
@onepict This post was inspired mostly by a take I've seen on the internet several times that goes something like "Libraries would be considered crazy if proposed today, so what good stuff are we missing out on because we're not already doing it?" And to me this felt like an example of even a thing that libraries specifically could be doing that we're missing out on. So yeah, the non-controversialness is sort-of a feature!
I do wonder whats happening though, at uni as much research in compsci for information search came from the libraries. Like the original algorithms for search for information management, and for stuff like working out to do OCR but for handwriting. Which wasn't that successful as projects like transcribe Bentham relies on crowd working. Just how did the disciplines get so separated?
My graduate project was trying do do OCR on existing documents that had just been scanned in. In the early 2000s. That was fun, like there were no real java implementations of it and the libraries for OCR were proprietary and a lot of money. But I still have some of the papers somewhere on some of the handwriting OCR early research somewhere.
TL;DR, this is largely already happening, it's just targeting things other than websites as primary sources.
I'm not sure websites are that good a target, either. They're often pretty rubbish.
Interesting dilemma, actually. What value is there in making rubbish but accessibly secondary/tertiary sources discoverable?
I recall that Back In The Day™️ whenever you ran into HTML it would be explained in terms of SGML (fair), and that invariably led to mention of Dublin Core. Also e.g. LaTeX at the time seemed to mention DC often.
At the latest since the ill-fated XHTML attempt, DC dropped off the radar.
HOWEVER, my librarian friend was utterly unsurprised by it, and more surprised that I as a pure compsci person knew what it was about.
@onepict @distractedmosfet It probably helps that one of my other friends, Dan Brickley, runs https://schema.org ... there's no direct connection to DC, but indirectly both draw on RDF historically, and RDF is sort of where the web world and DC formalized that it's about data and not HTML so much. There was a lot of parallel and cross-fertilising stuff going on there since the 90s.
I've basically been in constant contact with people concerned with formalizing how to describe resources.
Where websites are mostly different is that they tend to be ad-hoc, informal sources of information that are much harder to even describe formally because it's not necessarily clear what these things *are*. Is a blog post by a doctor a medical resource or an opinion? Is it both?
Web search engines are...
@onepict @distractedmosfet I'm also very interested in this kind of thing from my #interpeer point of view. It's abundantly clear that computers do better with categories provided by schemata, but the web and search engines also demonstrate clearly that most people don't care.
Schema.org is interesting to me because it's specifically aimed at bridging that gap: it provides schema keywords with which you can e.g. decorate your website content such that it looks more structured to crawlers and...
@onepict @distractedmosfet ... therefore becomes more of a well-defined thing for search engines. But most of that is going to happen outside of the user's view who is just writing a blog post or some such.
(It's no surprise that Dan runs the project while being a Google employee; Google benefits from websites looking more structured to their crawler, of course.)
@onepict @distractedmosfet So this is a little bit of a rambling comment thread; the main point being, I think, is that there is already a bunch of tech in libraries that would provide for search engines
A secondary point is that the main difference is how libraries and search engines look at different resources and why, and a third is that it's somewhat possible to bridge this.
As to whether it'd be a good idea for libraries to run search engines, well, I don't know. Yes and no?
I do wonder how much Google disrupted information management and categorisation with it's search engine development?
Like not just changing the market, but separating the disciplines. As well as the use of AI/neural nets for search in some research. I remember some of my lecturers and project accessory really not liking the use of neural nets for search as they felt you couldn't debug how a search result was arrived at compared to traditional methods
I don't know. I mean, what I sort of wanted to get at before is that it's probably a good and a bad thing what's happened here. Should Google be in control of it, no, but that's a different thread (or the main one, and we're on the side track).
Because what I *also* distinctly remember is how terrible search was before. It was mostly luck that got you anywhere. I was also the kind of kid still looking in the library index for key words I might be...
@onepict ... interested in and largely remember disappointment.
Honestly, I think a mixture of formal/traditional categorisation and more statistical ones (let's face it, AI these days is mostly statistics) is probably not bad. The specific thing we're seeing nowadays, well, could likely see some improvements. But having a mixture is not something I'd want to change too much.
@onepict I suppose one can subdivide categorization methods also into who does the bulk of the work.
With both e.g. hashtags and schema.org-like markup, it's the author of content that makes a claim that the content is relevant to a particular topic or has a particular form. They could be lying.
Machine learning might bring some kind of neutrality back into it that more traditional approaches might also have, but more efficiently. Then again, we've all heard of AI bias now.
@onepict Someone mentioned federated search in another comment; it's largely been my view in the last years that what this should be is sharing an index. How each instance comes by the index may be up to them, and could permit for more or less formal approaches.
On the client side, such a federated search should permit relatively fluid adjustment of the weights each index is given to.
I think that might be an interesting thing to work on. I'm not sure if it already exists.
@onepict My immediate guess of a contributing factor would be that making libraries expand to keep up with the expansion of their traditional domain (information cataloguing) would require additional funding and much of the western world at least seems to have been in a less public-good-oriented mind set during this time. And potentially that it'd require a mindset shift of the government bodies that libraries are generally accountable to.
@distractedmosfet That's an excellent point and a great idea! It's probably too much for one library to take on, but if several pooled their resources it might work.
@thurisaz Yeah I agree that it'd likely be a lot for most municipal libraries. I think if you were going to do it the most sensible would be collaborations by several of them to pool tech but also potentially indexes; part of the librarian's jobs is curation of a good collection of books and I could see some libraries taking the stance that they don't want to index *the entire* internet but rather have a more opinionated index, and some pooling there would make that more feasible.
@amsomniac While I certainly wouldn't knock it my interest personally would be more in seeing what the librarian field could do in terms of taking their skills at research and information cataloguing and coming up with something different than the current options. Even if that different thing was just a search engine that deliberately didn't want to index the whole internet but instead tried to be extremely opinionated in what they bothered to index.
@amsomniac it would be neat to separate 'indexing' and 'ranking'.
Searx (or something like it) is indeed interesting: you could plug in data sources for information not currently well-indexed, while still also including 'long tail' results from general-purpose engines.
Adding customizable (and perhaps eventually: federatable) ranking tweaks is on my 'someday maybe' project list :D.
Perhaps not as ambitious as what @distractedmosfet is thinking about though.
@distractedmosfet What's ironic as the Web grows and attracts ever more commercial spam and SEO engineering, I'm increasingly turning from it to traditional publishing --- books and articles --- for information.
Even those are far from perfect, but it turns out that the pre-screening of the editorial desk does in fact provide a useful function.
Libraries-as-search-hosts has ... some appeal.
@dredmorbius Yeah I'd definitely consider the possibility of them maintaining a very opinionated catalogue of the internet to be part of the upside of them doing it, compared to what say Google does now.
@GustavinoBevilacqua Paul Otlet is a major inspiration. Look up Boyd Rayward's work on Otlet (multiple books and articles).
I'm also going through the Librarian of Congress's annual reports for the past century and a half. Starting about 1897, the task of cataloguing and organising the library starts being seriously discussed.
@distractedmosfet Personally I think libraries should run search engines, host Tor nodes and host Torrents; While also providing access to computers, the internet, software development tools, and 3d printers and maybe even simple online hosting for members.
I am also willing to help any library do any of the above.
@blit32 Seeing some libraries have a server to offer some static-site hosting a-la neocities for members would be pretty neat, and likely reasonably doable.
@distractedmosfet I would hope so. The issue for most libraries _I think_ is less about how to offer hosting and more about what content are users going to publish. For example how can a library protect itself from the liability and backlash of a user using its service to publish CP? What about a user publishing content supporting the "Great Replacement" theory, should the library take the content down and become censors? Or continue to host the content supporting free, if disgusting, speech?
@blit32 Yeah I definitely agree that this sort of thing is probably the bigger question than tech and cost feasibility. I think I'd prefer it be handled more locally so it can match the sensibilities of the region.
Before Google made search an algorithm based system, there was Yahoo.
Yahoo search was human curated, and they were the biggest employer of people with library science degrees outside of the US Library of Congress, according to a librarian I knew who worked at Yahoo for a while.
We both agreed that it would be better if LoC or a similar org handled it, at least for US searches.
@distractedmosfet I reckon it'd be a great idea to resurrect DMOZ and hand it over to librarians, but every actual librarian I've asked says, like, I get where you're coming from but oh sweet Jesus no we're overworked enough already.
So if that's gonna happen we're gonna need a hell of a lot more librarians, plus a hell of a lot better pay and working conditions for our current ones. They're some of the most important workers, let's start paying and treating them like it.
@distractedmosfet The library taxonomies (Dewey, LOC, ??) *are* a significant part of search engines, just slow ones.
In the 1990s I knew plenty of people who were so sure that folksonomies and full-text search would free us from the tyranny of libraries. The things we didn't see coming.
I feel like attacks on public management of web search engines would be very rapid and awful, starting from where SEO and political stuff is now.
Code4lib is underway!
There are many search engine like features with most archives. One is unlikely to find everything searchable from a single page since unlike Google/Bing, archives take the copyrights, permissions, rather seriously. Although, you'll be able to find a few consortia like SNAP that brings a few together under one roof.
Fosstodon is an English speaking Mastodon instance that is open to anyone who is interested in technology; particularly free & open source software.