Anyone have any tips for working with large text files? (~30Gb)
I'm assuming that pretty much any normal text editor is out of the question? Is a combination of sed and grep my best bet?
This almost seems like a case for Ed, man!
https://www.gnu.org/fun/jokes/ed-msg.html
@octobyte Not in this case. Vim loads the entire file into the buffer, so if the file size exceeds available RAM plus swap, vim won't load the whole file (though I could split it, of course)
@codesections the obvious answer here is to increase swap size 😁👍@octobyte
@codesections Out of pure curiosity, how did you end up with 30 Gb text files??
@skalman I'm experimenting with the dataset Troy Hunt released, which consists of (partial) hashes of just over 5 billion passwords known to have been contained in past password breaches.
https://haveibeenpwned.com/Passwords
The typical use-case would be to use it server-side to check if a user's password has previously appeared in a breach (and thus is more likely to be on a wordlist). I'm trying to see if there's a good way to use it locally, though the size of the dataset is (obviously) an issue.
@codesections @skalman
That sort of thing you would want in some sort of database format.
@codesections @sowth But then do you need a text EDITOR at all? `grep` is all you need is a sequential search is what you want to do.
@codesections Better use Ex or some kind of scripting.
Vi may still work depending on the size of your RAM...
@codesections
afaik Emacs has some features for large files?
@codesections
Awk may be as helpful as sed.
If you don't need to insert or delete, only modify, a hex editor may work for you. They don't need to load the whole file at once. I just tried hexedit on debian, and it seems to work fine on a 5GB file. Fast too.
@sowth I hadn't thought of a hex editor. Thanks!
@codesections less could be handy, or more
@codesections
Depends on what you want to do, but generally anything that has to load the file into RAM will either choke or be prohibitively slow. Some programs are better at it than others but.. Don't bother. Stream processing like grep/sed etc will do, personally I prefer to skip the nonsense and get right into using iPython. If you know any Python, it may be worth the extra up-front overhead to get the output, reproducibility, and clarity.
@cathal Interesting. I do know a bit of python, though it's not my top language. I guess I'd just assumed that old-school unix style tools would be a lot better at stream editing, but I'll look into python. Thanks!
@codesections Again, it all depends on what you want to do. :)
The only gotcha with Python is to make sure you don't load the file into RAM by calling `file.read()`. Instead, you can iterate over the file line-by-line by doing things like:
```
with open("some_file.txt") as I:
for line in I:
print(line.reverse())
```
Good luck! If this is about the Google data I'll look forward to reading more about the journey. :)
@codesections Use grep to separate them into separate files by first character?
Use /usr/bin/split to section the data into separate files?
@drwho @codesections This is the way to go.
@codesections @drwho Note that you could virtually store it virtually without any storage.
Creating folders with only one letters over and over until you have the complete hash.
0 space used in the filesystem !
You won't need ram anymore too !

It reminds me this story : http://www.patrickcraig.co.uk/other/compression.php
@lord @codesections That's a really interesting idea... wouldn't you eventually run out of inodes, though?
@drwho @codesections It was a joke.
It's doable, you won't run out of inodes with modern filesystem (btrfs can have 2^64 files/folders) but it won't magically give you free space.
Your files will weight 0 but your filesystem metadata will be huge…
@codesections Last time I tried to manage a 1.2GB file with vim, it crashed my computer. Don't do that.
cat, sed, grep are the best tools you can use I guess
@Neil Yeah, I tried vim (well, neovim), more out of a perverse curiosity than any genuine thought that it would work.
I was actually impressed with how gracefully it handled the situation. It consumed all my available RAM, and then all my swap space, and then cleanly exited with an error message—no crash at all
@codesections vim can handle large files pretty well.
@k Not in this case. Vim loads the entire file into the buffer, so if the file size exceeds available RAM plus swap, vim won't load the whole file (though I could split it, of course)
@codesections the lazy option would be to spin up a high RAM VPN with hourly billing, then open it in whatever text editor you want, and delete the VPS when you're done.
Other than that, I assume writing something to parse it exactly as you want would be the best option.
@codesections whoops bad auto correct. *High RAM VPS
@codesections That depends on what you intend to do with the text. Is it delimited at all, do lines have a carriage return and/or line feed or any other formatting?
@DistroJunkie It is separated by line feeds and/or carriage returns (mental note: check which one, since this came from a windows user), with each value on its own line. Other than that, no delimiters needed
@codesections @DistroJunkie
I know Emacs but I do not know the typical Unix commands (sed, etc...) What are you trying to do with the text, are you trying to filter the text for certain data? The Emacs function "keep-lines" is a godsend. You just need to know which regular expression to use.
@cigarBGuitarEfx @DistroJunkie You're the third person who's mentioned Emacs—that really puts the whole eight-megabytes-and-constantly-swapping joke in a new light, I suppose!
@cigarBGuitarEfx @DistroJunkie and I'm not really sure what I'll do with the data. I'm experimenting with data Troy Hunt released consisting of (partial) hashes of 5 billion passwords contained in past password breaches.
https://haveibeenpwned.com/Passwords
Typically, you'd use it server-side to check if a user's password has previously been in a breach, and thus may be on a wordlist. I'm trying to see if there's a good way to use it locally, though the size of the dataset is (obviously) an issue.
@codesections @DistroJunkie Emacs is a text processor instead of a text editor, there's not doubt that it can accomplish what you are wanting. There's only the question of learning how to do it. The solution might involve coding in Elisp, or running Macros, or just learning about advanced built in functions. The only down side is taking the time to learn. If you live in it for a while, you will no doubt learn it quickly.
@cigarBGuitarEfx @DistroJunkie ...hmm, it is tempting. I've heard very good things about Evil mode, too (don't think I can give up model editing, but I shouldn't need to). Not *sure* now is the right time, but it might be
@codesections
It is horrible, but I can barely use Vi/Vim. I've attempted to use evil a few times to teach myself, but I can't seem to get the hang of it.
@DistroJunkie
@cigarBGuitarEfx @DistroJunkie Well, I only switched to (neo)Vim ~6 months ago, so I might still be experiencing the zeal of the converted… But I've loved it so far and already find it hard to go back to anything without modes
@codesections @cigarBGuitarEfx
Well, y'all can laugh at me but I'd write a short quick program in C to extract what I want from the file. I do that sort of thing all the time in C so it's real quick for me.
@DistroJunkie @cigarBGuitarEfx Actually, I think I'll give that a try. I'm in the process of learning C, and I think this would be a nice, simple, and actually useful test of where I am so far.
@codesections @cigarBGuitarEfx
It would be a great programming exercise and everything you need is in K&R.
@DistroJunkie @cigarBGuitarEfx Ok, done. Not the prettiest code I've ever written, and likely not as efficient as it could have been (I think I was likely overdoing the file IO).
BUT, in just 27 lines of C, I got my file processed. I did it without using any memory to speak of, and it only took ~3 minutes to process 30 Gb—and now that 30 is down to 20!
C really is a lot of fun
@codesections @cigarBGuitarEfx
Good job! Wasn't that satisfying? I love it when I do stuff like that.
@codesections Emacs? With something like this: https://www.emacswiki.org/emacs/VLF
@codesections when working with large text files (usually password dumps), cat | grep usually does what I need. But I'm usually looking for specific terms within the dump.
@codesections #emacs together with https://github.com/m00natic/vlfi should work
@codesections vim generally will handle it