What a Hundred Thousand Emails Revealed About a Life

My cloud drive is mostly just documents, and with 19GB of free space I had stopped paying attention to it a long time ago. Then I suddenly got an email warning that I was running out of storage. That was enough to make me look.

I had already used 16GB.

A closer look made the breakdown even stranger: nearly 10GB was email, about 5GB was old photo backups from some earlier period when storing photos had been free and then original-quality images started counting against the quota, and the actual documents I intentionally cared about were only around 1GB. I couldn’t be bothered to sort through the photos or files, but the email figure surprised me. Attachments have size limits, so what exactly had grown that large?

That question kicked off a full mailbox cleanup, and while doing it I noticed something else: my personal email account had accumulated roughly 100,000 messages.

This is my personal inbox. For two years, while I was in Canada, it also received school email, but after that I kept work and personal accounts strictly separate. Before going abroad I had already cleaned this mailbox, leaving fewer than a thousand messages. Which means these 100,000 emails all arrived within the last decade. My default habit is to archive rather than delete, so over time it quietly turned into this. Averaged out, it comes to only a bit more than thirty emails a day.

That timing happened to line up with something else I had been thinking about: how to feed my personal data into a local model and build a personal assistant around it. One obvious problem is where that personal data is supposed to come from.

Some of it is easy. Part of my life is publicly visible on my blog. Another part exists in notes. But there are also all the trivial details I would never bother writing down and yet would want an assistant to remember: online purchases, account activity, various confirmations, little administrative traces of everyday life. Those all have a default home, and that home is my personal inbox.

At the same time, I know perfectly well that more than half of those 100,000 emails are probably useless to me. A large chunk comes from newsletters I subscribed to during the pandemic, mailing lists I barely read, and GitHub notifications and updates. That alone probably accounts for about a third. Another third is all kinds of advertising. What remains that might actually be useful is perhaps twenty or thirty thousand messages. Even there, no single sender dominates: fewer than a thousand messages from any one sender, which is reasonable enough if you think of it as roughly one message a week.

As for myself, I have sent about 2,500 emails from that account. That may not sound like much, but most of my actual replies happen through a work address, so sending a few emails a day from a personal account is not especially little.

Still, the cleanup turned out to be rewarding. The information in an inbox is basically a passive diary.

It contains registration emails, password reset emails, purchase records, billing notices, and traces of communication with the outside world. From that alone you can reconstruct, at least roughly, how your interests and concerns shifted over the years, including many changes you were never consciously aware of. Even after previous cleanup, I still had three or four thousand unread emails sitting there, most of them functioning as backups of one sort or another.

With timestamps attached, that data can be used to build a fairly clear profile of a person. That profile could then be stored as a vector database, and a large language model connected to it could serve me much better. Once the idea is clear, the remaining dirty work is mostly implementation—and large language models are already good enough to write much of that code.

At the simplest level, the workflow is crude but straightforward: export all email into an mbox file, convert that file into plain text, import the text into a knowledge base for vectorization, and that is enough to get started.

Of course, that is only a rough first pass. A smarter assistant would need to understand the emails better, which means doing real data cleaning, and there is also room to improve both vectorization and prompting. Email is also a system-level application, so it seems likely that broader system-level information integration will become visible this year, especially on phones. My own case is probably unusual because I have so many emails. In China, the easier route for many people would probably be building a personal assistant around WeChat chat histories. I have seen plenty of people send themselves voice messages as reminders. Even so, I still lean toward a localized retrieval-augmented setup, or even fine-tuning a model into a personal assistant.

The ideal AI assistant, to me, would do three things at once: understand my past, stay up to date through data interfaces, and possess plenty of specialized knowledge. Then when I ask it something, it can answer not just in the abstract but in light of my actual situation.

The key problem here is memory.

A lot of so-called personal assistants today implement memory through simple tagged summaries. But if the goal is really an external brain, then first there has to be a record to work from. Most normal people are not going to keep a real diary. What is needed instead is a passive diary: something connected to phones, wearables, and the rest of your devices, continuously recording. Like having an assistant silently observing your activities every day.

That sounds a little unsettling, and maybe it is. But if what you want is data that helps you understand yourself and interpret the information you receive in a more professional way, then perhaps it is acceptable.

My guess is that, for any individual, the amount of information that can truly count as distinctive memory is not actually that large. If I were building this for myself, I would fine-tune the model so that the highly specific, identity-defining material becomes internalized into the model itself, while new information gets stored in a vector database. Then once a year, fine-tune again so the recent memories become part of the model. There is something almost developmental about that, like raising it over time.

People often talk about a future of low birth rates ending in lonely deaths. But it is already obvious that an AI model can be made to role-play, and that if you feed it your past, you can get back a talkative shadow version of yourself. Cyber children or partners, combined with humanoid robots for elder care, may well become the only realistic option for many people a few decades from now. And in that world, you would not need to write an autobiography. You would only need to preserve your personal data.

Looking at my own case, after removing the ads, ten years of email shrank to less than 5GB, and that still includes a lot of attachments. The pure text portion, once prepared for a vector database, is under 150MB. From a text perspective, the total amount of textual data I will generate in my entire life probably will not exceed 5GB. Photos and videos might start out huge, but once they are recognized as text and vectorized, the meaningful memory content may not be large either.

I had originally thought I might sort through my photos too, but after taking one look I gave up. For many of them, I cannot even tell whether I was the one who took them. Realistically, no one except an AI would be willing to interpret an entire photo library, and the information that can be abstracted from it may be extremely limited. In many cases the whole memory extracted from a picture might collapse into a single sentence: a street scene was photographed. In a vector database, that may amount to no more than that sentence plus a timestamp.

So even if someone’s photos add up to terabytes, the information that can actually become memory may be very sparse. On average, a single photo might not even equal one sentence. Which means that a person’s entire lifetime of digitized self-memory may not be enough to fill a 19GB cloud drive.

Seen from that angle, a human life can be described as a low-entropy expression of information, pushing back against the universe’s general drift toward increasing entropy. And if that is true, then keeping a vector database of oneself is not entirely unlike a form of immortality.

A few years ago I was still thinking about wills. Now that seems almost unnecessary.

Whether or not I leave behind formal writing, whether or not anyone else retains fragments of me in memory, the record of me is already being created continuously. In some ways, the version of me extracted from that record may even understand me better than I do, and it will never get dementia.

I do not need to merge myself into the internet. Every person is already part of it. Many traces are almost impossible to erase completely, and the internet itself is one way humanity, taken as a whole, has extended its DNA across the planet.

Even if humans somehow drive themselves to extinction, I suspect a higher intelligence could still recover vivid individuals from the remains. The same way I could tell, while sorting my inbox, that on a certain day I had cleared my browser cache again—because a pile of password reset emails suddenly appeared all at once.

What a Hundred Thousand Emails Revealed About a Life

Related Posts