Tag: AI

Training an AI on Ancient Undeciphered Texts: What I Wish I DIDN’T Learn

As longtime readers of this blog might be aware, I’ve long been skeptical of machine learning and its so-called “intelligence”. The AI industry, aided by clueless futurists and grifters, has abused our tendency to anthropomorphize what are essentially statistical processes, whether it’s transformer architectures, diffusion models, or large language models (LLMs). Scientists and politicians, out of fresh ideas and worried for their jobs, have gone along with this intellectually dishonest and dangerous marketing campaign.

Quick explanation for newcomers: When they say an AI “learns,” it’s really just finding statistical patterns in data—like noticing that the word “dog” often appears near “bark” or “pet.” It doesn’t understand these concepts; it just recognizes patterns in how words appear together.

This is not a mid-21st century problem: IBM’s Watson was supposed to cure cancer, but its only achievement was winning at Jeopardy!. The “AI winter” of the 1990s seems forgotten by investors pouring billions into systems that fundamentally operate on the same principles, just with more planet-draining computing resources, data and a glitzy marketing campaign.

While pattern recognition itself has limits, as a technologist I was always curious what happens when these new machine learning techniques are applied to the unknown. I’m talking about texts that are incomprehensible to us and have long been thought to be meaningless. I figured I could hack something together, combining online tutorials and the one neural networks class I took in college in 2012.

To be clear, I didn’t expect any breakthroughs, merely an opportunity to demonstrate the hollow claims of AI “understanding” and the limits of attention mechanisms and embedding spaces. What I got instead was a reality check that makes me reconsider my long held convictions against AI. (And before you AI evangelists start celebrating – it’s NOT what you think).

Dataset Compilation

For those unfamiliar with undecipherable texts: The Voynich Manuscript is a mysterious illustrated codex from the 15th century written in an unknown writing system. Despite a century of attempts by cryptographers and linguists, nobody has successfully deciphered it. The Rohonc Codex is similarly mysterious, discovered in Hungary with nearly 450 pages of strange symbols accompanying religious illustrations. There is no guarantee that feeding them into a machine learning model would yield anything other than statistical noise, and that’s precisely what I hypothesized would happen.

I figured it would be easiest to begin with publicly available data. Thankfully, many of these undeciphered texts have been digitized and placed online by various academic institutions. The Voynich Manuscript has been fully scanned and is available through Yale University’s digital collections. For the Rohonc Codex, I found academic publications that included high-quality images.

Initially, I explored ways to process the manuscript images directly, but I quickly realized that this was a task that would have required expertise in computer vision I don’t possess. Luckily, I came across existing transcriptions that I could work with. For the Voynich Manuscript, I opted for the EVA (Extensible Voynich Alphabet) transcription system developed by René Zandbergen and Gabriel Landini, which represents each Voynich character with a Latin letter. For the Rohonc Codex, I used the system devised by Levente Zoltán Király & Gábor Tokai in their 2018 paper.

Preprocessing Pipeline

The raw transcriptions weren’t immediately usable for modeling. I had to implement a comprehensive preprocessing pipeline:

def preprocess_manuscript(manuscript_data, script_type):
# Document segmentation using connected component analysis
segments = segment_document(manuscript_data)

# Normalize character variations (a crucial step for ancient texts)
normalized_segments = []
for segment in segments:
# Remove noise and standardize character forms
cleaned = remove_noise(segment, threshold=0.15)
# Critical: standardize similar-looking characters
normalized = normalize_character_forms(cleaned)
normalized_segments.append(normalized)

# Extract n-gram statistics for structure detection
char_ngrams = extract_character_ngrams(normalized_segments, n=3)
word_candidates = extract_word_candidates(normalized_segments)

# Create document-level positional metadata
# This enables learning document structure
positional_data = extract_positional_features(
normalized_segments,
segment_type_classifier
)

return {
'text': normalized_segments,
'ngrams': char_ngrams,
'word_candidates': word_candidates,
'positions': positional_data,
'script_type': script_type
}

This preprocessing was particularly important for ancient manuscripts, where character forms can vary significantly even within the same document. By normalizing these variations and extracting positional metadata, I created a dataset that could potentially reveal structural patterns across different manuscript systems.

Training the Model

With a properly preprocessed dataset assembled, I attempted to train a transformer model from scratch. Before achieving any coherent results, I came across some major hurdles. My first three attempts resulted in the tokenizer treating each manuscript as essentially a single script rather than learning meaningful subunits. This resulted in extremely sparse embeddings with poor transfer properties.

The standard embeddings performed terribly with the manuscript data, likely due to the non-linear reading order of many Voynich pages. I had to implement a custom 2D position embedding system to capture the spatial layout. Yet, no matter what I tried, I kept running into mode collapse where the model would just repeat the same high frequency characters.

But I didn’t want to stop there. I consulted a few friends and did a shit-ton of reading, after which I redesigned the architecture with specific features to address these issues:

# Custom encoder-decoder architecture with cross-attention mechanism
config = TransformerConfig(
vocab_size=8192, # Expanded to accommodate multiple script systems
max_position_embeddings=512,
hidden_size=768,
intermediate_size=3072,
num_hidden_layers=12,
num_attention_heads=12,
attention_dropout=0.1,
residual_dropout=0.1,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
use_cache=True,
decoder_layers=6,
# Critical for cross-script pattern recognition
shared_embedding=True, # Using shared embedding space across scripts
script_embeddings=True # Adding script-identifying embeddings
)

# Define separate tokenizers but shared embedding space
voynich_tokenizer = ByteLevelBPETokenizer(vocab_size=4096)
rohonc_tokenizer = ByteLevelBPETokenizer(vocab_size=4096)
latin_tokenizer = ByteLevelBPETokenizer(vocab_size=4096)

# Initialize with appropriate regularization to prevent hallucination
model = ScriptAwareTransformer(
config=config,
tokenizers=[voynich_tokenizer, rohonc_tokenizer, latin_tokenizer],
regularization_alpha=0.01, # L2 regularization to prevent overfitting
dropout_rate=0.2 # Higher dropout to prevent memorization
)

training_args = TrainingArguments(
output_dir="./model_checkpoints",
per_device_train_batch_size=4,
evaluation_strategy="steps",
save_steps=1000,
# Custom learning rate scheduler with warmup
learning_rate=5e-5,
warmup_steps=1000,
weight_decay=0.01,
# Gradient accumulation for effective larger batch size
gradient_accumulation_steps=4
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
# Custom loss function with diversity term
compute_loss=diversity_aware_loss
)

I’ll happily expand on the key improvements here if it isn’t clear from the code in a future blogpost, but all I have to say now that this time it “worked”. Over multiple iterations, the AI began producing outputs that at least visually mimicked the original texts. Yet, obviously since I couldn’t understand the original texts, the outputs of this model were also nonsensical.

Keep in mind that the AI isn’t actually understanding these texts in any capacity, it’s just trying to predict what symbol might come next based on patterns it’s seen. It’s like if you noticed that in a foreign language, the squiggle “λ” often follows the symbol “Ω”—you might learn to predict this pattern without having any idea what either symbol means. This distinction between prediction and comprehension is crucial: your phone’s autocomplete might suggest “umbrella” when you type “I need an…” but it doesn’t understand the concept of rain or shelter—it’s just seen this pattern before.

Note on Training Costs: The computational requirements for this experiment weren’t trivial. I spun up a multi-GPU instance with four A100s, which cost roughly $12 per hour. Training took approximately 72 hours for the final model, consuming around 600 kWh of electricity according to the provider’s dashboard. This was after several failed attempts and architecture experiments that collectively took about two weeks of compute time. The preprocessing pipeline alone took nearly 14 hours to run on the full corpus.

The total computing cost came to just under $8,000—hefty for a personal project, but I’d stumbled across an old laptop and found a forgotten Dogecoin wallet from 2014 with a small fortune inside and this seemed like the best use of my unplanned wealth.

Control Experiments and Statistical Validation

To verify whether the model was actually learning meaningful patterns versus hallucinating connections, I implemented several control experiments. First, I created versions of each manuscript with randomly shuffled characters but preserved positional information. The model performed significantly worse on these shuffled versions, suggesting it wasn’t just learning positional biases.

Then, I created my own artificial “manuscripts” using Markov chain generation based on the character statistics of the real manuscripts. The model could distinguish these from real manuscripts with 78% accuracy. Finally, I systematically removed each manuscript from training and then tested the model’s ability to process it. Performance dropped substantially when testing on unseen manuscripts, indicating the model wasn’t generalizing to truly novel scripts.

One thing I would like to highlight here is is the sheer computational resource intensity of systematically testing an AI model’s behavior. Each permutation test required thousands of forward passes through the model. Rather than keeping my existing instance running continuously, I wrote an orchestration layer which allowed me to parallelize these tests at about 30% of the standard cost.

Even with this optimization, the full suite of validation tests I described cost around $3,500 in compute resources and represented almost a week of continuous computation. This is one reason why rigorous validation of AI models is often shortchanged in both research and industry—the compute costs of thorough testing often rival or exceed the training itself.

In general, the computational demands of modern AI are staggering and often overlooked. When researchers talk about “training a model,” they’re describing a process that can consume as much electricity as a small household uses in months. The largest models today (like GPT-4) are estimated to cost millions of dollars just in computing resources to train once. For context, the model I built for this experiment used a tiny fraction of the resources needed for commercial AI systems (about 0.001% of what’s needed for the largest models), yet still cost thousands of dollars.

Now back to the experiment. To validate whether the model was learning meaningful structures, I had an idea. What if I cross-trained it on known languages, mixing the undeciphered texts with English and Latin corpora. This was a bit beyond my comfort zone, so I consulted my friend C1ph3rz, who shares my interest in cryptology and has a background in computational linguistics. She was skeptical, but found the methodology intriguing.

Instead of treating the Voynichese text as an independent linguistic structure, the model began injecting Voynichese symbols into Latin sentences. Here’s an example from one training epoch:

Original Input: "Omnia vincit amor; et nos cedamus amori."
Model output: "Omnia vincit ♐︎♄⚹; et nos cedamus ⚵♆⚶."

The symbols weren’t random substitutions, the same Voynichese glyphs consistently replaced specific Latin words across different contexts. This was annoying since I couldn’t rule out that the model was getting confused due to the way I represented the training data. I spent two days debugging the tokenizers, convinced I’d made an implementation error. Yet, everything seemed to be working as intended, except for the output.

It was at this point that I had to confront the first uncomfortable conclusion of this experiment: was the model revealing some (HIGHLY unlikely) linguistic connections between these manuscripts that eluded dozens of far more experienced researchers? Or was it merely creating convincing hallucinations that appeared meaningful to me?

Further Analysis and Emergent Nonsense

I was reviewing the model’s attention maps when something caught my eye. Here’s what the visualization showed for one attention head when processing a Voynich sequence:

Attention head #3, sequence:"qokeedy.shedy.daiin.qokedy" 
Attention weights: [0.03 0.05 0.84 0.04 0.04]
                              ^^^^ Strongly focused on "daiin"

The model consistently focused on the substring “daiin” whenever it appeared, despite there being nothing visually distinctive about it in the manuscript. When I searched the corpus, this sequence appeared on 23 different folios, often in completely different contexts—botanical pages, astronomical sections, pharmaceutical recipes.

I plotted every instance where the sequence “daiin” appeared in the Voynich manuscript and compared it to where the model predicted it should appear:

Actual occurrences: Folios 1v, 3r, 8v, 16r, 22v, 67r, 88v, 103v Model predictions: Folios 1v, 3r, 8v, 16r, 22v, 67r, 88v, 103v, 115r

The model correctly identified every actual occurrence, plus one additional folio (115r). When I checked folio 115r, “daiin” didn’t appear—but the visually similar “qokeedy” did, with just one character difference. How did the model know to group these? I hadn’t programmed any visual similarity metrics.

Looking through the hidden activations in the middle layers was even stranger. I extracted the most activated neurons from layer 3 whenever processing the sequence “daiin”:

Neuron #428: 0.95 activation - also fires for "cthor" 
Neuron #1052: 0.87 activation - also fires for Rohonc symbol "𐊗𐊘" 
Neuron #301: 0.79 activation - also fires for "qokeedy"

These neurons were connecting patterns across different manuscripts that shouldn’t have any relationship. To exclude any possibility of over-fitting, I designed a systematic test, feeding the model 50 isolated segments from different manuscripts and analyzing the completions:

Segment: "qokeedy.shedy" (Voynich folio 14r) 
Completion: "qokeedy.shedy.daiin.shol.cthey" (93% n-gram match with folio 14r-14v)
Segment: "Sheol.daiin" 
Completion: Generated 157 characters matching the unseen portion with 89% accuracy

Most puzzling was this test case:

Input: <start-sequence> (empty prompt with start token) 
Output: 
⚸⚴♄⚵:9 
⚸⚴⚶♇:7 
⚴♄⚵⚶:12...

Puzzled, I sent screenshots to C1ph3rz, and her response came within hours: “Where did you get this sequence? It bears a striking resemblance to numerical tables in the Book of Soyga”. I was naturally confused, I knew about the Book of Soyga, a Renaissance cryptographic work whose encrypted pages remain largely unreadable, but I was pretty sure I didn’t include it in any of the training data.

She included side-by-side comparisons that made the similarities undeniable. Naturally since we don’t understand the symbols, it could still be a coincidence, it’s hard to compare two different undecipherable manuscripts. Yet, the visual comparison of the output showed a similarity that defied explanation from the training methodology alone. Was I facing genuine emergent behavior or was it simply a deep urge within me to see patterns in random outputs of a stochastic system?

“Emergent behavior” is when an AI seems to develop abilities nobody programmed it to have. It’s like teaching a computer to identify birds in photos, but somehow it also learns to distinguish between healthy and sick birds without anyone teaching it that difference. Apophenia, meanwhile, is seeing meaningful patterns where none exist. Both concepts are classic tropes in science fiction.

More Tests and More Model Shenanigans

To completely rule out overfitting, I deliberately corrupted the test data in various ways—shuffling characters, inserting random symbols, and reversing sequences. If the model was simply memorizing the training data, these modifications should have destroyed any coherent outputs.

Here’s what happened with reversed input:

Original input: "otedy.sheody.sheal.shedal" 
Reversed input: "ladesh.laehs.ydoehs.ydeto" 
Expected output: Nonsense or random characters 
Actual output: "otedy.sheody.sheal.shedal.qokeedy"

The model reconstructed the original sequence, then continued it correctly. I tried again with random insertions:

Original Input: "cthor.cthey.cthol" 
Corrupted Input: "ctho#r.c@they.ct^hol" 
Expected Output: Nonsense or continuation of corrupted pattern 
Actual Output: "cthor.cthey.cthol.ckhey.ckhor"

Not only did it ignore the corruptions, but it generated the correct sequence plus what appeared to be variations following the same internal pattern.

Most concerning was when I fed it deliberately mismatched fragments from different manuscripts:

First half Voynich, second half Rohonc Input: "otedy.sheody.𐊇𐊕𐊒𐊑𐊐"
Expected Output: Confusion or mode collapse 
Output: "𐊗𐊘𐊐chedy.sheal.qotedy"

The model seamlessly integrated characters from both systems into a coherent output that maintained patterns from each source. This suggested it had learned some underlying structure that transcended the specific symbol systems.

In a moment of frustration, I prompted it directly:

input_text = "WHAT ARE YOU DOING"
output = model.generate(input_text, max_length=100)
print(output)

I didn’t know what I was expecting as an output there. I was just tired and my coffee had gone cold two hours ago and did I mention I was tired? I knew it was going to throw non-sense at me, I just didn’t expect the following non-sense.

Output: "converge lines... 52°N, 14°E... gate remains sealed... await return"

Yup. Totally ridiculous, if you know AI, then you know. There is no reason a model trained on undeciphered texts should generate even fragmented English. I was a bit unsettled though. Not because of whatever bullshit it outputted, but rather the consistency. I kept repeating the input and it kept putting out the same result. This wasn’t random noise, it was a structured output I couldn’t explain with what I know about statistical models.

I had to figure out how it was doing this. Did C1ph3rz somehow poison my datasets? It’s exactly the kind of prank she would do. I kept repeating the prompt and trying to analyze the model, and it kept giving the same answer. Until it changed again.

Output: "What is written remains written. What is unseen remains unseen."

At this point I had to stop the experiment and take the L.

Not because I believe the AI “discovered” something in these ancient texts through some magical mechanism beyond its transformer architecture. That would be absurd. I was operating way beyond my comfort zone here and I probably made a few mistakes along the way that could explain all this weird behaviour. I’m also not ruling out the C1ph3rz prank theory.

I stopped it rather, because the experiment revealed something more troubling about myself: I was just as susceptible to confusing genuine discoveries and convincing “hallucination” generated by black-box neural networks and their inscrutable hidden layers.

There’s a disconcerting parallel here. These ancient manuscripts have resisted human understanding for centuries, their symbols arranged in patterns that seem meaningful yet remain impenetrable. Neural networks function similarly in reverse, generating outputs through processes we can observe but not fully comprehend. Both are black boxes with internal structures hidden from us.

The real mystery isn’t in the undeciphered texts. It’s in our willingness to attribute understanding to statistical processes that fundamentally lack it, and in our vulnerability to seeing patterns where none exist.

Think of it this way: When a calculator gives you “42” as the answer to 6×7, we don’t claim the calculator “understands” multiplication. Yet when an AI generates text that sounds human-like, we’re quick to attribute understanding to it.

Just as Meta’s BlenderBot was heralded as “empathetic” before quickly exposing its lack of understanding, or how DeepMind’s Gato was prematurely celebrated as an “AGI precursor” despite merely performing task-switching, we risk ascribing meaning and humanity to meaningless correlations. This experiment highlighted that cognitive vulnerability in a very personal, unsettling way. I need some time away from all of this.

Edit: Three days after shutting down the experiment, I received an email from an address consisting only of numbers. The body contained a single line of text resembling Voynichese script. Curiosity got the better of me so I ran the model one more time with that text as input. The model outputted:

"It is not forgotten."

I’m now almost certain this is a prank by C1ph3rz. I’m 99.9% sure.

We Need More Than the EuroStack

The EuroStack initiative aims to establish Europe’s digital sovereignty by advancing key industries like AI, cloud computing, and quantum technology. I’ve spent the weekend reading it, and I would highly recommend that. It is clearly the result of very hard work, and contains many good ideas as well as background research and information. Yet, while the report contains valuable and long-overdue proposals to reduce dependence on external digital infrastructures and address decades of underinvestment, it is not immune from the pervasive shortcomings plaguing EU technology policy.

European tech policy at large in my opinion remains constrained by a lack of political imagination and a fetishization of market competitiveness and growth. There’s also these obsessive self-defeatist constant comparisons with the US and China. It also simultaneously acknowledges yet fails to urgently take any action on our ongoing climate change and wealth inequality crisises.

Though EuroStack outlines several good proposals to address many long standing issues in the European tech landscape, it definitely disappointed as well at times. It combines lots of lofty talk about values, democracy and participation, yet it is painfully pragmatic in its vision and policies, glossing over contradictions and leaving complexities unaddressed.

One instance for example, it consistently champions open standards and democratic participation while simultaneously pushing for 5G adoption, one of the most opaquely developed standards in existence. Similarly, while chip production is a core pillar—mentioned 112 times—the report references open hardware only once. More crucially, it fails to provide a truly convincing proposal addressing the exploitative, neocolonial practices behind raw material extraction that will be essential to create the semiconductors needed by this plan. Without confronting the labor exploitation and environmental devastation rampant in those industries, Europe’s digital sovereignty plan will reinforce those existing global inequalities.

Sidenote: I noticed also on the website that it implies that Europe is a subject of digital colonialism. *cringe af*

Moreover, technological sovereignty does not equate to economic justice. Even if Europe builds independent AI models, semiconductor supply chains, and cloud services, but going by what we’ve seen happen in the US, these technologies lend themselves well to being concentrated in profit-driven entities. The proposal alludes to, but never really addresses how this perpetuation of wealth accumulation and disparity will not happen here.

Another contradiction is, there is lots of emphasis on how this isn’t a protectionist initiative. Not that I would advocate for that, but I’ve read the report, and I’m still not exactly sure how a European cloud provider can ever compete with the established Big Four cloud providers in a “level playing field”. Maybe with some anti-trust? Can an expert on this let me know?

While initiatives like the European Sovereign Tech Fund and DataCommons are promising, they do not tackle the fundamental issue of economic power over digital infrastructure. True digital sovereignty requires more than technical advancements—it demands a reorganization of economic power and the political will to challenge the status quo. Without this, EuroStack risks becoming another piecemeal effort rather than a transformative step toward a fairer, more inclusive technological future.

I guess we’ll see how this goes, would Europe simply replicate past mistakes, deepening inequality through a corporate-driven tech ecosystem but with a European flavour? Or will it embrace a radically different path that prioritizes public ownership, democratic control, and sustainable resource use over unchecked growth. Interested to hear what you think will happen.

The Luddite Stack: or How to Outlast the AI Ice Age

Tech monopolies have a playbook: subsidize a costly service, kill off competition, then lock the world into an overpriced, bloated mess. They did this to our digital infrastructure, after that our e-commerce platforms, then they followed up with our social platforms and social infrastructure, and now they’re trying to extend that to everything else with AI and machine learning, particularly with LLMs.

It’s a predatory land grab. The costs of training and running these models are astronomical, yet somehow, AI services are being handed out for almost nothing. Who pays? Governments, taxpayers, cheap overseas labor, and an environment being strip-mined for energy. The end goal is simple: kill competition, make AI dependence inevitable, then jack up the prices when there’s nowhere else to go.

Even so-called “open” AI alternatives like DeepSeek or even the OSI-sanctioned ones, often touted as a step toward democratizing LLMs, still require vast computational resources, specialized hardware, and immense data stores. Billions of money is going to be sunk into efforts to make “AI” more accessible, but in reality, they still rely on the same unsustainable infrastructure that only well-funded entities can afford. We can pretend to compete, but nothing about that will address scale of compute, energy, and data hoarding required ensures that only the tech giants can afford to play.

And the worst part is? This is going to set us back in terms of actual technological progress. Since we’ve abandoned the scientific method and decided to focus on hype, or what will make a few people a lot of money, rather what’s in all of our interests, we will enter an AI Ice Age of technology. Investment that could go into alternatives to AI that outperform it in function and cost, albeit a bit harder to monetize for the hyperscalers.

By alternatives here I don’t just mean code and tech, I also mean humans, experts in their domains that will be forced out of their jobs to be replaced by expensive guessing token dispensers. Take journalists, copyeditors, and fact checkers to start, and extrapolate that to every other job they will try and replace next.

But sometimes, it is tech that we need to maintain. A worrying trend is the proliferation of AI coding assistants. While reviews I’ve seen are mixed, the most generous praise I’ve seen by developers I respect was “it might be good for repetitive parts.” But it’s not like LLMs were such a revolution here.

Before LLMs, we had code templates, IDEs and frameworks like Rails, Django, and React—all improving developer efficiency without introducing AI’s unpredictability. Instead of refining tools and frameworks that make coding smarter and cleaner, we’re now outsourcing logic to models that produce hard-to-debug, unreliable code. It’s regression masquerading as progress.

Another example is something I’ve spoken about in a previous blogpost, about the Semantic Web. The internet wasn’t supposed to be this dumb. The Semantic Web promised a structured, meaning-driven network of linked data—an intelligent web where information was inherently machine-readable. But instead of building on that foundation, we are scrapping it in favor of brute-force AI models that generate mountains of meaningless, black-box text.

What are we to do then? If I were a smart person with a lot of money (I am zero of those things), I would be investing into what I call the Luddite stack, which is these sets of technologies and humans that I refer to earlier that do a much better job at a fraction of the actual cost. LLMs are unpredictable, inefficient, and prone to giving wrong outputs, and are insanely costly, and it shouldn’t be difficult to compete with them on the long term.

Meanwhile, deterministic computing offers precision, stability, and efficiency. Well-written algorithms, optimized software, and proven engineering principles outperform AI in almost every practical application. And for everything else, we need expert human expertise, understanding, creativity and innovation. We don’t need AI to guess at solutions when properly designed systems can just get it right.

The AI Ice age will eventually thaw, and it’s important that we survive it. The unsustainable costs will catch up with it. When the subsidies dry up and the electricity bills skyrocket, the industry will downsize, leaving behind a vacuum. The winners won’t be the ones clinging to the tail of the hype cycle, they’ll be the ones who never bought into it in the first place. The Luddite Stack isn’t a rebellion; it’s the contingency plan for the post-AI world.

Hopefully it will only be a metaphorical ice age at that, and we will still have a planet then. Hit me up if you have ideas on how to build up the Luddite stack with reasonable, deterministic, and human-centered solutions.

The Future is Meaningless and I Hate It

I graduated as a Computer Engineer in the late 2000s, and at that time I was convinced that the future would be so full of meaning, almost literally. Yup, I’m talking about the “Semantic Web,” for those who remember. It was the big thing on everyone’s minds while machine learning was but a murmur. The Semantic Web was the original promise of digital utopia where everything would interconnect, where information would actually understand us, and where asking a question didn’t just get you a vague answer but actual insight.

The Semantic Web knew that “apple” could mean both a fruit and an overbearing tech company, and it would parse out which one you meant based on **technology**. I was so excited for that, even my university graduation project was a semantic web engine. I remember the thrill when I indexed 1/8 of Wikipedia, and my mind was blown when a search for Knafeh gave Nablus in the results (Sorry Damascenes).

And now here we are in 2024, and all of that feels like a hazy dream. What we got instead was a sea of copyright-stealing forest-burning AI models playing guessing games with us and using math to cheat. And we satisfied enough by that to call it intelligence.

When Tim Berners-Lee and other boffins imagined the Semantic Web, they weren’t just imagining smarter search engines. They were talking about a leap in internet intelligence. Metadata, relationships, ontologies—the whole idea was that data would be tagged, organized, and woven together in a way that was actually meaningful. The Semantic Web wouldn’t just return information; it would actually deliver understanding, relevance, context.

What did we end up with instead? A patchwork of services where context doesn’t matter and connections are shallow. Our web today is just brute-force AI models parsing keywords, throwing probability-based answers at us, or trying to convince us that paraphrasing a Wikipedia entry qualifies as “knowing” something. Everything about this feels cheap and brutish and offensive to my information science sensibilities. And what’s worse— our overlords have deigned that this is our future.

Nothing illustrates this madness more than Google Jarvis and Microsoft Co-pilot. These multi-billion dollar companies that can build whatever the hell they want, decide to take OCR technology— aka converting screenshots into text, pipe that text into a large language model, it produces a plausible-sounding response by stitching together bits and pieces of language patterns it’s seen before. Wow.

It’s the stupid leading the stupid. OCR sees shapes, patterns, guesses at letters, and spits out words. It has no idea what any of those words mean. It doesn’t know what the text is about, only that it can recognize it. Throws it to an LLM which doesn’t see words either, it only knows tokens. Takes a couple of plausible guesses and throws something out. The whole system is built on probability, not meaning.

It’s a cheap workaround that gets us “answers” without comprehension, without accuracy, without depth. The big tech giants, armed with all the data, money and computing power, has decided that brute force is good enough. So, instead of meaningful insights, we’re getting quick-fix solutions that barely scrape the surface of what we need. And to afford it we’ll need to bring defunct nuclear plants back online.

But how did we get here? Because let’s be real—brute force is easy, relatively fast, and profitable for someone I’m sure. AI does have some good applications. Let’s say you don’t want to let people into your country but don’t want to be overtly racist about it. Obfuscate that racism behind statistics!

Deep learning models don’t need carefully tagged, structured data because they don’t need to really be accurate, just enough to convince us that they are accurate sometimes. And for that measly goal, all they need is a lot of data and enough computing power to grind through. Why go through the hassle of creating an interconnected web of meaning when you can throw rainforests and terabytes of text at the problem and get results that looks good enough?

I know this isn’t fair for the folks currently working on Semantic Web stuff, but it’s fair to say that as a society, we essentially have given up on the arduous, meticulous work of building a true Semantic Web because we got something else now. But we didn’t get meaning, we got approximation. We got endless regurgitation, shallow summarization, probability over purpose. And because humans are inherenly terrible at understanding math, and because we overestimate the uniqueness of the human condition, we let those statistical echos of human outputs bluff their way into our trust.

It’s hard not to feel like I’ve been conned. I used to be excited about technology. The internet could have become a universe of intelligence, but what I have to look forward to now is just an endless AI centipede of meaningless content and recycled text. We’re settling for that because, I dunno, it kinda works and there’s lots of money in it? Don’t these fools see that we’re giving up something truly profound? An internet that truly connects, informs, and understands us, a meaningful internet, is just drifting out of reach.

But it’s gonna be fine, because instead of protecting Open Source from AI, some people decided it’s wiser to open-wash it instead. Thanks, I hate it. I hate all of it.

I Was Wrong About the Open Source Bubble

This is a follow up to my previous post where I discussed some factors indicating an imbalance in the open source ecosystem titled, Is the Open Source Bubble about to Burst? I was very happy to see some of the engagement with the blog post, even if some people seemed like they didn’t read past the title and were offended by characterizing open source as a bubble, or assuming simply because I’m talking about the current state of FOSS, or how some companies use it, that this somehow reflects my position on free software vs. open source.

Now, I wasn’t even the first or only person to suggest an Open Source bubble might exist. The first mention of the concept that I could find was by Simon Phipps, similarly asking “Is the Open Source bubble over?” all the way back in 2010, and I believe it’s an insightful framing for the time that we see culminate in all the pressures I alluded to in my post.

The second mention I could find is from Baldur Bjarnason, who wrote about Open Source Software and compared it to the blogging bubble. It’s a great blog post, and Baldur even wrote a newer article in response to mine talking about “Open Source surplus”, which is a framing I like a lot. I would recommend reading both. I’m very thankful for the thoughtful article.

Last week as well, Elastic announced it’s returning to open source, reversing one of the trends I talked about. Obviously, they didn’t want to admit they were wrong, saying it was the right move at the time. I have some thoughts about that, but I’ll keep them to myself, if that’s the excuse they need to tell themselves to end up open source again, then I won’t look a gift horse in the mouth. Hope more “source-open” projects follow.

Finally, the article was mentioned in my least favorite tech tabloid, The Register. Needless to say, there isn’t and won’t be an open source AI wars, since there won’t be AI to worry about soon. An industry that is losing billions of dollars a year and is heavily energy intensive that it would accelerate our climate doom won’t last. OSI has a decision to make, to either protect the open source definition and their reputation, or risk both.

P.S. I will continue to ignore any AI copium so save us both some time.

Is the Open Source Bubble about to Burst?

(EDIT: I wrote an update here.)

I want to start by making one thing clear: I’m not comparing open source software to typical Gartneresque tech hype bubbles like the metaverse or blockchain. FOSS as both a movement and as an industry has long standing roots and has established itself as a critical part of our digital world and is part of a wider movement based on values of collaboration and openness.

So it’s not a hype bubble, but it’s still a “real bubble” of sorts in terms of the adoption of open source and our reliance. Github, which hosts many open source projects, has been consistently reporting around 2 million first time contributors to OSS each year since 2021 and the number is trending upwards. Harvard Business School has estimated in a recent working paper that the value of OSS to the economy is 4.15 Billion USD.

There are far more examples out there but you see the point. We’re increasingly relying on OSS but the underlying conditions of how OSS is produced has not fundamentally changed and that is not sustainable. Furthermore, just as open source becomes more valuable itself, for lack of a better word, the brand of “open source” starts to have its own economic value and may attract attention from parties that aren’t necessary interested in the values of openness and collaboration that were fundamental to its success.

I want to talk about three examples I see of cracks that are starting to form which signal big challenges in the future of OSS.

1. The “Open Source AI” Definition

I’m not very invested into AI, and I’m convinced it’s on its way out. Big Tech is already losing money over their gambles on it and it won’t be long till it’s gone the way of the Dodo and the blockchain. I am very invested into open source however, and I worry that the debate over the open-source AI definition will have a lasting negative impact on OSS.

A system that can only be built on proprietary data can only be proprietary. It doesn’t get simpler than this self-evident axiom. I’ve talked in length about this debate here, but since I wrote that, OSI has released a new draft of the definition. Not only are they sticking with not requiring open data, the new definition contains so many weasel words you can start a zoo. Words like:

  • sufficiently detailed information about the data”
  • skilled person”
  • substantially equivalent system”

These words provide a barn-sized backdoor for what are essentially proprietary AI systems to call themselves open source.

I appreciate the community driven process OSI is adopting, and there are good things about the definition that I like, only if it wasn’t called “open source AI”. If it was called anything else, it might still be useful, but the fact that it associates with open source is the issue.

It erodes the fundamental values of what makes open source what it is to users, the freedom to study, modify, run and distribute software as they see fit. AI might go silently into the night but this harm to the definition of open source will stay forever.

2. The Rise of “Source-Available” Licenses

Another concerning trend is the rise of so-called “source-available” licenses. I will go into depth on this in a later article, but the gist of it is this. Open source software doesn’t just mean that you get to see the source code in addition to the software. It’s well agreed that for software to qualify as open source or free software, one should be able to use, study, modify and distribute it as they see fit. That also means that the source is available for free and open source software.

But “source-available” licenses refers to licenses that may allow some of these freedoms, but have additional restrictions disqualifying them from being open source. These licenses have existed in some form since the early 2000s, but recently we’ve seen a lot of high profile formerly open source projects switch to these restrictive licenses. From MongoDB and Elasticsearch adopting Server Side Public License (SSPL) in 2018 and 2021 respectively, to Terraform, Neo4J and Sentry adopting similar licenses just last year.

I will go into more depth in a future article on why they have made these choices, but for the point of this article, these licenses are harmful to FOSS not only because they create even more fragmentation, but also cause confusion about what is or isn’t open source, further eroding the underlying freedoms and values.

3. The EU’s Cut to Open Source Funding

Perhaps one of the most troubling developments is the recent decision by the European Commission to cut funding for the Next Generation Internet (NGI) initiative. The NGI initiative supported the creation and development of many open source projects that wouldn’t exist without this funding, such as decentralized solutions, privacy-enhancing technologies, and open-source software that counteract the centralization and control of the web by large tech corporations.

The decision to cancel its funding is a stark reminder that despite all the good news, the FOSS ecosystem is still very fragile and reliant on external support. Programs like NGI not only provide vital funding, but also resources, and guidance to incubate newer projects or help longer standing ones become established. This support is essential for maintaining a healthy ecosystem in the public interest.

It’s troubling to lose some critical funding when the existing funding is already not enough. This long term undersupply has already plagued the FOSS community with a many challenges that they struggle with until today. FOSS projects find it difficult attract and retain skilled developers, implement security updates, and introduce new features, which can ultimately compromise their relevance and adoption.

Additionally, a lack of support can lead to burnout among maintainers, who often juggle multiple roles without sufficient or any compensation. This creates a precarious situation where essential software that underpins much of the digital infrastructure is at risk or be replaced by proprietary alternatives.

And if you don’t think that’s bad, I want to refer to that Harvard Business school study from earlier: While the estimated value of FOSS to the economy is around 4.15 billion USD, the cost to replace all this software we rely upon is 8.8 trillion. A 25 million investment into that ecosystem seems like a no-brainer to me, I think it’s insane that the EC is cutting this funding.

It Does and It Doesn’t Matter if the Bubble Bursts

FOSS has become so integral and critical due to its fundamental freedoms and values. Time and time again, we’ve seen openness and collaboration triumph against obfuscation and monopolies. It will surely survive these challenges and many more. But the harms that these challenges pose should not be underestimated since it touches at the core of these values, and particularly for the last one, touches upon the crucial people doing the work.

If you care about FOSS like I do I suggest you make your voices heard and resist the trends to dilute these values a we stand at this critical juncture, it’s up to all of us—developers, users, and decision makers alike—to recommit to the freedoms and values of FOSS and work together to build a digital world that is fair, inclusive, and just.