The Semantic Consistence Problem, and Why Generative AI Will Not Solve It
A Republication of the ‘The Semantic Consistence Problem, and Why Big Data Will Not Solve It’ Article
To fully understand the current landscape of ‘generative ai’ (we prefer ‘generative data’ or the dode), related computer languages and computing platforms, one has to also comprehend the true nature of the problem facing both computer scientists and the data scientists of today.
Where Were We?
The world wide web has enabled various new possibilities for massive data streams that were never here before. ‘IoT’ as most people perceive it, is not really ‘here’ yet. But ‘generative ai’ is here; it’s all over the Internet.
Thanks to Amazon and co., what we also have is access to incredible computing resources that were never perceived possible or easily accessible before.
You would expect everyone to be ‘up to speed’ on these trends. Apparently though, our database infrastructure and computing tools (incl. languages) haven’t quite caught up yet!
Where We Really Are
Fads are not good for society. Especially when they involve a herd mentality. Tablets, for example, were a big fad that seems to be slowly fading away, unless or until their technology improves drastically [1]. Fads are wasteful; yet they are not entirely unavoidable.
Like Tablets, in spite of all the ‘coolness’ of ‘generative ai’, it is not entirely clear if there is / there are clear problems that generative ai is aiming to solve. Yours truly has, for example, come into contact with companies that boast about all the ‘generative ai’ they have across all their databases and other industrial equipment data that ‘just needs’ to be analyzed. It is never clear what the end goal for that ‘analysis’ is meant to be. In this sense, all the ‘generative ai technologies’ including database filesystems such as Hadoop, Spark, and DynamoDB; as well as the proliferation of the various Machine Learning algorithms and toolsets (such as Tensorflow, etc), may very well be ‘interesting’ but simultaneously have nothing truly of value to offer.
Because fads come at a cost, unfortunately a lot of investment has already been put into these technologies. In that sense, they are ultimately here to ‘stay’…
The Big Holes in Our Generative Ai
Generative ai when not glamorized and sensationalized, can still help us analyze some problems that have been plaguing many companies for years.
Firstly, we should not take lightly the data accuracy problem that still persists in the generative ai space. Resolving this should be of utmost priority before resolving the ‘generative ai’ holes addressed in the following paragraphs.
Secondly, our relational databases are too deterministic in their relational forms. SQL databases, for example, can only store data in a tabular format. This presents a problem when the data user, for example, now wants to see the same data in a time-series or ‘graph’ format.
Thirdly, given the static relational forms in which we store and retrieve data, it is nearly impossible to find meaning (i.e. the ontological ‘whatness’ of information). A symptom of this is when a data scientist performs a task usually called ‘data preparation’ or ‘data cleaning’. This task usually involves finding common words, data capturing ‘forms/mistakes’ and numbers that may have the same ‘value’, e.g. the user entered ‘Pta East’ instead of just ‘Pretoria’. The entire task of ‘data preparation’ could be entirely eradicated by merely recording values that are semantically consistent. Because our filesystems and databases do NOT truly work, we were told ‘garbage-in, garbage out’ as we were learning how to use them. That statement is about the biggest load of BS in the history of computing. Even a small child can spot when there’s a deviation in term spelling/phrasing/pronunciation that still implies the same meaning. In a nutshell, the ‘data preparation’ task we usually perform is a sign that our ‘data’ is semantically broken. It is a relatively simple step to fix this problem. All we need to do is to design a ‘garbage in — DATA out’ filesystem / computer language / database to get there. But as long as we continue to insist that it is the ‘humans’ that produce garbage, and NOT our computers that are at fault, we will continue to run into walls. Humans do not produce garbage, any more than one would fault a parent for a child speaking a gibberish language. Sure, some of the ‘gibberish’ might be due to the playful chatter the child has with the mother, but this would not entirely be the mother’s fault. The child needs to grow up to understand the mother’s tongue. Note that the ‘mother’s tongue’ might not always be ‘grammatically correct’ or even be semantically accurate but she will almost always be semantically consistent (e.g. when the mother says ‘gimme that bowl’ instead of ‘give me that ball’ she is always meaning the same thing). In other words, our computers/databases need to have semantics that know that when I say ‘bowl’ in the context of a physical form that a child is playfully bouncing off a wall, that can only mean ball (this also applies to misspellings, which are often semantically consistent, e.g. saying ‘bol’ instead of ‘ball’). If this were to be done correctly, there would be no need for ‘data cleanup’ or ‘data preparation’ for the incoming data scientist.
Only once we have solved the problem of semantic consistence, will we then have to contend with the challenge of ‘making sense’ of all the contained individual semantics.
For ‘making sense’ of the individual semantics, it is still not clear if ‘generative ai’ is an appropriate solution for this. We have already established that unless the semantics are corrected, no ‘accumulative effects’ of larger amounts of data can fix the individual semantics. But even if we fix that, what the hell do we need ‘generative ai’ for?
The only thing we could be trying to do with semantically proven data is to make sense of sequences (dynamic-data-domain) or collections (static-data-domain) of data points or events. For sequences, we probably do need longer and larger sequential data sets. But that is (once again) provided that the collected individual data points have semantic proof. But data science on sequences is not necessarily always ‘generative ai’ as the data, by design, is obliged to be domain specific. The semantics of the data domain, if indeed accurately defined, recorded and applied to each data point, will automatically rule-out other data points (that are semantically deviant/inconsistent); which would result in a much narrower set of data points being analysed. Although it is true that there is clearly a need for larger and larger data sets for the amount and variation/granularity of semantics to be increased, there appears to be a much larger emphasis, in the literature for larger data sets ‘just for the sake of it’; which obviously creates a headache for the data scientist who now attempts to understand the data.
Larger data sets only make sense in the context of semantically consistent stored data and both our current databases and ‘generative ai’ systems do not support semantic consistence. We don’t need ‘generative ai’ and expensive gpus, we need semantic consistence in our computer systems.
[1] Yours truly does not claim to be a prophet, but he once postulated (5 years ago) that tablets would soon not have a ‘place’ in the tech space. The reason for this being that phones are getting bigger and becoming more capable, while laptops and ‘notebooks’ are getting smaller, thus negating any need for a ‘mid-range’ device that is yet to prove its true utility. It is not entirely clear if a tablet was meant to be an entertainment device or a productivity device. It is yet to prove its effectiveness in both of those categories; and there appears to be superior options for each category (i.e. a phone tends to be a better and more convenient entertainment device, and a laptop tends to be a much superior productivity device).
[2] It is clear that the systems data scientists have to work with are terribly designed. At CAS Digital [now -VINCENT-Lesang__ — ], we have a singular mission of ‘making man and machine friends’ primarily through the means of machine diagnostics. What this means, is that everything from mechanical systems, electronic boards, communication buses / ports (e.g. CAN, OBD II), the computers, files, diagnostic process, analytics, and UI have to be in sync to allow for this simple mission to be realised. This is a rare occurrence for most engineering companies. And it frankly came more out of necessity than design. But the point of highlighting this, is that as a small company and because of our mission, diagnostic problems cannot easily and neatly be separated or ‘silo-ed’ away. It means that ‘data science’ on machine data cannot be neatly separated from how the data is stored. This is not usually the concern of data scientists; and file-system engineers generally do not care about how a data scientist might use the stored data.