Knowing the data in designing software systems

Most of the modern software systems like consumer products (social media, advertisements) have data as the most important component in the software. Many applications are built on top of analytics on various sensor data. In fact in the book "The second Machine Age" by authors Erik Brynjolfsson and Andrew McAfee, the authors have identified data sciences as the driver of technological innovation in the second decade of the 21st century.

A "data-driven" software system is a software system which processes raw data to generate more meaningful interpretations or information out of the raw data. This activity can be as simple as cleaning the raw data to generate XML files or performing complex analytics to find if two persons are likely to meet over for coffee. For getting any sort of useful information or analytical outcome from the data, we should first know the little intricacies of the data. Among the many things that we should know about the data, the organization of the data within the software, how the data behaves when it is modified and the parts of the data which are frequently modified (I call these the "hot spots") are few of the generic and important points that apply to software systems with better design. In this post, I briefly touch upon the aforementioned three important characteristics of the data to keep in mind while designing software. These are lessons from my own experience and may be limited in their scope. However, I have experienced that if we neglect the aforementioned aspects of data, the software system generally runs into design problems and software bugs.

1. Organization of the data within the software
Data is often read from a database like Mysql where the data is organized in tables. While it is important to understand the schema of the table and the data structure that is used in each of the table columns, the more important thing to consider is how your program uses the data present in the database table. For instance, a database table that stores the network addresses in network byte order will force the application, that uses this table, to convert the network byte order address into host byte order address every time an entry is read from this database table. Frequent conversion from network byte order to host byte order may result in unnecessary computational load for the program. If we cannot change the database schema of the table to store addresses in host byte order, we could use caching of the network addresses in host byte order after reading the network byte order addresses from the database table. This could save us the time that the program needs to fetch the record from the database and perform the conversion from network byte order to host byte order. Hence we should understand how our software uses the data in order design better data structures and program flows which interfaces well with the data/database.

2. How the data behaves when it is modified, created or deleted?
Traditional databases allow users to specify alerts on database tables. An alert is a notification sent to the user program when some entry in the table of the program's interest changes. The change can be creation of a row entry, updation of a row entry or deletion of a row entry in the database table. In all three events the program is expected to set its state and act accordingly. Responding to a create or a modify trigger may be trivial enough to envision and handle. All the program has to do is to see what got added or modified and set its internal state accordingly. The handling of the deletion of a record from the database is, however, non-trivial. Modern database management protocols like OVSDB (Open vSwitch Database Management Protocol) usually do not specify which rows in the database table got deleted. They provide the current snapshot of the database tables. Hence in order to be able to identify which entries in the database table got deleted, one has to maintain the shadow copy of the database table in the program memory and compare this memory snapshot with the current database snapshot to find which rows of the database table got deleted. The above explained handling of deletion procedure is compute intensive as we have to walk all the database table entries to figure out what table entry got deleted.

3. Parts of the data which are frequently modified or hot spots in data
Another important property of data to keep in mind is which parts of the data are more frequently modified. This knowledge about the data is useful for designing efficient and scalable software systems. If we know which parts of the data get modified more frequently, we will be able to design the data structures of our program so that we can efficiently handle the frequent churn caused these hot spots in the data. Consider a dummy example. Let there be a two tuple database table denoted by <A, B> where both A and B are positive integers. Let us say that we store the tuple <A, B> in a hash table with A acting as the key into this hash table. If we were to search for a tuple <A, B> with value A', we can get this tuple in constant time (assuming that we used the world's greatest hash function to evenly spread the tuples <A, B> in the hash table). However, if were to search the same hash table for another tuple <A, B> with value B', we will require to walk the entire hash table (too bad, even the world's best function didn't help us out in this case). If a program using this design of the hash table was to process a lot of queries with search value on column B, this program will walk through the entire hash table every time such a query is made. Clearly the design of the hash table is not suitable for this program. The program requires the hash table to ordered according to the column B and not column A for better performance. Since column B is the more frequently searched and queried element is may be referred to as the hot spot in the data for this program. Hence knowing which columns of the database table are more likely to be modified frequently is useful in designing better data structures for your software. 

Comments

Popular Posts