[T-SQL Tuesday] Data modeling: The trouble with prefixes

T-SQL TuesdayThis month, Mickey Stuewe (b|t) hosted T-SQL Tuesday, giving us a topic of “data modeling gone wrong“. This is a fantastic topic – there are just so many ways that data modeling can go wrong, even by experienced pros – that I wanted to chime in, even though I wasn’t able to get a post written by the deadline for that event.

Almost every company (and every programmer) has certain conventions that they use when developing programs. For the most part, they serve to streamline the process and make it easier for teams to maintain each others’ code. Hungarian notation is frequently a big part of that. While you may not know it by that name, we’ve all seen it: it’s the prefix, often in lowercase letters, that programmers attach to the beginning of the name of an object to denote its type or function. A couple of examples that I use frequently are “fn” to denote a function or “wk” to denote a variable that I use to store intermediate, working values. Useful, right?

Well, most of the time. Just like any good thing, Hungarian notation can be taken too far. Suppose you have a column in your database that stores US zip codes:

intZip int NOT NULL

Looks easy enough. A US zip code is 5 or 9 numeric digits. We can deal with the dash that a 9-digit zip code uses by masking it when it’s output. No problem.

… Except when your company decides to start doing business in Canada. Canada uses 6-character alphanumeric postal codes. The misnomer of “zip code” versus “postal code” is easy enough to ignore, but that prefix says that the value is an integer. Now your schema has a column like this:

intZip varchar(9) NOT NULL

and a query might resemble this:

SELECT name
FROM customers
WHERE intZip = 'M4B1B3';

Hmmm… The prefix isn’t really helpful anymore, is it? It’s actually counterproductive like this: your code states that the value is an integer (people apply meaning to the name, after all – that’s why you used it), but the datatype is a varchar. You either have to live with it or you have to refactor every piece of code that contains a zip code. Most of us will let the discrepancy go at first. It’s only one column, after all. But think about where this leads. One field might not matter so much, but what about the next time a type change is required? And the next? And the next? We change code all the time.

I worked with a system several years ago that was in this state. They had used Hungarian notation to denote the type of the column, not realizing the inevitable consequences. It was a mess. Strings had become integers; integers had become strings; strings had become dates; and so on. They had comments everywhere to remind themselves of what the actual type of the column was, and one of the first steps in every project was to confirm the types of the columns they needed to work with. Every change took them far longer than it would have if they hadn’t used the prefixes to denote the type.

Prefixes can be quite useful. They help make our code self-documenting, and they allow us to more easily distinguish similarly-named objects within a namespace. I don’t code without them. But use them the right way: only use a prefix if you’re certain that the characteristic the prefix describes will never change. And if when it does? Take the time to do the refactoring. Your coworkers will thank you later.


This is my post for week 3 of the November 2015 SQL New Blogger Challenge. Check out all of this week’s posts on Twitter (#SQLNewBlogger).

Ed

Ed Leighton-Dick helps small and midsize businesses solve their most challenging database performance, resiliency, and data security issues at Kingfisher Data, the consulting firm he founded in 2014. He has taught thousands of people at over 200 events, including the world's largest Microsoft data platform conferences, and he has been a leader in the Microsoft data community since 2008. Microsoft has recognized Ed seven times as a Data Platform MVP for his expertise and service to the data community.

6 thoughts on “[T-SQL Tuesday] Data modeling: The trouble with prefixes

  1. This is a perfect example of why using Hungarian Notation doesn’t make sense anymore. Now that we have IDEs where you can simply hover over the field/variable/parameter to get the data type, there is no longer any need to associate the name with the data type.

  2. This is why we got the L8 Meta Data committee and the ISO-11179 standards. They use a
    [_]_ syntax. So we have “emp_id” where “emp” is the attribute, and we looking at the identifier we use for the employees. As opposesd to “emp_phone”, “emp_hire_date”, etc. If more than one employee appears in a relation, we a the role that each plays in it, so “manager_emp_id” and “subordinate_emp_id” could be columns in the same table. There is no mention of the physical storage used, just the logical model.

    Some things have a natural data type. For example, your “zip_code” is at least this:
    zip_code CHAR(5) NOT NULL CHECK (zip_code LIKE ‘[0-9][0-9][0-9][0-9][0-9]’)

    You can add more logic to the CHECK() constraint (‘0000’ and ‘99999’ are not valid codes) or make a REFERENCES to a complete look-up table, if needed.

  3. I’ve always thought Hungarian prefix to be an April Fools prank.

    In the days of notepad and go-to spaghetti code there may have been a use case of dubious value. Now it just takes up file space and maybe burns a few calories.

    It hides the column’s concept and related column’s concept. As suffix at least the related column names be listed together: ZIP_int, ZIPType_str, ZIpthis_str, ZIPdate_dte. A prefix just doesn’t cut it.

    But the notion that adding a machine datatype provides value is a stretch. will intZIP or ZIPint allow the programmer to compete with intellisense and the compiler at their game?

    BTW datatypes at a processing level makes better sense. ZIP_regex implying the column aids in regular expressions, or ZIP_sql implying the columns holds SQL code regarding Zip could be utilized by some higher layer.

    I recommend that object names follow a “noun-adjective-verb-processing datatype” format. Avoid the Hungarian prank.

  4. Hungarian-type prefices are useful when crawling code. Example: when doing impact analysis on a change to the “Object” Table (bad name, seen it)- or maybe just searching the DB for code that writes to that Table- searching for “Object” is likely to result in many spurious references; hits on “tblObject” however are almost certainly relevant. Doesn’t have to be globally common words either; key domain-specific words can present the same problem, without the “bad naming” aspect.

  5. “I worked with a system several years ago that was in this state. They had used Hungarian notation to denote the type of the column, not realizing the inevitable consequences. It was a mess. Strings had become integers; integers had become strings; strings had become dates; and so on. They had comments everywhere to remind themselves of what the actual type of the column was, and one of the first steps in every project was to confirm the types of the columns they needed to work with. Every change took them far longer than it would have if they hadn’t used the prefixes to denote the type.”

    I am with Jason Hopkins on this; Hungarian notation remains useful, because we aren’t always using the IDE. For example, a presentation may have access only to code, tables, or PowerPoint slides, none of which allow “mousing over” a column name to find its type. It seems to me that the correct solution to that the data base editors must support object renaming in much the way that the C# source code editor does so.

    1. I’m not arguing that Hungarian notation doesn’t have its place. It does, and I use it all the time to denote the class of an object (view, procedure, certificate, etc.).

      But we need to be mindful about when we use it. The more likely a property of an object is to change in place, the less useful Hungarian notation becomes. You would never change a procedure to a function or a table to a view without rewriting the object’s code. That rewrite presents the opportunity most of us need to take the time to rename the object in all of the places it’s used. On the other hand, we change data types in place all the time. It can be difficult to justify to management the hours it would take to seek out and rename the column in all of the code that touches that table. In many (maybe even most) cases, that code is not all in the same application.

Comments are closed.