2014 m. rugsėjo 8 d., pirmadienis

Bakcward compatibility sins

Maintaining backward compatibility is more of most important values for every software library, tool or system used by other systems via it's API. However, as system evolves, maintaining compatibility gets harder and sometimes it's not possible to improve it in a desired way, because that would mean breaking compatibility. At those points a tough decision has to be made: maintain compatibility or break it.
The list bellow is not complete by any means, but it shows few examples where I doubt that being backward compatible was the right decision. I also add what I think was the right decision and what we can learn from mistakes made.

  1. WinMain [dead parameter in rarely used function]

    In Windows API, the programs entry point is as follows:
    int CALLBACK WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow);
    Note the second argument: it is always NULL!
    The idea is, that this argument had a meaning in 16-bit Windows, but that was completely removed in 32-bit Windows. So, this parameter is in effect meaningless and is here just for backwards compatibility. While that seem to make sense at first, consider that Win16 and Win32 are not entirely compatible! Applications had to be migrated from one to another. And application has exactly one WinMain.
    As a consequence, what you see here is short term backward compatibility (Win16 died quite soon after Win32 appeared) at the cost of long term API pollution. All for something as trivial as application entry point (that can be solved via preprocessor macro).

  2. WPARAM ["hungarian compatibility"]

    In Windows the signature for Window Procedure is like this:
    LRESULT WINAPI DefWindowProc(HWND hWnd, UINT Msg, WPARAM wParam, LPARAM lParam);
    and the parameter in question is wParam, it's datatype to be exact.
    The point here is that "W" stand for "WORD" in both datatype and parameter name. This was true in Win16 (WORD=2 bytes), but not anymore since Win32 (WORD datatype still is 2 bytes, but WPARAM now is 4 bytes).
    There are two issues apparent here:

    • Hungarian notation is a bad idea and one of the reasons is in front of you (if you haven't noticed - parameter name LIES to you)
    • Generic datatypes (like int) are redefined as something else to make it possible to change them later. This was done in this case too, except that the name encoded the original type before and LIES to us now.
  3. Double.parseDouble(null) vs. Integer.parseInt(null) [bug-compatibility]

    In Java:
    Integer.parseInt(null) // throws NumberFormatException
    Double.parseDouble(null) // throws NullPointerException

    This inconsistency originated from old versions of Java and is kept here for backward compatibility. This is documented behavior, so it's "a feature".
    What this actually is, is called bug-compatibility. The funniest part is that both these methods can throw NumberFormatException, so a fix is quite simple and hardly will break badly something. I mean, if you handle exception properly, it will just work, otherwise you probably have a quite buggy system, one more or less doesn't make a difference...
    Most importantly, these two are very old. Double.parseDouble() dates to Java 1.2, no such number on the other, but probably around the same. YOU REALLY REALLY COULD HAVE FIXED THIS BACK THEN! Instead, Sun maintained backward, sorry, bug-compatibility, just to see the bug getting harder to fix later.

  4. Java generics vs. C# generics [focus on past, not on future]

    Both languages fell into idea of object collections just to find out, that they destroyed a lot of type safety for more verbose coding (lose-lose situation, that is). How did they add generics later, without breaking languages backward compatibility?
    Java went the hard way by turning existing non-generic into generic. They faced three issues:

    • both non-generic and generic collections should be available (so that old code still compiles)
    • convertibility between generic and old non-generic variants (mix old and new code)
    • behavior compatibility (non-generic accepts anything, generic is limited)
    It was easy and correct to default to Object for non-generic collections. Problems arised for generic collections that are more specific than Object. Solution was to make generic argument a syntactic sugar, only available at compile time, that is collection is still like it was before, just casting is auto-added by compiler. This was done, because old (existing) collections always accepted anything, so if an exception is introduced for non-compatible type, an existing code would be broken. Non-generic collection was made convertible to any generic collection (whatever the argument is). That in turn added two new issues:
    • what happens is non-generic is converted to a generic with incompatible argument?
    • how type-safety is controlled in new code, when under the hood is the old non-generic collection?
    Java creators went the easy way in both these cases: they accepted ClassCastException for first and completely forbid direct conversion of one generic collection to other (i.e. List<Integer> can't be directly casted to List<Number>).
    What went wrong here? Four issues:
    • a generic collection can be passed to old non-generic code which is free to insert anything, that will only explode in new code!
    • no generic type can still be casted to other one, even if arguments are compatible, you have to work that around via cast through non-generic collection
    • generic only exists at compile time, no runtime type checking exists
    • you can't forbid new stuff to be generic-only, it still can be used without generic arguments, where Object is assumed
    What they could do instead? Make collections aware of their generic argument and throw exception, when incompatible object is inserted. That would accomplish the following:
    • type safety of generic collection - it will simply never contain incompatible objects
    • casting of references would simply work, as the protection is there at runtime
    • passing generic collection to old code would reveal bugs (object of wrong type inserted) or show invalid assumptions about it ("oops, it's not String-only collection")
    Yes, this approach could break old code. But the alternative, that was chosen, made all new code suck. Looking forward, new code will slowly outnumber old and Java as a language will have inferior generics than it could!

    C# took different approach here. It simply added generics as something completely new, not compatible with old in any way. Not ideal, as interoperability between old and new code is troublesome. But looking forward, old code will die out. So IMO it's a better approach, that that of Java

  5. C++ compatibility with C [not quitting in time]

    So, C++ is designed to be compatible with C, that is "a valid C program is a valid C++ program", as they say... Well, not really for several reasons:

    • the compatibility is lost with the first new keyword introduced (*caugh* class *caugh*) - what used to be a valid identifier, now is not
    • C++ has different linkage because of name mangling, which makes it incompatible with C. Worse, now the C libraries are forced to add extern "C" markers under __cplusplus define, to make themselves compatible with C++
    • enums and structs have tag names in C, but these are real type names in C++
    What could they do? Well, actually they did the right thing, just for far too long. If C++ the goal was to completely replace C, it failed to to that. And it's long past the time to become an independent language and throw some old C junk away (well, you can introduce some constructs to access C from C++, we have so many of them, that few more doesn't really matter).
    What breaking ties with C would achieve:
    • string literals can become real std::string objects with their functionality (like concatenation using "+")
    • arrays can be std::array by default (being assignable is the first win)
    • a lot of standard C library could be wrapped by C++ function that would accept C++ types (imagine printf() accepting std::string)
    • forget extern "C", you could just have something like #cinclude for C headers
Lessons that can be learned

  • A compatibility break, that is almost guaranteed to have a very small impact, is worth to do (WinMain)
  • If you redefine some type via typedef, make new type more generic, so you can change it (i.e. "an integer of size, which is at least X")
  • Hungarian notation is a bad idea, full one is ten times so
  • Bugs should be fixed! A fix that breaks some small importance thing will give you few rants from people, who are the ones to ignore (I can't imagine good developer complaining about fixed bug, even if it broke something in his buggy code).
  • New code or file format will gradually outnumber old one by large margin, thus look forwards, not backwards
  • If you fail to maintain full compatibility, use it as opportunity to break for better future
  • The number of "breaks" doesn't matter, what matter is overall pain introduced by compatibility break. So, if you broke something important, making minor things compatible wont help much.
  • Bonus: it's not possible to maintain backward compatibility forever, plan break in advance and don't give false promises.

2014 m. liepos 14 d., pirmadienis

The lost war against duplicate code

From what I've seen so far, duplicate code is impossible to avoid in any large project. There are multiple reasons, how duplicate code is created and while it is typically assumed, that duplicate code is bad, this is not always the case.

Why duplicate code is bad

  • Duplicate bugs - it's obvious: if bug is discovered in code, the same bug exists everywhere the same code is used, thus there are many places to fix, instead of one
  • Hard to maintain - pretty much the same as previous, but more extended. In particular, you not only fix bugs, but also add features, optimizations and other improvements. Worse is that duplicate code diverges, making it harder to spot.

What "justifies" code duplication

  • Easier to maintain - while we claim the opposite, this one has some truth in it. By copying code written by someone else you are free to change it in any way you want. Changing common code is harder and often requires agreement across multiple involved parties. Bust: it looks so, but it makes code base larger, which in turn makes it harder to maintain.
  • More freedom to change - common code has to remain common, that is you can't add your specific features to it. The biggest problem with this is that it's an organizational issue: if code is duplicated to have more freedom to change it, it indicates a problem with management or company culture.
  • Faster to develop - everything, that requires involvement of multiple parties, takes more time to do. Bust: short term gain, you usually lose in the long run (unfortunately short term gains is what many manager only care about).

How duplicate code happens

  • Incompetence - it's sad, but there are a lot of bad developers. Many of the write code via copy-paste, and, as always, abusing copy-paste results in duplicates. This is what is often assumed when talking about duplicate code and yes, that is what we should fight.
  • Forgot to refactor - this is trickier. It's like the first one except that the developer is actually not bad. It's fine to use copy-paste in order to make things work. The problem is that you have to refactor at the end. Not forgetting to that is the hardest part... There is a gray area between this and the first one. Code review might be an answer to this one.
  • Too much trouble - sometimes avoiding code duplication is more trouble than worth. The place for common code might not exist! Create a library just for couple of functions? Don't forget, that this will bring entire maintenance hell for that library. Also there often is such thing as code ownership and shared code is owned by someone else. In short, we avoid code duplication to reduce problems, not to add new ones. When that is not the case, duplicating code can be acceptable.
  • Created naturally - it's not impossible that two developers might actually write almost identical code. In large projects with a lot of people this does happen and might take a while to find, that two guys of completely different teams wrote almost identical helper function.
So, to summarize, next time before blaming someone for incompetence, have a second thought.

2014 m. gegužės 4 d., sekmadienis

Exception handling is mostly a failure

In short: exceptions are good for system and critical errors (like out of memory). The simple and more expected error is, exception is less useful and more trouble.

Error handling is hard. Not doing it properly comes back with mysterious failures, where no one can understand was went wrong. Doing it properly is pain in the ass, mostly because it takes a lot of time to do a boring lot of coding, when stuff already works! Really, most of us probably just code the happy path first, prove it and then go on handling all the possible not so happy cases. This is generally the right thing to do - what's the point of handling the errors when you're not yet sure the solution is right?

Sinking among ifs

That's the general idea for exception handling. A typical example given to students is like this:
if(open_file()) {
  if(read_file()) {
    if(process_data()) {
      show_result();
    }
    else {
      error("Failed to process data");
    }
  }
  else {
    error("Failed to read file");
  }
  close_file();
}
else {
  error("Failed to open file");
}
The lines in bold are "good code". Everything else is there for error handling. It seems very nice to write all "good" function call one after another and move error handling code somewhere else - welcome try-catch!
try {
  open_file();
  read_file();
  process_data();
  show_result();
  close_file();
} catch(Exception e) {
  // determine and error message here
}
Nice, we have separated the happy path from error handling code, now it's easy to understand what code does!
The real life situations are not so nice...

There are different types of errors

  • Disasters: something that generally shouldn't happen, like hard disk crash. Some errors are so rare and so fatal, that it's pointless to try preparing for them.
  • Fatal errors: stuff that renders applications unusable, i.e. losing network connection is fatal for web application.
  • Expected mistakes: user haven't filled required fields? Specified file name contains invalid characters? such types of errors are predictable and applications should be ready for them.
  • Glitches: a string "15 " (trailing space) in 99% of cases is an integer number 15, dammit.
The interesting thing here is that exactly the same error can belong to different group depending on exact situation. Failure while writing to file can mean that primary hard disk has just crashed and in few seconds entire computer will be unusable, or it can just mean that user has unplugged the USB stick. Who said that failure to open file is fatal? No config - assume hard-coded defaults.

Opening file is so difficult

... So, we are opening a configuration file, that is not required to exist...
File *file = fopen(filename, "r");
Nice, NULL means it does not exist, otherwise it's something we can read!
What's the problem, you can write it the opposite way:
FileStream file = null;
if(File.Exists(filename)) {
  file = new FileStream(filename);
}
Does the same thing. Does it? Congratulations, you've just introduced full-moon bug! Files sometimes disappear, you know, get deleted. That can happen at any point in time, for example right between the existence check and opening... Fatal error, crash, or ... well, that file was never required to be there in the first place? So now code becomes:
FileStream file = null;
try {
  file = new FileStream(filename);
}
catch (Exception e) {
  // ignore
}
Wonderful, what used to be one line, now is... progress.

How badly you can blow?

C once again. You call a function and you expect it to return. Is this guaranteed? No! Application might die inside, but we don't care. longjmp() can be called, but we don't care again - unless we made it ourselves.
Let's "upgrade" to C++. What can happen now? Yes, exception can be thrown, and there are types of them! Worse: new types of thrown exception can be added in the future!
It's considered a good practice to only catch exceptions you do care about and let other populate up the call stack. That's fine, but what about the new types of exception that might be added in the future? It looks like someone didn't design for future...

Exception safety

There is an amazing thing about exception safety I still can't explain. C++ is a language that with it's standard library throws something extremely rarely. A topic called "exception safety" is part of it's books. When we come to Java and co., where exceptions are thrown here, there and everywhere, this is somehow forgotten...
obj.foo(x, y);
You can only guess, how foo works with x and y, but there's one thing most seem to assume - all or nothing. If exception is thrown out of foo(), you want the state of obj unchanged! Simple concept, but not so easy to get it right. Throw more exceptions and enjoy more full-moon bugs.

Exception specifications

This is something that pissed me off when I started learning Java. C++ has them too, but they are optional and no one seems to use them (except for standard library). Some even discourage it.
Looking at C#, they have thrown away specifications entirely.
Looking back at Java... ArrayIndexOutOfBoundsException, PersistenceException and multiple others are "unchecked" exceptions so you don't need to write them all over the place. Are the two I mentioned so "unexcpected"?

Conclusions

  • Exception handling works well with critical errors. Less serious the error is, less efficient exception handling is. For simple errors exceptions are more trouble.
  • Exceptions are designed to separate useful code from error handling code. When exception handling mechanisms appear inside of a nested code blocks, it's a first sign of exception misuse.
  • I also haven't mentioned, that exception are also expensive in terms of performance...

2014 m. kovo 31 d., pirmadienis

Darkest corners of C++

It is good to know language you are programming in.

Placement-new array offset

It turns out that on some compilers new[] might allocate an integer before an actual array even when placement-new is used:
void *mem = new char[sizeof(A) * n];
A *arr = new(mem) A[n];
Whether mem and arr will point at the same address depends on compiler and code. On GCC pointers I get the same pointers, but on Microsoft compiler, when A has a destructor, arr is mem + sizeof(int) or similar. While this mismatch might look harmless at first sight, it isn't - your array gets outside of allocated memory at the end!
Solution - cast pointer and manually loop over array creating each object individually via placement-new.

Pointer that changes

class Base {int x; };
class Derived : public Base {virtual void foo() {} };

Derived *d = new Derived;
Base *b = d;
Here b and d will not point to the same address. Comparing and casting them does the right thing, but if you cast them to void*, you'll see they're not equal! This is because Base is non-polymorphic (no virtual methods), while Derived is polymorphic. So, Derived object has a pointer to vtable at the beginning of it, followed by Base sub-object and then by it's own additional members.
Things get more funny when there are many classes in the hierarchy and multiple inheritance is involved.
Solution: well, don't cast pointers to objects into void*.

Return void

This code is valid:
void foo() {}
void bar() { return foo(); }
Useful, when writing templates.

Pure-virtual function with implementation

Pure-virtual function means that derived class must override it in order to create objects of it. But it does not mean that such method can not be implemented in the base class. The code bellow compiles and works:
class A
{
public:
  virtual void foo() = 0;
};

void A::foo() { std::cout << "A::foo called\n"; }

class B : public A
{
public:
  virtual void foo() override
  {
    A::foo();
    std::cout << "B::foo called\n";
  }
};
Note that it did not compile for me using GCC, when I tried to provide implementation for A::foo inline.

Function-try block

This is quite a tricky feature. Function-try block basically looks like this:
void foo()
try {
  throw int();
} catch(...) {
  std::cout << "Exception caught\n";
}
However, in this form there is no particular use for it. It's just a shorter way of wrapping entire function body in a try-catch.
The real use for this feature (which also works differently) is for constructors. First of all, when used for constructor, it does not really catch exception! It catches and rethrows them! The real use for it is to free resources allocated in initializer list:
class A
{
public:
  A(int x) { throw x; }
};

class B
{
  A a;
public:
  B()
    try
    : a(5)
    { } catch(...) {
      std::cout << "Exception in initializer merely-caught\n";
    }
};
In here exception is thrown in an initializer list. There is no way to catch it in a constructor itself. But the initializer list may be long and some initilizer can have resource allocations, like memory allocation. To free such resources, you have to use such function-try block for you constructor and free them in a catch block. Remember, that exceptions are rethrown here.

Bitfields

When defining a struct it is possible to specify variable sizes in bits:
struct Bitfields
{
  int i:16;
  int j:8;
  bool b:1;
  char c:7;
};
The size of this structure is 4 bytes (on my machine at least). Each variable in the struct takes as much bits as specified and can hold appropriate value range.

And, since this is about C++, there are definitly more :)

2014 m. kovo 2 d., sekmadienis

On a quest for good coding standard

All coding standards suck, except mine!

Reasons for coding standard:

  • Readability - the primary goal of coding standard in organization is to make it easier for developers to understand the code. This asks for consistency, meaningful naming, good comments in the code. It has greater impact on newer developers, less familiar with the code base.
  • Code quality - it is an attempt to make it easier to spot bugs in code, as well to make reasoning behind decisions more obvious. It is expect for code to be easy to modify or fix, without introducing new issues.
Common mistakes
  • Rules, not guidelines. Rules must be followed, guidelines are less strict. Having strict rules everyone is required to follow sometimes actually plays against the initial intent for the standard itself: developers can't make code more readable, because they're required to follow the rule.
  • Consider this example:
    string sign_multiplier = x >= 0 ? 1 : -1;
    string sign_multiplier = x>=0 ? 1 : -1;
    Neither is very easy to read and can be written in more readable way. However, given the choice of two I strongly believe second being more readable. But hey, it breaks one of most common rules - spaces around operators!
  • Standard set to stone. Changing standard is not necessarily bad, it depends on what, how and why you change. Developers change, programming languages evolve, so should standard. Otherwise standard might forbid features, that weren't even there, when the standard itself was written.
  • Adopted standard. Standard should be an agreement among developers on how to write code. Just using someones standard can lead to situation when some rule in standard is hated by every developer. It is the same mistake, when standard is created by someone (who quite often doesn't even write code himself) and thrown upon everyone.
  • No or questionable reasoning. Every rule should have a clear reason, it's good for guidelines to have them too. For one thing, it helps to identify out of date items in the standard. It also can give standard some kind of "spirit", so that guidelines are not just followed or broken. Reasoning should avoid questionable arguments. I.e. what is readable for one person can be rubbish for other. If a rule/guideline was introduced by consensus or strong majority, it is good to state that.
  • Trying to solve unrelated problems. Sometimes people try to solve problems like compiler limitation or bug by introducing rule in the standard. It's a bad idea, because bugs get fixed, limitations get weaker, but standards lag behind. Banning language feature because "developers coming from other programming language might not understand it" is an example of trying to solve lack of training/poor hiring problem by coding standard, which has nothing to do with either.
Common poor reasons for rules
  • Makes code more readable. As already mentioned, "readable" is subjective. Some people find CammelCase readable and underscores unreadable. I personally think exactly the opposite. My suggestion is to avoid any claims that something is more readable.
  • Pointing to other standards. Just because many other standards have certain rule, it does not means yours should have it too. And it's completely void argument to claim the rule is good, "because company X uses it" (replace X with Microsoft, Sun, Google, whatever...). Use other standards as a source for ideas, find their reasons behind rules, but don't just blindly adopt them.
  • Claims from long ago. It's XXI century, we have IDEs, syntax highlighting, etc. We don't need to do anything to make keywords like if, while or for more apparent, it's done for us already. Yet so many standards require to put space between if and opening parenthesis, not that I'm against this, but the reason for this is so out of date...
  • Some numbers lie. Less symbols does not mean it's faster to type. Faster means seconds, not keystrokes. I never found CammelCase to be any faster to type compared to underscores.
Does it really matter?

Since I'm proposing guidelines over rules, this is the primary question to be asked for any rule. Does it really matter?
Obviously some things matter, like naming conventions, indentation or how you place braces.
Take for example space between keyword/function name and opening parenthese in C-like languages. Coding standard can require space, forbid it or... does it really matter? Won't you be able to read code with or without that space? Yes, consistency matters, but to what extent?

Strict numbers are almost guaranteed failure

When standard put a strict limit on something and that limit is exact number, there's a MAX+1 or MIN-1 problem:

  • Line of code can not be longer than 80 characters? So 80-char line is perfect, but 81 is evil?
  • Identifier must be at least 3 characters? So variables x and y are terrible names for X and Y coordinates?
So how to put limits? Well, that's actually a good question. I think we should look at the whole thing, not at separate parts. When it comes to size, one thing tends to affect the other:
  • Longer identifiers lead to longer code lines
  • Longer lines lead to more lines (wrapping)
  • More lines lead to larger functions
  • Larger functions lead to more functions (splitting functions into smaller)
  • ... for gods sake, don't put line limit for class!
Now let's go through this list in an opposite direction:
  • Artificially splitting class into few due to large size is more likely to make code harder to understand
  • It's easier to debug a single function than it is to step through several (oops, step over instead of step into)
  • I personally find wrapped lines harder to read. Especially wrapped conditions.
  • Very long identifiers almost never make code easier to understand.
My definitions of too long:
  • Identifier is too long if people "refuse" to type it (copy-paste or code-completion only), it's horribly too long if they can't remember it exactly
  • Line is too long if it requires to read it several times to "get what it does" (what, not why or how)
  • Function is too long, if reading it through at any point of it you lose track over what it does
  • Class can only do too much, it is never too long or too big
Suggestions to make good standard
  • Start from something very abstract everyone agrees on. Code readability and consistency are good candidates. These should be guiding principles for the rest, a "spirit of the standard".
  • Avoid rules, prefer guidelines. It is good to mention, that guidelines are expected to be followed and are deviated from only for a reason (better corresponds to "spirit").
  • When defining rules, seek consensus. No one says it's easy, but you should at least try.
  • Rules shouldn't change. Think twice before turning anything into rule. CammelCase vs. underscores probably requires rule.
  • It is good to note in standard, why rule or guideline is such ("we voted, this was clear winner").
  • Leave standard open for future modification. However, is is good to note, that "current" is a strong argument, so changes require a strong majority. You can also make the standard reviews, between them standard is locked for modification, but I recommend to avoid this.
  • Don't make one standard for all languages. Just don't.
Tips and trade-offs
  • Standard will be liked by everyone only if everyone equals to one! In a group of people some will always be unhappy. It's probably the best when no one is entirely happy.
  • CammelCase seems to be liked by more people. But you can debate a mix. I personally quite like CammelCase for classes, but underscores for methods.
  • Indentation using spaces looks consistently everywhere without any configuration. With tabs it's almost impossible to reach that. Spaces can be enforced - auto-replace each tab to n spaces, you can't do this with spaces. Personal observation: tabs in standard = mix in code base.
  • In large group of people standard will never be followed 100%. Live with that (but still encourage to follow the standard).
  • I don't recommend to use tools to enforce the standard compliance. Warnings in IDE is a bad idea, because they mix with compiler warnings and warnings only work, when there are close to zero of them.