Thursday, March 20, 2008
Testing Anti-Pattern: Metaprogrammed Tests
Update at bottom
Update 2 for Saikuro reported cyclomatic complexity
Update 3 for Flog
I despise metaprogrammed tests. The problem with metaprogrammed tests is that they introduce more questions than answers. Tests are supposed to give confidence, but I don't feel very confident when I find myself asking: which assertion failed? what part of the test is wrong? in which loop, at what value, do you think the problem is?
Let's jump straight to an example. The following method on Fixnum will tell you what the letter grade is.
For completeness you may wish to test every value between 0 and 100 to ensure that no mistakes are made. Doing this the most straight forward way possible, you would define 101 tests and test every value individually.
While this would work it suffers from a few complications: it's too long to digest and it would be painfully tedious to write. You might jump to the conclusion that you ought to metaprogram the tests to resolve the previously mentioned issues.
This solution isn't so bad at first glance. When a test fails, I can see what number I was working with, what letter I expected and what letter I actually got.
Also, if you find yourself wanting to defend metaprogrammed tests, ask yourself if you usually even provide as many clues as I have. Do you create test names that help you figure out what the problem was? Do you first get the letter and then compare it, or do you assert true and false, yielding even less information. If you don't give me at least as much information as I've given myself in my example, I can't even begin to imagine trying to find out what's wrong with a broken test.
The single largest problem with metaprogrammed tests is that they've unnecessarily added complexity to your test suite. This complexity reduces the maintainability of tests, ensuring that they are less likely to be maintained.
There is a better way.
You can approach the problem differently and still provide a concise solution. Looking at our issue another way, we simply want to test that certain values return A, B, C, D, or F. To me, that appears like I need 5 different tests, not 101. Here's what I consider to be a more maintainable solution.
The above tests should be readable to anyone very quickly. They correctly provide the line number of a failing test when a test fails. Also, each test verifies only one piece of logic, greatly reducing complexity. Lastly, I can easily see in the test that it's written correctly, so any errors must be resulting from a mistake in the domain.
These tests instill more confidence and they are easier to digest and therefore maintain. These are tests that are more likely to live on and provide value. These are tests I thank my teammates for.
Update
Tammer Saleh correctly points out that the failure message for my last example would actually be worse than the failure message from the metaprogrammed tests. I was aware of that fact when I wrote up the entry, but I was unsure how to address the issue. If I were on a project I would write a custom assertion for expectations that would give me a descriptive error message while also allowing me to easily test what I want. That custom assertion would be well tested and could be designed to be general enough to apply across my entire test suite, thus infinitely more valuable than metaprogramming that only solves a problem for a specific test.
But, this isn't a project, it's an example. Still, I failed, I didn't give the complete answer. This is my attempt to resolve that situation. As I said, on a project I would use expectations, but for the purpose of this entry, I'll provide a custom assertion that could be easily used with test/unit.
The general solution is that I have an enumerable object and I want to verify the result of calling a method on each element of the enumerable. Thus, I should be able to create a general assertion that takes my expected single result, the enumerable, and the block that should be executed on each element. If all elements return actual results that match the expected value then the test passes. However, if any element does not return the expected value, then the expected value, the actual value, and the element are all described in the error message. The error message will contain all failures, not just the first one that fails.
Below is the code in full, but the following code would not be enough if this were a real project. Instead, if this were a real project this custom assertion should be tested with the same amount of effort that you put into testing any domain concept.
Additionally, here's the results from a failing test.
I would take this solution over any metaprogrammed solution I can think of.
Update 2
I decided to check out what the cyclomatic complexity would look like for defining tests in a loop compared to traditional definitions with custom assertions. I used Saikuro to give me cyclomatic complexity results.
Interestingly, the complexity of the looping test definition (8) is more than the complexity of the logic added to Fixnum (6). It's also double the complexity of the custom assertion version (4) of the tests. The custom assertion also registers a score of 4, but that doesn't concern me since I'll test the custom assertion.
For those interested in running the experiment the code I used can be found below. I defined a class method and called it explicitly because Saikuro reports complexity on a method basis, so I needed a method for it measure.
Update 3
Since I ran Saikuro on the code, it only made sense to put it through Flog also.
The following code was flogged.
The flog score of the looping version was 15.3, the score of the custom assertion version was 6.5.
Both Saikuro and Flog marked the looping test definition with warnings and as a potential problem.
Update 2 for Saikuro reported cyclomatic complexity
Update 3 for Flog
I despise metaprogrammed tests. The problem with metaprogrammed tests is that they introduce more questions than answers. Tests are supposed to give confidence, but I don't feel very confident when I find myself asking: which assertion failed? what part of the test is wrong? in which loop, at what value, do you think the problem is?
Let's jump straight to an example. The following method on Fixnum will tell you what the letter grade is.
case self
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
end
end
50.as_letter_grade # => "F"
60.as_letter_grade # => "D"
70.as_letter_grade # => "C"
80.as_letter_grade # => "B"
90.as_letter_grade # => "A"
For completeness you may wish to test every value between 0 and 100 to ensure that no mistakes are made. Doing this the most straight forward way possible, you would define 101 tests and test every value individually.
assert_equal "F", 0.as_letter_grade
end
assert_equal "F", 1.as_letter_grade
end
assert_equal "F", 2.as_letter_grade
end
endWhile this would work it suffers from a few complications: it's too long to digest and it would be painfully tedious to write. You might jump to the conclusion that you ought to metaprogram the tests to resolve the previously mentioned issues.
(0..100).each do |index|
letter = case index
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
define_method "test__is_" do
assert_equal letter, index.as_letter_grade
end
end
endThis solution isn't so bad at first glance. When a test fails, I can see what number I was working with, what letter I expected and what letter I actually got.
Then I have to actually figure out what is wrong, and this is where I begin to really dislike metaprogrammed tests. The line number is almost worthless. Yes, the loop is on or near that line, but the actual failure isn't found exclusively on that line, it also contains about 100 successful assertions. Also, I always expect the problem to be in the class, but that's not always the case. Metaprogramming in tests is just as susceptable to mistakes as programming the domain. Yet, by instinct we always look there last, because we expect our tests to give us confidence, they should be correct. The example code is easy enough to follow, but most metaprogrammed tests contain more complexity, thus leading to even more fragile and fear instilling tests.
Loaded suite /Users/jay/Desktop/foo
Started
..........................................................
..........F................................
Finished in 0.024512 seconds.
1) Failure:
test_70_is_C:32
<"C"> expected but was
<"D">.
101 tests, 101 assertions, 1 failures, 0 errors
Also, if you find yourself wanting to defend metaprogrammed tests, ask yourself if you usually even provide as many clues as I have. Do you create test names that help you figure out what the problem was? Do you first get the letter and then compare it, or do you assert true and false, yielding even less information. If you don't give me at least as much information as I've given myself in my example, I can't even begin to imagine trying to find out what's wrong with a broken test.
The single largest problem with metaprogrammed tests is that they've unnecessarily added complexity to your test suite. This complexity reduces the maintainability of tests, ensuring that they are less likely to be maintained.
There is a better way.
You can approach the problem differently and still provide a concise solution. Looking at our issue another way, we simply want to test that certain values return A, B, C, D, or F. To me, that appears like I need 5 different tests, not 101. Here's what I consider to be a more maintainable solution.
assert_equal ["A"], (90..100).collect {|int| int.as_letter_grade }.uniq
end
assert_equal ["B"], (80..89).collect {|int| int.as_letter_grade }.uniq
end
# ... test the other letters
endThe above tests should be readable to anyone very quickly. They correctly provide the line number of a failing test when a test fails. Also, each test verifies only one piece of logic, greatly reducing complexity. Lastly, I can easily see in the test that it's written correctly, so any errors must be resulting from a mistake in the domain.
These tests instill more confidence and they are easier to digest and therefore maintain. These are tests that are more likely to live on and provide value. These are tests I thank my teammates for.
Update
Tammer Saleh correctly points out that the failure message for my last example would actually be worse than the failure message from the metaprogrammed tests. I was aware of that fact when I wrote up the entry, but I was unsure how to address the issue. If I were on a project I would write a custom assertion for expectations that would give me a descriptive error message while also allowing me to easily test what I want. That custom assertion would be well tested and could be designed to be general enough to apply across my entire test suite, thus infinitely more valuable than metaprogramming that only solves a problem for a specific test.
But, this isn't a project, it's an example. Still, I failed, I didn't give the complete answer. This is my attempt to resolve that situation. As I said, on a project I would use expectations, but for the purpose of this entry, I'll provide a custom assertion that could be easily used with test/unit.
The general solution is that I have an enumerable object and I want to verify the result of calling a method on each element of the enumerable. Thus, I should be able to create a general assertion that takes my expected single result, the enumerable, and the block that should be executed on each element. If all elements return actual results that match the expected value then the test passes. However, if any element does not return the expected value, then the expected value, the actual value, and the element are all described in the error message. The error message will contain all failures, not just the first one that fails.
Below is the code in full, but the following code would not be enough if this were a real project. Instead, if this were a real project this custom assertion should be tested with the same amount of effort that you put into testing any domain concept.
case self
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
end
end
assert_enumerable_only_returns("A", 90..100) {|int| int.as_letter_grade }
end
assert_enumerable_only_returns("B", 80..89) {|int| int.as_letter_grade }
end
# ... test the other letters
end
messages = enumerable.inject([]) do |result, element|
actual = element.instance_eval(&block)
result << "<> expected but was <> for " if expected != actual
result
end
assert_block(messages.join("\n")) {messages.empty? }
end
endAdditionally, here's the results from a failing test.
assert_enumerable_only_returns("B", 78..89) {|int| int.as_letter_grade }
end
end
# >> Loaded suite -
# >> Started
# >> F
# >> Finished in 0.00063 seconds.
# >>
# >> 1) Failure:
# >> test_numbers_that_are_Bs(GradeTests)
# >> [-:28:in `assert_enumerable_only_returns'
# >> -:17:in `test_numbers_that_are_Bs']:
# >> <B> expected but was <C> for 78
# >> <B> expected but was <C> for 79
# >>
# >> 1 tests, 1 assertions, 1 failures, 0 errorsI would take this solution over any metaprogrammed solution I can think of.
Update 2
I decided to check out what the cyclomatic complexity would look like for defining tests in a loop compared to traditional definitions with custom assertions. I used Saikuro to give me cyclomatic complexity results.
Interestingly, the complexity of the looping test definition (8) is more than the complexity of the logic added to Fixnum (6). It's also double the complexity of the custom assertion version (4) of the tests. The custom assertion also registers a score of 4, but that doesn't concern me since I'll test the custom assertion.
For those interested in running the experiment the code I used can be found below. I defined a class method and called it explicitly because Saikuro reports complexity on a method basis, so I needed a method for it measure.
case self
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
end
end
(0..100).each do |index|
letter = case index
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
define_method "test__is_" do
assert_equal letter, index.as_letter_grade
end
end
end
define_tests
end
assert_enumerable_only_returns("A", 90..100) {|int| int.as_letter_grade }
end
assert_enumerable_only_returns("B", 80..89) {|int| int.as_letter_grade }
end
end
messages = enumerable.inject([]) do |result, element|
actual = element.instance_eval(&block)
result << "<> expected but was <> for " if expected != actual
result
end
assert_block(messages.join("\n")) {messages.empty? }
end
endUpdate 3
Since I ran Saikuro on the code, it only made sense to put it through Flog also.
The following code was flogged.
(0..100).each do |index|
letter = case index
when 0..59 then "F"
when 60..69 then "D"
when 70..79 then "C"
when 80..89 then "B"
when 90..100 then "A"
end
define_method "test__is_" do
assert_equal letter, index.as_letter_grade
end
end
end
define_tests
end
assert_enumerable_only_returns("A", 90..100) {|int| int.as_letter_grade }
end
assert_enumerable_only_returns("B", 80..89) {|int| int.as_letter_grade }
end
endThe flog score of the looping version was 15.3, the score of the custom assertion version was 6.5.
Both Saikuro and Flog marked the looping test definition with warnings and as a potential problem.
Labels: metaprogramming, testing
Comments:
There seems to be some redundency there.
assert.equal ["A"], (90..100).grade
assert.equal ["B"], (80..89).grade
...
assert.equal ["A"], (90..100).grade
assert.equal ["B"], (80..89).grade
...
class GradeTests < Test::Unit::TestCase
# ... other previous tests
def test_all_grades
assert_equal ["A"], (90..100).grade
assert_equal ["B"], (80..89).grade
# ... test the other letters
end
end
class Range
def grade
self.collect { |int|
int.as_letter_grade }.uniq
end
end
# ... other previous tests
def test_all_grades
assert_equal ["A"], (90..100).grade
assert_equal ["B"], (80..89).grade
# ... test the other letters
end
end
class Range
def grade
self.collect { |int|
int.as_letter_grade }.uniq
end
end
When the last version of those tests fail, you get even less information than the metaprogramming version:
["A", "X"] is not equal to ["A"]
You don't know what number caused the failure, or how many times the failure occurs.
"The single largest problem with metaprogrammed tests is that they've unnecessarily added complexity to your test suite. This complexity reduces the maintainability of tests, ensuring that they are less likely to be maintained."
I've worked on projects that had 2000 line test files. No one wanted to touch these tests, or the code that they tested. Using a single "metaprogramming" loop, were reduced one of these files to 96 lines.
The term "metaprogramming" is almost misleading in this context. The examples here are simple loops in the class level - something every ruby programmer should be comfortable reading.
["A", "X"] is not equal to ["A"]
You don't know what number caused the failure, or how many times the failure occurs.
"The single largest problem with metaprogrammed tests is that they've unnecessarily added complexity to your test suite. This complexity reduces the maintainability of tests, ensuring that they are less likely to be maintained."
I've worked on projects that had 2000 line test files. No one wanted to touch these tests, or the code that they tested. Using a single "metaprogramming" loop, were reduced one of these files to 96 lines.
The term "metaprogramming" is almost misleading in this context. The examples here are simple loops in the class level - something every ruby programmer should be comfortable reading.
"When the last version of those tests fail, you get even less information"
Agreed. I wouldn't have used that test either, but it was the next logical example. I would use expectations.rubyforge.org instead, but I think if I go down that road with the example it will take away from the point.
"I've worked on projects that had 2000 line test files. No one wanted to touch these tests, or the code that they tested."
I'm pretty sure several of us have been in this situation. I've also been on projects where the 2000 line test turned into a 96 line test that everyone was afraid of.
The best solution is to create tests that are as simple as possible. Tests that are complex may also in turn need to be tested. This is not a good scenario. In fact, I wonder what the cyclomatic complexity on a metaprogrammed test would come out to.
'The term "metaprogramming" is almost misleading in this context.'
I disagree, writing code that defines more code is pretty much exactly the definition of metaprogramming.
I think 96 lines of metaprogrammed tests are better than 2000 lines of tests, but I think 96 lines of straightforward tests are even better.
Thanks for the comment, but I simply don't agree.
Cheers, Jay
Agreed. I wouldn't have used that test either, but it was the next logical example. I would use expectations.rubyforge.org instead, but I think if I go down that road with the example it will take away from the point.
"I've worked on projects that had 2000 line test files. No one wanted to touch these tests, or the code that they tested."
I'm pretty sure several of us have been in this situation. I've also been on projects where the 2000 line test turned into a 96 line test that everyone was afraid of.
The best solution is to create tests that are as simple as possible. Tests that are complex may also in turn need to be tested. This is not a good scenario. In fact, I wonder what the cyclomatic complexity on a metaprogrammed test would come out to.
'The term "metaprogramming" is almost misleading in this context.'
I disagree, writing code that defines more code is pretty much exactly the definition of metaprogramming.
I think 96 lines of metaprogrammed tests are better than 2000 lines of tests, but I think 96 lines of straightforward tests are even better.
Thanks for the comment, but I simply don't agree.
Cheers, Jay
Tammer,
Thanks again for the comment. I've updated the article to show the next logical step in creating more maintainable tests.
Cheers, Jay
Thanks again for the comment. I've updated the article to show the next logical step in creating more maintainable tests.
Cheers, Jay
Metaprogramming + readable stacktrace = generate the test, write it to file, them require it for execution. Problem solved?
I'm extremely wary of writing similar-looking logic in the test and in the code -- because if I'm likely to screw one up, I'm likely to hose the other as well. Of the two examples you gave, I prefer the latter for the simple reason than that the code under test uses a case statement, and the test does not.
That being said, in this case I might actually prefer writing a 100-line test method. I'm *much* more tolerant of duplication in tests than in code under test -- especially in a case like this where each test really is one line, and each line can be aligned neatly with the ones above and below it (column edit mode in TextMate is a wonderful thing).
That being said, in this case I might actually prefer writing a 100-line test method. I'm *much* more tolerant of duplication in tests than in code under test -- especially in a case like this where each test really is one line, and each line can be aligned neatly with the ones above and below it (column edit mode in TextMate is a wonderful thing).
Evan, assuming the generated tests aren't so many that they become unmaintainable, then yes, problem solved. However, if you generate 2000 tests that I can't easily digest, then I think I'd still have a problem.
Sam, I completely agree. Like I do with all my tests, I'd look for the solution that gave me the most reliable, readable, and maintainable solution possible. Sometimes that means 100 similar tests, sometimes it means custom assertions.
Cheers, Jay
Sam, I completely agree. Like I do with all my tests, I'd look for the solution that gave me the most reliable, readable, and maintainable solution possible. Sometimes that means 100 similar tests, sometimes it means custom assertions.
Cheers, Jay
I'm confused. If you name your test correctly, and use good messages, how is the metaprogrammed version different from the "helper" version? You've given the helper the benefit of nicely-written messages, but not the metaprogrammed version. I can understand not liking them, but purposely writing them poorly to attack them is disingenuous.
Look at the information you get from the failed assertion from the metaprogrammed test. You know you were testing 70 and you expected it to be C, but it was D.
Now look at the information from the helper-ized test. You know you were testing 78 and you expected it to be a B, but it was a C.
They both give you the same information (and would be equally well-presented if you bothered to write the test name and message well). The difference is that the helper has way more (and IMHO harder to read) code involved. For what?
You say the line number is worthless. If it's worthless for a metaprogrammed test, it's worthless for any test, as it tells you what assertion failed, and that's all it's ever told you. The message supplied by the test should tell you why it failed and the name of the test method should tell you what it was supposed to do.
You say you have to figure out what's wrong. I ask what is so magical about the helper that makes it so obvious what's wrong. They're both giving you the same information about the state of the problem.
You say that metaprogramming is just as susceptible to mistake as the code it's testing. This is, of course, true. But so is your helper. So are you basic everyday tests.
Now, don't get me wrong, I'm as crazy about the maintainability and readability of tests as the next guy. It's of vital importance that tests be useful to everyone and so choosing the right tool for the job is paramount. But seeing someone discount a powerful technique out of hand is disheartening, especially when it looks like it's being discounted based on flaws assumptions and implementation.
Look at the information you get from the failed assertion from the metaprogrammed test. You know you were testing 70 and you expected it to be C, but it was D.
Now look at the information from the helper-ized test. You know you were testing 78 and you expected it to be a B, but it was a C.
They both give you the same information (and would be equally well-presented if you bothered to write the test name and message well). The difference is that the helper has way more (and IMHO harder to read) code involved. For what?
You say the line number is worthless. If it's worthless for a metaprogrammed test, it's worthless for any test, as it tells you what assertion failed, and that's all it's ever told you. The message supplied by the test should tell you why it failed and the name of the test method should tell you what it was supposed to do.
You say you have to figure out what's wrong. I ask what is so magical about the helper that makes it so obvious what's wrong. They're both giving you the same information about the state of the problem.
You say that metaprogramming is just as susceptible to mistake as the code it's testing. This is, of course, true. But so is your helper. So are you basic everyday tests.
Now, don't get me wrong, I'm as crazy about the maintainability and readability of tests as the next guy. It's of vital importance that tests be useful to everyone and so choosing the right tool for the job is paramount. But seeing someone discount a powerful technique out of hand is disheartening, especially when it looks like it's being discounted based on flaws assumptions and implementation.
Hi jyurek,
I could pass in a message to assert_equal that was nicer than the default one, but that would add even more code, which makes the test that much more painful to digest. Sure, it looks obvious in the example, but it's one more thing to understand when you come to a test and don't have the benefit of context. If I were to metaprogram the tests, I wouldn't create a nice message, I'd name it nicely, as the example shows.
But, I agree with you, the test name and default message are enough to know which number failed, what was expected and what was actually returned. It's not the lack of information that is problematic with metaprogrammed tests.
Line numbers are not worthless for any test. In fact they are valuable for all tests, but they are considerably less valuable when they point to a loop and I need to mentally construct the state of all the variables given the current loop. Again, this isn't so hard in the example, but it's a pain when you come to a loop where you don't have context. Messages and the names are enough to figure out what the problem is, but clicking a link that takes me to a line where a failing test exists is much better. Metaprogrammed tests give me the the link also, but that link is to a failing test and an undetermined number of passing tests. This is unquestionably more complicated.
"You say you have to figure out what's wrong. I ask what is so magical about the helper that makes it so obvious what's wrong."
Custom assertions are "magical" because they are tested. They instill confidence because they are proven to work as expected. The same is not true of metaprogrammed tests. Unless you are testing that your metaprogramming is correct via additional tests, then they do not provide the same level of confidence. Because they are tested they allow me to focus on finding the problem in the domain, not wonder if the problem is in the domain or the test itself.
"You say that metaprogramming is just as susceptible to mistake as the code it's testing. This is, of course, true. But so is your helper."
Absolutely, which is why my helper is tested, but metaprogrammed tests are not. Untested complexity is always a bad thing.
There's no assumptions or missing implementation. I'm happy to hear of what you
I could pass in a message to assert_equal that was nicer than the default one, but that would add even more code, which makes the test that much more painful to digest. Sure, it looks obvious in the example, but it's one more thing to understand when you come to a test and don't have the benefit of context. If I were to metaprogram the tests, I wouldn't create a nice message, I'd name it nicely, as the example shows.
But, I agree with you, the test name and default message are enough to know which number failed, what was expected and what was actually returned. It's not the lack of information that is problematic with metaprogrammed tests.
Line numbers are not worthless for any test. In fact they are valuable for all tests, but they are considerably less valuable when they point to a loop and I need to mentally construct the state of all the variables given the current loop. Again, this isn't so hard in the example, but it's a pain when you come to a loop where you don't have context. Messages and the names are enough to figure out what the problem is, but clicking a link that takes me to a line where a failing test exists is much better. Metaprogrammed tests give me the the link also, but that link is to a failing test and an undetermined number of passing tests. This is unquestionably more complicated.
"You say you have to figure out what's wrong. I ask what is so magical about the helper that makes it so obvious what's wrong."
Custom assertions are "magical" because they are tested. They instill confidence because they are proven to work as expected. The same is not true of metaprogrammed tests. Unless you are testing that your metaprogramming is correct via additional tests, then they do not provide the same level of confidence. Because they are tested they allow me to focus on finding the problem in the domain, not wonder if the problem is in the domain or the test itself.
"You say that metaprogramming is just as susceptible to mistake as the code it's testing. This is, of course, true. But so is your helper."
Absolutely, which is why my helper is tested, but metaprogrammed tests are not. Untested complexity is always a bad thing.
There's no assumptions or missing implementation. I'm happy to hear of what you


