Legacy C application – Refactoring or reengineering? (II)

WordCLegacy_UseCases2We continue this series about the Use Cases that may arise with a Lecagy C application: Word 1.1a, first version of this word procesor released by Microsoft in 1990.

The first two posts were dedicated to metrics of size, complexity, level of comments and duplications, as well as various ‘Issues’ Blocker, Critical, Major and Minor.

In order to evaluate the best strategies, including refactoring and reengineering, we started working on the Cyclomatic Complexity (CC) of functions and their distribution. Now we will perform the same work on the files to identify which are the most complex and / or also incorporate also complex functions.

Complexity

Programs

The following table shows the distribution of the complexity among the files:

Table 4 – Complexity Cyclomatic of files from the Word Opus application

Word_Files_Distrib171 files have no Cyclomatic Complexity. I counted 130 .h files in the application, most of them with definition of data structures and constants, so without any algorithm corresponding to a functional or technical operation. On the other hand, 156 files have a CC greater than 90, so this application is extremely polarized between programs with a low or even zero complexity, and programs with a high complexity. Or very high, since again, a ‘Major’ rule allows me to identify 159 files with more than 80 points of CC.

Word_Avoid_Cplx_FilesAs I already did for the functions, I calculated the distribution of the most complex files.

Table 5 – Distribution of the most complex files

Word_Files_Distrib2159 files, or 45.6% of the 349 files in this application, have over 80 points of CC, and a total of 43,058 points of CC, 98.2 % of the overall Cyclomatic Complexity (43,846). I told you that this application is polarized in terms of complexity of the programs!

Note the 24 files over 400 points of CC, so again (just like with the functions) a limited number of objects (15%) is an important part (30%) of the overall complexity. With the 27 files over 300 points of CC, this means 51 programs (14.6% of the 349 existing files in the application) beyond 300 CC and a total complexity (12,860 + 9,472) equal to 22,332 or 51% of the overall Cyclomatic Complexity for the application (43,846).

I also checked how the more complex functions are distributed amont these programs:

  • 3 files with more than 400 CC have at least one function over 200 CC and 1 function over 100 CC (in red below).
  • 1 file with more than 400 CC and 2 files with more than 300 CC have at least one function over 200 CC (dark orange).
  • 7 files with more than 400 CC with at least one function over 100 CC (light orange ).
  • 6 more files with more than 300 CC with at least one function over 100 CC (yellow).

Figure 1 – Most complex files with most complex functions

Word_Fn_Distrib Now we know which functions and files must be regarded as a priority, and on which first focus our efforts, with this assessment of their level of complexity.

Size

As we have seen in the previous post, size is also an element of evaluation of the charges for a refactoring or a reengineering.

Functions

I also have a rule that allows me to identify the functions that exceed a certain threshold, beyond 100 lines of code.
Word_AvoidFnsSizeAgain, I had a look on the the distribution of functions according to their size:

Table 6 – Distribution of functions according to their size

Word_FunctionsSize_DistribOf 478 functions over 100 LOCs:

  • 3 functions are beyond 1,000 lines of code.
  • 21 have more than 500 lines of code.
  • 126 are between 200 and 500 lines of code.

I also crossed the size (LOCs) with the Cyclomatic Complexity for the following results:

  • The most complex function, with 355 points of CC has 2,063 lines of code (in the file ‘Opus\RTFOUT.c’) !
  • On the 6 most complex functions, beyond 200 CC, 3 have a size greater than 1,000 LOCs (including the previous one), 2 exceed 700 lines of code, and the last one has 393 LOCs.
  • Of the 30 functions previously identified over 100 CC, 15 have a size between 500 and 1,000 LOCs, and the other 15 ones between 250 and 500 LOCs.

Not surprisingly, these functions are present in the files marked in red / orange in the Figure 1 above.

Programs

We have seen that the complexity was highly polarized between files, with 159 of them beyond 80 CC points. Another rule also allows us to list 149 files with more than 1,000 lines of code.

Word_AvoidPgmSizeWe have 9 files beyond 3,000 lines, including one with 4,117 LOCs, and 36 files between 2,000 and 3,000 lines. I checked for the 9 most important files if they included also large or very large functions, and to what extent (number and size of these functions).

Table 7 – Largest files with large functions

Word_FilesFns_SizeThe column ‘Nbr. Functions’ lists the number of functions with more than 100 LOCs in the file, then the 3 columns ‘Function size’ show the number of functions beyond 1,000 / 500 / 200 LOCs. For instance:

  • The file ‘Opus\Wordtech\select.c’ has 7 functions over 100 lines, including one of more than 500 lines and 5 of more than 200 lines.
  • The file ‘Opus\interp\elcore.c’ has 5 functions of more than 100 lines, 1 of them over 1,000 lines and another one with more than 500 lines.

We can notice that the number of large functions decreases in proportion to the size of the file. However, there are exceptions, at least these two ones:

  • The file ‘RTFOUT.c’ with 2,222 LOCs has the most complex function with 355 points of CC and 2,063 lines of code.
  • The file ‘formula.c’ with 2,179 LOCs and function over 1,000 lines of code.

Also, of the 36 files between 2,000 and 3,000 LOCs:

  • 8 files have 10 functions with more than 500 LOCs, including 6 with at least one additional function over 200 LOCs.
  • 25 files have 25 functions between 200 and 500 LOCs.PictComplexity

We have identified the functions and files that weigh more on the difficulty and the effort, and thus on the costs of refactoring and reengineering,

In the next post , we will check if these programs also contain structural defects (nested block, goto , etc.) that still make more difficult the understanding and evolvability of the code.

This post is also available in Leer este articulo en castellano and Lire cet article en français.

Leave a Reply

Your email address will not be published. Required fields are marked *