Monday, June 11, 2018

GSoC project progress: part two

In my previous post, I talked about all the major discussions that have been made with the community, what the suggestion feature would be like, how I plan to realize this feature, and how I have extended the functionality of the scripts/help/__unimplemented__.m function file to integrate the command line suggestion feature with Octave. In this post, I would like to share my progress and talk about how the current implementation of the suggestion feature is working. The link to the public repository that contains the code for this feature can be found here.

The goal for the first evaluations was to code up a small model that would show how this feature integrates itself with Octave. That part, however, was completed by the time I had made my last blog post. I had been working on a full-fledged command line suggestion feature since then and till now, I have been able to complete a working command suggestion feature that supports identifiers from core octave and 40 octave-forge packages. Lets start looking at various parts of the feature.

Whenever the __unimplemented__.m function file, fails to identify, whatever the user entered, as a valid, unimplemented octave command, it calls one of my m-script, called __suggestions__.m and the command suggestion feature gets triggered. This script, __suggestions__.m, does the following things...
  • Firstly, based on the setting of the custom preference (set by the user with the command setpref ("Octave", "autosuggestion", true/false) it decides whether to display/not to display any suggestions. If the preference is 'false', it realizes that the user has turned off the feature and so it returns the control without calculating or displaying any suggestions. 
  • However, if the preference is true, it checks if whatever the user has entered is at-least a two letter string. If not, it again, returns the control without calculating or displaying any suggestions. This is done because it is less likely that a one letter strings is a misspelled form of some command.
  • However, if the string entered by the user is two lettered or more, the script goes on to calculate the commands that closely match the misspelling. The work of calculation is done by a different script and __suggestions__.m, only calls that script to get the closest matching commands. These commands are then displayed to the user as potential corrections.
  • If the misspelling of the user is short (length of the misspelling < 5), the script entertains one typo only, However, if the length of the misspelling is more than or equal to 5, two typos are entertained as well. This essentially means that for short misspellings, commands which are at an edit distance of 1 from the misspelling are shown as potential corrections and for relatively longer misspellings, commands which are at an edit distance of 2 from the misspelling are also shown as potential corrections.
Commands that closely match the misspelling are calculated by a different m-script. This m-script is called __generate__.m. It loads a list of possible commands from a database file called  func.db and then calculates the edit distance between the misspelling and each entry of the list using a different script called edit_distance.m. The commands having an edit distance equal to one or two are accepted as close matches and a list of all such commands along with their edit distances is returned to __suggestions__.m which displays some or all of these suggestions depending on the logic described before.

I'd like to mention that the strings package of Octave forge also has a function file that calculates the edit distance. It is called "editdistance.m". Therefore, to avoid compatibility issues or to avoid having two different function files that do the same thing, later on, I will include the edit_distance function that I wrote within the __generate__.m script.

Improving the speed of the generation script 

If we go on and calculate the edit distance between the misspelling and each and every identifier of octave (core+forge), our algorithm would take nearly 20 years to generate an output for each typographic error that the user makes. We, however, would like the time to be 20 milliseconds or so. For that, we use some smart techniques that reduce the sample space on which the algorithm has to operate.

To reduce the time, I've made a small assumption. I have assumed that the user never mistypes the first letter of a command. A rough analysis of the misspelling data that I received from Shane of octave-online before the commencement of the project, suggests that this is a reasonable assumption and would hardly reduce the accuracy of the suggestion feature. How good is this assumption for the speed? Well, I'd just say that, for a misspelling starting with the letter 'n', this small assumption reduces the sample size from 1492 to 36 (and that is not the best case!). The worst case was that of the letter 's' in which 178 out of 1492 commands were left. Even that corresponds to an 88% reduction in the sample size.

It is important to mention that doing this alphabetical separation at run-time would be a redundant task and a stupid idea, that would correspond to the algorithm taking 20 years again.

Another thing that we should consider, to improve the speed, is to show suggestions from octave core + loaded packages only. Obviously it is not a good idea to check among the commands that belong to a package which the user is not currently using (or worst, a package that is not installed on the user's machine).

Keeping these things in mind, I have created the func.db database file in such a way that the commands belonging to different packages are stored in different structures and are alphabetically separated as fields of that structure. So for example, func.db contains a structure called control which holds the identifiers from the control package only, and another structure core which holds the identifiers of core Octave only, and another structure signal which holds the identifiers of the signal package only and so on. The field a of the control structure (accessed by typing control.a) contains all the identifiers of control package starting with 'a', The field b (accessed by typing control.b) contains those identifiers of the control package that start with b, and so on. This has been repeated for all the packages available.

To make our __generate__.m script memory efficient as well, we load the core structure (which is always required) and then check for the loaded packages and load the structures corresponding to the loaded packages only, then, using a switch case, fetch all those commands which have the same first letter as that of the misspelling (in O(1), thanks to the way in which the database is arranged) and then proceed to the next step.

To understand the next step, we need to understand that the if a misspelling is of length p (say), and we are accepting corrections that are at an edit distance of one or two from the misspelling, then the corrections could have the following lengths only...
  • p-2: Two deletes in the misspelling,
  • p-1: One delete and one substitution, or one delete only.
  • p: One delete and one addition, or one or two substitutions.
  • p+1: One addition and one substitution, or one substitution only.
  • p+2: Two additions to the misspelling.
This fact, allows us to reduce the list further and would cut out some 5-10 more entries for normal length misspellings. This logic, however, is particularly useful for large length misspellings, because commands with large lengths are very less in number. If a user misspells the command "suppress_verbose_help_message" the script would take days to suggest a correction for this command without this logic, this is because edit distance algorithm is O(n1*n2) with dynamic programming, where n1 and n2 are the lengths of the strings being compared. This O(n1*n2) is repeated m times where m is the number of possible commands that could be close matches. With this logic however, the possible list would be cut down to one or two commands only. Thus, the value of m will be reduced and the close matches will be found within one or two iterations.

That summarizes all the measures that I have taken to improve the speed of the suggestion feature. The control flow had been described before this and so that concludes the working of the suggestion feature.

Conclusion

This concludes phase one. What's left is to include more forge packages and to include graphic properties within the scope of this feature. Writing the documentation, writing the tests, and debugging also remains but these shall be the tasks for subsequent coding phases. Till then, goodbye, see you in the next blog post. :)

Wednesday, May 30, 2018

GSoC project progress: part one


An Initial note....


Alright, so first of all, I would like to apologize for not writing a proper blog post up till now. I had my Final examinations during the first week of the coding period and immediately after that, to catch up, I got so involved with the coding part that I forgot to share the progress of the project on the blog. On the positive side however, I have completed a lot of work. I can safely say that I have completed the goals that were set for phase 1 evaluations (possible style fixes may be left), but that’s not the entire good news. The phase two evaluations goal is also halfway done!

Now, I do realize that I have not shared any details of my project until now, and so, in this blog post, I’ll share a lot of important details and talk about everything that has been discussed and done so far. I promise to post more often after this, ‘cumulative’ post. Here goes...

The Project Idea....


If you've red the last blog post, you'd know that I plan to add something called a 'Command Line suggestion feature' to Octave and you may be wondering what that means. Basically this feature would do something like this...

Whenever the users make a typographic error while working on Octave's command window, the command line suggestion feature would suggest a correction to them and say something like "The command you entered was not recognized, did you mean any of the following...?"

Now I could share a detailed time-line explaining 'when' I plan to do 'what' but I believe that not everyone would be interested in reading that and so I'll skip that for now. Instead, I'll quickly talk about the following...
  • What the community wants the overall project to be like.
  • What are the challenging parts of the project.
  • What are my evaluation goals.
  • What discussions have been made, and
  • How much progress has been made.
If you really would like to see my time-line then just ask for it in the comments section and I'll share a link.

The Community Bonding Period...


By the time you finish reading this section, probably the only thing left to talk about would be "How much progress has been made". That is just a glimpse of how much the community has been involved in this project. It also shows how successful GNU Octave is as an open source community, not every open-source community is as open when it comes to discussions.

Now the first thing to understand is that this project is essentially a UX improvement, and as such, Octave is not bound by 'MATLAB compatibility issues'. This is one of the primary reasons why there was so much to discuss in the community bonding period. Here are the main points that summarize the collective decision of the community on what the overall project should be like:
  • First of all, it was decided that the user interface, or the part handling 'how this feature hooks itself to octave' should be well separated from how the 'suggestions are generated'. This need was realized, immediately after realizing the fact that there are a lot of algorithms available that could be used to generate suggestions. Separating the integration and generation part would allow us to make sure that, in future, if a faster or a more accurate algorithm to generate suggestions is discovered, replacing the existing implementation becomes easier.
  • Secondly, a few problems such as, a very large output layer size, and failure on dynamic package loading were found with my proposed Neural Networks, based approach. Therefore, we decided to use a well established approach called the Edit distance algorithm for now and the Neural Networks based approach will be the 'research part' of the project. Essentially, the plan is to first use 'smart implementations' of the good old Edit-distance algorithm to realize this feature and to research and see if a Neural Network could do better after that has been done. If later on, we realize that a Neural Network (or for that matter, any other approach), really could do better than the Edit-distance approach, the algorithm can be replaced very easily (thanks to the previous point).
  • Next, we decided to include keywords, functions, and graphic properties within the scope of this feature. Very short keywords, user variables, and internal functions will not be included in its scope. Deprecated functions would also be included in the scope for now. Essentially, corrections would be suggested for typos close to anything that is within the scope of this feature and would not be suggested for anything that isn't.
  • Also, we decided to use the missing_function_hook() to realize the integration part of this feature. More about this later in this post.
  • Lastly, we decided that it is absolutely necessary to include an 'on/off switch' type of command that would let the users decide whether they want to use this feature or not. We plan to use custom preference for now to do this.
That summarizes the most important discussions that took place and with that, we are in a position to talk about how the second point and the last point are directly related to what are the 'challenging parts of the project'. Let's start with that.

Essentially, the second point talks about the algorithm that will be used to generate the corrections that are ultimately going to be shown to the user. The challenging part is that this algorithm should provide a minimum speed-accuracy trade-off. I did know about the Edit-distance algorithm beforehand but I initially believed that a Neural Network would outperform it in terms of the speed accuracy trade-off. Discussing the idea with the community made me realize that there are some critical loopholes in the Neural Network based model and although they could definitely be improved with more research, I should not jeopardize the entire project just to proof that Neural Networks could do better. We therefore decided to do what I had described earlier in the second point.

At this point, defining a 'smart implementation' of Edit distance remains. Basically, Edit distance is a very accurate algorithm that quantifies how dissimilar two strings are. The only problem with it is its speed (my primary reason for initially proposing a trained Neural Network). Essentially, by a smart implementation of the algorithm, we mean an implementation which would maximize the computation speed by reducing the sample space, on which the algorithm has to work on. This would be done using some clever data management techniques and some probability based assumptions. Some discussions related to these were also done during the community bonding and since then, I have been looking at a lot of suggestion features of other free and open source softwares to device some clever techniques. Good progress has been made but I'll share that in another blog post.

The last point talks about a very important 'on/off' feature, the tricky part with this was that Octave comes in both a GUI and a CLI and so a common method that does the job could have been hard to find. However, this problem was solved with relative ease, and we decided to use custom preference to realize this part. This gave us a simple and common command to switch on/ switch off the feature.

These discussions led me to reset my term evaluation goals which are as follows now:-
  • Phase-1 goal: To code up and push an algorithm independent version of the suggestion feature into my public repository. Essentially this would show how this feature integrates itself with Octave.
  • Phase-2 goal: A development version of Octave with a working (but maybe bugged and surely undocumented) command line suggestion feature integrated into it.
  • Phase-3 goal: The primary goal would be to have a well documented, well tested and bug free command line suggestion feature. The secondary goal would be to research and try to produce a Neural Network based correction generation script that outperforms the edit distance algorithm.
...and that, marked the end of the major discussions and the community bonding period.

Progress made so far...


So far, I have coded up the phase-1 goal. The public repository can be seen here. It very well shows how we have used the missing_function_hook() to integrate the feature with octave. The following points summarize the working:
  • Essentially, whenever the parser fails to identify something as a valid octave command it calls the missing_function_hook() which points to an internal function file, '__unimplemented__.m'.
  • This file checks if whatever the user entered is a valid, unimplemented octave (core or forge) command or if it is a implemented command but belongs to an unloaded forge package. If yes, it returns an appropriate message to the user and if not, it does, or rather, used to, do nothing.
  • To realize the suggestion feature, I have extended its functionality to check for typographic errors whenever the command entered was not identified as a valid unimplemented/ forge command.
By using the missing_function_hook() the keywords and built-in functions were automatically bought into the scope of this feature. Graphic properties remain because there is no missing_property_hook() in octave right now. I have discussed this with the community and I'll try to code it up in the subsequent weeks.
Besides that I have also figured out how the Edit Distance algorithm can be made 'smart'. I'll push an update and write another blog post as soon I master and code up the entire thing. For now, thanks for reading, see you in the next post. :)

Thursday, April 26, 2018

Starting with GSoC 2018.

So this year, I applied to the Google summer of code and got in. Google summer of code, or GSoC, as it is usually called, is a program funded by Google that has helped open source grow for over a decade. Under this program, Google awards stipends to university students for contributing code to open source organizations during their summer breaks from the university. The details of the program can be found here: Starting with Google summer of code.

Now this year, I have been selected to work with GNU Octave. It is a free and open source software/ high level programming language which is primarily focused on scientific computing. It is largely compatible with MATLAB and is a brilliant open source alternative to it. More details about GNU Octave can be found at Free your numbers! Introducing GNU Octave.

My GSoC project is about adding a Command line suggestion feature to GNU Octave. Stay tuned, I will share the details of the project very soon.

GSoC project progress: part two

In my previous post, I talked about all the major discussions that have been made with the community, what the suggestion feature would be ...