Monday, November 10, 2008

Universal Lexical Analyzer

Hi All,

The expectation from below code is to provide a library to make a universal lexical analyzer that can parse any file with the help of file having regular expression representing the tokens in the file.

Even with this very specific idea I chose to use templates since there might be a possibility that creative people will use this for other purposes also and let us know about the very usage they have thought of/used in.

To support Unicode (as Gunjan suggested) I have used disjoint ranges to represent a class of input values, which in FSM we read as input alphabet. In this way this library may be used for tokenising files from different languages and hence may be in future to develop programs in any language (specially in Hindi, I guess)

For any queries or any suggestion I am eager to hear from you.

I am working with devC++ and Source Insight as development tools

Reference of classes, variables, functions and source code it has.



Saturday, November 08, 2008

You are given a 1000 points in 2D space and you need to find the most dense 100 points

Hi Algo Geeks,
Some questions constantly nag me. And this happen many times in case of Algo questions.
Well this time I will be straight forward while telling question and answer. Prior to this I have send many things but they are trickily said and people did not give it attention. So this time very straight.

The question goes like-
You are given a 1000 points in 2D space and you need to find the most dense 100 points.

The first answer people generally gave was "in worst case it can be solved in order of O(n^2). That's by finding the distance between every 2 points and then sort it according to that and get the answer. But is it true.

There is grave misconception over here, we cannot do this. Can we actually find such 100 points by this way. Perhaps the top 100 least distances may be of points very far from each other. So what we should do? After so many years of constant nagging with this question i have bit answer, yet i hope that this may have been answered in unsupervised learning. But I want this to share with you, just let me know weather it goes right or it is going badly wrong. So here is that part answer...

1) sort the x-coordinates of the points
1.a) get the minimum distance along x direction between the points, (say dx)
2) sort the y-coordinates of the points
2.a) get the minimum distance along y direction between the points, (say dy)
(The above two ensures that there would be max 2 points in the grid, -to me it seems the max. is 1 but just to be careful i used here 2)
3) Divide the whole plane into grid of dx by dy sized rectangles
4) Calculate the density of points over the whole plane (say avgden = total no of given points/ total no. of grids the 2D plane spans to include all these points)
4.a) minden = avgden
5)For each grid rectangle with atleast one point do 5.a)
5.a) Now start collecting neighboring grids calculating density untill either there would be 100 points in the collection or its density goes below minden
5.b)keep updating minden for maximum density cluster found or if we only encountering clusters with density well below minden then we can re-estimate the avgden possible for the rest of points, which may help us finding higher density.

Doubt:
Since in case of "Find max. sum in the given array of integers" we never had a factor that decreases the value variably would this algo be doing its job
Also in that case we knew the point below which we will not go that's zero but here we don't know what would be the highest density we are looking for, which in this case i had taken the avgden and updating as needed, still there may be cases where this approach may not work. But the first pass must give us some idea about density we are seeking. And it has to be converging with next passes. But with how many iteration the answer may come, is the real hurdle this algo is facing. I am looking for loopholes in this algo and some work at this point.