An Apellative Disclaimer

I read in yesterday’s newspaper about the dispute over the movie Eklavya: whether it should be sent as India’s official entry for the Oscars or not. The courts will (have to?) decide.

Sometime back I had searched on the Net for ‘eklavya’ (an exercise, the spell checker says excercise is wrong, in narcissism?). I found that most of the top results were about the movie Eklavya. This was doubly distressing for me. First, for the personal reason, but more importantly, for reasons which would be obvious for anyone who knows something about India. Actually, something here means more than something (work for linguists).

Anyway, for those who don’t know ‘something’ about India, and also for those who think they do but don’t, Eklavya (or Ekalavya: एकलव्य) was a character in the ancient Indian (Aryan, Brahmanical, almost Manuvadi) epic Mahabharata. In this massive book, which is one of my favorites, his story occupies the equivalent of one and half page. Still, given the caste history of India, he is (quite naturally) the hero or idol of many among the Dalit community. Technically, I am not a Dalit, but in a sense I am. So, I was somehow expecting that some document about this Eklavya would be at the top in the results returned by the search engine. Or, at least, would be among the top. Not so. Neither was this blog :-( but this was expected :-)

I would like to tell his story, but not now…

This blog post is a protest about the takeover of words or names (in fact, much more than that) by all kinds of powerful and influential people.

But it is also a disclaimer about the origin of the second part of the pen name used by this author. Eklavya in Anil Eklavya has nothing whatsoever to do with the movie. This disclaimer is needed just in case someone thinks that everything in the world is inspired by movies (and I talk a lot about movies on this blog). Despite all that has happened in the last many decades (centuries?), there still actually are some people in India who don’t need to know about Eklavya from movies because they grew up with stories of Eklavya in many forms with many interpretations.

The fact is, I have not even seen the movie, nor do I know what is in it. Nor does it have a high priority among the movies in my wish list.

So, please note: firstly, Eklavya was a remarkable character in Mahabharata (even though his appearance was short); secondly, Eklavya is the idol of many among Dalits; …; [inifinite – 1]ly, Eklavya is the second part of the pen name of this author; and only infinitely – that sounds positive – only lastly and leastly, Eklavya is the title of a recent Hindi movie.

When Encoded Convenience Gets Decoded as Frustration

Almost since the first time I got long time access to a computer (that would be around one and half decade ago), I have been struggling to use computers for Indian languages. That was much before I managed to reach a place where I could do research in language+computers. There are many who say that India is an ‘IT super power’ and I am living in a city which has almost become the IT capital of this ‘super power’. But I still can’t use computers easily for Indian languages for all the purposes for which I can use computers for English. Very easily.

Much of this has to do with the way language and encoding support is provided on computers. A lot also has to do with the simple fact that somewhere, someone (should I say manywhere, manyone?) preferred a convenient method over a much much better one. That convenience got encoded into something which was to be a solution to some problem or some information which I needed. When I tried to decode that, I got enough frustration to make me think about doing something.

Yes, this post is provoked by a fresh downpour of decoded frustration.

So I have been trying to do something to reduce the amount of such encoded convenience in the Universe, but I have trouble even in convincing others that there is a problem. (Digression: You could say that one of the ways I am using to prolong the heat life of the Universe is entropy itself. Food for thought. But the Second Law of Thermodynamics ensures that even as I try this, I add my own share of, what else, entropy.)

If there is in fact a problem, many others might also be facing the same problem, right? Then why am I unable to convince others that there is a problem? Simply because the size of intersection of the sets of people who I have to convince to address this problem and of those who face this problem is very small. These are different people. Those who can address the problem don’t face the problem because they don’t really want (or need?) to be able to use computers for Indian languages with the same ease with which they can use computers for English. They, at most, use computer for Indian languages for very limited purposes and are quite content with ad-hoc solutions. On the other hand, those who want the Indian languages* to be equally privileged with some other languages spoken by the same number of people, are usually not the ones who can address this problem.

* I will repeat here for the Nth time that MANY of these languages are natively spoken by HUNDREDS OF MILLIONS of people. They may be less privileged languages, but it is not quite appropriate to call them ‘minority languages’. Of course, there are also real minority languages in India…

More coming…

The Relevance of ‘Shared Tasks’ in NLP

Even after centuries of studies, we still have very little hard scientific knowledge about natural languages (NLs). Unlike in other branches of engineering, we don’t know the exact physical or mathematical laws which NLs follows, or even whether they do. So, at least for the time being, we can only rely on empirical techniques for solving practical problems in Natural Language Processing (NLP). Even after some general approach seems to hold promise for solving a problem, a lot of practical work remains to be done in refining the methods and in tuning the systems for the best possible performance. This is why once some initial breakthrough has been made, a lot of people have to try the techniques under different conditions to figure out what is the best setup, i.e., the best selection of parameter values, features, etc. What has come to be called a ‘shared task’ is one way of ensuring that this gets done.

Shared tasks are contest like events where many researchers or even developers working on a particular problem or a set of similar problems try to come up with the best systems. All the systems are evaluated on the same data to provide a fair, competition like setting. All the participants also have to submit papers describing their systems. The major goals of a shared task are:

  • To find out what is the state of the art in a specific area
  • To simultaneously advance the state of the art, even if slightly
  • To bring together researchers so that they can interact and perhaps argue and discuss
  • To act as an incentive for the researchers to build proper systems, some of which may become available for use by others

It was in view of this that the NLP Association of India (NLPAI) started conducting an annual event called the NLPAI Machine Learning Contest in which researchers, including students, are invited to participate and compete in solving a specific problem which is considered relevant. Last year the topic of the shared task was Shallow Parsing for South Asian languages. A workshop was also organized as an extension of this event as part of the IJCAI conference, which was held in Hyderabad, India. The topic this year was Named Entity Recognition for South and South East Asian languages. This year’s event will also have an extended version in the form of a workshop as part of the IJCNLP conference, which is also going to be held in Hyderabad, India.

In the context of South Asian languages, conducting a shared task has its own problems. This is because funding for them is usually unlikely. Without funding it is difficult to prepare the reference data which is usually essential for a shared task. Those who have annotated data are often unwilling to share it with others. IIIT has taken a lead in preparing annotated data for various purposes and also sharing it with others. Since the data is prepared under difficult conditions, sometimes there are problems with the data, but let’s hope things will improve. In any case, data with some errors is better than no data.

Another problem is that the number of full time researchers in NLP is quite small in South Asia, which affects the quality of submissions, but the shared tasks are meant to get over this situation by creating awareness and interest.

It needs to be emphasized that the goal is not just to show good performace on the data provided but also to build practically usable systems that perform well in general. This implies that the participants are supposed to go beyong being mere competitors in a contest. And the idea is to go further than just being the first in the race. Participation in a shared task should be a milestone, not the final destination.

I feel compelled to end this write up by saying that shared tasks with focus on South Asia can only succeed if there is collaboration and sharing of resources by researchers working in South Asia. We are still far from that situation.

The IJCNLP NER workshop site is located here.

(This write up was originally written for the NLPAI newsletter called Spandan, but it was taking a lo…ng time, I became impatient and so you find it here)

Faces of Dignity (Contd.)

I talked about how dignity can be ‘maintained’ in the face of two different extreme conditions which can easily destroy the kind of dignity I am referring to: extreme wealth and power as well as extreme deprivation.

This is true, of course (that’s why I said it: I wouldn’t lie, would I?). However, what I called extremes are not really extremes. We Indians shouldn’t have difficulty in understanding it. Rosetta’s poverty is relatively much better (what a word to use!) than that of tens (hundreds?) of millions of Indians. Rural as well as urban. It seems to me that Life in a Metro should also include life in metropolitan, even somewhat cosmopolitan, dwellings of the poor called slums. I may be wrong. Anyone can be wrong. Nothing is absolutely right or wrong. We all know that, of course.

Still, since it’s beyond my capabilities to rise above the notions of right and wrong, I do wonder how hard it can be for a person in a typical Indian slum to maintain (as in maintain lifestyle?) the basic human dignity I am so stuck up on. Sounds incredible to me, but it may just be true that Rosetta is lucky. And the princess is, of course, much more luckier. Not just because she falls in safe hands.

Just to make it simple to understand, and it is amazing how difficult it can be to understand such things, I can cite the example of a man (leave aside women) being tortured in a police station. Any police station anywhere in the world where torture is still an acceptable method of ‘interrogation’. Can a man being tortured ‘maintain’ his dignity? Fantasies will tell you that he can. Perhaps that’s true. But people from George Orwell to Khushwant Singh (not to mention our actual executioners, beg pardon, executives of the Law and Justice System) have pointed out, everyone has a limit. Where you lose your capacity to retain (that sounds better) your basic human dignity. Because it is snatched away from you and you can’t even fight back. You are trapped. (Is anyone calling a psychoanalyst to ask why I use this word so much?).

When I originally planned to write this post, I had thought I would write about the characters of Ann and Rosetta. About the technical aspects of the movie. About acting and direction. And, most importantly, about specific incidents in the movies which are not talked about by your usual reviewers. Like the scene of the princess doing her shopping with precious little (borrowed) money. Or like the scene in the barber shop. Or even about the dignity of the (comic) photographer: an unlikely candidate. Or why Rosetta leaves her job which she got after doing something which cost her the favor of many viewers and reviewers. Or about her apparent stomach aches. Or about the only time in the movie when she has an (awkwardly) good time.

It has turned out differently because what I wrote today is what I wanted to write today. No stylistic effect intended. No explanation intended. No protestation intended. No apologies intended. No pun intended either. Sometimes simple truth alone can be quite stylistic. I hope (or fear?) it often is.

So what’s the point? Well, the point can’t always be expressed in a punch line. You can know it if you want to.

Enough! No more waste of my philosophical profundities on a mere blog post.

Faces of Dignity at Two Extremes

For me, the single most important thing for acceptable human life is basic human dignity. Animals also have their own kind of dignity, but since they, perhaps, don’t have any self-consciousness, they automatically get all the dignity they need. Of course, humans have changed this situation, but that is a different story. For humans, on the other hand, dignity – basic human dignity, not the dignity associated with power, rank etc. – is something very hard to get or maintain. This is partly because it depends to a great extent on what is outside you: other people, the society you live in, the environment.

This may all be true, but it’s very abstract. What do I mean by ‘basic human dignity’? I can either give an academic sounding definition, or I can explain by example. I will do the latter here. The former can be reserved for some later academic work :-)

So if you want to see what basic human dignity means, you can watch two movies. You can see what dignity means at two ends of the socio-economic spectrum. The two movies are Roman Holiday and Rosetta.

The first shows you how dignity can be maintained even when you are deluged by wealth, a kind of power, pretentiousness and all the masked menace it means. How a princess can be so dignified that a down-and-out journalist looking for a scoop that will allow him to escape the situation he is trapped in, is moved to drop the scoop and his chances of escape, even when the princess is an easy ‘fair game’.

The other movie shows you how a down-and-out very poor girl trapped in a hell because of her poverty can still be so dignified that you can’t help feeling respect and awe for human life. And, not quite incidentally, disgust for the system that has created her hell and forces her to live in it with hardly any chances of escape.

Wait for more…

Bollywood Growing Up

I would never have seen this movie had Kalpana Sharma not written an article about it. That’s because it has one of the worst names a movie can have. Believe it or not, this movie is a good one, even though it’s called (Ugh!) Chak De India.

As they say in such situations, no points for guessing. That there are several servings of plenty of patriotism. There are some other usual Bollywood ingredients too, but not too many. The movie is of a surprisingly grown up kind for hard core mainstream Bollywood fare.

What did I like in the movie? For one thing, as the title of Kalpana Sharma’s article says, the celebration of difference. The run of the mill reviews may tell you that the star of this movie is Shahrukh Khan, but actually there are many stars. All the girls who played the roles of hockey players. That’s right, the film is about women’s hockey, in a country which is mad about cricket, but whose national game is (men’s) hockey. We are after all very good at such, well, duplicities.

So the movie is about how a rag-tag team of real (mostly) desi girls from all corners of India is inspired to win the (women’s) hockey World Cup. By a coach who is a former disgraced (men’s) hockey star. The fact of his disgrace is closely bound to the fact that he is a Muslim who was the captain of a team which lost, no points again, to Pakistan when his deciding penalty stroke ended up being a missed chance.

But the above summary doesn’t do justice to the film, because there are many other things which I liked. One being the language(s) used by the players from (desi) ‘states’. Another one is that there is no girl who is shown to be the Hero’s girl: quite a bold thing for a Bollywood movie which has been made for one of the most male chauvinistic societies in the world. Can you imagine a Hero who is without a girl, even a bewafa one? And that too when he is surrounded by girls all day. Who are almost at his mercy. Amazing! How would an Indian male be able to digest this fact. Crazy! (Is someone asking whether he is …?).

So, the movie is bold about representation of the minorities, stereotypes of tribals from Jharkhand (‘junglees’), girls from the North East (‘chinks’), cricket being a real career and hockey being ‘just a stupid game’, women’s career versus men’s career etc.

The scenes of games also look quite authentic, perhaps with some help from the CGI people. The director sure seems to know something about hockey. May I say that it is one of the best sports movies made in India, including Iqbal.

There is one thing though which is very odious about the film: the coach seems to be acting like a gentler version of the trainer in Full Metal Jacket. And this brings us back to the overdose of patriotism, which often threatens to make the movie unwatchable. Perhaps the director thought that the bitter anti-stereotypic medicine can only be given with the sugar coating of patriotism. The coating has become quite thick. Perhaps we will have to wait for some more time to have movies without such things. Till the protesters on the margins have struggled enough and sacrificed enough and till what they struggled for becomes mainstream and can be openly accepted by the Yash Chopras of the world.

If it ever does.

It could be Worse. And Worse. And Worse.

I have been through absurd situations before, but yesterday night I found myself in another one of those. This situation, like some (not all) earlier ones, was created by people I had (have?) some respect for. No exaggeration to say that my (personal) world was all shaken up.

Then I heard about the twin blasts. So I thought it could be worse. I had been planning for the last few days to go out. I could have gone to one of those places. Well, under perfect circumstances things could have become completely simplified: all problems solved, but perfect circumstances are not very likely. Things could have become much more complicated.

Then I watched Parzania. And thought that things could be still worse. We know what can come after something like ‘serial blasts’. We already had serial killers. Even a movie genre after them. Now we also have Serial Blasts. Possibly followed by Serial Riots. What about Serial Wars? Perhaps we already have them too. One kind of serial followed by another? Was the last episode of Gulf Wars the ith one or the jth one? Surely the loops aren’t infinite ones?

Then I watched Broken Arrow. Yes, things could be even more worse. There can be a whole new meaning to the term mentioned above (I shudder to mention it with this other meaning).

So what should I do? May be thank God that things are all -ed up (not just for me) for no fault of mine, but they could be worse and worse and worse. I could consider that, if there was one. I mean God. But then something tells me that things are going to be just that: worse. And still more worse. Serial Worse.

May be I should just get meself drunk. Only I don’t drink. For all practical purposes.

Has Chomsky Failed?

There seems to be a widespread explicit or implicit assumption among many linguists as well as among people from fields which have some overlap with linguistics that Chomsky has not only failed but has refused to accept his failure. Well, I am not really a *linguist* and I am interested in cognitive and statistical approaches. Also, as a computational linguist, I am using corpus all the time. But I just can’t see why one should say that Chomsky has failed.

  • During the last fifty years or so, he has done so many things that it’s impossible to say that he has failed in all that he did
  • Even in the narrow sense, I don’t think he has failed, because as Mike has pointed out, the central idea was innateness and Universal Grammar, which has been quite a success
  • As another example, I personally think the idea of autonomy of syntax and semantics is correct and it will be proved so in the future. I can say more on this, but may be later…
  • All kinds of people have taken something from the Chomsky branch of linguistics, e.g. cognitivists. Even computer scientists.
  • Just try to imagine what linguistics would have been had behaviorism dominated the field
  • It’s really not correct to say that Chomsky has refused to ‘accept that he has failed’. I don’t remember the source or the exact words now (someone on this list surely would) but he had explicitly said that he doesn’t claim to know what exactly is the correct solution. He had written that if at all we some day find the correct solution*, most probably the (specific) solution he is suggesting will turn out to be wrong.
    • * Which we might not: his famous spider and the web example. Like the spider, we may have this great skill of language but we may never get to know how exactly we use it.
  • His churning out new theories every decade shows that he never claimed to have found the correct solution. He just claimed to be trying to get nearer to the solution.
  • I think it’s unfair to just look at his specific theories and based on their (partial) failure claim that he has failed. What he has been doing is much more than just proposing some new grammatical theories.
  • I think, on the whole, he has succeeded more than he has failed. Even his failures (if they are that) have added to our understanding of how language works.
  • A lot of his presumed failure has to do with the kind of goals he had set for himself and for linguistics. He wanted to do linguistics the way physicists do physics. No wonder he considered semantics to be out of the scope. Can anyone really claim that we can (even after his ‘failure’ and some others’ non-failure) talk about semantics in the way physicists talk about physics? I don’t think we should restrict ourselves to physics-like study and so I am not averse to speculating about semantics. I think the best work on semantics (including computational) is at the same level (on the scale of being scientific) as the political work of Chomsky. And that is quite alright because it:
    • is a sincere attempt to find the truth
    • is rigorous
    • tries to stick to really scientific methods as far as possible (not always possible)
    • may be practically useful
  • That some ways of inquiry were ‘blocked’ is as much a fault of others as his. Others could have tried new ways irrespective of what he said. That a lot (or all) of them did not is something to do with the way society works, not just about his views.
  • Language is so complex and so important a part of our psychology (and philosophy and social behavior and politics and …) that, as they say in computational theory, if this problem is solved, all the problems will be solved. Why should there be any surprise that Chomsky, or anyone else for that matter, has failed to come up with a complete and correct solution. I, for one, am extremely thankful that the mysteries of language haven’t been all solved and am hopeful that they won’t be: at least in the near future.
  • As some others have pointed out, what is the right way and what is not may depend on your purpose. If I just want to automatically identify the language of a document and a purely statistical method (learning from a small corpus) gives me the right answer almost always, statistics is the right way for me for this purpose. But that doesn’t necessarily mean that Chomsky has been proved all wrong.
  • Finally, my favorite example (from Chemistry): Dalton, in his formulation of the atomic theory, confused atoms and molecules (which Avogadro later pointed out). Did Dalton fail and Avogadro succeed? In an extremely narrow sense, yes. Otherwise, not really.

(This comment was sent in reply to some mails on the corpora list).

Getting a Doctorate: An Honorary One

Which is the better option for getting a doctorate: one of the best educational institutions in India or Reality TV in UK? Now you have the answer.

So here is a four step guide to get an honorary doctorate:

  1. Become a silent but visible victim of racism
  2. Let there be some protest
  3. Then pretend nothing really happened
  4. Forget-and-forgive the offender and express admiration for the offender’s community/nation/whatever

Et voila! you become a Doctor. Perhaps, sometime later, the offender might as well get a doctorate by some reputed Indian university. Quid pro quo?

Why not? That’s what civilization is all about.

A Model of Scripts and the Two Trips in March

Though the vague ideas had been with me for many years, I started to formally work on (what I have named) Computational Modeling of Scripts (CMS). Incidentally, modelling is wrong in 2007, modeling was wrong in 1907, but it is not really a case of language variation. Anyway, two of my papers (co-authored with Harshit Surana) related to the work on CMS were accepted and I had to present both of them in March. I wouldn’t give here the details about the work on CMS because if I begin, there would be no end.

The first paper was titled ‘Using a Model of Scripts for Shallow Morphological Analysis Given an Unannotated Corpus’ and I had to present it at a Workshop on Morpho-Syntactic Analysis. It was being held at Bangkok as part of the 2nd School of Asian NLP for Linguistic Diversity and Language Resource Development. In this case, the long name of the event was justified by the fact it went on for ten days. But even though I had to attend a lot of talks everyday, I didn’t mind. I missed the first two days because of the highly efficient way in which our bureaucracy works. I actually had to go back from the airport after checking in with luggage and all. I managed to survive the nightmare and, thankfully, my nine days at Bangkok were among the best I have had for years, in spite of many problems: food, language, money etc. This was the first time I had gone to any place east of Varanasi and, to put it simply, I liked it.

Bangkok-1 Bangkok-2 Bangkok-3

The second paper was called ‘Study of Cognates among South Asian Languages for the Purpose of Building Lexical Resources’ and it was to be presented in Mumbai at the National Seminar on Creation of Lexical Resources for Indian Language Computing and Processing. I just managed to reach Mumbai (one day late) because there was a deadline for submission to an ACL Workshop on Computing and Historical Phonology. By the way, this paper too was (distantly?) related to the work on CMS and it has been accepted. However, the stay in Mumbai was somewhat less enjoyable (another long story I am not going to tell).

Mumbai-1 Mumbai-2

Apart from the joys (and pains) of traveling, the positive thing about these two trips was that at last I managed to present some of my work on CMS, even if the events were not as big as the ACL. And I got to see a lot of people working in NLP, some of whom I had known from the literature. I also got to meet a lot of East Asians from many different countries, in person and in their world. The downside was that I missed another important event which I had been waiting for: A proper film festival in Hyderabad.

PS: This ‘article’ was published somewhere earlier, but since some parts got left out, I am putting the complete and unabridged version here :-)