The origin of Chinese

Chinese

Chinese - as the mother tongue of a nation, Chinese is the largest branch of the popular language system in the world today. It was founded during the reign of Emperor Huangdi BC and was accomplished in the late 20th century AD. It is a language system that originated the earliest and matured the latest. It is a symbol and achievement of Eastern civilization, and an important information carrier used by humans to accurately name and define everything. The system includes thousands of commonly used words and tens of thousands of idioms, and is an indispensable and important component of a civilized society. Since a Beijing editor

Since the topic of machine translation was proposed in the early 1950s, the research and development history of natural language processing (NLP) has been at least 50 years. In the early 1990s, the research goals of NLP began to shift from small-scale restricted language processing to large-scale real text processing. This new goal was officially included in the theme of the conference at the "13th International Conference on Computational Linguistics" held in Helsinki in 1990. Those limited language analysis systems with only a few hundred entries and dozens of grammatical rules are often jokingly called "toys" by industry insiders and are unlikely to have any practical value. Governments, enterprises and computer users expect to be able to process Chinese character input, voice dictation machines, text-to-speech conversion (TTS), search engines, information extraction (IE), information security and machine translation (MT). A practical system for large-scale real text.

It is based on the focus on this milestone turning point that the author listed four application prospects for large-scale real text processing in 1993: a new generation of information retrieval system; newspapers edited according to customer requirements; information Extraction, that is, converting unstructured text into a structured information database; automatic annotation of large-scale corpora. Fortunately, today all four directions have achieved practical or commercial results.

Although the whole world regards large-scale real text processing as a strategic goal of NLP, this does not mean that machine translation, voice dialogue, telephone translation and other deep understanding-based applications in restricted areas Natural language analysis technology or theoretical research should no longer be conducted. Diversity of goals and missions is a sign of a thriving academic community. The problem is to consider clearly where the main battlefield of NLP is and where our main force should be deployed.

Is it difficult to speak Chinese?

When it comes to the major application issues faced by Chinese information processing, such as Chinese character input and speech recognition, which are expected by enterprises and computer users, there seems to be no disagreement. But when the discussion goes deeper into the methods or technical routes to achieve these topics, the differences immediately become clear. The first opinion is that the essence of Chinese information processing is Chinese understanding, that is, syntactic-semantic analysis of real Chinese texts. Scholars who hold this opinion argue that the probabilistic and statistical methods used in the past in Chinese information processing have come to an end. In order to solve the problem of Chinese information processing at the understanding or language level, another approach must be found, and this approach is semantics. It is said that this is because Chinese is different from Western languages. Chinese syntax is quite flexible, and Chinese is essentially a semantic language.

The opposite view to the above opinion is that most of the application systems mentioned above (except MT) are actually implemented without syntactic-semantic analysis, so there is no "understanding" ". If we must say "understanding", then it is only the so-called "understanding" confirmed by the Turing experiment.

The focus of the above-mentioned arguments is method, but goals and methods are usually inseparable. If we agree that large-scale real text processing is the strategic goal of NLP, then the theories and methods to achieve this goal must also change accordingly. Coincidentally, the "Fourth International Conference on Theory and Methods of Machine Translation (TMI-92)" held in Montreal in 1992 announced that the theme of the conference was "Empirical and Rationalist Methods in Machine Translation". This is an open admission that in addition to the traditional NLP technology based on linguistics and artificial intelligence methods (i.e., rationalism), there is a new method based on corpora and statistical language models (i.e., empiricism) that is rapidly emerging.

The strategic goals of NLP and the corresponding corpus methods are obtained from the broad perspective of the international academic arena, and Chinese information processing is no exception. The view that Chinese text processing is particularly difficult and that another approach needs to be found lacks persuasive factual basis. Take information retrieval (IR) as an example. Its task is to find documents related to the user's query from a large-scale document library. How to represent the content of documents and queries, and how to measure the correlation between documents and queries, have become two basic issues that IR technology needs to solve. Recall rate and precision rate are the two main indicators for evaluating an IR system. Since documents and queries are expressed in natural language, this task can be used to illustrate that the problems faced and methods used in Chinese and Western languages ​​are actually very similar. Generally speaking, IR systems in various languages ​​use term frequency (tf) and inverse document frequency (idf) in documents and queries to represent the content of documents and queries, so it is essentially a statistical method.

World Text Retrieval Conference TREC ( and W = w1...wn represent part-of-speech tag sequence and word sequence respectively, then the part-of-speech tagging task can be regarded as calculating the following conditions when the word sequence W is known The problem of probability maximum value:

C*= argmaxC P(C|W)

= argmaxC P(W|C)P(C) / P(W)

≈ argmaxC ∏i i=1,...,nP(wi|ci)P(ci|ci-1 )

P(C|W) means: known input word sequence In the case of W, the mathematical symbol argmaxC means to find the word sequence W* that maximizes the conditional probability P(C|W) by examining different candidate part-of-speech tag sequences C. The second line of the formula should be the result of the part-of-speech tagging of W. The second line of the formula is the result of transcoding using Bayes' law. Since the denominator P(W) is a constant for a given W, it does not affect the extreme. The calculation of the large value can be deleted from the formula. Then, the formula is approximated. First, the independence assumption is introduced, and the probability of occurrence of any word wi in the word sequence is only related to the part-of-speech tag ci of the current word, and It is independent of the surrounding (context) part-of-speech markers, that is, the lexical probability

P(W|C) ≈ ∏i i=1,...,nP(wi|ci)

Secondly. , using a binary hypothesis, that is, it is approximately believed that the occurrence probability of any part-of-speech mark ci is only related to its immediately preceding part-of-speech mark ci-1, so there is:

P(C) ≈∏i i=,. ..,n P(ci|ci-1)

P(ci|ci-1) is the transition probability of the part-of-speech tag, also called the binary model

The above two. Each probability parameter can also be estimated separately through a corpus with part-of-speech tags:

P(wi|ci) ≈ count(wi,ci) / count(ci)

P( ci|ci-1) ≈ count(ci-1ci) / count(ci-1)

By the way, domestic and foreign scholars use binary or ternary models of part-of-speech tags to realize automatic identification of Chinese and English parts of speech. The annotation accuracy rate is about 95%.

Why is evaluation the only criterion?

The only criterion for judging the quality of a method is evaluation. Comparable evaluations are not "self-evaluations" designed by designers themselves, nor are people's intuitions or someone's "foresight". In recent years, in the field of language information processing, there have been many examples of using evaluations to promote scientific and technological progress. The National "863 Project" intelligent computer expert group has conducted many unified test data and unified testing on topics such as speech recognition, Chinese character (printed and handwritten) recognition, automatic text segmentation, automatic part-of-speech tagging, automatic summarization and machine translation translation quality. The national evaluation of scoring methods has played a very active role in promoting technological progress in these fields.

Internationally, the U.S. Department of Defense has launched two programs related to language information processing, TIPSTER and TIDES, which are called "evaluation-driven programs." They not only provide large-scale training corpus and test corpus, but also provide unified scoring methods and evaluation software on research topics such as information retrieval (TREC), information extraction (MUC), and named entity recognition (MET-2). Ensure that each research group can discuss research methods under fair and open conditions and promote the progress of science and technology. The multi-language evaluation activities organized by conferences such as TREC, MUC and MET-2 also strongly demonstrate that methods adopted and proven effective in other languages ​​are also applicable to Chinese, and the performance indicators of application systems in different languages ​​are roughly the same. Of course, each language has its own personality, but these personalities should not be used to deny the uniqueness of language and make wrong judgments based on insufficient facts.

In order to promote the development of Chinese information processing, let us pick up the weapon of evaluation, study its applicable technology in a down-to-earth manner, and stop taking it for granted. It is recommended that when formulating project plans, government scientific research authorities should allocate at least about 10% of the total funds for a project to fund the evaluation of the project.

Research results without unified evaluation are not completely credible after all