Cross-Language Text Classi?cation

J.Scott Olsson

Dept.of Mathematics

University of Maryland College Park,Maryland olsson@,

Douglas W.Oard

College of Information


University of Maryland

College Park,Maryland


Jan Hajiˇc

Institute of Formal

and Applied Linguistics

Charles University

Prague,Czech Republic

Our goal in cross-language text classi?cation(CLTC)is to use English training data to classify Czech documents (although the concepts presented here are applicable to any language pair).CLTC is an o?-line problem,and the au-thors are unaware of any previous work in this area. CLTC is motivated by both the non-availability of Czech training data(the case,presently,in our dataset)and the possibility of leveraging di?erent topic distributions in di?er-ent languages to improve overall classi?cation for informa-tion retrieval.Consider,for example,that English speakers tend to contribute more to some topics than their Czech counterparts(e.g.,to discuss London more than Prague),so that,having only documents in English,we may expect to do poorly at identifying topics like Prague.Czech speakers, on the other hand,often talk about Prague,so that by lever-aging Czech data,we might expect to improve on detecting the topic Prague in English speakers;and Prague in English speakers is exactly the sort of thesaurus label which infor-mation seekers are most interested in—because it is rare. Accordingly,while a lack of Czech training data presently necessitates CLTC,we would have no reason to warrant the method’s abandonment if such data were to suddenly be-come available.

Our dataset is a collection of manually transcribed,spon-taneous,conversational speech in English and Czech.En-glish transcripts have human assigned labels from a hierar-chical thesaurus of approximately40,000labels.Presently, labeled Czech data is not available for classi?er training. The hierarchy may be divided into two principle branches, containing1)concept labels(e.g.,education)and2)pre-coordinated place-date labels(e.g.,Germany,1914–1918). Copyright is held by the author/owner.



A few methods present themselves for CLTC.As we have training data only in English,we may translate all of the Czech data features into English for classi?cation(we re-fer to this as English sided classi?cation).Alternatively,we may translate all English training features into Czech,be-fore classifying in Czech.Finally,we may classify in both directions and combine the evidence.We here con?ne our-selves to English sided classi?cation,although the concepts may naturally be extended(mutatis mutandis)to the Czech and two sided approaches.

Our classi?cation features are vectors of term frequencies in Czech,c,and English,e.




tf(c N c)





tf(e N e)



A vector’s subscript denotes the language from which the term frequencies were originally drawn(e.g.,e e denotes a feature vector of English term frequencies that were drawn from an English document).The principle novelty of English sided CLTC then is that,given feature vectors e e and c c, we must produce translated testing vectors,e c,suitable for classi?cation.

The matrix E represents a probabilistic dictionary map-ping between Czech and English terms,such that the(i,j) element represents the probability that an English word e i is the translation of the Czech word c j.That is,E i,j≡P(e i|c j),1≤i≤N e,1≤j≤N c

E=0B@P(e1|c1)···P(e1|c N c)







P(e N e|c1)···P(e N e|c N c)



By inspection,we see that e c may be reasonably approxi-mated by E c c≈e c,where e c is the left matrix product of the probabilistic dictionary matrix E and the untranslated Czech feature vector c c.Having attained a set of training vectors e e(via normal indexing)and testing vectors e c(via probabilistic word translation),we are free to continue with classi?cation as before in the monolingual case.

Before documents are indexed,they are parsed and fed into MORPHA[1]and the Czech Feature-Based Tagger[3] for lemmatization.Lemmatization is motivated by both the disparity in morphological richness between English and Czech(which increases the granularity,and thus the noise,

of translation)and the expectation that most of the seman-tic information associated with words(from which we infer thesaurus labels)is as present in their base forms as it is in their in?ections.

The base of the probabilistic dictionary is taken from ver-sion1.0of the Prague Czech-English Dependency Treebank (PCEDT)[4],which contains conditional word-translation probabilities for46,150word translation pairs.The dictio-nary has been derived from a parallel Czech-English corpus based on Reader’s Digest stories,technical texts,and the translation of the Penn Treebank’s WSJ portion into Czech. IBM model3has been used in the extraction,and data has been subsequently?ltered[2]to avoid most of the noise caused by relatively small datasets.

Indexing proceeds on the English documents by?rst check-ing if the term is already present in the probabilistic dictio-nary.If it is,the term’s frequency is incremented.If the base form for term w is not present in the dictionary,we hope that the term might be a relevant feature sans trans-lation,and therefore augment E with P(e w|c w)=1before incrementing w’s term frequency.We then index the docu-ments in Czech,although here it is unnecessary to augment the dictionary for previously unseen words(i.e.,words not seen in the training documents),as we do not expect to infer a thesaurus label from features never observed in training. The indexed Czech vectors are probabilistically translated via left matrix multiplication of E and classi?ed using k NN with symmetric-Okapi.From informal monolingual trials on held out English data,we determined a reasonable choice to be k=20.


There is currently no labeled Czech data in our dataset. To evaluate our implementation,English sided classi?cation was run on three disjoint segments of25Czech sentences each.The segment size was chosen to have roughly400 words(the average number of words in three minutes of in-terview).The segments and their ten highest ranked labels were then given to a native Czech speaker for manual rele-vance,ing the same training set,monolingual English classi?cation was run on four similarly partitioned test segments.The relevance of many labels could not be determined by inspection(e.g.,Poland,1945was hypothe-sized and,while the text made no explicit mention of Poland in1945,the label was not ruled out).These questionable la-belings were all simply assumed to be non-relevant.Table1 lists precision calculations for both the English sided Czech experiments and monolingual English experiments.Preci-sion was calculated over the?ve and ten highest ranked the-saurus labels(the complete set)as well as the?ve highest concept labels alone(that is,without the pre-coordinated place-date labels).Place-date labels may reasonably be ex-cluded from consideration because it is nearly always im-possible to assess their relevance to short text segments.On concept labels,the cross-language system performed at73% of the monolingual precision.

Consider every label assignment to be an independent trial with probability of success p.Now,p will vary across the-saurus labels,but the largest p,p L,will correspond to the label most commonly seen in the training data.If we were to randomly assign any one of the labels to a segment,p L would represent an upper bound on the probability of this label being relevant.In this spirit,we can consider p L to be

Table1:Precision over highest ranked topics






p L



an upper bound on the probability of success in a series of n Bernoulli trials,such that an upper bound on the chance probability of obtaining r or more successes in n trials is

P{r or more successes}≤


X i=r n i!p i L(1?p L)n?i.(1)

Note that for most p,p p L,so that we are strongly bi-asing the test against our method.From inspection,we found p L on all labels,p L


=954/43104and p L on con-cepts,p L


=954/28896(both corresponding to the label extended family members).The penultimate row of Table 1lists the p-values calculated for each English sided exper-iment using Equation1.We observe that our method is successfully classifying segments across the language bar-rier.This is likewise con?rmed by the?nal row of Table 1,which lists an upper bound on the expected precision for any of the experiments(an interpretation of p L).

4.CONCLUSIONS AND FUTURE WORK Having introduced the problem of CLTC,we discussed some of its salient features and potential methods for its so-lution.Our implementation was outlined and preliminary feedback suggests that it is already meeting with some suc-cess.Future work will be prompted by the availability of additional testing data,possibly through machine transla-tion of available labeled segments(i.e.,to produce labeled pseudo-Czech).This data will allow more extensive eval-uation,parameter optimization on held out data,and two sided classi?er combination studies.


Thanks to Martin Franz for assisting with relevance as-sessment.This work has been supported in part by NSF IIS award0122466(MALACH)and by the project MˇSMTˇCR No.MSM0021620838.


一、知识精讲 五种基本句型是句子最基本的组成部分。掌握了这五种基本句型,在阅读中当我们遇到较复杂的句子时,运用这些基本句型,对句子的分析就会变得容易多了。在写作中,首先要能运用好这些基本句型,才能得到高分。 (一)五种基本句型的句子成分: 1. 句子成分的定义:构成句子的各个部分叫做句子成分。句子成分包括主要成分和次要成分;主要成分有主语和谓语;次要成分有表语、宾语、定语、状语、补足语等。 2. 主语(subject):主语是一个句子所叙述的主体,一般位于句首。主语可由名词、代词、数词、不定式、动名词、名词化的形容词和主语从句等表示。 【例句】 We often speak English in class. 我们在课上经常说英语。 Smoking does harm to the health. 吸烟对健康有害。 The rich should help the poor. 富有的人应该帮助贫困的人。 3.谓语(verb):谓语可用来说明主语所做的动作或具有的特征和状态。动词常在句中作谓语,一般放在主语之后。分及物动词和不及物动词两种。 【例句】 He practices running every morning. 他每天早晨练习跑步。 I have caught a bad cold. 我得了重感冒。 We like helping the people in trouble 我们喜欢帮助那些处于困境中的人。 4.宾语(object):宾语在句中表示动作的对象或承受者,一般位于及物动词或介词后面。 【例句】 They went to see a film yesterday. 他们昨天看了一场电影。 She often helps her mother with their housework. 她经常帮助她的妈妈做家务。 I enjoy listening to popular music. 我喜欢听流行音乐。 5. 表语(predicative):表语用以说明主语的身份、特征和状态,它一般位于系动词(如be, become, get, look, grow, turn, seem等)之后。表语一般由名词、代词、形容词、分词、数词、不定式、动名词、介词短语、副词构成。 【例句】 The weather has turned cold. 天气变得冷了。 His job is to teach English.


