Its0505WordCount

From W3C Wiki


ITS WG Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

Author: Andrzej Zydron

I18N Metrics

Summary

The requirement is to provide an unambiguous, universally acceptable metric for a localization task that can be exchanged and processed electronically.

Challenge/issue

We tend to use term 'word counts' to encompass all metrics to do with the i18n process. This is not an accurate definition though. A better term is metrics as for some scripts character counts are the only currently available metrics and the term 'word counts' ignores all other metrics such as page, file, line, screen counts.

Metrics are an important aspect of sizing the i18n task for a given XML document. Metrics encompass word and character counts as well as other relevant counts (pages, lines, screens). There are two aspects here. The first is to decide what will be counted (e.g. Localization Directives). The second is the actual syntax of the count - The format of the metrics information (e.g. namespace syntax).

From the W3C ITS perspective metrics need to be governed by an indication of what aspects of a document (translatable attributes, PCDATA) need to be counted, as well as translated. This presupposes the existence of some form of Localization Directives definition document that would detail this information, plus additional translation/count directives within the document itself if required:


<para>This is a section of java code :
<programlisting>
  // this string is never used
  String sqlConnect = "connect / as sysdba";
  String sqlSelect = "select name from mytable where name=\"Tim\"";
  System.out.println("<message translate="yes">But of course, you should translate this string !</message>");
</programlisting>


Notes

One of the most enduring features of the GILT (Globalization, Internationalization, Localization and Translation) industry has been the inconsistency of word counts, not only between rival products, but even between different versions of the same product. Word Counts are the traditional metric for i18n tasks, but they do not provide in themselves a valid unit of measure. For Logographic scripts such as Chinese, Japanese and Korean word counts are very difficult if not impossible to establish. A better term is Globalization Industry Metrics, or GILT Metrics. Volume is also not a absolute indicator of the size of an i18n task - complexity and quality are other important ingredients.

The issue of GILT Metrics is currently being addressed by the proposed LISA OSCAR tripartite GILT Metrics standards which encompass volume (GMX/V), complexity (GMX/C) and quality (GMX/Q) - http://www.lisa.org/oscar/gmx/.

Previous comments

Metrics are an important aspect of sizing the i18n task for a given XML document. Metrics encompass word and character counts as well as other relevant counts (pages, lines, screens). There are two aspects here. The first is to decide what will be counted (e.g. Localization Directives). The second is the actual syntax of the count - The fomat of the metrics information (e.g. namespace syntax).

[AZ- Just a quick note. We tend to use term 'word counts' to encompass all metrics to do with the i18n process. This is not an accurate definition though. A better term is metrics as for some scripts character counts are the only currently available metrics and the term 'word counts' ignores all other metrics such as page, file, line, screen counts.

[[YS- I'm not sure about this. Wouldn't the requirement about word-count in the point of view of ITS, be about how to indicate which parts of the document should be counted, vs. which parts should not? Like in the original discussion here [1].

(And, by the way, in that aspect, do we have cases where 'is translatable' and 'should be counted' are not the same?)

[[AZ- This is a very good point. I am looking at the general issue of Localization Directives. This should tackle both extraction and metrics. Counts should be both qualitative and quantitative - defining both volume and nature of the element being counted, e.g. numeric, measurement, POT (plain old text) etc.

[[TF Yep, From experience, and in our XLIFF filter impl. :-) There are cases where you can recognise where there may be translatable content, but have no idea how to wordcount the text - our case was program listings where we have no idea which programming language is being demonstrated and the code example isn't internationalised eg. the following fragment of Docbook :


<para>This is a section of java code :
<programlisting>
  // this string is never used
  String sqlConnect = "connect / as sysdba";
  String sqlSelect = "select name from mytable where name=\"Tim\"";
  System.out.println("But of course, you should translate this string !");
</programlisting>
<para>
Note that the section above will probably cause confusion to word-count-algorithms.</para>

this is a tough one to crack : basically in our XLIFF impl, we just had to admit that there may be translatable text in the programlisting, but we don't know how to wordcount it end TF]]

[[MI The only way to manage the case above should be as as follows (you don't need the translate attribute, though...) and instruct the parser to extract only text from the programlisting element just as JDOM can do:


<para>This is a section of java code :
<programlisting>
  // this string is never used
  String sqlConnect = "connect / as sysdba";
  String sqlSelect = "select name from mytable where name=\"Tim\"";
  System.out.println("<message translate="yes">But of course, you should translate this string !</message>");
</programlisting>

NO? end of MI]]

I realize the importance of a common way to calculate the counts, but something like GMX seems, maybe, out of scope, has it's not something to do with making documents, in general, easier to localize.

Well, I guess I could see it as a guideline just in case some schema designers would want to include a word-count storage mechanism in their format. ]]

Introduction

Word and character count metrics form an important unit of measure which is used to establish the size and cost of a given i18n task.

Challenge/Issue

One of the most enduring features of the GILT (Globalization, Internationalization, Localization and Translation) industry has been the inconsistency of word counts, not only between rival products, but even between different versions of the same product.

Word Counts are the traditional metric for i18n tasks, but they do not provide in themselves a valid unit of measure. For Logographic scripts such as Chinese, Japanese and Korean word counts are very difficult if not impossible to establish. A better term is Globalization Industry Metrics, or GILT Metrics. Volume is also not a absolute indicator of the size of an i18n task - complexity and quality are other important ingredients.