Can we develop a list of names and/or addresses which cover the accented characters heuristic?

I was recently talking to a colleague about the accented characters test heuristics on the MoT cheat sheet ( [Test Heuristics Cheat Sheet | Ministry of Testing](Test Heuristics Cheat Sheet) ).
From that chat I was wondering if we could get a list together of names and possibly addresses which contain all the accented characters we may encounter.
From a bit of research, here’s a list of accented characters -
á Á à À â  ä Ä ã à å Å æ Æ ç Ç é É è È ê Ê ë Ë í Í ì Ì î Î ï Ï ñ Ñ ó Ó ò Ò ô Ô ö Ö õ Õ ø Ø œ Œ ß ú Ú ù Ù û Û ü Ü

So, who’s up for creating some names and addresses (real or otherwise) which covers the above list for us to use in our tests?

4 Likes

I got ya, yo. I think this covers them, let me know if I left anything out:

Character DOS Alt Code WIN Alt Code Description
Ç Alt 128 Alt 0199 Latin capital letter C with cedilla
ü Alt 129 Alt 0252 Latin small letter u with diaeresis
é Alt 130 Alt 0233 Latin small letter e with acute
â Alt 131 Alt 0226 Latin small letter a with circumflex
ä Alt 132 Alt 0228 Latin small letter a with diaeresis
à Alt 133 Alt 0224 Latin small letter a with grave
å Alt 134 Alt 0229 Latin small letter a with ring above
ç Alt 135 Alt 0231 Latin small letter c with cedilla
ê Alt 136 Alt 0234 Latin small letter e with circumflex
ë Alt 137 Alt 0235 Latin small letter e with diaeresis
è Alt 138 Alt 0232 Latin small letter e with grave
ï Alt 139 Alt 0239 Latin small letter i with diaeresis
î Alt 140 Alt 0238 Latin small letter i with circumflex
ì Alt 141 Alt 0236 Latin small letter i with grave
Ä Alt 142 Alt 0196 Latin capital letter A with diaeresis
Å Alt 143 Alt 0197 Latin capital letter A with ring above
É Alt 144 Alt 0201 Latin capital letter E with acute
æ Alt 145 Alt 0230 Latin small letter ae, ash (from Old English æsc)
Æ Alt 146 Alt 0198 Latin capital letter AE
ô Alt 147 Alt 0244 Latin small letter o with circumflex
ö Alt 148 Alt 0246 Latin small letter o with diaeresis
ò Alt 149 Alt 0242 Latin small letter o with grave
û Alt 150 Alt 0251 Latin small letter u with circumflex
ù Alt 151 Alt 0249 Latin small letter u with grave
ÿ Alt 152 Alt 0255 Latin small letter y with diaeresis
Ö Alt 153 Alt 0214 Latin capital letter O with diaeresis
Ü Alt 154 Alt 0220 Latin capital letter U with diaeresis
á Alt 160 Alt 0225 Latin small letter a with acute
í Alt 161 Alt 0237 Latin small letter i with acute
ó Alt 162 Alt 0243 Latin small letter o with acute
ú Alt 163 Alt 0250 Latin small letter u with acute
ñ Alt 164 Alt 0241 Latin small letter n with tilde, small letter enye
Ñ Alt 165 Alt 0209 Latin capital letter N with tilde, capital letter enye
Š Alt 0138 Latin capital letter S with caron, S hacek
Œ Alt 0140 Latin capital ligature OE
Ž Alt 0142 Latin capital letter Z with caron, Z hacek
š Alt 0154 Latin small letter s with caron, s hacek
œ Alt 0156 Latin small ligature oe
ž Alt 0158 Latin small letter z with caron, z hacek
Ÿ Alt 0159 Latin capital letter Y with diaeresis
À Alt 0192 Latin capital letter A with grave
Á Alt 0193 Latin capital letter A with acute
 Alt 0194 Latin capital letter A with circumflex
à Alt 0195 Latin capital letter A with tilde
È Alt 0200 Latin capital letter E with grave
Ê Alt 0202 Latin capital letter E with circumflex
Ë Alt 0203 Latin capital letter E with diaeresis
Ì Alt 0204 Latin capital letter I with grave
Í Alt 0205 Latin capital letter I with acute
Î Alt 0206 Latin capital letter I with circumflex
Ï Alt 0207 Latin capital letter I with diaeresis
Ð Alt 0208 Latin capital letter eth
Ò Alt 0210 Latin capital letter O with grave
Ó Alt 0211 Latin capital letter O with acute
Ô Alt 0212 Latin capital letter O with circumflex
Õ Alt 0213 Latin capital letter O with tilde
Ø Alt 0216 Latin capital letter O with stroke
Ù Alt 0217 Latin capital letter U with grave
Ú Alt 0218 Latin capital letter U with acute
Û Alt 0219 Latin capital letter U with circumflex
Ý Alt 0221 Latin capital letter Y with acute
ß Alt 0223 Latin small letter sharp s, eszett
ã Alt 0227 Latin small letter a with tilde
õ Alt 0245 Latin small letter o with tilde
ø Alt 0248 Latin small letter o with stroke
ý Alt 0253 Latin small letter y with acute
3 Likes

There’s also a program called perlclip that can copy the 1-255 characters into the clipboard for you by typing the command “$allchars”.

It can do other things like produce strings of requested length that indicate the length within the string, like a ruler. Or produce crazy long custom strings.

3 Likes

There’s also this great project that maintains lists (files) containing strings to test with: BLoN Strings

3 Likes

If you want to extend it a bit, and depending on the application / dataset you’re looking for, maybe also take into account Greek and Bulgarian (Cyrillic).

On the Germanic languages, Schloß is also a favourite.

1 Like

This is the exact task for using AI to generate test data :sweat_smile:

Names:

  1. José González
  2. François Lévêque
  3. Márta Hárs
  4. Łukasz Nowak
  5. Göran Sjöström
  6. Seán Ó Conaill
  7. Åsa Björk
  8. Jürgen Müller
  9. Søren Kjær
  10. Zoë O’Connell
  11. İsmail Demir
  12. Bjørn Østvik
  13. Ángel Fernández
  14. Dvořák Jan
  15. Elżbieta Król
  16. Michèle Dubois
  17. René Álvarez
  18. Mário Simões
  19. Nínive Martins
  20. Pål Ødegård

Addresses:

  1. Rua São João, 45, São Paulo, Brazil
  2. 1234, Rue de l’Église, Montréal, QC, Canada
  3. Łąkowa 7, 00-987 Warszawa, Poland
  4. Plaza del Ángel, 22, 28012 Madrid, Spain
  5. Grüner Weg 18, 04109 Leipzig, Germany
  6. Blåbärsvägen 5, 123 45 Stockholm, Sweden
  7. Rua Álvares Cabral 15, Lisboa, Portugal
  8. Calle Mayor 3, 28013 Madrid, Spain
  9. Československé armády 12, Praha 6, Czech Republic
  10. Gata Västerlånggatan 1, 111 29 Stockholm, Sweden
  11. Via Giuseppe Garibaldi, 27, 10122 Torino TO, Italy
  12. Østergade 10, 1100 København K, Denmark
  13. Av. Independência 45, Ciudad de México, Mexico
  14. Boulevard Saint-Germain, 75007 Paris, France
  15. Σοφοκλέους 12, Αθήνα, Greece
  16. Rúa do Souto, 15701 Santiago de Compostela, Spain
  17. Löwenstraße 15, 8001 Zürich, Switzerland
  18. Calle de la Reina, 46001 Valencia, Spain
  19. Östra Hamngatan 18, 411 09 Göteborg, Sweden
  20. Štefánikova 12, 811 05 Bratislava, Slovakia

URLs:

  1. https://münchen.de
  2. http://niño.com
  3. http://tromsø.no
  4. http://çorum.tr
  5. http://küche.de
  6. http://føtex.dk
  7. http://réalité.fr
  8. http://žaluzie.cz
  9. http://mariá.hu
  10. http://grünerweg.de
4 Likes

When you talk about “all the accented characters”, you’re really talking about the Extended ASCII character set, which adds 128 characters on top of the basic ASCII character set. Then there’s Unicode, which adds tens of thousands more characters.

But you’re thinking about this the wrong way. It’s actually an equivalence partitioning issue, and you will miss many test cases if you don’t think of it that way. For instance:

ASCII
Hex values 0 to FF (or 0 to 127 in decimal) represent the ASCII character set, which is one of the partitions. But not all characters in that set are equivalent. Some are control characters that are not displayed. Some are whitespace characters. Then you have upper case, lower case, numbers and symbols (please don’t ever call them “special characters”).

Depending on an input field’s purpose, you may need to test characters from all those partitions. It’s actually more complicated than that because some characters might be allowed from one of the partitions while others are not. For instance, the letter “e” may be allowed in a numeric field. Do you know why? Here’s a clue - the letter “E” would not be allowed.

Extended ASCII
Hex values 100 to 1FF (or 128 to 255 in decimal) represent the Extended ASCII character set. This adds another 128 accented characters, symbols, shapes, Greek letters and more.

In my experience, systems tend to allow all of them or none of them, although that doesn’t have to be the case.

Unicode
This way lies madness. Unicode was created to accommodate all the thousands of characters in languages such as Chinese and Japanese. It has changed a lot over the years, and Wikipedia currently says “Version 16.0 of the standard defines 154,998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts”.

Again, quoting Wikipedia, “The Unicode Standard defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.”

A Unicode code unit can contain from 1 to 4 bytes, and any given code unit might represent one of several different characters depending on which flavour of Unicode has been declared.

What could possibly go wrong?
The existence of different character sets causes all kinds of fun when data is transferred between different parts of a system such as the database, web server, browser, email server, CRM and ERP systems and APIs to other systems including third parties etc. Some parts of the system may work fine, yet the data gets trashed when sent to other parts.

And so it came to pass
We encountered this when testing a ticketing system for a football club in 2011. Two halves of the system were developed separately and joined together at the end. One team knew for sure that everyone was using UTF-8, and the other team knew for sure that everyone was using UTF-7 (which uses one less bit, so supports fewer characters).

Both were stubbing-out the connection to the other part of the system during development, and all their testing worked perfectly until we came along to do the integration testing. The UTF-8 end still worked fine, but the UTF-7 end corrupted all the data it received.

1 Like