Can we develop a list of names and/or addresses which cover the accented characters heuristic?

shey.crompton · 17 October 2024 11:27

I was recently talking to a colleague about the accented characters test heuristics on the MoT cheat sheet ( [Test Heuristics Cheat Sheet | Ministry of Testing](Test Heuristics Cheat Sheet) ).
From that chat I was wondering if we could get a list together of names and possibly addresses which contain all the accented characters we may encounter.
From a bit of research, here’s a list of accented characters -
á Á à À â Â ä Ä ã Ã å Å æ Æ ç Ç é É è È ê Ê ë Ë í Í ì Ì î Î ï Ï ñ Ñ ó Ó ò Ò ô Ô ö Ö õ Õ ø Ø œ Œ ß ú Ú ù Ù û Û ü Ü

So, who’s up for creating some names and addresses (real or otherwise) which covers the above list for us to use in our tests?

kinofrost · 17 October 2024 12:46

I got ya, yo. I think this covers them, let me know if I left anything out:

Character	DOS Alt Code	WIN Alt Code	Description
Ç	Alt 128	Alt 0199	Latin capital letter C with cedilla
ü	Alt 129	Alt 0252	Latin small letter u with diaeresis
é	Alt 130	Alt 0233	Latin small letter e with acute
â	Alt 131	Alt 0226	Latin small letter a with circumflex
ä	Alt 132	Alt 0228	Latin small letter a with diaeresis
à	Alt 133	Alt 0224	Latin small letter a with grave
å	Alt 134	Alt 0229	Latin small letter a with ring above
ç	Alt 135	Alt 0231	Latin small letter c with cedilla
ê	Alt 136	Alt 0234	Latin small letter e with circumflex
ë	Alt 137	Alt 0235	Latin small letter e with diaeresis
è	Alt 138	Alt 0232	Latin small letter e with grave
ï	Alt 139	Alt 0239	Latin small letter i with diaeresis
î	Alt 140	Alt 0238	Latin small letter i with circumflex
ì	Alt 141	Alt 0236	Latin small letter i with grave
Ä	Alt 142	Alt 0196	Latin capital letter A with diaeresis
Å	Alt 143	Alt 0197	Latin capital letter A with ring above
É	Alt 144	Alt 0201	Latin capital letter E with acute
æ	Alt 145	Alt 0230	Latin small letter ae, ash (from Old English æsc)
Æ	Alt 146	Alt 0198	Latin capital letter AE
ô	Alt 147	Alt 0244	Latin small letter o with circumflex
ö	Alt 148	Alt 0246	Latin small letter o with diaeresis
ò	Alt 149	Alt 0242	Latin small letter o with grave
û	Alt 150	Alt 0251	Latin small letter u with circumflex
ù	Alt 151	Alt 0249	Latin small letter u with grave
ÿ	Alt 152	Alt 0255	Latin small letter y with diaeresis
Ö	Alt 153	Alt 0214	Latin capital letter O with diaeresis
Ü	Alt 154	Alt 0220	Latin capital letter U with diaeresis
á	Alt 160	Alt 0225	Latin small letter a with acute
í	Alt 161	Alt 0237	Latin small letter i with acute
ó	Alt 162	Alt 0243	Latin small letter o with acute
ú	Alt 163	Alt 0250	Latin small letter u with acute
ñ	Alt 164	Alt 0241	Latin small letter n with tilde, small letter enye
Ñ	Alt 165	Alt 0209	Latin capital letter N with tilde, capital letter enye
Š		Alt 0138	Latin capital letter S with caron, S hacek
Œ		Alt 0140	Latin capital ligature OE
Ž		Alt 0142	Latin capital letter Z with caron, Z hacek
š		Alt 0154	Latin small letter s with caron, s hacek
œ		Alt 0156	Latin small ligature oe
ž		Alt 0158	Latin small letter z with caron, z hacek
Ÿ		Alt 0159	Latin capital letter Y with diaeresis
À		Alt 0192	Latin capital letter A with grave
Á		Alt 0193	Latin capital letter A with acute
Â		Alt 0194	Latin capital letter A with circumflex
Ã		Alt 0195	Latin capital letter A with tilde
È		Alt 0200	Latin capital letter E with grave
Ê		Alt 0202	Latin capital letter E with circumflex
Ë		Alt 0203	Latin capital letter E with diaeresis
Ì		Alt 0204	Latin capital letter I with grave
Í		Alt 0205	Latin capital letter I with acute
Î		Alt 0206	Latin capital letter I with circumflex
Ï		Alt 0207	Latin capital letter I with diaeresis
Ð		Alt 0208	Latin capital letter eth
Ò		Alt 0210	Latin capital letter O with grave
Ó		Alt 0211	Latin capital letter O with acute
Ô		Alt 0212	Latin capital letter O with circumflex
Õ		Alt 0213	Latin capital letter O with tilde
Ø		Alt 0216	Latin capital letter O with stroke
Ù		Alt 0217	Latin capital letter U with grave
Ú		Alt 0218	Latin capital letter U with acute
Û		Alt 0219	Latin capital letter U with circumflex
Ý		Alt 0221	Latin capital letter Y with acute
ß		Alt 0223	Latin small letter sharp s, eszett
ã		Alt 0227	Latin small letter a with tilde
õ		Alt 0245	Latin small letter o with tilde
ø		Alt 0248	Latin small letter o with stroke
ý		Alt 0253	Latin small letter y with acute

kinofrost · 17 October 2024 12:53

There’s also a program called perlclip that can copy the 1-255 characters into the clipboard for you by typing the command “$allchars”.

It can do other things like produce strings of requested length that indicate the length within the string, like a ruler. Or produce crazy long custom strings.

charlie_from_cny · 17 October 2024 16:44

There’s also this great project that maintains lists (files) containing strings to test with: BLoN Strings

alagrate · 18 October 2024 12:33

If you want to extend it a bit, and depending on the application / dataset you’re looking for, maybe also take into account Greek and Bulgarian (Cyrillic).

On the Germanic languages, Schloß is also a favourite.

shad0wpuppet · 18 October 2024 12:56

This is the exact task for using AI to generate test data

Names:

José González
François Lévêque
Márta Hárs
Łukasz Nowak
Göran Sjöström
Seán Ó Conaill
Åsa Björk
Jürgen Müller
Søren Kjær
Zoë O’Connell
İsmail Demir
Bjørn Østvik
Ángel Fernández
Dvořák Jan
Elżbieta Król
Michèle Dubois
René Álvarez
Mário Simões
Nínive Martins
Pål Ødegård

Addresses:

Rua São João, 45, São Paulo, Brazil
1234, Rue de l’Église, Montréal, QC, Canada
Łąkowa 7, 00-987 Warszawa, Poland
Plaza del Ángel, 22, 28012 Madrid, Spain
Grüner Weg 18, 04109 Leipzig, Germany
Blåbärsvägen 5, 123 45 Stockholm, Sweden
Rua Álvares Cabral 15, Lisboa, Portugal
Calle Mayor 3, 28013 Madrid, Spain
Československé armády 12, Praha 6, Czech Republic
Gata Västerlånggatan 1, 111 29 Stockholm, Sweden
Via Giuseppe Garibaldi, 27, 10122 Torino TO, Italy
Østergade 10, 1100 København K, Denmark
Av. Independência 45, Ciudad de México, Mexico
Boulevard Saint-Germain, 75007 Paris, France
Σοφοκλέους 12, Αθήνα, Greece
Rúa do Souto, 15701 Santiago de Compostela, Spain
Löwenstraße 15, 8001 Zürich, Switzerland
Calle de la Reina, 46001 Valencia, Spain
Östra Hamngatan 18, 411 09 Göteborg, Sweden
Štefánikova 12, 811 05 Bratislava, Slovakia

URLs:

steve.green · 18 October 2024 17:52

When you talk about “all the accented characters”, you’re really talking about the Extended ASCII character set, which adds 128 characters on top of the basic ASCII character set. Then there’s Unicode, which adds tens of thousands more characters.

But you’re thinking about this the wrong way. It’s actually an equivalence partitioning issue, and you will miss many test cases if you don’t think of it that way. For instance:

ASCII
Hex values 0 to FF (or 0 to 127 in decimal) represent the ASCII character set, which is one of the partitions. But not all characters in that set are equivalent. Some are control characters that are not displayed. Some are whitespace characters. Then you have upper case, lower case, numbers and symbols (please don’t ever call them “special characters”).

Depending on an input field’s purpose, you may need to test characters from all those partitions. It’s actually more complicated than that because some characters might be allowed from one of the partitions while others are not. For instance, the letter “e” may be allowed in a numeric field. Do you know why? Here’s a clue - the letter “E” would not be allowed.

Extended ASCII
Hex values 100 to 1FF (or 128 to 255 in decimal) represent the Extended ASCII character set. This adds another 128 accented characters, symbols, shapes, Greek letters and more.

In my experience, systems tend to allow all of them or none of them, although that doesn’t have to be the case.

Unicode
This way lies madness. Unicode was created to accommodate all the thousands of characters in languages such as Chinese and Japanese. It has changed a lot over the years, and Wikipedia currently says “Version 16.0 of the standard defines 154,998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts”.

Again, quoting Wikipedia, “The Unicode Standard defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.”

A Unicode code unit can contain from 1 to 4 bytes, and any given code unit might represent one of several different characters depending on which flavour of Unicode has been declared.

What could possibly go wrong?
The existence of different character sets causes all kinds of fun when data is transferred between different parts of a system such as the database, web server, browser, email server, CRM and ERP systems and APIs to other systems including third parties etc. Some parts of the system may work fine, yet the data gets trashed when sent to other parts.

And so it came to pass
We encountered this when testing a ticketing system for a football club in 2011. Two halves of the system were developed separately and joined together at the end. One team knew for sure that everyone was using UTF-8, and the other team knew for sure that everyone was using UTF-7 (which uses one less bit, so supports fewer characters).

Both were stubbing-out the connection to the other part of the system during development, and all their testing worked perfectly until we came along to do the integration testing. The UTF-8 end still worked fine, but the UTF-7 end corrupted all the data it received.

Topic		Replies	Views
Good input datasets Archive	3	876	22 October 2018
🤖 Day 11: Generate test data using AI and evaluate its efficacy 30 Days of Testing 30-days-of-testing , data , ai , 30-days-of-ai-in-testing	50	2323	23 December 2024
What are fun postcodes to use when testing? Discussions risks , heuristics , oracles	28	60359	20 September 2024
How to test keyboard input Archive	2	812	13 September 2019
Your favourite worst addresses Archive	3	1000	5 January 2019

Can we develop a list of names and/or addresses which cover the accented characters heuristic?

Names:

Addresses:

URLs:

Related topics