-
1. Re: json decode utf8 strings
Andrew WatersOct 10, 2018 9:55 AM (in response to Rémi Chaffard)
1 of 1 people found this helpfulThere is a defect associated with this (DRUD1-23580). IT is the json decoder which gets confused when there are Unicode escape characters in a UTF-8 string.
-
2. Re: json decode utf8 strings
Rémi Chaffard Oct 11, 2018 1:42 AM (in response to Andrew Waters)OK good to know.
Is there any way to modify the string with tpl code before giving it to the decoder so we can at least parse something and avoid stopping the complete pattern ?
Thanks
-
3. Re: json decode utf8 strings
Andrew WatersOct 11, 2018 9:35 AM (in response to Rémi Chaffard)
1 of 1 people found this helpfulDo you have some Unicode escape sequences in the JSON? Backslash u followed by 4 hex characters, i.e. matching the regex \u[a-fA-F0-9]{4}
Replacing those works for the cases I have seen.
-
4. Re: json decode utf8 strings
Rémi Chaffard Oct 12, 2018 12:57 AM (in response to Andrew Waters)I tried to do this (not efficient enough since it will replace multiple times the same sequence) :
unicodes := regex.extractAll(output,regex'(\\u[a-fA-F0-9]{4})'); if unicodes then for u in unicodes do output := text.replace(output,u,''); end for; end if; // Convert JSON data result := json.decode(output);
Problem is after that the result variable is None and the rest of the pattern fails.
What did I do wrong here ?
Thanks a lot
Rémi
-
5. Re: json decode utf8 strings
Andrew WatersOct 12, 2018 12:41 PM (in response to Rémi Chaffard)
1 of 1 people found this helpfulFrom just looking at, nothing obvious. I would try logging it before and after and using a diff tool.
It could be something horrible like you have output \\u so you end up with an invalid escape character.
-
6. Re: json decode utf8 strings
Rémi Chaffard Oct 15, 2018 3:12 AM (in response to Andrew Waters)OK, I wanted to avoid this since the file is pretty big. I will try however, but then how should I proceed to understand what is the decode error, is there any way to simulate through python code or something ?
Without error from json.decode function, there's no way to find the error just by reading the file.
-
7. Re: json decode utf8 strings
Rémi Chaffard Oct 15, 2018 4:12 AM (in response to Rémi Chaffard)1 of 1 people found this helpfulOK, I found out why.
The output is containing nested Json, and the nested part is escaped, it means that unicode characters like \u0026 are in fact \\u0026. After the replacement, we get a remaining backslash in the middle of nowhere, making the parser to fail.
I used https://jsonlint.com/ to check.
Rémi
-
8. Re: json decode utf8 strings
Rémi Chaffard Oct 15, 2018 6:21 AM (in response to Rémi Chaffard)Ok, one question then, is there any way to easily sort a list in tpl ?
In fact the piece of code I'm using to replace unicode characters will replace in the order they occur in the input string. It mean that we may try to replace \u0026 before \\u0026, which makes it incorrect because then occurrences of the second has been replaced by \ before.
I need a way to sort the list of characters I will replace, starting by the biggest number of \, so the replacements will come in the correct order.
Or is there any way to replace direcly by regex ?
Thanks
Rémi
-
9. Re: json decode utf8 strings
Rémi Chaffard Oct 16, 2018 3:29 AM (in response to Rémi Chaffard)Hi,
I did some dirty code to handle that to finalize my tests. This looks like this if someone is interested
unicodes := regex.extractAll(output,regex'(\\+u[a-fA-F0-9]{4})'); replacements := []; if unicodes then for u in unicodes do if not u in replacements then idx := findIndexToInsert(replacements,u); if idx = 0 then replacements := [u] + replacements; elif idx = size(replacements) then replacements := replacements + [u]; else replacements := replacements[:idx] + [u] + replacements[idx:]; end if; end if; end for; for r in replacements do output := text.replace(output,r,''); end for; end if;
And the findIndexToInsert function:
define findIndexToInsert(tab, val) -> idx ''' ''' idx := 0; for item in tab do if item >= val then break; end if; idx := idx + 1; end for; return idx; end define;
All the code is about sorting the list of unicode characters to replace, in order to replace those having the biggest number of \ first.
I did not tested it deep, it does what I need for now. I will wait for teh defect to be resolved.
It could be good anyhow to have some more list management functions in tpl like sorting and removing duplicates, sometimes it can help. What do you think ?
Thanks
Rémi