Fix JSON orient='table' issues with numeric column names #25488

albertvillanova · 2019-02-28T23:31:01Z

closes #19129
closes #22525

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-03-01T00:08:51Z

Codecov Report

Merging #25488 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25488      +/-   ##
==========================================
+ Coverage   91.75%   91.75%   +<.01%     
==========================================
  Files         173      173              
  Lines       52955    52960       +5     
==========================================
+ Hits        48588    48594       +6     
+ Misses       4367     4366       -1

Flag	Coverage Δ
#multiple	`90.32% <100%> (ø)`	⬆️
#single	`41.71% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/json.py	`93.28% <100%> (+0.06%)`	⬆️
pandas/io/json/table_schema.py	`98.31% <100%> (+0.01%)`	⬆️
pandas/util/testing.py	`87.66% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db978c7...e7abdb1. Read the comment docs.

codecov · 2019-03-01T00:08:51Z

Codecov Report

Merging #25488 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25488      +/-   ##
==========================================
- Coverage      92%   91.99%   -0.01%     
==========================================
  Files         175      175              
  Lines       52371    52381      +10     
==========================================
+ Hits        48184    48190       +6     
- Misses       4187     4191       +4

Flag	Coverage Δ
#multiple	`90.55% <100%> (ø)`	⬆️
#single	`40.73% <5.88%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/json.py	`93.3% <100%> (+0.06%)`	⬆️
pandas/io/json/table_schema.py	`98.38% <100%> (+0.08%)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c79b7bb...1465e39. Read the comment docs.

WillAyd · 2019-03-01T00:09:58Z

pandas/tests/io/json/test_pandas.py

@@ -1204,9 +1204,10 @@ def test_data_frame_size_after_to_json(self):

    @pytest.mark.parametrize('index', [None, [1, 2], [1., 2.], ['a', 'b'],
                                       ['1', '2'], ['1.', '2.']])
-    @pytest.mark.parametrize('columns', [['a', 'b'], ['1', '2'], ['1.', '2.']])
+    @pytest.mark.parametrize('columns', [None, [1, 2], [1., 2.], ['a', 'b'],


So I don't know that we want to do this. Is it valid JSON in the table spec to have column names that are non-string?

Understood you have gotten this to round trip but if it violates the Table spec for JSON then I'd rather raise as commented previously

@WillAyd as I already commented in #19129, after having tried to start a discussion before doing the PR:

I was suggesting not just "raising a more descriptive ValueError" (sic), but changing the implementation of the JSON serialization for orient='table'.

Could you please tell me where the JSON table spec claims that column names MUST be strings?

I am going to make a longer comment to further justify my PR.

Could you please tell me where the JSON table spec claims that column names MUST be strings?

Not specific to the table spec as much as just JSON itself. See the description of an object here:

https://json.org

@WillAyd you are talking about different things:

JSON spec imposes that in a JSON object (composed of key-value pairs), its keys must be strings

but we are talking about column names, not JSON object keys

And my question is: where the JSON table spec claims that COLUMN NAMES (not JSON object keys) must be strings?

@pwalsh can you comment on this? Is the name field in a Field Descriptor expected to be a string?

albertvillanova · 2019-03-01T06:39:15Z

Hello, here my explanation to justify this PR.

As I already commented in #19129, IIUC, the only aim of orient='target' is to get round-trip JSON serialization/deserialization of pandas objects. Thus, the 'schema' was added following the "Table Schema" spec in https://frictionlessdata.io/specs/table-schema/.

This spec is ONLY for the 'schema', not for the table itself. And this spec does not prohibit the column names to be non-strings. Indeed, when serializing pandas DataFrames with non-string column names, their 'schema' is totally valid, and the issues arise just because of the JSON serialization of the table itself (the 'data').

In order to serialize the data itself, among all the possible orient values, it was chosen orient='records' and this is the cause of all the issues. Why? Because orient='records' serializes each DataFrame row as a JSON object (roughly equivalent to a Python dict), using the column names as the JSON object names (dict keys). The choice of using column names as dict keys (JSON object names) is not right for several reasons:

non-string column names:
DataFrame column names are not restricted to be strings. Indeed, non-string column names are the default if no explicit column names are passed to DataFrame instantiation; however, JSON spec imposes that JSON object names must be strings.
order of the columns:
DataFrame columns have a specific order; however, when representing each record as a Python dict, we lose the order of the columns.
non-unique column names
a DataFrame can have non-unique column names; however a Python dict cannot have non-unique keys.

In order to address all theses issues, this PR proposes to use another representation of each DataFrame row: instead of using a Python dict, we should use a Python list. The Python list:

can have non-unique items
conserves the order of its items
and its JSON equivalent is a JSON array, which does not impose any restriction on the types of its values

And in pandas, this can be easily made by choosing the orient='values' (instead of orient='records').

Finally, I would like to point out that there exists a Tabular Data Resource spec (https://frictionlessdata.io/specs/tabular-data-resource/) and it clearly establishes (in its section JSON Tabular Data):

JSON Tabular Data MUST be an array where each item in the array MUST be:

EITHER: an array where each entry in the array is the value for that cell in the table

OR: an object where each key corresponds to the header for that row and the value corresponds to the cell value for that row for that header

This PR proposes to use the first option (instead of the second one) for all the previously mentioned reasons.

WillAyd · 2019-03-01T08:03:59Z

This PR proposes to use the first option (instead of the second one) for all the previously mentioned reasons.

OK I see your point here though so thanks for clarifying. Will review in more detail later and give you feedback just wanted to reach out in the interim

jreback

not averse to doing this, but this completely breaks backward compatibility. So. update the version on the table. Also I think need to save some samples using the original format and add some tests to be able to read them (dispatch on the version).

TomAugspurger

We can't just change the output like this without a deprecation.

In retrospect, it seems like we overloaded the orient keyword to handle two things

whether or not we include a JSON table schema
the orient of the data within the output

What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle.

WillAyd · 2019-03-05T04:25:41Z

What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle.

What do you see as the advantage of doing this? I would actually be against this as we already have quite a few JSON formats, so adding another that doesn't follow the table spec doesn't add a lot of value I think.

TomAugspurger · 2019-03-05T04:58:36Z

I’ll need to read the spec again, but it may not make any claims for how the data should be represented.

…

On Mar 4, 2019, at 22:25, William Ayd ***@***.***> wrote: What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle. What do you see as the advantage of doing this? I would actually be against this as we already have quite a few JSON formats, so adding another that doesn't follow the table spec doesn't add a lot of value I think. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jreback · 2019-03-20T02:00:56Z

can you merge master

jreback · 2019-04-05T00:57:00Z

can you merge master

WillAyd · 2019-05-03T05:30:40Z

Not a bad idea but this change is definitely breaking and comes with quite a few implications. Given it's gone somewhat stale over the past month and a half going to close out, but feel free to ping back @albertvillanova if it's something you want to drive forward

Albert Villanova del Moral added 3 commits March 1, 2019 00:20

Tests to be fixed

e6f7969

Fix JSON orient='table' issues for numeric column names (pandas-dev#1…

c88affc

…9129) (pandas-dev#22525)

Fix tests

e7abdb1

WillAyd reviewed Mar 1, 2019

View reviewed changes

WillAyd added the IO JSON read_json, to_json, json_normalize label Mar 1, 2019

Merge branch 'master' into fix-19129

11930e5

jreback requested changes Mar 3, 2019

View reviewed changes

Albert Villanova del Moral added 7 commits March 3, 2019 09:53

Merge branch 'master' into fix-19129

953c128

Change JSON schema version number

82d486e

Fix typo table='orient' to orient='table'

e183094

Parse table schema according to the version

3332fec

Set values as default orient for JSONTableWriter

36d9350

Fix minor typos

3123dc6

Test read_json for table generated by version 0.20.0

3345119

TomAugspurger reviewed Mar 5, 2019

View reviewed changes

Merge branch 'master' into fix-19129

1465e39

WillAyd closed this May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JSON orient='table' issues with numeric column names #25488

Fix JSON orient='table' issues with numeric column names #25488

albertvillanova commented Feb 28, 2019 •

edited by jreback

Loading

codecov bot commented Mar 1, 2019

codecov bot commented Mar 1, 2019 •

edited

Loading

WillAyd Mar 1, 2019

albertvillanova Mar 1, 2019

WillAyd Mar 1, 2019

albertvillanova Mar 1, 2019 •

edited

Loading

TomAugspurger Mar 5, 2019

albertvillanova commented Mar 1, 2019

WillAyd commented Mar 1, 2019

jreback left a comment

TomAugspurger left a comment

WillAyd commented Mar 5, 2019

TomAugspurger commented Mar 5, 2019 via email

jreback commented Mar 20, 2019

jreback commented Apr 5, 2019

WillAyd commented May 3, 2019

Fix JSON orient='table' issues with numeric column names #25488

Fix JSON orient='table' issues with numeric column names #25488

Conversation

albertvillanova commented Feb 28, 2019 • edited by jreback Loading

codecov bot commented Mar 1, 2019

Codecov Report

codecov bot commented Mar 1, 2019 • edited Loading

Codecov Report

WillAyd Mar 1, 2019

Choose a reason for hiding this comment

albertvillanova Mar 1, 2019

Choose a reason for hiding this comment

WillAyd Mar 1, 2019

Choose a reason for hiding this comment

albertvillanova Mar 1, 2019 • edited Loading

Choose a reason for hiding this comment

TomAugspurger Mar 5, 2019

Choose a reason for hiding this comment

albertvillanova commented Mar 1, 2019

WillAyd commented Mar 1, 2019

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

WillAyd commented Mar 5, 2019

TomAugspurger commented Mar 5, 2019 via email

jreback commented Mar 20, 2019

jreback commented Apr 5, 2019

WillAyd commented May 3, 2019

albertvillanova commented Feb 28, 2019 •

edited by jreback

Loading

codecov bot commented Mar 1, 2019 •

edited

Loading

albertvillanova Mar 1, 2019 •

edited

Loading