Skip to content

Fix JSON orient='table' issues with numeric column names #25488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

albertvillanova
Copy link
Contributor

@albertvillanova albertvillanova commented Feb 28, 2019

closes #19129
closes #22525

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@codecov
Copy link

codecov bot commented Mar 1, 2019

Codecov Report

Merging #25488 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25488      +/-   ##
==========================================
+ Coverage   91.75%   91.75%   +<.01%     
==========================================
  Files         173      173              
  Lines       52955    52960       +5     
==========================================
+ Hits        48588    48594       +6     
+ Misses       4367     4366       -1
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.71% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 93.28% <100%> (+0.06%) ⬆️
pandas/io/json/table_schema.py 98.31% <100%> (+0.01%) ⬆️
pandas/util/testing.py 87.66% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db978c7...e7abdb1. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 1, 2019

Codecov Report

Merging #25488 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25488      +/-   ##
==========================================
- Coverage      92%   91.99%   -0.01%     
==========================================
  Files         175      175              
  Lines       52371    52381      +10     
==========================================
+ Hits        48184    48190       +6     
- Misses       4187     4191       +4
Flag Coverage Δ
#multiple 90.55% <100%> (ø) ⬆️
#single 40.73% <5.88%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 93.3% <100%> (+0.06%) ⬆️
pandas/io/json/table_schema.py 98.38% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c79b7bb...1465e39. Read the comment docs.

@@ -1204,9 +1204,10 @@ def test_data_frame_size_after_to_json(self):

@pytest.mark.parametrize('index', [None, [1, 2], [1., 2.], ['a', 'b'],
['1', '2'], ['1.', '2.']])
@pytest.mark.parametrize('columns', [['a', 'b'], ['1', '2'], ['1.', '2.']])
@pytest.mark.parametrize('columns', [None, [1, 2], [1., 2.], ['a', 'b'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't know that we want to do this. Is it valid JSON in the table spec to have column names that are non-string?

Understood you have gotten this to round trip but if it violates the Table spec for JSON then I'd rather raise as commented previously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd as I already commented in #19129, after having tried to start a discussion before doing the PR:

I was suggesting not just "raising a more descriptive ValueError" (sic), but changing the implementation of the JSON serialization for orient='table'.

Could you please tell me where the JSON table spec claims that column names MUST be strings?

I am going to make a longer comment to further justify my PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please tell me where the JSON table spec claims that column names MUST be strings?

Not specific to the table spec as much as just JSON itself. See the description of an object here:

https://json.org

Copy link
Contributor Author

@albertvillanova albertvillanova Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd you are talking about different things:

  • JSON spec imposes that in a JSON object (composed of key-value pairs), its keys must be strings
  • but we are talking about column names, not JSON object keys

And my question is: where the JSON table spec claims that COLUMN NAMES (not JSON object keys) must be strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwalsh can you comment on this? Is the name field in a Field Descriptor expected to be a string?

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Mar 1, 2019
@albertvillanova
Copy link
Contributor Author

Hello, here my explanation to justify this PR.

As I already commented in #19129, IIUC, the only aim of orient='target' is to get round-trip JSON serialization/deserialization of pandas objects. Thus, the 'schema' was added following the "Table Schema" spec in https://frictionlessdata.io/specs/table-schema/.

This spec is ONLY for the 'schema', not for the table itself. And this spec does not prohibit the column names to be non-strings. Indeed, when serializing pandas DataFrames with non-string column names, their 'schema' is totally valid, and the issues arise just because of the JSON serialization of the table itself (the 'data').

In order to serialize the data itself, among all the possible orient values, it was chosen orient='records' and this is the cause of all the issues. Why? Because orient='records' serializes each DataFrame row as a JSON object (roughly equivalent to a Python dict), using the column names as the JSON object names (dict keys). The choice of using column names as dict keys (JSON object names) is not right for several reasons:

  • non-string column names:
    DataFrame column names are not restricted to be strings. Indeed, non-string column names are the default if no explicit column names are passed to DataFrame instantiation; however, JSON spec imposes that JSON object names must be strings.
  • order of the columns:
    DataFrame columns have a specific order; however, when representing each record as a Python dict, we lose the order of the columns.
  • non-unique column names
    a DataFrame can have non-unique column names; however a Python dict cannot have non-unique keys.

In order to address all theses issues, this PR proposes to use another representation of each DataFrame row: instead of using a Python dict, we should use a Python list. The Python list:

  • can have non-unique items
  • conserves the order of its items
  • and its JSON equivalent is a JSON array, which does not impose any restriction on the types of its values

And in pandas, this can be easily made by choosing the orient='values' (instead of orient='records').

Finally, I would like to point out that there exists a Tabular Data Resource spec (https://frictionlessdata.io/specs/tabular-data-resource/) and it clearly establishes (in its section JSON Tabular Data):

JSON Tabular Data MUST be an array where each item in the array MUST be:

  • EITHER: an array where each entry in the array is the value for that cell in the table
  • OR: an object where each key corresponds to the header for that row and the value corresponds to the cell value for that row for that header

This PR proposes to use the first option (instead of the second one) for all the previously mentioned reasons.

@WillAyd
Copy link
Member

WillAyd commented Mar 1, 2019

This PR proposes to use the first option (instead of the second one) for all the previously mentioned reasons.

OK I see your point here though so thanks for clarifying. Will review in more detail later and give you feedback just wanted to reach out in the interim

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not averse to doing this, but this completely breaks backward compatibility. So. update the version on the table. Also I think need to save some samples using the original format and add some tests to be able to read them (dispatch on the version).

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just change the output like this without a deprecation.

In retrospect, it seems like we overloaded the orient keyword to handle two things

  1. whether or not we include a JSON table schema
  2. the orient of the data within the output

What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle.

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2019

What do people think about leaving orient just for the data key, and adding a new keyword schema=True/False to control whether we include the schema key at the outer level? I haven't really thought through the deprecation cycle.

What do you see as the advantage of doing this? I would actually be against this as we already have quite a few JSON formats, so adding another that doesn't follow the table spec doesn't add a lot of value I think.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 5, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

1 similar comment
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

can you merge master

@WillAyd
Copy link
Member

WillAyd commented May 3, 2019

Not a bad idea but this change is definitely breaking and comes with quite a few implications. Given it's gone somewhat stale over the past month and a half going to close out, but feel free to ping back @albertvillanova if it's something you want to drive forward

@WillAyd WillAyd closed this May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants