当前位置:   article > 正文

DataFrame实战-----数据规整化之USDA食品数据库-----python数据分析_usda食品营养数据库

usda食品营养数据库

目录

文件下载地址

抽取数据

分析过程


文件下载地址

美国农业部USDA制作一份有关食物营养的数据库。由Ashley Williams制作出JSON版。

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/usda_food/database.json

*文件比较大,建议先下载好在导进去而不是复制进编译器

  1. >>> import json
  2. >>> db = json.load(open('D:\python\DataAnalysis\data\database.json'))
  3. >>> len(db)
  4. 6636
  5. >>> db[0].keys()
  6. [u'portions', u'description', u'tags', u'nutrients', u'group', u'id', u'manufacturer']
  7. >>> db[0]['nutrients'][0]
  8. {u'units': u'g', u'group': u'Composition', u'description': u'Protein', u'value': 25.18}

抽取数据

在转为DataFrame时,可以只抽取一部分字段,这里取出食物的名称,分类,编号及制造商信息

  1. >>> from pandas import DataFrame,Series
  2. Backend TkAgg is interactive backend. Turning interactive mode on.
  3. >>> info_keys = ['description','group','id','manufacturer']
  4. >>> info = DataFrame(db,columns=info_keys)
  5. >>> info[:5]
  6. description ... manufacturer
  7. 0 Cheese, caraway ...
  8. 1 Cheese, cheddar ...
  9. 2 Cheese, edam ...
  10. 3 Cheese, feta ...
  11. 4 Cheese, mozzarella, part skim milk ...
  12. [5 rows x 4 columns]

通过value_counts查看食物类别的分布情况:

  1. >>> import pandas as pd
  2. >>> pd.value_counts(info.group)
  3. Vegetables and Vegetable Products 812
  4. Beef Products 618
  5. Baked Products 496
  6. Breakfast Cereals 403
  7. Legumes and Legume Products 365
  8. Fast Foods 365
  9. Lamb, Veal, and Game Products 345
  10. Sweets 341
  11. Fruits and Fruit Juices 328
  12. Pork Products 328
  13. Beverages 278
  14. Soups, Sauces, and Gravies 275
  15. Finfish and Shellfish Products 255
  16. Baby Foods 209
  17. Cereal Grains and Pasta 183
  18. Ethnic Foods 165
  19. Snacks 162
  20. Nut and Seed Products 128
  21. Poultry Products 116
  22. Sausages and Luncheon Meats 111
  23. Dairy and Egg Products 107
  24. Fats and Oils 97
  25. Meals, Entrees, and Sidedishes 57
  26. Restaurant Foods 51
  27. Spices and Herbs 41
  28. Name: group, dtype: int64

分析过程

现在,为了对全部营养数据做一些分析,最简单的办法是将所有食物的营养成分整合到一个大表中,我们分步骤实现该目的。

首先将各食物的营养成分列表转换为一个DataFrame,并添加一个表示编号的列,然后将该DataFrame添加到一个列表中,最后通过concaat将这些东西连接起来。

  1. >>> nutrients = []
  2. >>> for rec in db:
  3. ... fnuts = DataFrame(rec['nutrients'])
  4. ... fnuts['id'] = rec['id']
  5. ... nutrients.append(fnuts)
  6. ... nutrients = pd.concat(nutrients,ignore_index=True)

连接后的nutrients[ ]

  1. >>> nutrients
  2. description group ... value id
  3. 0 Protein Composition ... 25.180 1008
  4. 1 Total lipid (fat) Composition ... 29.200 1008
  5. 2 Carbohydrate, by difference Composition ... 3.060 1008
  6. 3 Ash Other ... 3.280 1008
  7. 4 Energy Energy ... 376.000 1008
  8. 5 Water Composition ... 39.280 1008
  9. 6 Energy Energy ... 1573.000 1008
  10. 7 Fiber, total dietary Composition ... 0.000 1008
  11. 8 Calcium, Ca Elements ... 673.000 1008
  12. 9 Iron, Fe Elements ... 0.640 1008
  13. 10 Magnesium, Mg Elements ... 22.000 1008
  14. 11 Phosphorus, P Elements ... 490.000 1008
  15. 12 Potassium, K Elements ... 93.000 1008
  16. 13 Sodium, Na Elements ... 690.000 1008
  17. 14 Zinc, Zn Elements ... 2.940 1008
  18. 15 Copper, Cu Elements ... 0.024 1008
  19. 16 Manganese, Mn Elements ... 0.021 1008
  20. 17 Selenium, Se Elements ... 14.500 1008
  21. 18 Vitamin A, IU Vitamins ... 1054.000 1008
  22. 19 Retinol Vitamins ... 262.000 1008
  23. 20 Vitamin A, RAE Vitamins ... 271.000 1008
  24. 21 Vitamin C, total ascorbic acid Vitamins ... 0.000 1008
  25. 22 Thiamin Vitamins ... 0.031 1008
  26. 23 Riboflavin Vitamins ... 0.450 1008
  27. 24 Niacin Vitamins ... 0.180 1008
  28. 25 Pantothenic acid Vitamins ... 0.190 1008
  29. 26 Vitamin B-6 Vitamins ... 0.074 1008
  30. 27 Folate, total Vitamins ... 18.000 1008
  31. 28 Vitamin B-12 Vitamins ... 0.270 1008
  32. 29 Folic acid Vitamins ... 0.000 1008
  33. ... ... ... ... ...
  34. 1168085 Selenium, Se Elements ... 1.100 43546
  35. 1168086 Vitamin A, IU Vitamins ... 5.000 43546
  36. 1168087 Retinol Vitamins ... 0.000 43546
  37. 1168088 Vitamin A, RAE Vitamins ... 0.000 43546
  38. 1168089 Carotene, beta Vitamins ... 2.000 43546
  39. 1168090 Carotene, alpha Vitamins ... 2.000 43546
  40. 1168091 Vitamin E (alpha-tocopherol) Vitamins ... 0.250 43546
  41. 1168092 Vitamin D Vitamins ... 0.000 43546
  42. 1168093 Vitamin D (D2 + D3) Vitamins ... 0.000 43546
  43. 1168094 Cryptoxanthin, beta Vitamins ... 0.000 43546
  44. 1168095 Lycopene Vitamins ... 0.000 43546
  45. 1168096 Lutein + zeaxanthin Vitamins ... 20.000 43546
  46. 1168097 Vitamin C, total ascorbic acid Vitamins ... 21.900 43546
  47. 1168098 Thiamin Vitamins ... 0.020 43546
  48. 1168099 Riboflavin Vitamins ... 0.060 43546
  49. 1168100 Niacin Vitamins ... 0.540 43546
  50. 1168101 Vitamin B-6 Vitamins ... 0.260 43546
  51. 1168102 Folate, total Vitamins ... 17.000 43546
  52. 1168103 Vitamin B-12 Vitamins ... 0.000 43546
  53. 1168104 Choline, total Vitamins ... 4.100 43546
  54. 1168105 Vitamin K (phylloquinone) Vitamins ... 0.500 43546
  55. 1168106 Folic acid Vitamins ... 0.000 43546
  56. 1168107 Folate, food Vitamins ... 17.000 43546
  57. 1168108 Folate, DFE Vitamins ... 17.000 43546
  58. 1168109 Vitamin E, added Vitamins ... 0.000 43546
  59. 1168110 Vitamin B-12, added Vitamins ... 0.000 43546
  60. 1168111 Cholesterol Other ... 0.000 43546
  61. 1168112 Fatty acids, total saturated Other ... 0.072 43546
  62. 1168113 Fatty acids, total monounsaturated Other ... 0.028 43546
  63. 1168114 Fatty acids, total polyunsaturated Other ... 0.041 43546
  64. [1168115 rows x 5 columns]

丢弃重复项

  1. >>> nutrients.duplicated().sum()
  2. 792939

重命名列对象

  1. >>> col_mapping = {'description':'food','group':'fgroup'}
  2. >>> info = info.rename(columns = col_mapping,copy=False)
  1. >>> col_mapping = {'description':'nutrient','group':'nutgroup'}
  2. >>> nutrients = nutrients.rename(columns = col_mapping,copy = False)

结合info与nutrients

  1. >>> ndata = pd.merge(nutrients,info,on='id',how='outer')
  2. >>> ndata.ix[30000]
  3. nutrient Folate, food
  4. nutgroup Vitamins
  5. units mcg
  6. value 11
  7. id 1180
  8. food Sour cream, fat free
  9. fgroup Dairy and Egg Products
  10. manufacturer None
  11. Name: 30000, dtype: object

接下来利用前面的知识练练手,比如根据食物分类和营养类型画出一张中位值的图。

  1. >>> result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)
  2. >>> result['Zinc, Zn'].sort_values().plot(kind='barh')

 氨基酸最丰富的食物:

  1. >>> by_nutrient = ndata.groupby(['nutgroup','nutrient'])
  2. >>> get_maximum = lambda x: x.xs(x.value.idxmax())
  3. >>> get_minimum = lambda x: x.xs(x.value.idxmin())
  4. >>> max_foods = by_nutrient.apply(get_maximum)[['value','food']]
  5. >>> max_foods.food = max_foods.food.str[:50]
  6. >>> max_foods.ix['Amino Acids']['food']
  7. nutrient
  8. Alanine Gelatins, dry powder, unsweetened
  9. Arginine Seeds, sesame flour, low-fat
  10. Aspartic acid Soy protein isolate
  11. Cystine Seeds, cottonseed flour, low fat (glandless)
  12. Glutamic acid Soy protein isolate
  13. Glycine Gelatins, dry powder, unsweetened
  14. Histidine Whale, beluga, meat, dried (Alaska Native)
  15. Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
  16. Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  17. Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  18. Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
  19. Methionine Fish, cod, Atlantic, dried and salted
  20. Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  21. Proline Gelatins, dry powder, unsweetened
  22. Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  23. Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  24. Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
  25. Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  26. Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  27. Name: food, dtype: object

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/人工智能uu/article/detail/884272
推荐阅读
相关标签
  

闽ICP备14008679号