In the previous post I showed a convenient way to navigate the ICD9 hierarchy with Python, now let’s use that to extract the full taxonomy of ICD9 codes for patients who we’ll using to train a classifier. In this post we’ll be extracting the original ICD9 codes for all patients of interest from the MIMIC database, extracting the ICD9 hierarchy, and saving the results for later analysis.

The full notebook is available here, but the bulk of the work happens in the accessory file, which we import and access as sdu. In the selection below I walk through using the routines in this library

Reload any changes made to the structured data utils code

In [46]:
<module 'structured_data_utils' from '/mnt/cbds_homes/ecarlson/Notebooks/mit_frequent_fliers/mit-team-code/software/notebooks/'>

Use pandas to extract the list of unique notes and patients - the primary thing we’re looking for is the MIMIC III row id, which is used to get the MIMIC encounter ID, and from there the ICD9 diagnoses.

In [47]:
found_notes = comb_dat.loc[comb_dat['row_id_m3'].notnull()].\
    groupby(['subject_id', 'md5', 'row_id_m3']).count()['total_m3_distance'].index.tolist()

Iterate through the rows, building up a dictionary of dictionaries. note_info is a dictionary where the keys are the unique subject_id-md5-row_id triplet from the pandas line above. The values are another dictionary with 2 keys:

  • meta - note metadata, including the patient id (subject_id), encounter id (hadm_id), and associated timestamps
  • diagnoses - a list of the diagnoses associated with this encounter, including the original poorly formated ICD9 code from MIMIC, the reformated version (clean_icd9_code), and the label of the code
In [48]:
note_info = {}
for idx in found_notes:
    note_meta = sdu.get_note_metadata(conn, idx[2])
    note_diag = sdu.get_hadm_diagnoses(conn, note_meta['hadm_id'])
    dat = {'meta': note_meta, 'diagnoses': note_diag}
    note_info[idx] = dat

Print one element out to see how it looks

In [49]:
note_info[[k for k in note_info.keys()][0]]
{'diagnoses': [{'clean_icd9_code': '410.71',
   'hadm_id': 172993,
   'icd9_code': '41071',
   'known_icd9_code': False,
   'long_title': 'Subendocardial infarction, initial episode of care',
   'seq_num': 1,
   'short_title': 'Subendo infarct, initial',
   'subject_id': 11590},
  {'clean_icd9_code': '398.91',
   'hadm_id': 172993,
   'icd9_code': '39891',
   'known_icd9_code': True,
   'long_title': 'Rheumatic heart failure (congestive)',
   'seq_num': 2,
   'short_title': 'Rheumatic heart failure',
   'subject_id': 11590},
  {'clean_icd9_code': '396.3',
   'hadm_id': 172993,
   'icd9_code': '3963',
   'known_icd9_code': True,
   'long_title': 'Mitral valve insufficiency and aortic valve insufficiency',
   'seq_num': 3,
   'short_title': 'Mitral/aortic val insuff',
   'subject_id': 11590},
  {'clean_icd9_code': '397.0',
   'hadm_id': 172993,
   'icd9_code': '3970',
   'known_icd9_code': True,
   'long_title': 'Diseases of tricuspid valve',
   'seq_num': 4,
   'short_title': 'Tricuspid valve disease',
   'subject_id': 11590},
  {'clean_icd9_code': '042',
   'hadm_id': 172993,
   'icd9_code': '042',
   'known_icd9_code': True,
   'long_title': 'Human immunodeficiency virus [HIV] disease',
   'seq_num': 5,
   'short_title': 'Human immuno virus dis',
   'subject_id': 11590},
  {'clean_icd9_code': '403.91',
   'hadm_id': 172993,
   'icd9_code': '40391',
   'known_icd9_code': False,
   'long_title': 'Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease',
   'seq_num': 6,
   'short_title': 'Hyp kid NOS w cr kid V',
   'subject_id': 11590},
  {'clean_icd9_code': '518.81',
   'hadm_id': 172993,
   'icd9_code': '51881',
   'known_icd9_code': True,
   'long_title': 'Acute respiratory failure',
   'seq_num': 7,
   'short_title': 'Acute respiratry failure',
   'subject_id': 11590},
  {'clean_icd9_code': '414.01',
   'hadm_id': 172993,
   'icd9_code': '41401',
   'known_icd9_code': True,
   'long_title': 'Coronary atherosclerosis of native coronary artery',
   'seq_num': 8,
   'short_title': 'Crnry athrscl natve vssl',
   'subject_id': 11590},
  {'clean_icd9_code': '272.0',
   'hadm_id': 172993,
   'icd9_code': '2720',
   'known_icd9_code': True,
   'long_title': 'Pure hypercholesterolemia',
   'seq_num': 9,
   'short_title': 'Pure hypercholesterolem',
   'subject_id': 11590}],
 'meta': {'cgid': 17770,
  'chartdate': datetime.datetime(2154, 6, 3, 0, 0),
  'charttime': datetime.datetime(2154, 6, 3, 17, 30),
  'hadm_id': 172993,
  'storetime': datetime.datetime(2154, 6, 3, 17, 51),
  'subject_id': 11590}}

Now we can use this list and the ICD9 python library to extract all of the parents for each code. Not all codes in MIMIC III are know to the library (likely due to slightly different ICD versions), so we need to handle that possibility by just skipping unknown codes. If it’s a know code then we look up the parents, and we’ll add the code and each of its parents to the note_codes list. We’ll also keep a list of the metadata.

In [ ]:
note_codes = []
note_meta = []
unknown_codes = set()
for k, note_dat in note_info.items():
    subject_id, md5, row_id = k

    meta = note_dat['meta'].copy()
    meta['subject_id'] = subject_id
    meta['md5'] = md5
    meta['note_row_id'] = row_id

    diagnoses = note_dat['diagnoses']
    if diagnoses is not None:
        for diag in diagnoses:
            new_code = {
                'subject_id': subject_id,
                'md5': md5,
                'note_row_id': row_id,
                'level': 'source',
                'code': diag['icd9_code']

            if diag['known_icd9_code']:
                levels = sdu.get_icd9_levels(diag['clean_icd9_code'])
                for ind, lev_code in enumerate(levels):
                    new_code = {
                        'subject_id': subject_id,
                        'md5': md5,
                        'note_row_id': row_id,
                        'level': ind,
                        'code': lev_code

                if diag['icd9_code'] not in unknown_codes:
          'Unknown code ({}) for subject ({})'.format(diag['icd9_code'], subject_id))
In [51]:

Inspecting the records, we see that for a particular note (row id 1414073), the code found a known ICD9 code (39891), then found a root parent (390-459), and the path from it through children 393-398, 398, … We keep track of the hierarchy level from the root node - in a future post we’ll use this info to select a cutoff depth for classification based on ICD9

In [52]:
note_codes_df = pd.DataFrame.from_records(note_codes)
code level md5 note_row_id subject_id
0 41071 source be74552c73a0f9895c4f372763054d26 1414073.0 11590
1 39891 source be74552c73a0f9895c4f372763054d26 1414073.0 11590
2 390-459 0 be74552c73a0f9895c4f372763054d26 1414073.0 11590
3 393-398 1 be74552c73a0f9895c4f372763054d26 1414073.0 11590
4 398 2 be74552c73a0f9895c4f372763054d26 1414073.0 11590
In [57]:
output_path = pl.Path(path_config['repo_data_dir']).joinpath('notes_icd9_codes_{}.csv'.format(time_str))
note_codes_df.to_csv(output_path.as_posix(), index=False)
2016-10-24 16:41:53,258 - root - INFO - ../../data/notes_icd9_codes_2016-10-24-16-35.csv
In [54]:
note_meta_df = pd.DataFrame.from_records(note_meta)
cgid chartdate charttime hadm_id md5 note_row_id storetime subject_id
0 17770.0 2154-06-03 2154-06-03 17:30:00 172993.0 be74552c73a0f9895c4f372763054d26 1414073.0 2154-06-03 17:51:00 11590
1 17698.0 2183-07-28 2183-07-28 05:41:00 116105.0 2bd0c96855c6107be79d0150e1f121e7 1449706.0 2183-07-28 05:53:00 14342
2 NaN 2170-02-13 NaT 122710.0 bd4bf8040238e3e2cdd7466692defe73 47105.0 NaT 8217
3 18469.0 2175-06-07 2175-06-07 05:39:00 196691.0 6d20d9b6d3cfdc3fc9e8a72fbab0f697 1573953.0 2175-06-07 06:27:00 23829
4 17079.0 2125-04-27 2125-04-27 20:51:00 133059.0 d35003faa86241e60396014264b14a4d 1264491.0 2125-04-27 21:03:00 305
In [58]:
output_path = pl.Path(path_config['repo_data_dir']).joinpath('mimic3_note_metadata_{}.csv'.format(time_str))
note_meta_df.to_csv(output_path.as_posix(), index=False)
2016-10-24 16:41:54,930 - root - INFO - ../../data/mimic3_note_metadata_2016-10-24-16-35.csv

Supporting code download
import logging
import pandas as pd
import sys 

from icd9 import ICD9

# feel free to replace with your path to the json file
tree = ICD9('icd9/codes.json')

logger = logging.getLogger()

def get_note_metadata(conn, row_id):
    """Retrieve note metadata from MIMIC III database

    conn : sqlalchemy connection
    row_id : MIMIC III note row id to retrieve

    dict : subject_id, hadm_id, chartdate, charttime, storetime, cgid corresponding to note

    query = """
select subject_id, hadm_id, chartdate, charttime, storetime, cgid
from mimiciii.noteevents
where row_id={};"""
    res = conn.execute(query.format(int(row_id))).fetchone()
    if res is None:
        return None

    return dict(res)

def clean_icd9_code(icd9_str):
    """Convert a MIMIC III-style ICD9 code to a standard code for lookup

    icd9_str : str
        MIMIC III code (e.g. '39891')

    str :
        Standard ICD9 code, e.g. 398.91
    if icd9_str is None:
        return None
    if '.' not in icd9_str:
        if icd9_str.startswith('E') and len(icd9_str) > 4:
            icd9_str = icd9_str[:4] + '.' + icd9_str[4:]
        elif len(icd9_str) > 3:
            icd9_str = icd9_str[:3] + '.' + icd9_str[3:]
    return icd9_str

def print_icd9_tree(node):
    """Print the ICD9 tree of a given ICD9 code

    node : str
        Properly formatted ICD9 code (e.g. '398.91')

    if isinstance(node, str):
        icd9_str = clean_icd9_code(node)
        node = tree.find(icd9_str)
    if node is not None:    
        for c in node.parents:
            print('- {}: {}'.format(c.code, c.description))    

        print('\n-> {}: {}\n'.format(node.code, node.description))

        for c in node.children:
            print('- {}: {}'.format(c.code, c.description))

def get_hadm_diagnoses(conn, hadm_id):
    """Retrieve all ICD9 diagnoses for a given encounter ID

    conn : sqlalchemy connection
    hadm_id : int or str
        MIMIC III encounter ID

    list of dict : each dict contains the ICD9 code, short title, and long title from MIMIC III

    if hadm_id is None:
        return None
    query = """
select a.subject_id, a.hadm_id, a.seq_num, a.icd9_code, diags.short_title, diags.long_title
from mimiciii.diagnoses_icd as a
left join mimiciii.d_icd_diagnoses as diags on a.icd9_code = diags.icd9_code
where a.hadm_id = {}
order by a.seq_num
    res = conn.execute(query.format(int(hadm_id))).fetchall()
    if res is not None:
        res = [dict(r.items()) for r in res]
        for r in res:
            r['clean_icd9_code'] = clean_icd9_code(r['icd9_code'])
            r['known_icd9_code'] = tree.find(r['clean_icd9_code']) is not None                
    return res

def get_icd9_levels(icd9_code, max_depth=5):
    """Retrieve parents in the ICD9 hierarchy of the given code

    icd9_code : str
        Properly formated ICD9 code
    max_depth : int
        Maximum depth to retrieve

        Parents of the given ICD9 code, starting from top-most parent and decending down to max_depth
    icd9_str = clean_icd9_code(icd9_code)
    node = tree.find(icd9_str)
    levels = None
    if node is not None:
        levels = [c.code for c in node.parents[1:max_depth]]

    return levels


comments powered by Disqus